.. some thoughts on Citizendium

Open-source communities have quite a lot of antagonism against their open-source ‘rivals’, instead of seeing as partners against Greater Evils. I imagine that bootstrapping a project like Citizendium is a huge task, so I followed some of the discussions in their forums:

  • It’s a nightmare. Only Mozilla Thunderbird gets more disrespect from me. – Lead developer Jason describes the software they use, mediawiki.
  • Can we like anonymize some stuff and submit Mediawiki to WorseThanFailure? – Technical liasion [sic] Zachary suggests.

Of course, being forced to run open-source package from greatest ‘rival’ is pain oh pain. Citizendium team even forked software, called it CaesarWiki.
This is how improvements to the fork are described:

Well, hypothetically, we can do whatever we want in terms of improving MediaWiki, including working on the difference engine.
However, I think that it’s more likely that any changes in that area will filter down from work done by the MediaWiki team.
They have a lot more developer time (in developer-hours/month) and a lot more expertise with MediaWiki.

Of course, having paid lead developer not understand core principles of how software functions (disrespect, remember?) doesn’t help with real improvements. Of course, half a year ago, big work was ahead:

Ideally, I would like to rewrite mediawiki from the ground up in OO style. Since that may not work well, the best way is to wrap it in a bow and let the “present” develop into something pretty over time.

Xoops was given as an example of package that scales, has some security, even uses caching, so integrating with MediaWiki would make it scale. Thats sure way forward. Of course, one of biggest mistakes Wikipedia folks has made is LAMP choice:

To not box ourself in like Wikipedia has done with Mediawiki, PHP and MySQL, we need to pursue modular, easy to use and easy to maintain and update solutions. No one needs network and system admins spinning dinner plates on sticks all day.

It is quite difficult to understand how people who never talked to us know about our operations that much. Back when this was written, Wikipedia had one full-time employee working on the system, few others did the work whenever they (we) wished, and that usually was creative (of course, sometimes artistic) work. Anyway, to run away from evil MySQL to PG, this set of arguments was used:

Disadvantage: no years of heavy use to test it. Advantage: fewer workarounds, easier to scale overall, incredibly knowledgeable community ready to help out.

Of course, at Wikipedia we failed to scale. Now what made me slightly envious, is discussion about security and operating personnel – having a pool of developers scattered around the world, with floating 24/7 schedule is priceless, we really can’t afford that at Wikipedia – at one moment all of us were in Europe, now just Brion is sitting in Florida (what is not that far away either).

Anyway, though I believe in Wikipedia evolution more than in Citizendium revolution, I wouldn’t reject advises – the project may be quite interesting, and if content can be reused on other projects, it just adds value to the Web. Probably we’re rookies in software engineering, but there has been long path to build Wikipedia platform. Some of us learnt technologies used specifically for the project. I’m not sure we did earn the disrespect we’re getting, but I still think that antagonism is harming Citizendium, not us.

On books, examples and wizards

I really like Flickr. I believe it is one of greatest services of the Web (my album is there :), and it runs MySQL. But I don’t understand why their database folks think nobody else knows how to build stuff. I have seen nice books and presentations from Yahoo! (oh wait!) and other guys, who have been building big systems and engineered their solutions properly, so they survived.

Technical literature was noticed:

Then there are whole books on the subject of capacity and scalability for the database layer.

Yes, they are good reads.

Or…

Then there are novels from developers that in many cases really don’t know the tricks of the DBMS they are working with, and create elaborate abstraction layers that automatically generate SQL for the DB in question from objects and such.

Some of these layers are made with efficient SQL in mind, or always allow to override their functionality with efficient SQL.

So, it is easy to answer this question:

But, with all these people who tell you how to do it, actually can they prove that it works under a constant high workload for many people all at the same time.

I believe these people can.

Now there’re parts of Flickr operation described:

You may be thinking to yourself yea right say you can do 20K + transactions per seconds that must be a crap load of expensive hardware all running, where all the data is served out of memory.

With proper data layout single 10000$ system may handle 10000 queries per second. Of course, hitting disk may decrease efficiency, so one may end up with 2-5$/query. I’m not sure Flickr would consider 100k$ database setup as expensive hardware. Here again, “all data served from memory” may sound expensive, but mostly systems serve just “most data from memory”. Like ours, which is running on those 10k$-class (disclosure: ~10 of them) machines and serving >30000 queries per second at the moment. And that is efficient.

This is blowing away minds and wiping stuff we know away:

All of our database connections are real time. Our load balancer for the database is written in 13 lines of PHP code.

There are lots of posts detailing how fast MySQL connections are and how database pooling isn’t necessary anymore. Our load balancer is actually 651 lines of PHP code, but it still connects to database at each request. And it takes less than millisecond to connect – quite affordable cost.

I am sure interested in all Flickr design specifics – it is nice application, perfect example and it seems to work. Though I don’t believe that we should deny any other knowledge, or we should be blindly following wizards and their examples. Every application differs, every community has different patterns and wishes, so we should rather follow what people need, and create good products for them, like Flickr. Sometimes even one-man (or all-volunteer) engineering team may do miracles, especially when there’re open platforms to build on.

It is hard to swallow the endless possibilities, that are provided by new type of services. I’m not sure wizardry these times is that difficult to swallow. In modern software world there’re no orthodox or unorthodox designs. There’re just ones which work and which don’t.

spread: bad example of open source

The Spread toolkit is one of examples, where opensource project should better not exist. It is reliable multicast, it has APIs in multiple programming languages, and can provide message queueing facility you can run and forget. There’s even MySQL Message API based on it – you can use sync and async messaging between bunch of MySQL servers. Using Spread may give you lots of possibilities in deploying distributed system.

At Wikipedia’s content cluster we could use lots of synchronization based on Spread, but…

3. All advertising materials (including web pages) mentioning features or use
   of this software, or software that uses this software, must display the following
   acknowledgment:

   "This product uses software developed by Spread Concepts LLC for use in the Spread toolkit.
    For more information about Spread see http://www.spread.org"

That would mean that if we used Spread somewhere in cluster, we’d be showing adds for university project on every page (or at least that is what ‘must display’ sounds like). Of course, as some university project, it might want some advertisement, but I think it would get far more of it, if it was without viral advertisement clause – it is still the only framework of a kind out there.

Additional problem in such situation is that being half-free (or.. adware) it half-fills the need of proper messaging toolkit for community. Starting similar project when there’s Spread might not look attractive.. Of course, there’s always bunch of IRC servers – you would find lots of systems messaging needs efficiently implemented there, just without reliability and guarantees. But probably the best way would be simply asking Spread authors to release it under GPL or any other proper open source license? :)

php4: not supported, use php5

Tim wrote to PHP internals list, asking:

is there any intention to backport this simple but important bugfix to PHP 4?
Many PHP users are still using PHP 4, and it's not a very well advertised fact that
it does not properly support arrays with more than 64K entries.

Markus Berger responds:

Just change to 5.

It seems that MediaWiki HEAD branch will drop php4 support soon.

ways to (not) attract tv viewers

This year December 31st is Saturday. It means that I’m lazy in the morning, procrastinating all tasks and doing whatever is absolutely useless. So I compared two major TV channels in Lithuania, what are they offering for their beloved viewer on New Years Eve. I just took all movies (and some full-length animated ones) they’re showing, did check their IMDB ratings and used some formulas everyone knows to determine, if anyone should watch TV.

Metric LNK TV3
Count: 8.0 9.0
Rating sum: 36.6 47.0
Rating average: 4.6 5.2
Maximum: 6.6 7.4
Minimum: 3.6 3.3
SqDev: 6.1 10.7
Median: 4.3 5.4

First of all, you may end up in terror, as both TVs have sub-4 rating movies. Generally speaking, this is very very bad. On the other hand, average is terribly low as well, though TV3 might have slightly better status. If you get up early in the morning, you may watch ‘Chicken Run’, which is quite good animated flick. You’re saved if you have cable or satellite TV, or if you’ve got enough booze to erase all memories :) I just wonder, how could major TV channels get so much crap…

And this is the proof:

LNK                     IMDB  | TV3                       IMDB
09:45 Groove squad       3.9  | 08:00 Der Weihnachtswolf   5.5
11:10 Good burger        4.2  | 09:45 Chicken Run          7.4
12:50 Mr. Nice Guy       4.8  | 11:10 The Cat In The Hat   3.3
14:25 Joe's Apartment    5.0  | 12:45 Dunston Checks In    4.6
15:55 On Deadly Ground   3.6  | 14:25 Shallow Hal          6.0
19:00 Der Clown          4.1  | 16:35 Nutty Professor II   4.5
20:50 Rush Hour          6.6  | 19:08 Gorgeus              5.4
01:30 Who's your daddy   4.4  | 00:05 Commando             5.7
                              | 01:55 Swimfan              4.6