domas mituzas

one day in london

Woo, had one day (actually one and a half ;-) in London. I had a perfect guide, who could walk fast and show whole center on foot in single afternoon. Of course, Trafalgar square was a must for me (as I’ve read about it in detail more than ten years ago).

We were still walking late at night:

Besides riding double-deckers, we did take a tube, where steep and long spiral staircases are common, if you want to get out. And right, you don’t have to forget you’re in London – taking a minor tube journey may surprise you with heavy shower, though before that you could not see clouds :) Here, we went to British Museum, it’s all wet:

Inside you may spend hours, and variety of cultures represented may force you into deep discussions about all worlds being together (and, heh, Muslim cartoons).. Here, your stairs to the wisdom:

So, next early morning, full of impressions I was riding a cab home…

I always try to get emergency exit seats (not because I’m afraid, though), so every time I have to hear what should I do in case of disaster. This time it was funny:

Steward: Please follow instructions how to open emergency door in case of disaster
Me: Oh sure, if windows doesn’t fall off first
Steward: ?????
I gently touch the decoration frame around the window and it falls off. Window stays intact.
Steward: ?!!!?!!!!?
Me: :-D

wordpress: friendly look at query log

WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).

I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).

First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:

SELECT * FROM wp_users WHERE user_level > 0

Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.

Next step: options.

SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:

No need to put options in database when you can have it serialized on disk.
No need to put bulk RSS feed data together with lots of small options – use other table for that.
Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important

Further: categories

SELECT * FROM wp_categories

Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!

Fun begins. Posts!!!

SELECT DISTINCT * FROM wp_posts
WHERE 1=1 AND
post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2
AND post_status != 'draft' AND
    post_status != 'static')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 10

First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)

Mapping categories:

 SELECT DISTINCT
post_id, cat_ID FROM wp_categories, wp_post2cat
WHERE category_id = cat_ID AND post_id IN
(31,30,29,28,27,26,25,24,23,22)

It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)

Counting comments… I know this is always very important – everyone needs to know exact count:

SELECT ID, COUNT( comment_ID ) AS ccount
FROM wp_posts
LEFT JOIN wp_comments ON (
comment_post_ID = ID  AND
comment_approved =  '1')
WHERE ID IN (31,30,29,28,27,26,25,24,23,22)
GROUP BY ID

What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.

Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key

Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!

Dates.. months.. years… counts…

SELECT DISTINCT
   YEAR(post_date) AS `year`,
   MONTH(post_date) AS `month`,
   count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.

Here we can count articles once again:

SELECT COUNT(ID) FROM wp_posts
WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2 AND
post_status != 'draft' AND post_status != 'static')

Some of above comments apply :)

Static pages… This is content management system – it has to allow to create pages! For that we:

SELECT * FROM wp_posts WHERE post_status = 'static'
ORDER BY post_title ASC

So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.

Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:

SELECT DISTINCT
YEAR(post_date) AS `year`,
MONTH(post_date) AS `month`, count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

Um, not again… dejavu? :)

And then…

 Quit \o/

This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:

Do not use DISTINCT or GROUP BY if you do not have to.
Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
Avoid putting bulk data together with metadata.
Count and save. Reuse.

What would be my favorite front page query?

SELECT id,name,content,comment_count,
    group_concat(category) as categories
FROM posts
    LEFT JOIN postcategories USING (postid)
WHERE date< 'offset' AND status='published'
ORDER BY date DESC LIMIT 10

It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.

I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.

Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?

more of spring in Vilnius

memories as rabbits

This is one of most funny memory back from school. Seven years ago a classmate managed to produce this magnificent picture, where you can recognize everyone, whom you’ve been for years with. Their faces, expressions, poses, gestures, properties and thoughts are in here. Some day I’ll really want to see works of the artist :) Click on thumb to get a bigger pic (and sure, easily identify myself).

100 processes ought to be enough for anybody

I couldn’t resist to rework the famous phrase to portray 100 process per user limit in MacOSX. One would ask why would you ever want to run 100 programs. You don’t – it’s enough that every running Dashboard widget is a process. Every terminal window has multiple processes attached, of course, every application running (and they hide safely in dock even without windows). There’re as well various OS helper applications. At the moment when I type, there’re like 90 processes running on my system, and I’m not even doing any development work. There is a solution for that, hidden in operating system internals, but what should regular user do?

Imagine Apple keynote:

Presenter: Here I’ll add magnificient amazing superb Dashboard widget. Oh wait, it doesn’t add, something is wrong, haha, it happens sometimes, doesn’t it?
Voice from crowd: There’s a nice undocumented file that has nice undocumented option that you can change!
Presenter: Yay, see, our system can do pretty anything, even run more than 100 processes!

Solution:
Create a /etc/sysctl.conf file and add following lines there:

kern.maxproc=2000
kern.maxprocperuid=1000

Then type following commands:

sysctl -w kern.maxproc=2000
sysctl -w kern.maxprocperuid=1000

Then you’d have to log out (or restart) and login. Puff, magic capabilities activated. Thanks, Steve!

it is better to stay inside

db abstraction: zend framework vs mediawiki

So now as Zend have released their framework I couldn’t resist to look at it’s database classes (there’s much more other stuff in it) and compare with our own bits in mediawiki.

Thumbs up for (what is a must in abstraction layer) plain query building functions. Those are easy to understand and seem to work :)

One of issues with SQL programming is that people queries like text instead of ‘structured queries’. This is where Zend’s $db->quoteInto() is overused – even where there’s no need to people are forced to embed SQL bits, instead of thinking just about data flow and relations.

In order to add additional criteria in MW, you’d do:

$where[]=array('field'=>$value);

Zend Framework’s query builder offers slightly different approach. There’s Select object, which still eats text as parameters:

$select->where('field = ?',$value);

What raised my eyebrows – $select->from('foo', '*');. First of all, SELECT * is evil. On the other hand, it is interesting design decision to put it together with ‘FROM’. How would one issue ‘SELECT 1’ query this way? :)

The $db->insert() calls appeared to be identical in both libraries, you just pass an array of key=>value pairs and a table to put all that into. The $db->update though had same arguments, but different evaluations of those. For Zend you have to pass just a plaintext query in form of $db->quoteInto(‘field=’,$value). MW though allows to pass arrays of conditions:

$where = array ( 'group' => $groupid, 'parent' => array ( 1,2,3,4));

Of course you may pass a plain string to MW’s DB abstraction, type detection helps a lot there :)

Same differences applies to delete method. People used to tell, that you don’t need any query building once you’ve got placeholder replacement in SQL queries. Usually what is missed is that abstractions are not just for portability or security, but also for easier way to do things. If plain text is to be replaced into array passing, doing it completely would help more (I’m not advocating to split strings into arrays of chars though ;-) It still looks better when one appends an item to an array than string concatenation.

‘SELECT’ builder has much more differences. We packed handling of ORDER BY, GROUP BY, HAVING, LIMIT, etc into $options, while Zend provides separate methods for that. I immediately missed ‘SELECT … FOR UPDATE’ in documentation – this is one of most important options in proper web applications.

The OFFSET/LIMIT support (and even recommendation for paging) requires serious attitude readjustment. Scanning result sets _is_ expensive and numeric offsets shouldn’t be shown as a ‘nice way out’.

Lack of outer JOINs (as well as lack of workarounds in documentation) was surprising too.

Of course, providing table as an object was a nice thing to do, especially with possibility to add additional logic to a table object:

class RoundTable extends Zend_Db_Table {
    public function insert($data) {
        if (empty($data['created_on'])) { $data['created_on'] = time(); }
        return parent::insert($data); }

It is nice to see such library pushed – as it is in it’s 0.1.1 state, there’s lots of room for improvement in future :)

There’re other libraries out there, but those are undocumented, unpackaged and not that well maintained. And of course, there’s Ruby On Rails, which was poorly mimicked in Zend’s Active Record implementation. I don’t like ORM myself, but with hype around it, Zend could try to make their ORM better (or rather, existing, as there’re no relations in their objects). Even Django looks much better in this field. So now I’m just waiting for 1.0 ;-)

spread: bad example of open source

The Spread toolkit is one of examples, where opensource project should better not exist. It is reliable multicast, it has APIs in multiple programming languages, and can provide message queueing facility you can run and forget. There’s even MySQL Message API based on it – you can use sync and async messaging between bunch of MySQL servers. Using Spread may give you lots of possibilities in deploying distributed system.

At Wikipedia’s content cluster we could use lots of synchronization based on Spread, but…

3. All advertising materials (including web pages) mentioning features or use
   of this software, or software that uses this software, must display the following
   acknowledgment:

   "This product uses software developed by Spread Concepts LLC for use in the Spread toolkit.
    For more information about Spread see http://www.spread.org"

That would mean that if we used Spread somewhere in cluster, we’d be showing adds for university project on every page (or at least that is what ‘must display’ sounds like). Of course, as some university project, it might want some advertisement, but I think it would get far more of it, if it was without viral advertisement clause – it is still the only framework of a kind out there.

Additional problem in such situation is that being half-free (or.. adware) it half-fills the need of proper messaging toolkit for community. Starting similar project when there’s Spread might not look attractive.. Of course, there’s always bunch of IRC servers – you would find lots of systems messaging needs efficiently implemented there, just without reliability and guarantees. But probably the best way would be simply asking Spread authors to release it under GPL or any other proper open source license? :)

printing envelopes in macosx

I had to print address labels on an envelope (people still need to use snail mail sometimes). I used to be spoiled by MS Office, so I started OpenOffice and tried to do something there. I failed:

It didn’t print “landscape” whatever I tried to do, didn’t save settings, barf.
It complained that locale “” did not exist, so I couldn’t have my street name spelled correctly.
Ah, sure, it crashed several times, it’s 2.0 experimental Mac Intel build, so I guess it must do that, in order to show how difficult QA is. :)

So I opened simple text editor and tried to do it from there. I failed too, because didn’t understand how to transform simple text files into rotated envelope print. It was either rescaled or simply wrong somewhere else. That counts as failure too. I tried to do that from TextMate too, but it didn’t like my tries either.

So I searched the web for hints. It wasn’t a real search, because first title of first hit did tell me exactly what to do: Address Book. The solution is simple, select recipient, print. Sometimes I still am too power-oriented :)

popping a stack in mysql

Warning: this post describes practices that rely on side effects and may provide unstable results.

I was asked how to treat a table in MySQL as a stack and pop oldest entry from it. Of course there is always traditional approach – you lock table, fetch oldest record, delete it, commit. It isn’t that fun, though. This is what I came up to (and had multiple corrections):

mysql> delete from jobs where (@id:=jobs_id) limit 1; select @id;
Query OK, 1 row affected (0.00 sec)

+------+
| @id  |
+------+
| 3    |
+------+
1 row in set (0.00 sec)

mysql> delete from jobs where (@id:=jobs_id) limit 1; select @id;
Query OK, 1 row affected (0.00 sec)

+------+
| @id  |
+------+
| 4    |
+------+
1 row in set (0.00 sec)

In case of transactions used, you’d need to commit as fast as you can, as DELETE would acquire locks (but that’s the price of ACID ;-) – concurrent pops are not that easy.

Appending to stack is easy though – just.. INSERT. If you do not want transactions, you would still want InnoDB and that is where side issues kick in.

InnoDB stores all records clustered together with primary key, so table scans also happen in order of PK. If you would apply ORDER BY, there would be no guarantee that MySQL would not do filesort (if it would always pick indexes, this method would be even more awesome ;-). In case of filesort, last @a value would be set before actually sorting the data set, so a different value would be returned than deleted.

MyISAM doesn’t care at all about order, unless you specify it to. Doing the pop trick without ORDER BY would treat the stack just as pool of records and would delete any single row (and return it of course). Adding ORDER BY would set @values again to pre-filesort state.

That reminds me another trick (has been floating around for a while), that allows atomic incrementing of a field and fetching a value:

mysql> update counters set counter_value=(@val:=counter_value+1); select @val;
    -> update counters set counter_value=(@val:=counter_value+1); select @val;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
+------+
| @val |
+------+
| 562  |
+------+
1 row in set (0.00 sec)
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
+------+
| @val |
+------+
| 563  |
+------+
1 row in set (0.00 sec)

Yet another warning: As it’s modifying same row, only one transaction would be able to edit the record. So this should be done either not in transaction, or on HEAP/MyISAM/Blackhole engines.

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…