wordpress: friendly look at query log

WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).

I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).

First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:

SELECT * FROM wp_users WHERE user_level > 0

Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.

Next step: options.

SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:

  • No need to put options in database when you can have it serialized on disk.
  • No need to put bulk RSS feed data together with lots of small options – use other table for that.
  • Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important

Further: categories

SELECT * FROM wp_categories

Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!

Fun begins. Posts!!!

SELECT DISTINCT * FROM wp_posts
WHERE 1=1 AND
post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2
AND post_status != 'draft' AND
    post_status != 'static')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 10

First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)

Mapping categories:

 SELECT DISTINCT
post_id, cat_ID FROM wp_categories, wp_post2cat
WHERE category_id = cat_ID AND post_id IN
(31,30,29,28,27,26,25,24,23,22)

It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)

Counting comments… I know this is always very important – everyone needs to know exact count:

SELECT ID, COUNT( comment_ID ) AS ccount
FROM wp_posts
LEFT JOIN wp_comments ON (
comment_post_ID = ID  AND
comment_approved =  '1')
WHERE ID IN (31,30,29,28,27,26,25,24,23,22)
GROUP BY ID

What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.

Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key

Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!

Dates.. months.. years… counts…

SELECT DISTINCT
   YEAR(post_date) AS `year`,
   MONTH(post_date) AS `month`,
   count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.

Here we can count articles once again:

SELECT COUNT(ID) FROM wp_posts
WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2 AND
post_status != 'draft' AND post_status != 'static')

Some of above comments apply :)

Static pages… This is content management system – it has to allow to create pages! For that we:

SELECT * FROM wp_posts WHERE post_status = 'static'
ORDER BY post_title ASC

So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.

Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:

SELECT DISTINCT
YEAR(post_date) AS `year`,
MONTH(post_date) AS `month`, count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

Um, not again… dejavu? :)

And then…

 Quit \o/ 

This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:

  • Do not use DISTINCT or GROUP BY if you do not have to.
  • Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
  • Avoid putting bulk data together with metadata.
  • Count and save. Reuse.

What would be my favorite front page query?

SELECT id,name,content,comment_count,
    group_concat(category) as categories
FROM posts
    LEFT JOIN postcategories USING (postid)
WHERE date< 'offset' AND status='published'
ORDER BY date DESC LIMIT 10

It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.

I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.

Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?

db abstraction: zend framework vs mediawiki

So now as Zend have released their framework I couldn’t resist to look at it’s database classes (there’s much more other stuff in it) and compare with our own bits in mediawiki.

Thumbs up for (what is a must in abstraction layer) plain query building functions. Those are easy to understand and seem to work :)

One of issues with SQL programming is that people queries like text instead of ‘structured queries’. This is where Zend’s $db->quoteInto() is overused – even where there’s no need to people are forced to embed SQL bits, instead of thinking just about data flow and relations.

In order to add additional criteria in MW, you’d do:

$where[]=array('field'=>$value);

Zend Framework’s query builder offers slightly different approach. There’s Select object, which still eats text as parameters:

$select->where('field = ?',$value);

What raised my eyebrows – $select->from('foo', '*');. First of all, SELECT * is evil. On the other hand, it is interesting design decision to put it together with ‘FROM’. How would one issue ‘SELECT 1’ query this way? :)

The $db->insert() calls appeared to be identical in both libraries, you just pass an array of key=>value pairs and a table to put all that into. The $db->update though had same arguments, but different evaluations of those. For Zend you have to pass just a plaintext query in form of $db->quoteInto(‘field=’,$value). MW though allows to pass arrays of conditions:

$where = array ( 'group' => $groupid, 'parent' => array ( 1,2,3,4));

Of course you may pass a plain string to MW’s DB abstraction, type detection helps a lot there :)

Same differences applies to delete method. People used to tell, that you don’t need any query building once you’ve got placeholder replacement in SQL queries. Usually what is missed is that abstractions are not just for portability or security, but also for easier way to do things. If plain text is to be replaced into array passing, doing it completely would help more (I’m not advocating to split strings into arrays of chars though ;-) It still looks better when one appends an item to an array than string concatenation.

‘SELECT’ builder has much more differences. We packed handling of ORDER BY, GROUP BY, HAVING, LIMIT, etc into $options, while Zend provides separate methods for that. I immediately missed ‘SELECT … FOR UPDATE’ in documentation – this is one of most important options in proper web applications.

The OFFSET/LIMIT support (and even recommendation for paging) requires serious attitude readjustment. Scanning result sets _is_ expensive and numeric offsets shouldn’t be shown as a ‘nice way out’.

Lack of outer JOINs (as well as lack of workarounds in documentation) was surprising too.

Of course, providing table as an object was a nice thing to do, especially with possibility to add additional logic to a table object:

class RoundTable extends Zend_Db_Table {
    public function insert($data) {
        if (empty($data['created_on'])) { $data['created_on'] = time(); }
        return parent::insert($data); }

It is nice to see such library pushed – as it is in it’s 0.1.1 state, there’s lots of room for improvement in future :)

There’re other libraries out there, but those are undocumented, unpackaged and not that well maintained. And of course, there’s Ruby On Rails, which was poorly mimicked in Zend’s Active Record implementation. I don’t like ORM myself, but with hype around it, Zend could try to make their ORM better (or rather, existing, as there’re no relations in their objects). Even Django looks much better in this field. So now I’m just waiting for 1.0 ;-)

printing envelopes in macosx

I had to print address labels on an envelope (people still need to use snail mail sometimes). I used to be spoiled by MS Office, so I started OpenOffice and tried to do something there. I failed:

  • It didn’t print “landscape” whatever I tried to do, didn’t save settings, barf.
  • It complained that locale “” did not exist, so I couldn’t have my street name spelled correctly.
  • Ah, sure, it crashed several times, it’s 2.0 experimental Mac Intel build, so I guess it must do that, in order to show how difficult QA is. :)

So I opened simple text editor and tried to do it from there. I failed too, because didn’t understand how to transform simple text files into rotated envelope print. It was either rescaled or simply wrong somewhere else. That counts as failure too. I tried to do that from TextMate too, but it didn’t like my tries either.

So I searched the web for hints. It wasn’t a real search, because first title of first hit did tell me exactly what to do: Address Book. The solution is simple, select recipient, print. Sometimes I still am too power-oriented :)

popping a stack in mysql

Warning: this post describes practices that rely on side effects and may provide unstable results.

I was asked how to treat a table in MySQL as a stack and pop oldest entry from it. Of course there is always traditional approach – you lock table, fetch oldest record, delete it, commit. It isn’t that fun, though. This is what I came up to (and had multiple corrections):

mysql> delete from jobs where (@id:=jobs_id) limit 1; select @id;
Query OK, 1 row affected (0.00 sec)

+------+
| @id  |
+------+
| 3    |
+------+
1 row in set (0.00 sec)

mysql> delete from jobs where (@id:=jobs_id) limit 1; select @id;
Query OK, 1 row affected (0.00 sec)

+------+
| @id  |
+------+
| 4    |
+------+
1 row in set (0.00 sec)

In case of transactions used, you’d need to commit as fast as you can, as DELETE would acquire locks (but that’s the price of ACID ;-) – concurrent pops are not that easy.

Appending to stack is easy though – just.. INSERT. If you do not want transactions, you would still want InnoDB and that is where side issues kick in.

InnoDB stores all records clustered together with primary key, so table scans also happen in order of PK. If you would apply ORDER BY, there would be no guarantee that MySQL would not do filesort (if it would always pick indexes, this method would be even more awesome ;-). In case of filesort, last @a value would be set before actually sorting the data set, so a different value would be returned than deleted.

MyISAM doesn’t care at all about order, unless you specify it to. Doing the pop trick without ORDER BY would treat the stack just as pool of records and would delete any single row (and return it of course). Adding ORDER BY would set @values again to pre-filesort state.

That reminds me another trick (has been floating around for a while), that allows atomic incrementing of a field and fetching a value:

mysql> update counters set counter_value=(@val:=counter_value+1); select @val;
    -> update counters set counter_value=(@val:=counter_value+1); select @val;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
+------+
| @val |
+------+
| 562  |
+------+
1 row in set (0.00 sec)
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
+------+
| @val |
+------+
| 563  |
+------+
1 row in set (0.00 sec)

Yet another warning: As it’s modifying same row, only one transaction would be able to edit the record. So this should be done either not in transaction, or on HEAP/MyISAM/Blackhole engines.

so how does mysql generate revenue?

In order to talk in UC2006 I have to somehow enter United States. Being from East means you are treated as immigrant, unless you prove you are not. In order to prove that at consulate you have to bring income tax declarations, bank statements, lists of financial transactions and job approvals.

Oddly, for me questions at the consulate were:

  • How does MySQL AB generate revenue?
  • What is special about MySQL’s products?

Answering those questions seemed to satisfy Uncle Sam. Later I wondered, did they really know or verify, what MySQL is? Asking such questions did seem quite… educated!

So now with my fingerprints stored in databases I’ve got green light to visit US and talk about our experiences there :)