Open transactions are bad

Two identical queries have somewhat different execution times, though data layout, table format, everything… is quite same:

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (0.00 sec)

mysql> select count(*) from transtest;
| count(*) |
|    10000 |
1 row in set (2.88 sec) < ---- oh noes!!!

The difference is very simple, though often forgotten. Multiversioned (MVCC) tables may have multiple versions (hence the name) of rows.
Even more, the engine will keep all versions of rows, that would be needed for oldest active transaction, just in case it requests some snapshot from the past.
Think of it as of some sort of time machine, just somewhat less efficient :)

If some old transaction is running, heavily edited tables will have more stored row versions.
In above case, second query had to read a million of rows to produce a count, when first one was working on clean environment.
Once the transaction holding old versions was rolled back, the query took 0.00sec again.

Databases (or DataACIDs) will take additional measures, often unexposed, to provide consistent data, so we may treat this as a feature, not a bug.

Conclusion: avoid long running transactions, especially if doing nothing. Oh well, avoid long running transactions with locks too, but that is already different question. If you spot anything sleeping – kill the slacker. At application level. :-)

wordpress: friendly look at query log

WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).

I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).

First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:

SELECT * FROM wp_users WHERE user_level > 0

Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.

Next step: options.

SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:

  • No need to put options in database when you can have it serialized on disk.
  • No need to put bulk RSS feed data together with lots of small options – use other table for that.
  • Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important

Further: categories

SELECT * FROM wp_categories

Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!

Fun begins. Posts!!!

post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2
AND post_status != 'draft' AND
    post_status != 'static')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 10

First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)

Mapping categories:

post_id, cat_ID FROM wp_categories, wp_post2cat
WHERE category_id = cat_ID AND post_id IN

It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)

Counting comments… I know this is always very important – everyone needs to know exact count:

SELECT ID, COUNT( comment_ID ) AS ccount
FROM wp_posts
LEFT JOIN wp_comments ON (
comment_post_ID = ID  AND
comment_approved =  '1')
WHERE ID IN (31,30,29,28,27,26,25,24,23,22)

What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.

Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key

Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!

Dates.. months.. years… counts…

   YEAR(post_date) AS `year`,
   MONTH(post_date) AS `month`,
   count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.

Here we can count articles once again:

WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2 AND
post_status != 'draft' AND post_status != 'static')

Some of above comments apply :)

Static pages… This is content management system – it has to allow to create pages! For that we:

SELECT * FROM wp_posts WHERE post_status = 'static'
ORDER BY post_title ASC

So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.

Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:

YEAR(post_date) AS `year`,
MONTH(post_date) AS `month`, count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

Um, not again… dejavu? :)

And then…

 Quit \o/ 

This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:

  • Do not use DISTINCT or GROUP BY if you do not have to.
  • Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
  • Avoid putting bulk data together with metadata.
  • Count and save. Reuse.

What would be my favorite front page query?

SELECT id,name,content,comment_count,
    group_concat(category) as categories
FROM posts
    LEFT JOIN postcategories USING (postid)
WHERE date< 'offset' AND status='published'

It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.

I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.

Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?

%d bloggers like this: