wordpress: friendly look at query log

WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).

I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).

First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:

SELECT * FROM wp_users WHERE user_level > 0

Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.

Next step: options.

SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:

  • No need to put options in database when you can have it serialized on disk.
  • No need to put bulk RSS feed data together with lots of small options – use other table for that.
  • Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important

Further: categories

SELECT * FROM wp_categories

Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!

Fun begins. Posts!!!

SELECT DISTINCT * FROM wp_posts
WHERE 1=1 AND
post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2
AND post_status != 'draft' AND
    post_status != 'static')
GROUP BY wp_posts.ID
ORDER BY post_date DESC
LIMIT 0, 10

First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)

Mapping categories:

 SELECT DISTINCT
post_id, cat_ID FROM wp_categories, wp_post2cat
WHERE category_id = cat_ID AND post_id IN
(31,30,29,28,27,26,25,24,23,22)

It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)

Counting comments… I know this is always very important – everyone needs to know exact count:

SELECT ID, COUNT( comment_ID ) AS ccount
FROM wp_posts
LEFT JOIN wp_comments ON (
comment_post_ID = ID  AND
comment_approved =  '1')
WHERE ID IN (31,30,29,28,27,26,25,24,23,22)
GROUP BY ID

What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.

Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key

Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!

Dates.. months.. years… counts…

SELECT DISTINCT
   YEAR(post_date) AS `year`,
   MONTH(post_date) AS `month`,
   count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.

Here we can count articles once again:

SELECT COUNT(ID) FROM wp_posts
WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND
(post_status = "publish" OR post_author = 2 AND
post_status != 'draft' AND post_status != 'static')

Some of above comments apply :)

Static pages… This is content management system – it has to allow to create pages! For that we:

SELECT * FROM wp_posts WHERE post_status = 'static'
ORDER BY post_title ASC

So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.

Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:

SELECT DISTINCT
YEAR(post_date) AS `year`,
MONTH(post_date) AS `month`, count(ID) as posts
FROM wp_posts
WHERE post_date < '2006-03-25 21:15:11' AND
post_status = 'publish'
GROUP BY YEAR(post_date), MONTH(post_date)
ORDER BY post_date DESC

Um, not again… dejavu? :)

And then…

 Quit \o/ 

This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:

  • Do not use DISTINCT or GROUP BY if you do not have to.
  • Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
  • Avoid putting bulk data together with metadata.
  • Count and save. Reuse.

What would be my favorite front page query?

SELECT id,name,content,comment_count,
    group_concat(category) as categories
FROM posts
    LEFT JOIN postcategories USING (postid)
WHERE date< 'offset' AND status='published'
ORDER BY date DESC LIMIT 10

It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.

I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.

Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?

memories as rabbits

This is one of most funny memory back from school. Seven years ago a classmate managed to produce this magnificent picture, where you can recognize everyone, whom you’ve been for years with. Their faces, expressions, poses, gestures, properties and thoughts are in here. Some day I’ll really want to see works of the artist :) Click on thumb to get a bigger pic (and sure, easily identify myself).

100 processes ought to be enough for anybody

I couldn’t resist to rework the famous phrase to portray 100 process per user limit in MacOSX. One would ask why would you ever want to run 100 programs. You don’t – it’s enough that every running Dashboard widget is a process. Every terminal window has multiple processes attached, of course, every application running (and they hide safely in dock even without windows). There’re as well various OS helper applications. At the moment when I type, there’re like 90 processes running on my system, and I’m not even doing any development work. There is a solution for that, hidden in operating system internals, but what should regular user do?

Imagine Apple keynote:

  • Presenter: Here I’ll add magnificient amazing superb Dashboard widget. Oh wait, it doesn’t add, something is wrong, haha, it happens sometimes, doesn’t it?
  • Voice from crowd: There’s a nice undocumented file that has nice undocumented option that you can change!
  • Presenter: Yay, see, our system can do pretty anything, even run more than 100 processes!

Solution:
Create a /etc/sysctl.conf file and add following lines there:

kern.maxproc=2000
kern.maxprocperuid=1000

Then type following commands:

sysctl -w kern.maxproc=2000
sysctl -w kern.maxprocperuid=1000

Then you’d have to log out (or restart) and login. Puff, magic capabilities activated. Thanks, Steve!

db abstraction: zend framework vs mediawiki

So now as Zend have released their framework I couldn’t resist to look at it’s database classes (there’s much more other stuff in it) and compare with our own bits in mediawiki.

Thumbs up for (what is a must in abstraction layer) plain query building functions. Those are easy to understand and seem to work :)

One of issues with SQL programming is that people queries like text instead of ‘structured queries’. This is where Zend’s $db->quoteInto() is overused – even where there’s no need to people are forced to embed SQL bits, instead of thinking just about data flow and relations.

In order to add additional criteria in MW, you’d do:

$where[]=array('field'=>$value);

Zend Framework’s query builder offers slightly different approach. There’s Select object, which still eats text as parameters:

$select->where('field = ?',$value);

What raised my eyebrows – $select->from('foo', '*');. First of all, SELECT * is evil. On the other hand, it is interesting design decision to put it together with ‘FROM’. How would one issue ‘SELECT 1’ query this way? :)

The $db->insert() calls appeared to be identical in both libraries, you just pass an array of key=>value pairs and a table to put all that into. The $db->update though had same arguments, but different evaluations of those. For Zend you have to pass just a plaintext query in form of $db->quoteInto(‘field=’,$value). MW though allows to pass arrays of conditions:

$where = array ( 'group' => $groupid, 'parent' => array ( 1,2,3,4));

Of course you may pass a plain string to MW’s DB abstraction, type detection helps a lot there :)

Same differences applies to delete method. People used to tell, that you don’t need any query building once you’ve got placeholder replacement in SQL queries. Usually what is missed is that abstractions are not just for portability or security, but also for easier way to do things. If plain text is to be replaced into array passing, doing it completely would help more (I’m not advocating to split strings into arrays of chars though ;-) It still looks better when one appends an item to an array than string concatenation.

‘SELECT’ builder has much more differences. We packed handling of ORDER BY, GROUP BY, HAVING, LIMIT, etc into $options, while Zend provides separate methods for that. I immediately missed ‘SELECT … FOR UPDATE’ in documentation – this is one of most important options in proper web applications.

The OFFSET/LIMIT support (and even recommendation for paging) requires serious attitude readjustment. Scanning result sets _is_ expensive and numeric offsets shouldn’t be shown as a ‘nice way out’.

Lack of outer JOINs (as well as lack of workarounds in documentation) was surprising too.

Of course, providing table as an object was a nice thing to do, especially with possibility to add additional logic to a table object:

class RoundTable extends Zend_Db_Table {
    public function insert($data) {
        if (empty($data['created_on'])) { $data['created_on'] = time(); }
        return parent::insert($data); }

It is nice to see such library pushed – as it is in it’s 0.1.1 state, there’s lots of room for improvement in future :)

There’re other libraries out there, but those are undocumented, unpackaged and not that well maintained. And of course, there’s Ruby On Rails, which was poorly mimicked in Zend’s Active Record implementation. I don’t like ORM myself, but with hype around it, Zend could try to make their ORM better (or rather, existing, as there’re no relations in their objects). Even Django looks much better in this field. So now I’m just waiting for 1.0 ;-)

spread: bad example of open source

The Spread toolkit is one of examples, where opensource project should better not exist. It is reliable multicast, it has APIs in multiple programming languages, and can provide message queueing facility you can run and forget. There’s even MySQL Message API based on it – you can use sync and async messaging between bunch of MySQL servers. Using Spread may give you lots of possibilities in deploying distributed system.

At Wikipedia’s content cluster we could use lots of synchronization based on Spread, but…

3. All advertising materials (including web pages) mentioning features or use
   of this software, or software that uses this software, must display the following
   acknowledgment:

   "This product uses software developed by Spread Concepts LLC for use in the Spread toolkit.
    For more information about Spread see http://www.spread.org"

That would mean that if we used Spread somewhere in cluster, we’d be showing adds for university project on every page (or at least that is what ‘must display’ sounds like). Of course, as some university project, it might want some advertisement, but I think it would get far more of it, if it was without viral advertisement clause – it is still the only framework of a kind out there.

Additional problem in such situation is that being half-free (or.. adware) it half-fills the need of proper messaging toolkit for community. Starting similar project when there’s Spread might not look attractive.. Of course, there’s always bunch of IRC servers – you would find lots of systems messaging needs efficiently implemented there, just without reliability and guarantees. But probably the best way would be simply asking Spread authors to release it under GPL or any other proper open source license? :)