WordPress is no doubt one of most popular web packages (besides MediaWiki, ha ha ha) out there. It is easy to install, it is easy to use. And it doesn’t run huge sites, it is not designed to. Anyway, even as it is not designed to, I’m running it for my blog, an old version of course (2.0 is scary for me).
I have general query log enabled (for now it’s easier to truncate it, than to disable it on running server), so of course I spot all sorts of funny queries running on my system (people hit the site when I want to test some stuff…). So now I looked at what WordPress does in order to greet my fellow visitors. Some of my comments might be rubbish, as I did not do any profiling on this, but some of them might provide insights for anyone developing web application (or at least blog).
First of all, it wants to know who are privileged users on the site. I don’t know why, but it does this:
SELECT * FROM wp_users WHERE user_level > 0
Of course, no index on user_level, so even if you do not have any privileged users, if you have enabled registration (or even force people to register to post comments, eh), you will have server scanning whole table – there’s no index on user_level. You might want to add it.
Next step: options.
SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'
Here it reads all options one would need. It’s fun – there’re 60 options that are autoloaded, and their values weight 500 bytes in total. The ones that are not autoloaded (hey, no index here either) – 11 rows, 150k of data. Even more – it’s just 4 rss feeds serialized and put into database. There are multiple lessons here:
- No need to put options in database when you can have it serialized on disk.
- No need to put bulk RSS feed data together with lots of small options – use other table for that.
- Here again, even if putting bulk data together, use indexes for getting proper subset. Anyway, above points are somewhat more important
Further: categories
SELECT * FROM wp_categories
Um, small dataset, let it live. Of course, one could build self-join to have a recursive tree. Doesn’t matter. Sorry for noise!
Fun begins. Posts!!!
SELECT DISTINCT * FROM wp_posts WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND (post_status = "publish" OR post_author = 2 AND post_status != 'draft' AND post_status != 'static') GROUP BY wp_posts.ID ORDER BY post_date DESC LIMIT 0, 10
First of all, we select DISTINCT * (this table has primary key, it can’t have dupe rows), then we GROUP BY by unique ID (which again doesn’t make sense), then we eliminate all possible order by index situations, oh wait, there’s no index on post_date. I’d really like if it just used “publish” status for front page and would not care about anything else. This is publishing system with workflows! :)
Mapping categories:
SELECT DISTINCT post_id, cat_ID FROM wp_categories, wp_post2cat WHERE category_id = cat_ID AND post_id IN (31,30,29,28,27,26,25,24,23,22)
It could be done by… joining with above wp_posts query, scanning an index, then using GROUP BY and GROUP_CONCAT() for category names. Or even having that denormalized (aggregated) inside posts table, but that’s already not that nice. Forget it :)
Counting comments… I know this is always very important – everyone needs to know exact count:
SELECT ID, COUNT( comment_ID ) AS ccount FROM wp_posts LEFT JOIN wp_comments ON ( comment_post_ID = ID AND comment_approved = '1') WHERE ID IN (31,30,29,28,27,26,25,24,23,22) GROUP BY ID
What happens more often, reading comment count or adding a comment? If later, skip this section, otherwise – put comment count in posts table and increment it once comment is visible. Incrementing and decrementing is slightly cheaper than scanning a table after you’re flooded with hate/love messages.
Um, then… meta!
SELECT post_id, meta_key, meta_value FROM wp_postmeta
WHERE post_id IN(31,30,29,28,27,26,25,24,23,22)
ORDER BY post_id, meta_key
Maybe I’m not one of those semantic web folks, but I’d rather skip querying empty table, less IPC – faster site is! And I’ve seen multiple queries for every article displayed against this table. It’s empty! No need to check it again!
Dates.. months.. years… counts…
SELECT DISTINCT YEAR(post_date) AS `year`, MONTH(post_date) AS `month`, count(ID) as posts FROM wp_posts WHERE post_date < '2006-03-25 21:15:11' AND post_status = 'publish' GROUP BY YEAR(post_date), MONTH(post_date) ORDER BY post_date DESC
This is where aggregation may help. Once new month starts, statistics for old month (and even current one) can be written into separate table. Otherwise one would have to scan all posts, order them, sum and present that to you. Every time.
Here we can count articles once again:
SELECT COUNT(ID) FROM wp_posts WHERE 1=1 AND post_date_gmt < = '2006-03-25 19:15:59' AND (post_status = "publish" OR post_author = 2 AND post_status != 'draft' AND post_status != 'static')
Some of above comments apply :)
Static pages… This is content management system – it has to allow to create pages! For that we:
SELECT * FROM wp_posts WHERE post_status = 'static' ORDER BY post_title ASC
So, for showing links to internal content (ten or twenty bytes on your html) it sends all text of all internal content to application. Well, WordPress keeps all text in posts table, so… * includes that. Of course, there’s no composite index on (status,title), so a traditional order by happens. Of course, no filesort optimizations apply (as rows are too wide to add them to filesort buffer), so table hits multiple times during query.
Afterwards it has some of ‘links’ magic, which I skipped. I don’t have jumbo blogrolls (though I’d love jumbo rolls in sushi bar :), so those queries look evil, but cause no harm :) Eventually there’s some additional counting:
SELECT DISTINCT YEAR(post_date) AS `year`, MONTH(post_date) AS `month`, count(ID) as posts FROM wp_posts WHERE post_date < '2006-03-25 21:15:11' AND post_status = 'publish' GROUP BY YEAR(post_date), MONTH(post_date) ORDER BY post_date DESC
Um, not again… dejavu? :)
And then…
Quit \o/
This was just a look at load of single page. There’re more various issues once other types of pageviews are done.
Anyway, major issues:
- Do not use DISTINCT or GROUP BY if you do not have to.
- Use indexes for sorting, row selection and offset retrieval. Create multiple indexes.
- Avoid putting bulk data together with metadata.
- Count and save. Reuse.
What would be my favorite front page query?
SELECT id,name,content,comment_count, group_concat(category) as categories FROM posts LEFT JOIN postcategories USING (postid) WHERE date< 'offset' AND status='published' ORDER BY date DESC LIMIT 10
It would be absolutely efficient with (status,date) index, it would not need to scan comments table to get count, it would not have to scan whole posts table then group it or order it, and it would be faster than 3 slow queries on tables that may grow.
I had inner voice inside (from the hosting business past) – but MySQL has query cache! Yes, it has, but putting timestamps into queries eliminates possibilities of caching. And one should not rely on query caching, one should rely on data and index caching. This is where databases shine.
Final question: Will upgrade to WordPress 2.0 solve these issues? Won’t I be trapped in yet another featuritis?
Awesome! :-D
Of course, we can’t be too hard on them, since their team doesn’t include a seasoned MySQL professional like yourself to advise (read: beat the living crap out of) people using inefficient queries.
I reckon some stuff are scaring, but encouraging denormalization (the count of comments :)… Has WordPress any page-caching possibilities ?
You contacted the author and offered to help them optimize their code, right?
Right??
I tried on #wordpress ;-) Nobody paid any attention, there were zillions of people. Right! :)
Pooly, if you looked at kernel internals or any other high-performance system, maintaining counters is much easier way than traversing lists, arrays, hashes or whatever else for obtaining the number. Same is with database. It’s possible to recalculate data from existing rows, but maintaining the counter and incrementing/decrementing (heh, transactions are short-lived anyway) is not that much of burden.
What I’m looking at here – most of issues are not designed to scale. If you have many comments, site is slower. If you have many posts, site is slower. If you have many users, site is slower. If you have many static pages, site is slower. Whatever you scale, it is slower. And databases are supposed to solve that :)
Hi,
thanks for the extra comment, I get the point !
(I then have some optimization to do on my website concerning last post and comments count ;)
Cheers !
Great feedback. It looks like your comments are being looked at here.
good job :)
Domas,
Excellent work, although you’ve just scratched the surface! I too would love to help make WordPress a better database citizen. Where’s the action taking place?
Libertus,
most of stuff can be raised at #wordpress in Freenode IRC network – devs are there, devs are friendly, once you poke them directly ;-)
Domas,
I’ve been indirectly poking for a while. I’m ready for some action. I’m sitting in #wordpress-dev right now. I’ll join #wordpress too. :)
Thanks for the great article. It seems that the wordpress devs are really taking your advice to heart :)
What would be my favorite front page query?
SELECT id,name,content,comment_count,group_concat(category) as categories
FROM posts LEFT JOIN postcategories USING (postid)
WHERE date
Doesn’t this have implications for posts in more than one category being returned more than once because of the JOIN? Sorry, my SQL skills are rusty, so I’m not sure… :)
John, group_concat() would put all category names into single comma separated field – easy to use,..
Nice, I’ve learned some database tricks from your entry and I’m sure I’d be using them on my projects.
Using group_concat would significally up the mysql version they are requiring since group_concat wasn’t adding until 4.1. Not that I think thats a bad thing but it could cause problems for everyone who is running a server on debian and other linux distro’s that upgrade things like mysql at an incredibily slow rate.
Hi Domas,
i read in the 2.1-Releasepost on wordpress.org that you worked on the SQL-results.
I am very interesting in the changes you made. In my opinion it is a great idea to do some more writing about it.
The biggest problem with wordpress-performance IMO are plugins. So if your changes/ideas will improve some plugins’ sql this will really benefit the community!
Thanks!
~Tobias
PS: Please make shure the wordpress-team links to your hopefully upcomming posts so that I dont miss them :)