Quite often fast databases, super-duper backend caching layers and other fancy stuff doesn’t help if you don’t serve your customer right. Take, for example, Twitter. This service has lots and lots of clicks, people following each other, in endless loops, trees, and probably serving occasional page-views.
I noticed that every click seemed to be somewhat sluggish, and started looking at it (sometimes this gets me free lunch or so ;-)
Indeed, every click seemed to reload quite a bit of static content (like CSS and JavaScript from their ‘assets’ service). Every pageview carrying information took 2s to serve, but static content slowed down the actual page presentation for three-six additional seconds.
Now, I can’t say Twitter didn’t try to optimize this. Their images are loaded from S3 and have decent caching (even though datacenter is far away from Europe), but something they completely control and own, and what should make least amount of costs, ends up the major slow-down.
What did they do right? They put timestamp markers into URLs for all included javascript and stylesheet files, so it is really easy to switch to new files (as those URLs are all dynamically generated by their application for every pageview).
What did they do wrong? Let’s look at the response headers for the slow content:
Accept-Ranges:bytes Cache-Control:max-age=315360000 Connection:close Content-Encoding:gzip Content-Length:2385 Content-Type:text/css Date:Wed, 25 Mar 2009 21:12:21 GMT Expires:Sat, 23 Mar 2019 21:12:21 GMT Last-Modified:Tue, 24 Mar 2009 21:21:04 GMT Server:Apache Vary:Accept-Encoding
It probably looks perfectly valid (expires in ten years, cache control existing), but…
- Cache-Control simply forgot to say this is “public” data.
- ETag header could help too, especially if no ‘public’ is specified.
- Update: Different pages have different timestamp values for included files – so all caching headers don’t have much purpose ;-)
And of course, if those files were any closer to Europe (now they seem to go long long way to San Jose, California), I’d forgive lack of keep-alive. Just serve those few files off a CDN, dammit.
They could at least use Amazon’s CloudFront as a CDN which is so easy to implement. Proper timestamps for assets would probably save them more than a lunch in data transfer fees :)
The culprit here must be the Vary header. Most browsers interpret it as “Cache-control: must-revalidate” no matter what other headers you sent with. In fact, I’ve just tested twitter.com on my Firefox 3.0.7 and the CSS files don’t even seem to be cached at all.
well, we use Vary at Wikipedia a lot, and it doesn’t seem to be causing any caching problems (as long, as we handle the remaining set of headers properly).
Including an ETag won’t do much of anything if they have an Expires and max-age already set. With just an ETag set , anytime a repeat visitor hits a page with a static asset already cached, it will send a INM request, and the server will respond with a 304. If you have just the expires or max-age set, the browser won’t even re-validate the object, it will just serve from the browser cache, saving on some bandwidth.
I don’t think adding public to cache-control would affect anything either as that IIRC just makes http authenticated responses cachable.
is that why you don’t have a twitter account? :D
Thank you for the expiration date hint, but a lot of traffic is produced via API and not just the web interface. xD