Tim O’Reilly wanted to hear Amazon S3 success stories and numbers. In short, Amazon S3 allows organizations to outsource their storage and they say one can save lots of money. I wondered how that applies to Wikipedia environment and did some calculations.
The cost structure is very simple – $0.15 per gigabyte stored and $0.2 per gigabyte transfered. It is not that easy exercise to immediately convert that into costs already in one’s head, so it took a bit of work before being able to produce any summaries or conclusions.
Some numbers may slightly differ in Wikimedia operation, I’m not sure how much of cost structure I could disclose, so I’ll just use some sane or widely known figures. Here’s what we get…
So, the idea was to offload media storage and serving to S3. Our cluster does average 2Gbps of traffic, with peaks at 3Gbps, say 50% of that is images. That means we have to handle 1.5Gbps of images traffic if we maintain our own systems, or we serve 10TB of data daily via Amazon S3. This results in $60000 monthly bill for S3 traffic, add few hundreds for storage (we have just few terabytes of data) and this results in $60300 per month.
Now calculating Wikimedia costs would be much more complicated – we have to take storage servers into account ($30000), add bunch of cache servers ($100000), some mediocre routing gear ($30000), pay $5000 for racks and electricity, and of course, get 1.5Gbps – at 15$/Mbps (Cogent may offer $10 and a free iPod!) this ends up being additional $22500. With hardware costs distributed over 12 months the bill totals at about $41000.
Some of hardware is of course shared for other tasks, some numbers may not be accurate, but of course, the final number is much lower. Additionally, we get distributed CDN (faster response times!), efficient invalidation, our own statistics and dynamic access miss rules (404 handlers). Oh, and while Amazon serves 16000 requests per second at their total peaks, we do over 30000 (thats with text pages).
The biggest cost one pays for S3 appears to be bandwidth rather than storage, so if traffic persists, having own infrastructure seems to be cheaper. Unless I missed something.