Rant on search crawlers

This isn’t even remotely funny. Every major search crawler provides different Accept-Encoding headers that make it bypass cache and always hit the backend. It is easy to hack Squid to disregard spaces between options (as IE puts them in headers: gzip, deflate, and Mozilla does not: gzip,deflate), but some of these things make caching hell:

msnbot: Accept-Encoding: identity;q=1.0
googlebot: Accept-Encoding: gzip
yahoo (slurp): Accept-Encoding: gzip, x-gzip

Add Opera with it’s Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 and KHTML with Accept-Encoding: x-gzip, x-deflate, gzip, deflate, and you get a hell where bold normalization solutions have to be applied. I guess we just have to treat it as single-bit ‘gzip’ and ‘plain’ difference, and screw everything else.

Update: squid patch :)

	markcallaghan (@mark… on MySQL does not need SQL
	markcallaghan (@mark… on MySQL does not need SQL
	Domas Mituzas on MySQL does not need SQL
	Marc on MySQL does not need SQL
	Nils Meyer on linux memory management for…

Related