Rant on search crawlers

This isn’t even remotely funny. Every major search crawler provides different Accept-Encoding headers that make it bypass cache and always hit the backend. It is easy to hack Squid to disregard spaces between options (as IE puts them in headers: gzip, deflate, and Mozilla does not: gzip,deflate), but some of these things make caching hell:

  • msnbot: Accept-Encoding: identity;q=1.0
  • googlebot: Accept-Encoding: gzip
  • yahoo (slurp): Accept-Encoding: gzip, x-gzip

Add Opera with it’s Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 and KHTML with Accept-Encoding: x-gzip, x-deflate, gzip, deflate, and you get a hell where bold normalization solutions have to be applied. I guess we just have to treat it as single-bit ‘gzip’ and ‘plain’ difference, and screw everything else.

Update: squid patch :)

%d bloggers like this: