Rant on search crawlers

This isn’t even remotely funny. Every major search crawler provides different Accept-Encoding headers that make it bypass cache and always hit the backend. It is easy to hack Squid to disregard spaces between options (as IE puts them in headers: gzip, deflate, and Mozilla does not: gzip,deflate), but some of these things make caching hell:

  • msnbot: Accept-Encoding: identity;q=1.0
  • googlebot: Accept-Encoding: gzip
  • yahoo (slurp): Accept-Encoding: gzip, x-gzip

Add Opera with it’s Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 and KHTML with Accept-Encoding: x-gzip, x-deflate, gzip, deflate, and you get a hell where bold normalization solutions have to be applied. I guess we just have to treat it as single-bit ‘gzip’ and ‘plain’ difference, and screw everything else.

Update: squid patch :)

Smart software and dynamic links

“Know the stack” is really required mantra in web app world, but usually the most important component – clients (aka browsers) are quite often forgotten. We have learnt to deal with IE bugs (MS definition: behavior to help web-masters create better content :), adapt various CSS fixes for different browsers, provide hints for robots (robots.txt must die…), etc. But new breed of clients showed up – desktop indexing software, that decides to help with internet indexing too – and we didn’t know some of its behavior.

One search engine’s desktop searching software decided to traverse few meta-links to RSS feeds on our pages. It encountered & escape sequence in links (well, HTML standard way to write ampersand everywhere, even in links), and decided that this must be some error, did not unescape it, and followed the link. This resulted in non-RSS page with meta link to RSS embedded. All the usual web links “a=b&c=d” became “a=b&c=d” – what ended up as options ignored, non-RSS versions of pages given, meta links with newly generated RSS links (with all the & spam) embedded, and the desktop indexing software happily ended up in recursion & loop.

This has resulted in additional gigabit of traffic from users using the product. As it was generally end-of-year decline of load, we didn’t feel it too much, but still, it raised awareness of the issue:

Never have infinite links, as there will be some product which might follow them infinitely.

Put that product on many PCs of regular users (oh yay, thats status quo) and very nice unintentional DDoS happens.

Every link written on website has to be normalized, canonized, unknown options stripped, known options ordered and filtered.

Especially, if such link is written on every page on the site. Then even standards ignorant or buggy products will not end up doing crazy loops. And by the way, let’s welcome the desktop indexing software to the stack of buggy clients we have to care about.

A perfect christmas story

This made me laugh. Providers were fighting who will give better ‘free sms’ plans. Then my GSM provider decided that they have enough resources for even better campaign – they offered adding 0.02 LTL to account balance for every SMS received. So, smart kids saw business opportunity – they started spamming SMS messages from one phone (free SMS!) to another (get paid for SMS received). Smarter kids started automating the process using their computers (though didn’t see to many “how to use kannel” guides – most public solutions were using gui tools and automated mouse movers :).

The best part is that smartest kids immediately found ways to cash out the ‘GSM LTLs’ – by using ‘call to pay’ service providers, and getting 50% cash efficiency.

One GSM provider (Omnitel, TeliaSonera company) reacted by establishing daily SMS limit that works (only 6LTL worth of SMSes per day), whereas other provider (Tele2) established limit that doesn’t (phones would get disconnected next day only).

And of course, this has brought down GSM providers, or at least their SMS networks – at Christmas. Way to go, marketing people. Way to go.

For all the international people: 1 EUR= 3.4528LTL

Edit: 0.02LTL for SMS

Google does encyclopedia: Knol

It is all closed yet, announcement by VP Engineering tells us Google is launching their idea of encyclopedia – looking for people who can write authoritative articles. No words on licensing apart from “we want to disseminate it as widely as possible”, though author-centric view is more what Citizendium, than Wikipedia wants to do.

Ad revenue sharing poses many interesting questions, especially in collaborative effort. As Wikipedia now provides page view statistics, Google (or knollers) may just work on top of the cream pages (by knowing search trends), and have very distorted overall content. Now it is closed, invite only, so we can’t tell anything more. Time will show. Good to know more organizations are believing about aggregating and disseminating knowledge – it is Wikipedia’s mission, and it is nice to have partners. :-) Though of course, there might be some tensions with Search Quality team…

On guts and I/O schedulers

Benchmarks and guts sometimes may contradict each other. Like, a benchmark tells that “performance difference is not big”, but guts do tell otherwise (something like “OH YEAH URGHH”). I was wondering why some servers are much faster than other, and apparently different kernels had different I/O schedulers. Setting ‘deadline’ (Ubuntu Server default) makes miracles over having ‘cfq’ (Fedora, and probably Ubuntu standard kernel default) on our traditional workload.

Now all we need is showing some numbers, to please gut-based thinking (though it is always pleased anyway):

Deadline:

avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.72    0.00    7.95   18.18   69.15

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 91.30 31.30 3147.20 1796.00

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  1573.60   898.00    40.32     0.98    7.98   3.65  44.80

CFQ:

avg-cpu:  %user   %nice    %sys %iowait   %idle
           4.65    0.00    7.62   38.26   49.48

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s
sda          0.00   0.10 141.26 38.86 4563.44 2571.03

    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
  2281.72  1285.51    39.61     7.61   42.52   5.38  96.98

Though load slightly rises and drops, the await/svctime parameters are always better on deadline. The box does high-concurrency (multiple background InnoDB readers), high volume (>3000 SELECT/s) read-only (aka slave) workload on ~200gb dataset, on top of 6-disk RAID0 with write-behind cache. Whatever next benchmarks say, my guts will still fanatically believe that deadline rocks.

Optimization operator

I have introduced this to quite a few colleagues in a form of question “what is the optimization operator in C++/PHP/…?”
The answers varied a lot, people would come up with branch prediction stuff (likely(),etc), and many other ideas, though never right ones.
The answer was pretty straightforward. It works in quite a lot of programming languages:

//

Simply commenting out code optimizes things better than any other way. Go, try it.