Rant on search crawlers

This isn’t even remotely funny. Every major search crawler provides different Accept-Encoding headers that make it bypass cache and always hit the backend. It is easy to hack Squid to disregard spaces between options (as IE puts them in headers: gzip, deflate, and Mozilla does not: gzip,deflate), but some of these things make caching hell:

  • msnbot: Accept-Encoding: identity;q=1.0
  • googlebot: Accept-Encoding: gzip
  • yahoo (slurp): Accept-Encoding: gzip, x-gzip

Add Opera with it’s Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 and KHTML with Accept-Encoding: x-gzip, x-deflate, gzip, deflate, and you get a hell where bold normalization solutions have to be applied. I guess we just have to treat it as single-bit ‘gzip’ and ‘plain’ difference, and screw everything else.

Update: squid patch :)

Smart software and dynamic links

“Know the stack” is really required mantra in web app world, but usually the most important component – clients (aka browsers) are quite often forgotten. We have learnt to deal with IE bugs (MS definition: behavior to help web-masters create better content :), adapt various CSS fixes for different browsers, provide hints for robots (robots.txt must die…), etc. But new breed of clients showed up – desktop indexing software, that decides to help with internet indexing too – and we didn’t know some of its behavior.

One search engine’s desktop searching software decided to traverse few meta-links to RSS feeds on our pages. It encountered & escape sequence in links (well, HTML standard way to write ampersand everywhere, even in links), and decided that this must be some error, did not unescape it, and followed the link. This resulted in non-RSS page with meta link to RSS embedded. All the usual web links “a=b&c=d” became “a=b&c=d” – what ended up as options ignored, non-RSS versions of pages given, meta links with newly generated RSS links (with all the & spam) embedded, and the desktop indexing software happily ended up in recursion & loop.

This has resulted in additional gigabit of traffic from users using the product. As it was generally end-of-year decline of load, we didn’t feel it too much, but still, it raised awareness of the issue:

Never have infinite links, as there will be some product which might follow them infinitely.

Put that product on many PCs of regular users (oh yay, thats status quo) and very nice unintentional DDoS happens.

Every link written on website has to be normalized, canonized, unknown options stripped, known options ordered and filtered.

Especially, if such link is written on every page on the site. Then even standards ignorant or buggy products will not end up doing crazy loops. And by the way, let’s welcome the desktop indexing software to the stack of buggy clients we have to care about.

Google does encyclopedia: Knol

It is all closed yet, announcement by VP Engineering tells us Google is launching their idea of encyclopedia – looking for people who can write authoritative articles. No words on licensing apart from “we want to disseminate it as widely as possible”, though author-centric view is more what Citizendium, than Wikipedia wants to do.

Ad revenue sharing poses many interesting questions, especially in collaborative effort. As Wikipedia now provides page view statistics, Google (or knollers) may just work on top of the cream pages (by knowing search trends), and have very distorted overall content. Now it is closed, invite only, so we can’t tell anything more. Time will show. Good to know more organizations are believing about aggregating and disseminating knowledge – it is Wikipedia’s mission, and it is nice to have partners. :-) Though of course, there might be some tensions with Search Quality team…

Optimization operator

I have introduced this to quite a few colleagues in a form of question “what is the optimization operator in C++/PHP/…?”
The answers varied a lot, people would come up with branch prediction stuff (likely(),etc), and many other ideas, though never right ones.
The answer was pretty straightforward. It works in quite a lot of programming languages:

//

Simply commenting out code optimizes things better than any other way. Go, try it.

Weird wit by Google translation technology

I was translating some document from German to English, that had my surname in it.
It got translated to ‘Beesley’, and I immediately thought of Angela Beesley, chair of Wikimedia Advisory Board. I started playing more, and did find, that:

  • French ‘Domas Mituzas’ to English translates as ‘Anthere fall’
  • ‘Mituzas’ in German is ‘Schindler’ (Matthias?:)
  • Spanish ‘Domas Mituzas’ to English translates as ‘Anthere Anthere’ (every wikipedian has a bit of Florence inside :)
  • English to Portuguese renders me as “Domas Lessig” (I have creative commons t-shirt :)
  • English to Chinese is “florence 100,000″…

Thats what Web 3.0 is all about. Tampering with my personality. Who am I? :)

Citizendium revisited

Just spotted amazing article, how Citizendium built better infrastructure than Wikipedia’s. There lots of fascinating details there, like…

They went with PostgreSQL for a number of reasons, including better scalability. PostgreSQL is an MVCC database. Unlike Wikipedia, Citizendium never has to lock the database for reads and writes. MySQL can do a lot of things quick and replicate them to slave servers, but PostgreSQL excels at complex functions and full features like JOINs and can do complicated categories and full text searches faster than Wikipedia.

If PG can function without locks, it must be definitely more scalable. InnoDB uses mutexes, spinlocks, etc – and that internal locking can be a bottleneck in many cases. Additionally, if a row is updated, a lock on the record is acquired. It is still a question how PG maintains ACID without any locks, got to research on that more.
I’m aware that MySQL isn’t best at full-text search out there – but Wikipedia uses Lucene for full-text search, so it is somewhat strange to hear that Citizendium platform is faster in that regard. And… I’m not sure where JOIN performance is really faster there – especially when we do lots of covering-index based joins. Probably the key word there is ‘complex’, though I’m not sure what that means :-)
The first reason not to use MySQL was:

First, to be different from Wikipedia.

Indeed, I always support critical thinking! Though this one:

Finally, we felt from reading various mailing lists over mediawiki development that mediawiki was hitting the ceiling of the features MySQL can provide as a backend.

IIRC that came from single post on single mailing list from someone who is not running Wikipedia backend. Mhm.
Of course, their monthly traffic is equal to our single minute traffic, so some views might differ…

%d bloggers like this: