“Know the stack” is really required mantra in web app world, but usually the most important component – clients (aka browsers) are quite often forgotten. We have learnt to deal with IE bugs (MS definition: behavior to help web-masters create better content :), adapt various CSS fixes for different browsers, provide hints for robots (robots.txt must die…), etc. But new breed of clients showed up – desktop indexing software, that decides to help with internet indexing too – and we didn’t know some of its behavior.
One search engine’s desktop searching software decided to traverse few meta-links to RSS feeds on our pages. It encountered & escape sequence in links (well, HTML standard way to write ampersand everywhere, even in links), and decided that this must be some error, did not unescape it, and followed the link. This resulted in non-RSS page with meta link to RSS embedded. All the usual web links “a=b&c=d” became “a=b&c=d” – what ended up as options ignored, non-RSS versions of pages given, meta links with newly generated RSS links (with all the & spam) embedded, and the desktop indexing software happily ended up in recursion & loop.
This has resulted in additional gigabit of traffic from users using the product. As it was generally end-of-year decline of load, we didn’t feel it too much, but still, it raised awareness of the issue:
Never have infinite links, as there will be some product which might follow them infinitely.
Put that product on many PCs of regular users (oh yay, thats status quo) and very nice unintentional DDoS happens.
Every link written on website has to be normalized, canonized, unknown options stripped, known options ordered and filtered.
Especially, if such link is written on every page on the site. Then even standards ignorant or buggy products will not end up doing crazy loops. And by the way, let’s welcome the desktop indexing software to the stack of buggy clients we have to care about.