More than three years ago one prominent Wikipedia community member, who didn’t have enough time to work on our technology, but had enough of it to criticize it, had a problem. He was working for a super-computing lab that boasted one of fastest world’s clusters (IBM blue-something), and to relieve the stress of some of his hard work, was trying to solve a computer science mystery:
Given a dictionary, find words that have most anagrams. With few-hundred-thousand word dictionary, on 8-way machine this should complete in less than 12 hours.
The supercomputing lab inspired solution was something like this (I may be wrong, years pass though):
- Put all the words into an array
- Start generating letter sequences: aaaa, aaab, aaac, …
- For every generated word, traverse the array, find how many matches it hits
- Get sequences with most matches
See, the problem was not how to make this algorithm efficient, but how to parallelize it, or maybe rewrite in C (thats what they do in supercomputing labs). It took few minutes to write following code:
wordlist=open("/usr/share/dict/words") anagrams={} for line in wordlist: word=line.strip() sorted=list(word) sorted.sort() sorted="".join(sorted) if sorted not in anagrams: anagrams[sorted]={'count':1,'words':[word,]} else: anagrams[sorted]['count']+=1 anagrams[sorted]['words'].append(word)
(I just canonized words into base-anagram-form – sorted letters, and ran collision test)
It took maybe half a second to run it on my 1.2ghz ibook back then. The end of the story back then was that we get one critic less.
Now, why I remember this story today – Yasufumi Kinoshita just did exactly the same to us, MySQL users. Simply, brute force is not the answer, if you can spare the minute and rethink stuff. Thanks, for reminding this.
This is one of Jon Bentley’s “aha algorithms”, as found in “Programming pearls”.
I wrote some notes about that in 2001
http://www.perlmonks.org/?node_id=135345
Cheers
Giuseppe
In the end it all comes down to Algorithms 101 :)