On valgrind and tcmalloc

I already wrote about tcmalloc, and how it helped with memory fragmentation. This time had some experience with extended features – memory profiling and leak checker. With tcmalloc it is possible to get an overview as well as detailed reports of what areas of memory are allocated for what uses. Even more, it can detect and report any memory that leaked. Of course, valgrind does that too. With one slight difference:

  • valgrind slows down applications 20-40 times (my tests ran 4s instead of 14ms)
  • tcmalloc does not. Same 14ms.

I wrote some MySQL UDFs for profiling and heap checking management, so can extract per-thread single-test stuff. Will try to clean up and release. Would be shame not to.

LAMPS on steroids

I’m not sure if I’m the first coining in ‘LAMPS’ – scaled out LAMP environment with Squid in front, but it sounds cool. Squid is major component in content distribution systems, reducing the load from all the backend systems dramatically (especially with proper caching rules). We had various issues in past, where we used code nobody else seemed to be using – cache coordination, purges and of course, load.

Quite a few problems resulted in memory leaks, but one was particularly nasty: Squid processes under high load started leaking CPU cycles somewhere. After deploying profiling for squid we actually ended up seeing that the problem is inside libc. Once we started profiling libc, one of initial assumptions appeared to be true – our heap was awfully fragmented, slowing down malloc().

Here comes our steroids part: Google has developed a drop-in malloc replacement, tcmalloc, that is really efficient. Space efficient, cpu efficient, lock efficient. This is probably mostly used (and sophisticated) libc function, that was suffering performance issues not that many people wanted to actually tackle. The description sounded really nice, so we ended up using it for our suffering Squids.

The results were what we expected – awesome :) Now the nice part is that the library is optimized for multi-threaded applications, doing lots of allocations for small objects without too much of lock contention, and uses spinlocks for large allocations. MySQL exactly fits the definition, so just by using simple drop-in replacement you may achieve increased performance over standard libc implementations.

For any developers working on high-performance applications, Google performance tools provide easy ways to access information that was PITA to work on before. Another interesting toy they have is embedded http server providing run-time profiling info. I’m already wondering if we’d should combine that with our profiling framework. Yummy. Steroids.