catching top waits

Modern systems are complicated beasts with lots of interdependent activities between threads, programs and kernels. Figuring out some problems is nearly impossible without building some time machine and crystal ball mix that tells exactly what happened.

Did your cgroups CPU limit inject a sleep in the middle of mmap_sem acquisition by ‘ps’? Is everyone waiting for a mutex that is held by someone who is waiting for a DNS response? Did you forget to lock in your libnss.* libraries into memory and hence ended up stalling in unexpected place under memory pressure?

I’ve grabbed Brendan Gregg‘s offcpu profiler, gutted it to the point where all it does is record longest waits per stack trace, as well as timestamp of the longest wait.

Some debugging sessions that would’ve taken hours, days or weeks before now are few minute endeavors. It is still quite hacky, so present it without too much additional polishing.

I do have plans in place to have a visualizing frontend for the output some day.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s