Yesterday I wrote how certain build optimizations can have performance differences – and I decided to step a bit deeper into a quite interesting field – profile guided binary optimization. There’re quite a few interesting projects out there, like LLVM (I hear it is used extensively in iphone?) – which analyze run-time profile of compiled code and can do just in time adjustments of binary code. Apparently, you don’t need that fancy technology, and can use plain old gcc.
The whole plan is:
- Compile all code with -fprofile-generate in {C|CXX|LD}FLAGS
- Run the binary
- Run your application/benchmark against that binary
- Recompile all code with -fprofile-use (above steps will place lots of .gcda files in source tree)
- PROFIT!!! (note the omission of “???” step)
How much profit? I measured ~7% of sysbench performance increase (and probably would see much higher value in CPU-tight benchmarks). YMMV. Can such PGO be useful for every user out there? Maybe – but the best results are achieved once looking at actual use patterns – though of course, lots of them are similar everywhere around.
Also, I am showing the actual profiling process with too much of pink. Apparently gcc/gcov profiles tend to get corrupted in multithreaded applications, so I did multiple profile/build passes, until I managed to assemble final binary. :-)
Now I have to figure out how to use -combine flag in gcc, and treat whole MySQL codebase as one huge .c file (apparently compilers can make much much better decisions then).
domas, I’m afraid this will totally trash fno-omit-frame-pointers you praised so much in the previous post:) PGO is also able to inline cross-module, so the callstacks on hot places could be a big surprise too
Vlad, nooooeessss! does it really inline cross-module? I wonder how that works (and if it does). On the other hand, frame pointers are still there… ;-)
Sun Studio C/C++ also supports PGO, as apparently do Intel’s and various others.
http://technopark02.blogspot.com/2005/07/sun-studio-cc-profile-feedback.html
A certain commercial DBMS vendor has tried profile guided optimizations and has never shipped optimized binaries. Apparently +7% in one workload results in -7% on another.
Which is not really surprising.
When reading the blog post I was thinking, exactly, that to make use of it every user would have to profile *his* application, under his typical load. When optimizing for everybody the speedup will be much smaller, if at all.
Domas,
the above was merely a speculation :)
I do know only how Microsoft compiler does PGO and this works via generating/optimizing code by linker rather than compiler. Dunno if GCC is using similar technology.
How cross-module inlining works for Microsoft compiler.The clue is that executable code is generated at link time rather than at compile time.
Compiler still produces *.obj files, but those are output of the compiler frontend in some internal form, not the executable code. Linker acts as compiler backend for all these .obj file. Since linker has enough info about module interdependencies and with profiles also knows hot places, it can inline a hot function in another module.
This sounds somewhat similar -combine flag for GCC you mentioned. During code generation GCC will know about the whole program behavior and could actually make “cross-module” inlining decisions.