Fine tuning of compiler options to increase application performance
Posted on : 21-03-2011 | By : Alexander Permyakov | In : Uncategorized
2
Performance is essential for video analytic applications since algorithms are usually computationally heavy and such systems are supposed to work almost in real time. From one side it can be increased by improving & changing algorithms. This is a major way since it allows to increase performance dramatically. From another side performance can be increased little bit more by relatively simple way – using of good compiler and by tuning of compile options. Let see how it can be done in real programs.
For the first example I used LAME encoder (http://lame.sourceforge.net/) . Why LAME? First of all because it open source and I can recompile it with different compilers and options. In the second place the simplicity of performance measurement. Performance will be a time required to reencode mp3 file. In the third place it shows well determinate results what allow better understand how different compile options affect speed.
The testing has been performed on computers with different CPUs under Windows operation system.
Intel Pentium 4 3GHz
Intel Core 2 Duo 2.8 GHz
AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz
Intel Core i5 (2500) 3.3 GHz
Compilation has been done by VisualStudio9 and GCC4.5.1(using MinGW)
Encoding time has been measured 10 times and average value placed to the table.
As the base 0.00% I used safe options (-O3 -march=prescott -fomit-frame-pointer -mfpmath=sse) that will work on most modern AMD and Intel CPUs. Option -march=core2 may use ssse3 instructions and therefore code may fail to work on AMD and Intel Pentimum 4 family CPUs.
Intel Pentium 4 3GHz
Compiler | Compiler options | Average Time | % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 13.206153 sec | -7.83 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use | 13.537400 sec | -5.52 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 13.999892 sec | -2.29 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 14.328020 sec | 0.00 % |
GCC4.5.1 | -O3 -fomit-frame-pointer -mfpmath=sse | 14.621770 sec | 2.05 % |
Visual_Studio_9 | /GS- /fp:fast /O2 | 14.646769 sec | 2.22 % |
- Optimization to prescott architecture gives 2% speed increase.
- –ffast-math gives 2% more
- Profile guided optimization gives 5% speed increase
Intel Core 2 Duo 2.8 GHz
Compiler | Compiler options | Average Time | % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 7.818235 sec | -6.65 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use | 7.824039 sec | -6.58 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 7.893243 sec | -5.75 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use | 7.976644 sec | -4.75 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math | 8.234858 sec | -1.67 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse | 8.374867 sec | 0.00 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 8.415269 sec | 0.48 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 8.423270 sec | 0.58 % |
Visual_Studio_9 | /GS- /fp:fast /O2 | 8.814092 sec | 5.24 % |
GCC4.5.1 | -O3 -fomit-frame-pointer -mfpmath=sse | 9.224519 sec | 10.15 % |
- Optimization to core2 architecture gives 10% speed increase.
- –ffast-math gives only 1% increase
- Profile guided optimization gives 6% increase
Intel Core i5 (2500) 3.3 GHz
There is no special -march option for core i3,5,7 CPUs. Option -march=core2 can be used for them.
Compiler | Compiler options | Average Time | % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 4.059390 sec | -8.52 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 4.093767 sec | -7.75 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math | 4.156268 sec | -6.34 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 4.200015 sec | -5.35 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use | 4.253143 sec | -4.15 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use | 4.321892 sec | -2.61 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 4.437519 sec | 0.00 % |
GCC4.5.1 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse | 4.468770 sec | 0.70 % |
Visual_Studio_9 | /GS- /fp:fast /O2 | 4.737522 sec | 6.76 % |
GCC4.5.1 | -O3 -fomit-frame-pointer -mfpmath=sse | 4.815647 sec | 8.52 % |
- Optimization to core2 architecture gives 9% speed increase.
- –ffast-math gives 6% increase
- Profile guided optimization gives 4% increase
AMD Athlon2x4 (635) 2.9 GHz overclocked to 3.3 GHz
Compiler | Compiler options | Average Time | % |
GCC4.5.1 | -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 6.078386 sec | -6.14 % |
GCC4.5.1 | -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -ffast-math | 6.170114 sec | -4.73 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 6.308954 sec | -2.58 % |
GCC4.5.1 | -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse -profile-use | 6.388826 sec | -1.35 % |
GCC4.5.1 | -O3 -march=amdfam10 -fomit-frame-pointer -mfpmath=sse | 6.476186 sec | 0.00 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -profile-use | 6.527979 sec | 0.80 % |
Visual_Studio_9 | /GS- /fp:fast /O2 | 6.942938 sec | 7.21 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 7.293316 sec | 12.62 % |
GCC4.5.1 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 7.372564 sec | 13.84 % |
GCC4.5.1 | -O3 -fomit-frame-pointer -mfpmath=sse | 7.661477 sec | 18.30 % |
- Optimization to amdfam10 architecture gives 18% speed increase
- –ffast-math gives 5 %
- Profile guided optimization gives only 1%.
Total results
Optimization to particular architecture and profile guided optimization may give up to 20 % speed increase.
As I already said LAME is simple example. Let see how performance options affect real video analytic application .
For the second example I used critical part of real video analytic application (myAudience). It uses boost, opencv and ffmpeg libraries. Also it runs in several threads. In comparison with LAME encoder performance measurement for this application was not so simple. Moreover because of inaccuracy of measurements in multithreading dynamic enviroment results were not so well determinate. So I have prepared just one table which shows results in general how I understand them.
Compilation has been done by GCC4.5.2 and GCC4.1.2 on CentOS_5.5
Intel Core i5 (2500) 3.3 GHz
Compiler | Compiler options | Average Time | % |
GCC4.5.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math -profile-use | 19591.16 | -4.90 % |
GCC4.5.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -profile-use | 19873.74 | -3.53 % |
GCC4.5.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math | 20010.55 | -2.86 % |
GCC4.5.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse | 20410.36 | -2.09 % |
GCC4.5.2 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 20410.36 | -0.92 % |
GCC4.1.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse -ffast-math | 20532.91 | -0.33 % |
GCC4.5.2 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 20600.55 | 0.00 % |
GCC4.1.2 | -O3 -march=core2 -fomit-frame-pointer -mfpmath=sse | 20816.26 | 1.05 % |
GCC4.1.2 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse -ffast-math | 21962.44 | 6.61 % |
GCC4.1.2 | -O3 -march=prescott -fomit-frame-pointer -mfpmath=sse | 22221.88 | 7.87 % |
What can we conclude after that? Few things
- GCC4.5.2 little bit faster than GCC4.1.2 plus it allow to use profile guided optimization, and “amdfam10”, “atom” architecture options.
- Profile guided optimization give about 4% speed increase.
- –ffast-math gives about 2% speed increase
As you can see, tuning compiler options allows to get real improvement in performance, not so huge sometimes, but almost free, so it should be kept in mind.
I played around with this kind of options, a long time a go. Now, thanks to your post, I’m going to give it a new try.
Best regards 😉
Nice post, some of those settings made a big difference, and its good to see that newer versions of GCC are faster than VS2008 🙂