This article examines how to improve software execution speed in embedded devices having a multi-core ARM mobile processor, by migrating existing software code to utilize parallel computing using OpenMP API. The improvements are implemented and tested in Raspberry Pi 2 and mobile Android platforms, although the same principles and technologies can directly be applied also for optimizing desktop and server software.
Some time ago I was migrating an audio processing software to Android devices and faced a challenge of how to make an intensive calculation algorithm run fast enough even in the low-end Android devices.
The case was about an audio signal processing algorithm that runs very smoothly in a regular laptop computer, occupying about one percent of laptop’s CPU capacity while processing multi-channel audio for transposing it’s pitch in real-time. When ported into a low-end smart phone device, this same software algorithm was on the very verge of executing fast enough for smooth real-time audio experience.
This performance difference was of course due to a different CPU capabilities. Despite the very impressive development of the low-power mobile CPUs during past decade or so, the processors used in mobile/embedded devices are yet optimized for low cost and low power consumption and their true performance still fall rather far behind the processors used in personal computers, which allow much more permissive cost and power design budgets.
Mobile trend towards multi-core CPUs
The mobile device world has recently yet gone through a similar transfer from single-core CPUs to multi-core CPUs as the PC world saw during the past decade: While single-core CPUs were mainstream in mobile smart handsets still few years ago, nowadays even low-end mobile handsets feature two or more CPU cores, and flag-ship smart phone models from Samsung, LG, Sony etc already feature eight-core CPUs, and no doubt we shall see even higher number of cores in the future products. The same trend applies also to embedded microcontroller families.
This puzzle lead to an examination of how to improve execution performance of a C++ software originally written for traditional single CPU execution, by modifying the processing algorithms to utilize several CPU cores in parallel. OpenMP API was selected as the parallel programming technology due to its transparent support of different operating system, compiler tools and hardware environment, so that it works nowadays even in mobile devices.
Parallel programming for multi-core CPUs
Packing several CPU cores into a multi-core unit provides a significant theoretical increase in computation horsepower, yet the full power of a multi-core unit will get harnessed only in if the target software is also (re)designed to utilize multiple parallel CPU cores.
The key for a utilizing multiple CPU cores is to design the program so that it will divide execution of algorithms into several parallel execution threads that each run simultaneously. This is called parallel programming.
Designing efficient parallel algorithm is a non-trivial task, because algorithms execution typically require specific stages of the algorithm to perform in a specific well-defined order sequence. There are certain practical challenges in arranging different parts of the algorithms to execute efficiently in parallel, while ensuring that the algorithm can eventually combine the results together in specific order in the end.
Luckily there software tools designed for easing the parallel programming task.
OpenMP
OpenMP is among the most prominent SMP parallel programming technologies. OpenMP is a specification for programming language extensions that allow the software developer to tag hints into the program source code about which functions and loops can benefit of splitting the execution path into several parallel execution threads. Based on these additional tags added by the software developer, the compiler can then take care of the rest by splitting execution into separate parallel execution threads for multi-core execution, and again synchronize the final results of each thread back together in a consistent way.
OpenMP has several benefits:
- High-level abstraction make it easy to use in new and existing software, without developer needing to worry too much about the details of how the threaded execution will get managed internally
- Widely adopted: OpenMP specification is being developed by a non-profit organization consisting of industrial corporations and institutions, so it is widely supported and somewhat recent versions of common C / C++ and Fortran compilers support at least elementary OpenMP API features. For example, GCC compiler 4.2 onward and Microsoft Visual Studio 2008 onward support OpenMP
- Support various platforms: Same optimizations will work in different compiler and hardware/OS platforms even if compiler and hardware platform are different, thanks to the standard API specification. The same source code can thus be compiled for desktop, mobile and embedded platforms alike.
- Transparent API allows building the same software with and without parallel optimizations enabled. This is beneficial in some cases, for example it can be desirable to use same source codes also in platforms that do not have OpenMP-capable compiler tool available. In such case the legacy compiler can simply ignore the OpenMP directives and the software will execute normally using just a single CPU. In some cases it is also useful to work on a non-parallel compilation for easier debugging or comparison reasons.
While OpenMP is easy and versatile approach for parallel programming in SMP multicore/mullti-CPU computers, other parallel programming techniques exist too. For example:
- Cilk, Intel-sponsored C/C++ extensions for parallel programming. Alternative technology for OpenMP.
- MPI (Message Passing Interface), technology for distributed parallel programming. MPI enables building large-scale parallel programs that can expand a common calculation task beyond a shared-CPU unit to a whole cluster of networked computers. MPI can be used together with OpenMP, so that OpenMP can perform parallel execution inside a single SMP computer unit, while MPI can distribute execution across a cluster of such computers units.
- CUDA, Nvidia-sponsored programming model for using GPU (graphics processor units) for general-purpose computing. Concurrent Graphics Processor Units in PC and mobile devices consist of a wide set of multi-core execution units that nowadays can be used also for general-purpose calculation besides drawing graphics. The hardware and programming model of GPUs is much more restricted than what OpenMP enables, but because single GPU can contain between tens to up to thousands of parallel execution units, certain special routine calculation tasks can benefit of remarkable performance acceleration with CUDA.
Optimizing software runtime performance
The principle of software performance optimization is straightforward: Detect where the program spends most of the execution time, and improve these parts to run faster.
Very typically relatively few hot-spot routines use up most of the software execution time, so that typically just handful of critical functions consume >80% of the overall software execution time. The key for software performance optimization is thus to identify these “hot-spot” routines within the software, and focus optimization effort in rewriting these few routines to execute better, which will then result in faster overall execution of the software.
Let’s follow this approach for migrating the existing C/C++ software to execute in parallel CPU cores.
The main steps for performance optimization are following:
- Step 1: Define performance benchmark: To measure and identify which parts of the software use most CPU time i.e. involve most optimization potential, it’s necessary to define a realistic and repeatable benchmark scenario for measuring execution performance before and after optimization improvements
- Step 2: Measure how the software uses the execution time. Run the software through the benchmark scenario while using a software profiling tool to record information about the software execution.
- Step 3: Analyse results to identify what routines to improve. Use software profiling tool to analyse the recorded data to see in which routines the CPU spent most of the time
- Step 4: Implement code improvements. Knowing the hot-spot routines with biggest contribution to the software performance, improve these routines to execute faster. In this case, this step involves making these hot-spot routines to utilize multiple parallel CPU cores.
- Step 5: Control by comparing results before/after optimization changes. Repeat the benchmark scenario to get feedback about efficiency of the optimizations, and to see how much further optimization potential there can be remaining
Let’s do this!
Step 1. Benchmark setup
Target setup
Raspberry Pi 2 was chosen as working environment for the OpenMP optimization:
- It runs a full-fledged Linux OS so it has a versatile set of free software development tools available and thus allows easy testing and tuning of the software
- With its four-core ARM CPU, Raspberry Pi 2 makes a good example of product that introduces multi-core CPUs into ever more affordable electronic products
- Common smart phone devices (Android, iPhone and Windows Phones) use similar multi-core ARM processors as Raspberry, so similar performance improvements as achieved in Raspberry Pi 2 platform expectedly translate also to smart phones and tables.
As of writing, the GCC compiler version 4.6 included in Raspberry Pi 2’s “Raspbian” operating system distribution supports OpenMP 3.0 specifications; recent enough for this examination.
The target software
The target software to optimize is SoundTouch, an open-source audio processing library+application written in C/C++ language. It makes a good example software for parallel optimization, because it’s openly available for several platforms, it’s originally been designed and written for traditional single-core execution, and it involves relatively calculation intensive DSP routines.
SoundTouch processes digital audio streams, so to give fair amount of work for sake of benchmark testing, let the benchmark scenario be retuning a 1000 seconds long CD-quality (44.1kHz/16bit/stereo) audio track from 440Hz to 432Hz base frequency with the SoundStretch utility. Converted to semi-tones scale, this corresponds to lowering the key by 0.318 semi-tones.
The shell command to run the benchmark scenario in Raspberry / Linux is accordingly:
time ./soundstretch original.wav /dev/null –pitch=-0.318
In this case we direct output to /dev/null instead of a real output file to minimize impact of mass storage I/O to the benchmark result. This is because Raspberry Pi2 uses consumer-grade flash memory SD card as mass storage whose relatively low write speed might unnecessarily bias the processing result times.
The “time” command before the main command is standard GNU command that will inform the total execution time after the main program finishes.
Visit the SoundTouch site to download the source codes. The build instructions for various platforms are provided in the README file included in the source code package.
Step 2. Measure how the software uses the execution time
The measure step is done using software profiling. Profiling tools and how to perform software profiling in Raspberry Pi 2 are discussed in more details in another article.
Based on comparing available free profiling tools, gperftools were chosen as the most suitable toolkit for this OpenMP optimization tasks.
Let’s run the benchmark software scenario with gperftool profiling features enabled, and then run the google-pprof tool to analyse the results:
time ./soundstretch test.wav /dev/null –pitch=-0.318 google-pprof –text ./soundstretch soundstretch.prof
Step 3. Analyse profiling results to identify routines to improve
Here are the profiler results before the OpenMP improvements:
real 121.939 s user 121.210 s sys 0.610 s Total: 12183 samples 7713 63.3% 63.3% 7713 63.3% soundtouch::TDStretch::calcCrossCorr 3366 27.6% 90.9% 3366 27.6% soundtouch::FIRFilter::evaluateFilterStereo 399 3.3% 94.2% 399 3.3% soundtouch::InterpolateCubic::transposeStereo 182 1.5% 95.7% 182 1.5% WavInFile::read 175 1.4% 97.1% 175 1.4% WavOutFile::write 155 1.3% 98.4% 155 1.3% memcpy 106 0.9% 99.3% 106 0.9% soundtouch::TDStretch::seekBestOverlapPositionFull 51 0.4% 99.7% 51 0.4% read
This output lists functions ordered based on the consumed CPU time.
We can see that two first functions together used over 90% of all the execution time:
- TDStretch::calcCrossCorr() : 63,3% of CPU time / 77.173 seconds
- FIRFilter::evaluateFilterStereo : 27,6% of CPU time / 33.66 seconds
This is a good example of the optimization rule of thumb discussed earlier, that typically just few hot-spot functions consume most of the CPU time. It’s clear that optimization effort should focus around these two functions; cutting the execution time down in these two hot-spot functions will provide a good overall reduction to the execution time.
On the other hand, all the other functions in the program will together consume less than 10% of the execution time, so that even very prudent optimization of any or all other functions would allow only a fractional improvement to the overall execution time.
Step 4. Implement OpenMP improvements
Based on the “before” profiling results we know that about 63% of execution time get spent within single function calcCrossCorr. By examining the source codes we can notice that this function is being called from within a “for” loop from single another other function in the program.
Let’s now see how we can use OpenMP to optimize this loop to execute in several CPU cores in parallel.
Inside function TDStretch::seekBestOverlapPositionFull():
int bestOffs = 0: double bestCorr = FLT_MIN; for (i = 1; i < seekLength; i ++) { double corr; // Calculates correlation value for the mixing position corresponding to 'i' corr = calcCrossCorr(refPos + channels * i, pMidBuffer, norm); // heuristic rule to slightly favour values close to mid of the range double tmp = (double)(2 * i - seekLength) / (double)seekLength; corr = ((corr + 0.1) * (1.0 - 0.25 * tmp * tmp)); if (corr > bestCorr) { bestCorr = corr; bestOffs = i; } }
Browse this source code file online here.
OpenMP Syntax
There are good OpenMP programming online tutorials available, describing in details how to use various features of OpenMP API. See for example Tutorials at OpenMP.org
All OpenMP source-code level instructions begin with sentence #pragma omp. #pragma is a C/C++ language standard directive for special compiler definitions. This approach has an advantage that compilers that support OpenMP know what to do with these #pragma omp directives, while compilers that don’t support OpenMP can still compile the same source code correctly and simply ignore the OpenMP #pragma statements.
In this case we need to use only two OpenMP’s #pragma directives. The first of these directives is:
#pragma omp parallel for
This directive instructs compiler to split execution for the following for loop into several threads in parallel, which will allow each thread to utilize separate CPU core for parallel execution. After the for loop, the program execution will continue in usual single-thread execution.
This directive is used by adding it before a ‘for’ loop as follows:
int bestOffs = 0: double bestCorr = FLT_MIN; #pragma omp parallel for for (i = 1; i < seekLength; i ++) { double corr; // Calculates correlation value for the mixing position corresponding to 'i' corr = calcCrossCorr(refPos + channels * i, pMidBuffer, norm); // heuristic rule to slightly favour values close to mid of the range double tmp = (double)(2 * i - seekLength) / (double)seekLength; corr = ((corr + 0.1) * (1.0 - 0.25 * tmp * tmp)); // Check if we’ve found new best correlation value if (corr > bestCorr) { bestCorr = corr; bestOffs = i; } }
compiler switches
Then, let’s add proper compiler switches to the project’s Makefile to build a proper OpenMP-enabled compilation. For gcc compiler this flag is -fopenmp and that is to be added to both compiler and linker options.
These can be added into Makefile CXXFLAGS/LDFLAGS variables, or alternatively add these in command line when launching the software build:
make –j CXXFLAGS=-fopenmp LDFLAGS=-fopenmp
That’s it! This is in principle all that is required to make the given ‘for’ loop execute the loop iterations in parallel. This single directive would in this case be already enough to divide the execution to multiple threads and thus run on multiple CPU cores in parallel!
…however: multi-thread access protection
However, there’s still a small catch. Once the contents of the “for” will loop execute in several parallel threads and cores simultaneously, it’s possible that several threads would attempt to modify the shared variables bestCorr and bestOffs simultaneously. This is called a multi-thread access conflict, meaning that reading and modifying shared variable values at the very same time by two threads, and this can cause unpredictable results.
What is typically seen when multiple threads get to modify shared variable values simultaneously is that produced calculation results are not correct, and results can also be unpredictable so that the same algorithm may produce randomly different result on different run times, even with the same input and parameter data.
To avoid possible multi-thread variable access conflicts and ensure coherent result, we’ll need the following OpenMP directive to protect shared variable access:
#pragma omp critical
This is a directive for controlled access of shared data. This directive ensures that the “critical section” immediately following this directive can get executed only by a single thread at a time.
If several threads will reach the critical section simultaneously, then only one thread will get to execute the code in critical section at a time, so that all other threads wanting to access the critical section will pause and wait their turn until the previous thread(s) have exited the critical section block.
critical section performance consideration
Apply the critical section directive with some caution to avoid degradation in performance, because critical section can cause parallel threads to pause and wait. If parallel threads will end up using too much of their time waiting ahead of a critical section, then the resulting code will not possibly run any faster than a single-thread / single-core execution, or in worst case can even become slower than original single-thread execution.
For this reason we add here an additional “if” check before the critical section in the code, to enter the critical section only when it is surely necessary to modify the shared variable value:
#pragma omp parallel for for (i = 1; i < seekLength; i ++) { double corr; // Calculates correlation value for the mixing position corresponding to 'i' corr = calcCrossCorr(refPos + channels * i, pMidBuffer, norm); // heuristic rule to slightly favour values close to mid of the range double tmp = (double)(2 * i - seekLength) / (double)seekLength; corr = ((corr + 0.1) * (1.0 - 0.25 * tmp * tmp)); // Check if we’ve found new best correlation value if (corr > bestCorr) { // For optimal performance, enter critical section only in case // that best value found. In such case repeat 'if' condition as it's // possible that other parallel execution may have already updated the // bestCorr value in the mean time #pragma omp critical if (corr > bestCorr) { bestCorr = corr; bestOffs = i; } } }
That’s it for this function!
In similar fashion we improve the other detected hot-spot function evaluateFilterStereo, by again adding the directive #pragma omp parallel for in front of a for loop. In that case it’s not even necessary to add a critical section code, because in this function two parallel threads won’t ever try to modify same data simultaneously:
uint FIRFilter::evaluateFilterStereo(SAMPLETYPE *dest, const SAMPLETYPE *src, uint numSamples) const { int j, end; double dScaler = 1.0 / (double)resultDivider; end = 2 * (numSamples - length); #pragma omp parallel for // This is the OpenMP directive added in this example! for (j = 0; j < end; j += 2) { const SAMPLETYPE *ptr; LONG_SAMPLETYPE suml, sumr; uint i; suml = sumr = 0; ptr = src + j; for (i = 0; i < length; i += 4) { // loop is unrolled by factor of 4 here for efficiency suml += ptr[2 * i + 0] * filterCoeffs[i + 0] + ptr[2 * i + 2] * filterCoeffs[i + 1] + ptr[2 * i + 4] * filterCoeffs[i + 2] + ptr[2 * i + 6] * filterCoeffs[i + 3]; sumr += ptr[2 * i + 1] * filterCoeffs[i + 0] + ptr[2 * i + 3] * filterCoeffs[i + 1] + ptr[2 * i + 5] * filterCoeffs[i + 2] + ptr[2 * i + 7] * filterCoeffs[i + 3]; } dest[j] = (SAMPLETYPE)(suml * dScaler); dest[j + 1] = (SAMPLETYPE)(sumr * dScaler); } return numSamples - length; }
Browse this source code file online here.
That’s it!
Step 5. Compare before/after optimization results
Let’s now profile the software again in the same benchmark scenario to see how the OpenMP improvements impact the results.
Here are new gproftools results with OpenMP optimizations:
real 39.647 s user 156.040 s sys 0.630 s Total: 3968 samples 1934 48.7% 48.7% 1934 48.7% soundtouch::TDStretch::calcCrossCorr 1058 26.7% 75.4% 1058 26.7% gomp_barrier_wait_end 827 20.8% 96.2% 827 20.8% soundtouch::FIRFilter::evaluateFilterStereo 85 2.1% 98.4% 85 2.1% gomp_team_barrier_wait_end 35 0.9% 99.3% 35 0.9% soundtouch::TDStretch::seekBestOverlapPositionFull 11 0.3% 99.5% 11 0.3% soundtouch::InterpolateCubic::transposeStereo 9 0.2% 99.8% 9 0.2% WavInFile::read
The table below compares the before / after results. We can conclude that OpenMP optimizations reduced the real-time execution duration from about 122 seconds to just below 40 second, meaning better than three-fold improvement in execution time. That’s quite good improvement from adding just few optimization clauses into two functions!
Before | With OpenMP | |
Real time | 121.939 sec | 39.647 sec |
User time | 121.210 sec | 156.040 sec |
System time | 0.610 sec | 0.630 sec |
Execution times for the hotspot function level are as below. We can see that execution times of the two OpenMP-optimized functions have decreased to one-fourth of the original, which is expectable as they now run in four CPU cores in parallel!
Before | With OpenMP | |
soundtouch::TDStretch::calcCrossCorr | 77.13 sec | 19.34 sec |
gomp_barrier_wait_end | – | 10.58 sec |
soundtouch::FIRFilter::evaluateFilterStereo | 33.66 sec | 8.27 sec |
gomp_team_barrier_wait_end | – | 0.85 sec |
Result considerations
Surveying the results above, we can notice that while Raspberry Pi 2 has four CPU cores and indeed the optimized functions spend now only fourth of the time vs. original, the overall run-time improvement is about three-fold, not four-fold as would be expected in case of perfect improvement.
The reason for the difference is that splitting execution of an algorithm into four threads in parallel will unavoidably introduce some multi-threading overhead and thread wait times to synchronize results between the threads. We can see this overhead in that the profiled function results now have two new functions that were not there in the original profiling results: these are gomp_barrier_wait_end and gomp_team_barrier_wait_end.
These new function calls are OpenMP thread synchronization function that manage thread access synchronizing before the critical section, and also the waiting times at end of parallel sections to wait until all threads have complete their work before pulling the results together from each parallel tasks. Due to this additional threading overhead, the overall OpenMP new improvement in this case is about three-fold reduction in runtime, not four-fold.
We can see this same OpenMP overhead also yet in another way if we look at the overall “user time” profiling result figures. The “user time” means how much CPU time the program have occupied in all utilized CPU cores together. In this case the parallel OpenMP accumulative CPU user time increased to 156 from 121 seconds of the original run: 156 seconds on four cores together is about three times as fast as 121 seconds on single core. The parallelized version can complete the work much faster by using four CPU cores, yet it also consumes some 35 seconds of extra CPU work for managing the threading overhead.
Conclusion
We just improved existing C++ software that had originally been written for single-thread execution to utilize several parallel CPU cores, by adding just few simple OpenMP API directives into the source codes. These same source codes can now be compiled for OpenMP parallel processing support, or without OpenMP if so desired.
Based on benchmark tests, these OpenMP improvements yield three-fold improvement in real-world processing speed in Raspberry Pi 2 platform with a four-core ARM processor. What an easy yet remarkable improvement!
See other post for the OpenMP multi-core parallel benchmark results for Android devices …
ps. Comparison with Six-Sigma improvement process
As a side-note, these steps that we just performed to optimize the software match with the Six-Sigma DMAIC process steps:
- Recognize: Detect issue – in this case, need multi-core
- Define – In this defining a tuning benchmark of having SoundStretch to process a 1000 second audio clip
- Measure – In this case we ran gprofile tool to measure runtime information
- Analyse – Analysed the profiling results to identify in which functions the software uses most of the time
- Improve – implemented OpenMP optimizations for the detected hotspot functions
- Control – repeat profiling after improvements to analyse impacts of optimizations, and if necessary repeat the measure-analyse-improve-control steps
We can thus conclude that we have just completed a Six-Sigma improvement project! Please congratulate yourself and feel free to buy yourself a green belt from your local apparel store!
Pingback: Software profiling tools for Raspberry Pi | Software Coven
Pingback: Parallel computing with OpenMP in Android | Software Coven
what all research work we can do using Pi clusters
Yes, a 32-board and even a 120-board Raspberry Pi cluster have been built.
Raspberry Pi clusters can make nice academic examination platforms for how to build, develop and optimize software for large computer clusters, as Raspberry Pi setups support SIMD-level, Thread-level and Cluster-level optimization techniques (particularly Pi models 2 & 3 that have multi-core CPUs).
However, Raspberry Pi boards are optimized for low-cost/low-power use and thus feature somewhat feeble CPUs. If looking for serious real-world cluster calculation horsepower, then one will accordingly find much better performance for the price by looking setups based on more advanced CPUs. Currently the ultimate choice would be a Xeon Phi system, yet one might also be able to build interesting academic computing cluster by using <100USD Intel x86 stick PCs...
Nice example, thanks.
Nicely done tutorial.
I’m an old hand at computer programming, but new to parallel code. This has been very helpful. I have a “stack” of 3 x Raspberry Pi boards ( 2 x Pi M2 and 1 x Pi M3 ) that I’m using as my learning tool.
FWIW, I did some comparisons of raw CPU power of various SBC boards and their $cost using Wiki data for Dhrystone performance. The Pi M3 is 11,040 and the Odroid XU4 is 42640 kilo-Dhrystones / sec. But the big jump comes in $ per k-Dhrystones/sec.
The Pi M3 is good at $3.17, but the XU4 scored highest at $1.64.
I did not compute values for Odroids newer 4 x XU4 cluster product as I don’t have one. They also removed some of the I/O devices (like HDMI) from those boards. (But it is mounted to a nice unified heatsink / houseing).
I now feel ready to “give it a try” with some OpenMP test cases and compare actual performance (limited by memory et. al.)
Thanks again.