OpenMP Parallel computing in Android

This is a follow-up to an earlier post about parallel programming in embedded devices, introducing how to use OpenMP for parallel programming in Android devices. 

Using OpenMP in Android

Laptop, Raspberry Pi, Android tablet and phonesAlthough OpenMP is not officially supported in Android platform as of writing, it’s possible to get C/C++ code using OpenMP parallel optimizations to work in Android environment with few simple tricks…

Step 1: Use NDK to compile C/C++ for Android

Java is the default programming language for Android application development, and Java indeed works fine for writing general purpose application logic. However, it’s possible to include also C/C++ routines into Android Java applications by compiling the C/C++ routines into a dynamic-link library using Android NDK (“Native Development Kit”) and calling these routines from Android application using Java’s JNI “Native” interface.

Using C/C++ code can be convenient in Android, because:

  • Lot of useful pieces C/C++ software already exists and these can be desirable to reuse in Android/Java applications
  • Execution speed for native C/C++ code is several times faster than Java code in Android, so algorithms involving intensive calculations can be desirable to write in C/C++ rather than Java.

To compile C/C++ code for Java,

  • Install Android NDK tools, available for free download here
  • Create an Android NDK project, copy in your C/C++ files and edit Makefile.mk to include your C/C++ files into the compilation.
  • Use Android NDK’s ndk-build script to build the source codes into library binaries

Again, let’s use the same SoundTouch library as in the OpenMP parallel programming article also as an Android Native C++ library example. The SoundTouch source code package readily contains an Android NDK example project for building the SoundTouch library into Android NDK library binaries, and a simple Android example application that processes audio files in Android devices using this SoundTouch NDK library. See SoundTouch Android README for instructions on how to build the SoundTouch example libraries & app for Android.

Step 2: Setup OpenMP compiler flags

The Android NDK uses gcc cross-compiler for compiling the C/C++ source codes into Android executable binaries. To enable the OpenMP support in compilation, edit the Android NDK C/C++ project’s Android.mk script file and add –fopenmp switch into compiler & linker settings:

LOCAL_CFLAGS += -fopenmp
LOCAL_LDFLAGS += -fopenmp

Step 3: Workaround for Android native threading bug

Android NDK v10 (and earlier versions [and who knows if the future versions will also]) has an issue that OpenMP threading common storage settings do not get properly initialized for other threads than the Application’s main thread.

This means that OpenMP-optimized functions will work ok if they are invoked from the Android Application main UI thread, however, invoking OpenMP calculations from any other thread than the main one will will cause the whole application to crash with laconic fatal signal 11 due to memory access violation.

This is undesirable behavior, in particular as intensive calculations are not to be run from main thread to avoid unresponsive UI.

A workaround for this thread storage issue is to invoke a quick JNI function call at beginning of the Android program from the Main UI thread, to copy a pointer to the Main thread’s OMP threading storage data, and later utilize this stored pointer to initialize OpenMP support for other threads properly. With this arrangement OpenMP optimizations shall work also when executed from Android application background threads.

Here is example of the workaround code:

#include <pthread.h>
extern pthread_key_t gomp_tls_key;
static void * _p_gomp_tls = NULL;   // custom storage for thread state pointer

...

// call this piece of code from App Main UI thread at beginning of the App execution, 
// and then again in JNI call that will invoke the OpenMP calculations
{
    // get pointer to the current thread's thread storage 
    void *ptr = pthread_getspecific(gomp_tls_key);
    if (ptr == NULL)
    {
        // it's empty, thus set the thread storage based on earlier stored _p_gomp_tls
        pthread_setspecific(gomp_tls_key, _p_gomp_tls);
    }
    else
    {
        // storage not empty, thus store it into _g_gomp_tls for later use
        _p_gomp_tls = ptr;
    }
}

See function _init_threading in soundtouch-jni.cpp for an another example of this workaround.

That’s it! With these steps your app will be all set for utilizing OpenMP APIs.

OpenMP benchmark results in Android

To get an idea of how much OpenMP will benefit in Android environment, the same SoundTouch benchmark tests that were earlier run in Raspberry Pi 2 platform were repeated with a variety of Android devices. The results are here:

      Benchmark duration, seconds:  
Android Device CPU Cores No
OpenMP
With
OpenMP
OpenMP speed-up factor
Sony Acro S ARM v7-A 1.5 Ghz (Scorpion) ¹ 2 99,9 64,4 1,6x
Sony Tablet Z ARM v7-A 1.5 Ghz (Krait) ¹ 4 73,5 27,1 2,7x
Samsung J5 ARM v8 1.2 Ghz (Cortex-A53) ¹ 4 91,2 31,1 2,9x
Acer Iconia A1 Intel x86 Atom Z3745 1.33Gh ² 4 29,8 13,6 2,2x

¹ The ARM compilation used ARM instruction set with -marm compiler switch, for about 20% faster code than the NDK’s default Thumb code.
² the SSE optimizations were disabled in x86 compilation for sake of more fair comparison. With both SSE and OpenMP optimizations enabled, the Atom benchmark run duration reduced further to 6,6 seconds.

Conclusion

We can see that Android devices indeed benefit of OpenMP optimizations. The improvement depends on the processor generation, and devices with a quad-core CPU naturally will gain a larger benefit than devices with a dual-core CPU. In this benchmark case, a four-core ARM processor saw almost a 3-fold speed-up improvement with OpenMP-optimized code at best.

Curiously, the four-core Intel x86 Atom processor sees a smaller improvement from OpenMP than the four-core ARM processors. This is possibly due to the fact that in this OpenMP SoundTouch benchmark case the overall execution workload get split into about 20 000 parallelizable tasks, so that processing duration of a single task becomes relatively short. The Atom CPU is quite a lot more complex and powerful than the ARM counterparts (indeed; this Atom unit is based on X86 Silvermont microarchitecture that is used also as building block in the Xeon Phi teraflop-monster processors of some of the most powerful super computers in the world). While the more sophisticated hardware benefits Atom in that it can complete the Benchmark scenario several times faster than the ARM counterparts in clock-to-clock comparison, it yet gets hurt by worse OS-level threading overhead in case that the processing workload get split into tasks with very short duration.

ps.

Cray X-MP/24

Old-school parallel processing. Year 1982 super computer Cray X-MP/24 delivered 400MFLOPS performance, comparable or just slightly below of concurrent smart phones