StarTools for Android: ARM coming of age
Disclaimer: The following constitutes the informal opinion and cursory research of the author, who is the first to admit he is not an expert in the area of CPU architectures. That said, if you have any issues with the content below, do use the contact form.
From the very outset, StarTools' code and UI were created to be portable, and not rely on any one Operating System feature, 3rd party library, framework or processor architecture. As a matter of fact, almost everything in StarTools was written from scratch over a period of almost 10 years. This has been partly for my own personal educational purposes (I like to understand all facets of Computer Science by writing something from scratch at least once in my life) and partly because existing solutions didn't appeal to me for one reason or another (bloated, too slow, not portable, not flexible enough, prohibitive licensing conditions, etc.).
Previous experience with coding for different architectures (6502, Z80, 68K, ARM, PPC, i386) on various machines and platforms (C64, MSX, Palm OS, Windows CE, MacOS, Linux, Windows, iOS, Android, etc.) had instilled a strong discipline; being careful when aligning data structures, mindful of using floating-point operations (speed), using integer arithmetic where possible, trying to use powers of two when dividing and multiplying (easily shiftable), being cache friendly (avoiding vertical scanline code), considering endianess, respecting small stack sizes, even not to taking the availability of global variables for granted, etc - these all had pretty much become second nature.
Now these are things that you normally wouldn't bother with (I know I probably wouldn't!), if all you wanted to do was create a type of application that traditionally only featured on desktop/Intel environments, such as an image processing application for astronomical data. However, having a keen professional and personal interest in what's going on in the mobile sector, I've always had bigger plans for StarTools.
ARM is fast becoming the 'second' CPU architecture to take in consideration when creating your multi-platform application. From my PalmOS and WinCE days, I knew the following simple things about ARM CPUs;
- Fast Integer arithmetic (except for division)
- Fast barrel shifter (pipeline efficiency)
- No branch prediciton
- Small pipelines
- Typically paired with less, slow memory
- RISC means that the compiler can do a better job of optimising your code and the need for handcoded assembler is less of an issue.
Now, things have changed a little since then (as they have on Intel); we now have a dedicated floating point unit on v7 devices, which helps a lot. And we have branch prediction with the slightly longer pipelines, with the latest Cortex A9 design going fully super scalar. Bigger L1 and L2 cache sizes help and we even have SIMD-like extensions on some CPUs.
What hasn't changed though is ARM's pervasiveness and low cost. Case in point are the $40 ARM-powered 'Android TV sticks' that turn any TV with HDMI into a full fledged computer. Incredibly, at the time of this writing, the first 2GB RAM Quad Core models have been spotted. These sort of specs seemingly constitute an incredible bang-for-your-buck. StarTools is all about my personal mission to make astrophotography possible for more people by lowering the bar to entry - it seemed like these developments could play a big role in that.
Reading an article some months back (which, for the life of me, I cannot find) about the state of ARM vs Intel, the imminent 'war' between the two architectures, and how some manufacturers have started to create and deploy ARM-based servers, I decided it was time to investigate running StarTools on an ARM platform. The article I read included some detailed synthetic benchmark scores (here are some similar ones) for various ARM-based boards (including the Exynos 5 found in the Google Chromebook) versus an Intel Atom board.
While the ARM-powered boards (especially those of the older generation) appeared to exhibit the well published shortcomings (slow memory access, slow FPU), the more recent offerings (Cortex A9 and especially A15) were starting to look quite competitive in some areas. I found this article here that sheds some light on comparative performance between the different generations of ARM designs.
If anything is clear from the various inter-architecture comparisons though, it is that the different strengths and weaknesses vary wildly even between ARM cores of the same generation. These variations are compounded by the different chip makers creating their own take on ARM's design, for example, leading to A9-like cores with A15-like features.
Encouraged by some of these synthetic benchmarks and numbers, taking into account that StarTools was coded with ARM's strengths and weaknesses in mind where possible, plus knowing that StarTools performs OK on Atom (Netbooks) devices, I decided it was time to give it a go.
Choosing an ARM platform for my experiment was an easy choice; it was going to be Android. Android is incredibly pervasive and at the same time open enough to even have a chance of working for this sort of project; side-loading is possible, it has a proper (user-accessible) file system, often comes with media expansion slots, comes on a range of hardware and, in line with StarTools' mission, is low-cost. Plus, if you tend to find yourself at the geeky end of the spectrum of users like me, its dead-easy rootability is a big plus!
As soon as the first Android NDK (native Development Kit) was released back in 2009, I had a play with it, creating my usual abstraction layer that exposes a framebuffer, input devices, system events (such as rotation, exit, etc.) and file/resources system. Back then, Android Activities could not be run natively yet (it became available for Android 2.3+), so I had to build my own interface to relay events back and forth between Java and native code. Not rocket science, just a bit of work.
Various methods to perform frame buffer access didn't seem like they were going to be future-proof at the time (there now is an offically supported method for Android 2.2+), so instead I decided to use OpenGL ES and render to a texture. This was fine a long as I didn't need insane frame rates. The UI library for StarTools only ever updates the screen when something has changed, so that wasn't a problem.
Next, I had to jump through some hoops to be able to bundle resources with the APK, access them from native space (for a more up-to-date & definitive answer on the best method for this, have a look here), as well as work around a 1Mb decompression buffer/cache limit on Android 1.6. However once that was sorted, what I ended up with - in the form of a 'hello world' calculator app - performed quite well and has stood the test of time - working on anything that I have tried over the years that runs Android 1.6 and up.
All this ground work came in handy a few days ago when I decided I'd have a stab at porting StarTools. The "Hello World" calculator app implemented the same UI library that StarTools uses (which was create to be so light-weight that it runs on Palm devices with a 512Kb heap). That UI library had evolved a good bit during StarTools' development (with the addition of new widgets, controls and a new font renderer), but the bulk of the code entry points and setup was still the same.
Perhaps the biggest difference was the use of 24-bit 8:8:8 RGB graphics, versus the old 16-bit 5:6:5 graphics. The latter would result in some serious banding issues, too severe for any serious image processing. Fortunately, all was properly abstracted in the code and the changes required were minimal.
After adding a log2 function (shouldn't have relied on it being available - my bad), it booted!
After a few additional small fixes, implementing a makeshift 'load' function and resizing some controls for smaller screens, the app was working absolutely indisinguishable from the Desktop version.
The only snag was that I found that, when checking the amount of cores in the about page, the amount of cores detected varied between 1 and 4 for my Samsung Galaxy S3. Apparently I had run into this bug (feature?).
All-in-all, it took just 6 hours to port StarTools to Android, including setting up the Android SDK and NDK and getting the old 'Hello World' codebase to play nice with the small changes in the tools since 2009.
Now, StarTools makes for a very interesting example of a real-world application that heavily taxes a system in a great many 'real' ways - to me it is much more interesting than the synthetic benchmark that are outthere. Rather than testing only one aspect of computing or one task, we get to see if a heavy-duty image processing application feels/runs faster as a whole.
So, are current Android devices (and ARM CPUs by extension) ready for prime-time when it comes to astronomical image processing?
To find out, it pitted my netbook, an ASUS 1001PX (running Linux Mint 14 32-bit) against my phone, a Samsung Galaxy S III (running Android 4.2.1 CyanogenMod 10.1). Both sport 1Gb of DDR2 RAM.
Full specs of the CPUs;
- N450 (Launch date Q1 2010), 1 Core (Hyper Threading), 1.66Ghz, 32KB Instruction cache/24KB data cache, 512KB L2
- Exynos 4412 (Launch Date Q2 2012), 4 Cores, 1.4 Ghz, 32KB Instruction cache/32KB data cache, 1MB L2
StarTools was compiled with MMX, SSE and SSE2 enabled for the Linux32 version, and VFP3 (where available) enabled for the Android version. As a test file, I used a 732x722 TIFF file derived from a B&W H-alpha stack of M42 acquired by Josh Lake (as used in this video).
To come up with a quick-and-dirty measure of StarTools' performance, I used an oldskool stopwatch to test 5 complex algorithms in StarTools, giving each algorithm 5 runs and averaging the results. I picked the algorithms for their complexity and their varying reliance on different aspects of CPU performance; integer arithmetic, floating-point arithmetic, memory access intensity and multi-threaded optimisation.
Now to appreciate the results, you need to understand that StarTools is not your average PhotoShop or The GIMP. First of all, every pixel is recorded as 32 x 3 (RGB) = 96-bits, rather than your standard 8-bit x 3 (RGB) = 24. StarTools is shifting around 4 times as much memory before we even get started. Then there is the complexity of the algorithms - us astrophotographers are not easily impressed with Unsharp Mask. Deconvolution is where it's at! Shadows/Midtones/Highlights? Psssh. Think Retinex/Local Histogram Equalisation hybrids, etc. Not only is the arithmetic much more complex than the simple filters in something like PhotoShop, the amount of data that gets thrown back and forth is enormous (plus it's 4 times the amount of that in a regular 8-bit photo processing application). You need all the RAM and CPU cycles you can get.
So, without further ado, the results were as follows;
5.0 seconds for the Atom CPU
5.7 seconds for the ARM CPU
The HDR Equalize algorithm incorporates a number of different algorithms; a local histogram equalizer, a bicubic scaler, an anti-aliasing filter and a noise filter. The algorithms are all very memory and arithmetic intensive. The mathematical operations are a mix of floating point and integer operations. The algorithms are multi-core optimised.
Even though the Atom CPU only has a single core (with Hyper Threading), it still edges ahead on this test.
14.5 seconds for the Atom CPU
7.5 seconds for the ARM CPU
The HDR Reveal algorithm also incorporates a number of different algorithms; a local histogram equaliser, a local histogram optimiser, a bicubic scaler, an anti-aliasing filter. The algorithms are all very memory and integer-arithmetic intensive. A comparatively smalller amount of floating point operations are used. The algorithms are multi-core optimised.
Here the ARM CPU clearly edges ahead, which is probably due to the brute force of the quad core powering through the integer arithmetic. The relatively low amount of floating point operations, seemingly the Atom's forte, are too few and far between to let the Atom make up for its lack of concurrent processing.
Denoise Scale Extraction
19.6 seconds for the Atom CPU
10.7 seconds for the ARM CPU
The Denoise Scale Extraction step of the de-noising algorithm decomposes the image into a number of different detail scales. The algorithms used are both very memory and very integer math intensive. Very little floating-point operations are used. The algorithms are multi-core optimised.
Again, the ARM CPU makes use of its multiple cores to quickly get through the task at hand. Atom's single core is struggling with the load in comparison.
Life Isolate preset (after doing an Autodev)
2m30 for the Atom CPU
1m55 for the ARM CPU
The Life module's Isolate preset calculates and applies a Point Spread Function to the image in certain areas. The algorithms used are reasonably floating point, very integer and very, very memory intensive. The algorithms are multi-core optimised (3 cores max).
A pattern starts to emerge; the comparitively smaller amount of floating point operations have the ARM CPU edge ahead. The intensity of the memory read and write operations level the playing field a little, while not all cores are used (intentionally because of the memory bottleneck).
Synth, scope creation, Newtonian preset
6.0 seconds for the Atom CPU
10.3 seconds for the ARM CPU
The Synth module allows the user to create a virtual telescope model and apply its corresponding point spread function to any point lights in the image, augmenting those point lights with physically-correct modeled diffraction spikes. The algorithms used to preview the point spread function and virtual scope are very floating point intensive, with a moderate amount of general purpose memory manipulation. The algorithms are mostly non-multicore optimised.
Mostly single threaded code and lots of floating point operations means that Atom has this one in the bag.
Discussion of the results
A clear trend emerges from the tests;
- The Exynos 4412 Cortex A9-based design's achilles heel is floating point operations.
- Multiple cores on ARM are very useful in situations where lots of integer math needs to be performed, in which case the Exynos runs rings around the Atom CPU.
- Memory intensive tasks limit the usefulness of multiple cores on ARM (as they seem to be mostly waiting to get access).
- In situations where tasks are equally memory, floating-point math and integer math intensive, and as long as the tasks are multi-threading friendly, the quad core ARM performs roughly on par with the single core (HT) Atom.
So, to answer the question 'are current Android devices (and ARM CPUs by extension) ready for prime-time when it comes to something like astronomical image processing?', the answer should be an unequivocal 'yes'.
My overall impression is that StarTools felt just as fast on the Atom as it did on the Exynos 4412 and the benchmark results bear that out.
Much has been said about the need for putting multiple cores in smartphones, with some claiming they are a waste. However, with increasingly complex content creation applications like StarTools, they really do make a difference and are not just a gimmick. Integer arithmetic is still ARM's strong point, while floating point is clearly still its downfall. The Core A15 design is supposedly addressing the latter, so if anyone wants to donate a Chromebook to port StarTools to :), I'd be happy to redo my tests!
All said and done however, the most limiting issue is currently still the available RAM (a mere 1Gb in case of my S3, but 2Gb is slowly becoming the new standard) in Android devices. Even so, there is not much scope to go beyond these memory amounts, as the ARM instruction set can only stil address 32-bits worht of locations. Of course, 64-bit ARM processors are on the horizon, while A15 cores can at least use more than 4GB system (but not app) wide. The lack of a swap file/partition on standard Android installs further reduces the amount of memory StarTools has available (this can be addressed though if you're feeling particularly adventurous), since everything (e.g. system + other apps) needs to reside in RAM along with StarTools. Long story short, what's holding us back right now is memory, but certainly not performance!
Even today's generation of ARM devices (running Android) show promise as capable processing platforms. Admittedly to my surprise, my 6-month old phone actually outperformed my 2-year old netbook in a number of areas. I am pleased that all the hard work of being vigilant and mindful of other architectures has paid off - had I been 'lazy' and just stuck with floating point all the way for StarTools, instead of switching back to integer wherever possible, then performance would have demonstrably suffered on ARM.
More importantly, though, it seems that the day where you can buy a cheap $40 TV dongle and start processing images from your cheap $50 consumer digital camera or $5 webcam are almost upon us!
You find the Tech Demo APK in the download section.
You can find the M42 test image that was used for the benchmark here.
Copyright notice: the Android robot is reproduced from work created and shared by Google and used according to terms described in the Creative Commons 3.0 Attribution License.