Chasing demons on Android

This was originally posted over at sinch.com, however I’m reposting it here since I wrote it

So, debugging native code on Android… not the most pleasant experience I’ve had.

As you may know, Sinch has this neat Android and iOS SDK for doing VOIP (voice over IP) and IM (instant messaging). To reduce the amount of code they both share, we have written a common layer in C++ that is then cross-compiled for each platform, which is then wrapped in a thin layer of Java/Obj-C code. One of our big third-party dependencies is boost, which gives us shared_ptr, bind and other nice stuff (since we haven’t migrated to C++11 yet).

We’ve been having a not-really-reproducible crash hanging over our heads on Android for quite some time that finally ended up on my table last week. It was a lot of random crashes that seemed to point to our JNI layer handling the Java <> C++ integration.

In the beginning, it looked like we were doing something wrong in our memory management, and deleted objects that we still had pointers pointing to. This stemmed mostly from the fact that our platform specific wrapper code is what decides when certain C++ objects are no longer needed. At first glance, this code looked rather suspicious:

However, after some investigation, I concluded that it is actually doing what it’s supposed to (keeping otherSharedPtr alive until s is deleted).

Up until this point, we didn’t have any clear steps to reproduce the issue. Based on the reports available, I set up a stress test that resulted in a fair bit of memory pressure and that spawned a few threads. Finally, the issue was more or less reproducible within a few minutes.

First step: gdb

After getting GDB setup, I started running my test case again, hoping for similar repeated backtraces. Unfortunately, when it finally crashed (it took a bit longer while running under a debugger), it would only give me single frame backtraces. Interestingly, adb logcat would still give more or less accurate backtraces once I exited the debugger. Why? No idea…

Sinch my gdb-foo was clearly weak, I went back to looking through the saved up backtraces running with verbose logging to see what they had in common. In more than a few it always crashed when we were trying to do something with Session objects (the same one mentioned earlier). However, after provoking a few more crashes, the real pattern emerged: instead of Invalid indirect reference 0x42157c30 in decodeIndirectRef (SEGV) (which I saw in the few old reports concerning this issue), I started seeing more and more of the same kind:

Heap corruption. Great.

Enter: valgrind

Valgrind is this magnificent monster when it comes to… well, most any thing you might want to do to your application or library. To get the most out of valgrind, you want debug symbols for your code. Easy, just compile AOSP from scratch! There are plenty of good posts on the subject; I went with this guide.

After spending a few hours on cloning, setting up the build environment and then building AOSP, I noticed I missed the step where you enable the use of the proprietary drivers… No worries, just rebuild… For reference, a build from scratch took roughly 2hrs on the i7 w/ 8GB RAM and SSD I used.

I set valgrind up, more or less like this.

Finally, it was time to run my test case through this magnificent monster! I started it via the script and waited anxiously to see what would happen. BOOM. SIGILL. Doh! Some googling, a fresh valgrind checkout from source, and off we go again! This time it would ran for a bit longer, but still crashed and burned in a glorious fire. Turns out, ART and valgrind do not make the best of friends… Checking out the latest version of AOSP was a mistake since it only supports ART as the runtime, instead of tried and true dalvik.

At this point, I’m royally tired of building stuff from scratch, so I download the factory images for the Nexus 4, flash it and hope I haven’t messed the phone up for good with my engineering build. After waiting on the boot screen for what felt like an eternity, it finally boots up in working condition. I ended up just frankensteining valgrind in there and forgetting about symbols for everything. After setting up valgrind yet again, I started my test app and waited… No crash while launching the app! Yay! I start my test case and valgrind starts spitting out errors. However, the errors are in libdvm, /dev/ashmem and other non-symbolicated files. Easy enough to ignore (or suppress via valgrind if you like that sort of thing). And then I wait. From my experience with running it under gdb I know it might take a while longer to see the error, so I wait patiently for an hour… and then two… and three. After four hours, I begin to accept what I already knew at some level. It’s a race condition somewhere, and gdb and valgrind slows everything down so much it’s probably not going to happen very often, or at all.

After a fair bit of self-doubt and questioning what I’m doing, I go back to looking at my saved up backtraces. What I notice now is that quite a few of the backtraces involve boost. It’s not that surprising, since we use plenty of shared_ptrs, but at this point I’m willing to blame someone else for a while. A bit of searching later and I stumble upon libtorrents bugtracker….

Turns out that boosts shared_ptr uses a lock-free platform specific implementation by default. Recompiling boost with BOOST_SP_USE_PTHREADS seems to have fixed all of the weird crashes. I found a few old references mentioning issues with certain ARM CPUs. Without actually deep-diving and verifying if there is indeed a problem in either shared_ptrs spinlock on ARM, or the stdlib shipped on Android devices, I’m quite content with saying that at least we are not seeing any more crashes :) (and yes, the thought that using pthreads is slowing something down to avoid the race-condition crossed my mind…)

TL;DR

If you are using boost::shared_ptr on Android, compile with BOOST_SP_USE_PTHREADS to avoid troubles.