Tuesday, January 24, 2006

Thread Safety is hard to maintain

I am constantly amazed by how hard it is to write thread safe code and have it stay thread safe. This week I had the pleasure of debugging two very different thread safety problems with very different origins on the theme.

Actually as I think about it, the origin of the problems are fairly similar. In both cases, the root cause was a failure of the thread safe architecture to be appropriately documented. In the first case, the degradation was gradual. The bugs were introduced slowly over time as multiple developers who did not understand the architectural requirements added or modified functionality. The worst part about thread safety problems is that problems that result in reference counting errors in which references are not released are nearly impossible to find. The same holds true for race conditions due to a failure to apply read/write locks or mutexes in a consistent manner.

These errors are extremely easy to introduce and unless there is very thorough auditing of the code it is quite likely that even a well written test suite is not going to trigger the errors. Deadlock conditions and under counting of references are much easier to identify because they typically result in a hard error.

The second problem was even more hideous. The architect made an assumption about the order in which threads would be allocated in an application. It never occurred to her/him that the libraries that are linked into the application might one day be multi-threaded themselves. The end result was that a data structure that was meant to allow thread safe access to objects by allocating a single bit in a table to each thread ended up due to math truncation associating multiple threads with the same bits. The end result is that while not a single line in the program changed, the thread safety of the application went out the window.

I don't know if better documentation would have helped in this second instance but it is clear that heavily used production servers started to bounce because of an undocumented assumption.

My advice to anyone that is developing and maintaining multi-threaded systems is to be careful. What looks on the surface like a trivial change may very well disrupt a a delicately balanced system. Whenever possible find someone else who understands the architecture of the system to review each and every line being very careful to pain attention to the expansion of macros and the side-effects that function calls might have.