The threat of threads – how do you avoid threading bugs?
Increasing the performance of an application can often only be established by splitting up code into multiple threads, the only way to exploit the computational horsepower of a multi-core processor. However, the introduction of threads may cause a race condition. Threads are often not executed in the same order as they are created, and may execute at different speeds as they race to completion. If not properly accounted for, this results in unexpected and incorrect program behavior.
Deadlocks occur when threads wait on each other indefinitely, causing the program to hang. Since the rate at which the threads are executed are dependent on the operating system's scheduler, and the system load, such thread coding bugs are almost always hard to detect, hard to reproduce, and hard to isolate. As a result, threading bugs are notorious for being found late in the design cycle, and thus very costly repair, and may even cause very painful recalls.
In 2010, Toyota recalled several million vehicles because of faulty accelerator pedals and software, costing them $2B. Intel's Pentium processor floating point divide bug caused the most famous recall in the semiconductor industry, resulting in Intel taking a half a billion dollar charge. Toyota's and Intel's reputation damage were probably even greater.
Bugs can be very expensive, and the cost of fixing bugs increases the further you get in the software development cycle. A bug that's found during system integration is much more expensive to fix than when it's found by the programmer while he’s writing the code himself. Fixing a bug that's found in the field is much more expensive than a bug found during release-candidate testing. As witnessed by Intel and Toyota, the difference in cost can be several orders of magnitude.

Threading code is like spinning plates?
Threading bugs don't come out of nowhere though; they are coded by software engineers. They know this of course, and therefore often simply avoid writing multi-threaded code. A few years back a colleague of mine wanted to speed up his processor simulator by partitioning the code into multiple threads. After adding the threading library calls to the code in what seemed to be the appropriate places, the simulator ran significantly faster. During extensive testing however, it turned out that the additional threading constructs caused the simulator to operate incorrectly and sometimes hang. My colleague never really got to the source of this problem and months later ended up simply stripping out the threading code and tried to find different ways to speed up his simulator.
This is not always an option though. Due to the consumer's insatiable appetite for faster and more efficient code, and the increase of multi-core systems on the market, programmers more and more often can't abstain from having to multi-thread their code.
I believe programmers often can’t grasp the complexity of correctly splitting up large chunks of code in a correct and efficient manner though. The brain simply isn’t big enough and isn’t built for tasks that require keeping track of hundreds or thousands of small pieces of information at the same time. Without highly-automated analysis and partitioning tools, parallelization is a lost cause, and that is why I believe Vector Fabrics is addressing a key market need.
How do you avoid those hard-to-find threading bugs?
Comments
No comments found.