I recently ran the gold linker under Thread Sanitizer. It’s a nice plugin for Valgrind which looks for race conditions in multi-threaded programs. To describe it briefly, it builds Happens-Before relationships based on mutex operations and warns when it notices a write and a read/write to the same memory location without a Happens-Before relationship. This approach can yield false positives to be sure, but it does a very nice job of identifying real problems.
It was able to identify one real bug in gold, one problem that led to less efficient link time, and several cases where several threads would set a shared memory location to the same value. The latter cases are not a problem on x86 architectures, though they could be on other processors. In order to get clean Thread Sanitizer results in the future, I fixed all of the cases so that I could get a clean run of gold, at least with the default settings.
The real bug that it found was a typical multi-threaded bug: the code looked fine, but it had a well-hidden error. Gold uses a workqueue of tasks to execute, with a pool of worker threads. Many of the tasks are run using a blocker token. The blocker token is set to the number of tasks that precede it. As each task completes, it decrements the blocker count. When the count goes to zero, the next set of tasks can start. This is a simple way to parallelize linker operations, in which one set of operations (e.g., process the symbol tables) must be run before the next set (e.g., process the relocations) can begin. Naturally I paid close attention to the blocker behaviour when a task completed, and there were no problems there. The problem arose in setting the blocker count when the tasks started. The code was doing a loop of “increment the blocker count” and then “queue the task.” What I forgot was that the process of queuing the task actually lets another thread in the pool start working on it immediately. When the task completed, it decremented the blocker count with a lock. But if the task completed fast enough, the initial code was still running the loop queuing up new tasks, and thus incrementing the blocker count. I didn’t think that I needed to lock the increment, since I wasn’t expecting any task to actually complete before I started all of the tasks. A dumb mistake—just the kind of mistake one makes in multi-threaded programming.
Gold is written in C++. In Go I would of course have each task communicate its completion on a channel. The locking would be handled by the runtime, and there would be no chance for me to make the same sort of error. If you write multi-threaded code, and you can’t use Go, you should definitely check out Thread Sanitizer.
Leave a Reply
You must be logged in to post a comment.