Multi-cores are here, and they are here to stay. Industry trends show that each individual core is likely to become smaller and slower (see my post to understand the reason). Improving performance of a single program with multi-core requires that the program be split into threads that can run on multiple cores concurrently. In effect, this pushes the problem of finding parallelism in the code to the programmers. I have noticed that many hardware designers do not understand the MT challenges (since they have never written MT apps). This post is to show them the tip of this massive iceberg.
Update 5/26/2011: I have also written a case study for parallel programming which may interest you.
Why finding parallelism is hard?
Some jobs are easy to parallelize, e.g., if it takes one guy 8 hours to paint a room, then two guys working in parallel can paint it in four hours. Similarly, two software threads can convert a picture from color to grayscale 2x faster by working on different halves of the picture concurrently. Note: Programs that fall in this category are already being parallelized, e.g., scientific computing workloads, graphics, photoshop, and even open-source apps like ImageMagick.
There also other programs that are sequential in nature, e.g., two guys will not be able to cook 2x faster than one guy because the task isn’t fully parallelizable: there are inter-task dependencies and the cooks end up waiting for each other at times. Unfortunately, a lot of programs have artificial inter-task dependencies because the programmers wrote them with an ST mind set. For example, consider this code excerpt from the H.264 reference code (I have removed unnecessary details to highlight my point):
macroclock_output_data mb; //Global variable for (...) // The OUTER loop decode_slice(...); if(mb->last_mb_in_slice) break; void decode_slice(...) ... mb = ...
Notice how the variable mb is written every iteration and no iteration uses the mb written by the previous iterations. However, mb was declared as a global variable probably to avoid its repeated allocation and deallocation. This is a reasonable ST optimization. However, from an MT standpoint, the loop iterations of the OUTER loop now have a dependency among each other and they cannot be run in parallel. To parallelize this code, the programmer has to first identify that the dependency is artificial. He/She then has to inspect 1000s of lines of code to ensure that this assumption isn’t mistaken. Lastly, he/she has to change the entire code to make mb a local per-iteration variable. All this is difficult to achieve (I parallelized H.264 for this paper).
So here is the status: Leaving the artificial dependencies in the code limits parallelism, mistakenly removing a real one breaks the program, and reaching the perfect balance requires prohibitive effort. Since its hard to identify all dependencies correctly the first time, they make errors and debugging begins.
Why is debugging difficult?
Debugging multi-threaded code is very hard because bugs show up randomly. Consider the following:
Say two threads T0 and T1 need to increment the variable X. The C/C++/JAVA code for this will be
X = X + 1
Their assembly code will look as follows: (instructions are labeled A-F).
|A||Load X , R0||D||Load A, R0|
|B||Increment R0||E||Increment R0|
|C||Store R0, X||F||Store R0, A|
The programmer wants X to be incremented by 2 after both threads are done. However, when the threads run concurrently, their instructions can interleave in any order. the final value of X depends on the interleaving (assume X was 0 before the two threads tried to increment). For example,
ABCDEF: X = 2 (correct)
DEAFBC: X=1 (incorrect)
ADBCEF: X = 1 (incorrect)
Basically, there is a dependency that D shall not execute before C (or A should not happen before F). The programmer has missed this dependency. However, the code does work fine half the times making it impossible to track the bug or test a fix. Moreover, traditional debugging techniques like printf and gdb become useless because they perturb the system, thereby changing the code behavior and often times masking the bug.
Why is optimizing for performance so important and challenging?
The sole purpose of MT is performance. It is very common that the first working version of the parallel code is slower than the serial version. There are two common reasons:
Still too many dependencies (real or artificial): Programmers often iteratively remove dependencies and sometimes even re-write the whole program to reduce these dependencies.
Contention for a hardware resource: Threads can also get serialized if there is contention for some hardware resource, such as a shared cache. Programmers have to identify and reduce this contention. Note that identifying these bottlenecks is especially challenging because hardware performance counters are not reliable.
After several iterations, the code becomes good enough for the performance target.
The work does not end here..
Unlike ST code which would get faster every process generation, MT code has complex non-deterministic interactions that can make its performance swing long ways when hardware changes. For example, I had a branch-and-bound algorithm (a 16-Puzzle solver) which would slow down with more cores because the algorithm would end up on a different path when more threads were running. Even a simple kernel like histogram computation can behave very differently with different inputs or machine configuration (see this paper). Thus parallel programmers are also burdened with the task of making their code robust to changes.
My goal here was not to teach parallel programming but merely provide a flavor of what it takes to write a good parallel program. It is indeed a complex job and I assert that it is not possible to appreciate the challenges without actually writing a parallel program. For those who have not written one yet, here is my call for action:
Write a parallel program to compute the dot-product of two arrays and get it to scale perfectly, 4x speedup on 4-cores, etc. It is simple but you will learn more than you expect.
Search the keywords: pthreads or winthreads to learn the syntax on linux or windows respectively. Share your experiences!
Update 5/30/2011: Thank you everyones for your feedback and readership. I have written a third post on parallel programming. It describes a profile-based solution for choosing the best task granularity in parallel programs.