Writing parallel code is all about finding parallelism in an algorithm. What limits parallelism are the dependencies among different code portions. Understanding the dependencies in a program early on can help the programmers (a) determine the amount of available parallelism, and (b) chose the best parallel programming paradigm for the program. In this post, I try to layout a taxonomy of dependencies in loops and how it plays into parallel programming.
What is my definition of tasks?
It is any piece of work that needs to be done and often takes more than one instructions. The instructions in the task are closely coupled such that it logically makes sense to group them together.
What are task dependencies?
There are two types of dependencies at the task granularity.
Data Dependency: A task K is said to be data-dependent on task J if K needs data generated by J. For example,
J: foo = bar + 3;
K: lama = foo + 3;
Note that this dependency is ordered: the dependent task J can neither run before K nor in parallel with K.
Un-ordered Dependency: A task J and K are said to have an unordered dependency if they both read-modify-write the same data. For example,
Note that J and K can be processed in any order but they cannot be processed in parallel.
Thus, two tasks can run concurrently only if they are independent. The goal in parallel programming is to remove as many dependencies as possible by re-factoring the code or the algorithm.
How does it apply to loops?
Lets use the following loop as an example:
for i = 1 to N:
A single core will run the loop in this order: A0, B0, C0, A1, B1, C1, A2, … (A0 = task A in iteration 0, B0 = task B in iteration 0, etc.)
Loops have two types of dependencies:
Intra-Iteration: Typically, tasks within an iteration of the loop are dependent on each other for data. For example, B0 depends on A0.
Inter-iteration: Sometimes different loop iterations share data or hardware resources which creates data or ordered dependencies among them. For example, A1 depends on A0.
How it applies to parallel programming?
I have seem four types of common loops.
1. No parallelism exists in a loop where all tasks are data-dependent on the previous tasks. Thus, programmers should re factor the code to remove those dependencies before writing any parallel code. If no dependencies can be removed then the loop should be left untouched. For example,
2. Loops with independent iterations are easy to parallelize. In this case, programmers should consider using SIMD and watch out for false-sharing and off-chip bandwidth. For example,
3. In a typical loop, some tasks are independent of other iterations while other tasks have inter-iteration dependencies. In my experience, such dependencies are usually un-ordered which can be enforced using critical sections. For example, task B should be put inside a critical section in the following loop.
4. If you cannot remove ordered dependencies, you should use a lesson from hardware designers, i.e., use pipeline parallelism. For example, the following loop is well-suited for pipeline parallelism where one thread can process all instances of A in-order, one thread can be in-charge of processing Bs, and the last thread can be in-charge of processing Cs. Note that if B is substantially longer than A or C, it is possible to use multiple threads for the “B-stage” but that will make things out of order and C will need a re-ordering structure to put them back in order.
I am sure there are many other types of loops out there. Other examples?