To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects. Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.
Need for dynamic partitioning
Suppose there are four tasks (T1-T4) that need to be completed by 2 threads. Now suppose that the tasks have the following execution times.
|Task Name||Time (units)|
These tasks will run as a single threads as follows:
For simplicity, assume that these tasks are completely independent. In parallel programming, if we naively assign T1,T2 to Thread0 and T3,T4 to Thread 1, they will run on 2-threads as follows:
Notice how Thread 1 is sitting idle while Thread 0 is working. This is reducing the benefit of multi-threading.
Now suppose we use dynamic partitioning. In dynamic partitioning, the four tasks are added to a to-do list (known as the work-queue). Each time a thread becomes idle, it picks up the next task in the work-queue and processes it. Our four tasks will run as follows using dynamic scheduling
At the start, Threads 0 and 1 pick up T1 and T2 respectively. When Thread 0 finishes T1, it starts work on T3. Both threads become idle again at the same time and Thread 1 picks up T4. Notice how the work finishes a lot faster with dynamic partitioning.
Implementing Dynamic Partitioning:
Each thread in dynamic partitioning is a generic worker thread. It runs a simple loop as follows:
A work-queue implements an enque operation to add a task to the queue and a deque operation to remove a task from the queue. A work-queue can be implement using many different data structures. The simplest is a bounded-buffer implemented using an array with head and tail pointers. Since it will be read/written by multiple threads, the operations need to be wrapped inside locks (or made thread safe some other way).
Several optimizations have been proposed to reduce the contention for the work-queue. The most prominent is work-stealing which I discuss later.
What is the best task size?
This brings us to a very interesting challenge. The size of each task is a very important design decision because there are two competing goals:
1. We want to make each task as small as possible for best load-balancing. The longer the tasks, the longer we can get stuck waiting for that last thread.
2. We want each task to be longer to better amortize the work-queue overhead.
The following is a first-order equation that describes this phenomenon.
task_length: average task length in cycles
dq_enq: sum of deque and enque latencies
ST_cycles: the number of cycles the kernel takes as a single thread
K: The fraction of a task that will need to run alone at the end of the kernel
If we differentiate w.r.t. task_length, equate it to zero, and solve for task_length, we get:
The graph below shows the impact of task_length on overalll speedup. For this graph, I set
dq_enq = 70 // measured on my core2quad using something similar to the FastForward queue
P = 10 //a hypothetical machine with 10 cores
K = 0.5 //a guess based on some empirical data
ST_cycles = 100000
Notice how increasing the task size is a winner initially but continuing to increase the task_length eventually becomes a loss. Thus, while reducing the overhead of the queue can allow shorter tasks, the decision of the actual task size is still a difficult one.
Choosing the best task_size
Measure queue overhead: The work-queue can be measured by writing a program where the actual work quantum is empty. dq_enq is then equal to the total number of cycles over the number of iterations.
Note: To measure cycles, use the cycle count register on the core. On x86, this register can be read using the rdtsc instruction. as follows:
static unsigned long long rdtsc(void)
unsigned hi, lo;
__asm__ __volatile__ (“rdtsc” : “=a”(lo), “=d”(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
Watch out! rdtsc is unrealizable with parallel threads as cycle count registers of all cores aren’t synchronized.
Measure ST_cycles: Run the loop as single thread with no queue overhead and use the CPU cycle counter to measure cycles spent in the entire loop.
Guess K: I need help from a probability expert to figure out K. I am sure its a function of task length variation and can be computed. For now, I always use 0.5.
Plug in the numbers into my simplistic analytic model to get an a ball park of how long your task ought to be. Next, measure the average task length and decide if you need to split/combine your tasks to get closer to the optimal task latency.
In general, a faster work-queue can simplify the decision for task_size.
Work-stealing (Cilk, Intel Thread Building Blocks)
A major overhead of a work-queue is that it can introduce contention among threads. To reduce this contention, researchers have proposed the notion of distributed queues where multiple queues are used instead of a single shared queue. Since only one or a few threads work with each queue, the contention is reduced.
The most promising of these approaches is work-stealing. First used in MIT’s Cilk, work-stealing has been picked up by several programming frameworks. In work stealing, each thread is assigned its own private work queue for which it does not need to contend. Work assigned to the thread is enqeued in this queue. However, if a thread finishes the work it was assigned, the thread “steals” work from one of the other threads who still has work pending in its queue. This allows the best of both worlds: for most of the program, threads are working on their own queue but when one thread finishes, and load imbalance becomes a problem, the thread can steal work and re-balance the load.
Using work-queues is preferable to static partitioning as it makes the program’s performance robust to outside events. However, using work-queues requires choosing the task_size. I presented a simple profile-based solution to this problem which can help developers decide the task size for their parallel kernels.
Future: It is also possible to develop a simple hill-climbing algorithm to chose the best task-size automatically at run-time. I have tried such an algorithm in OpenMP and it works great!