May 292011
 

To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects.  Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.

Need for dynamic partitioning

Suppose there are four tasks (T1-T4) that need to be completed by 2 threads. Now suppose that the tasks have the following execution times.

Task Name Time (units)
T1 7
T2 11
T3 4
T4 1

 

These tasks will run as a single threads as follows:

Screen shot 2011-05-29 at 6.20.42 PM

For simplicity, assume that these tasks are completely independent. In parallel programming, if we naively assign T1,T2 to Thread0 and T3,T4 to Thread 1, they will run on 2-threads as follows:

Screen shot 2011-05-29 at 6.22.52 PM

Notice how Thread 1 is sitting idle while Thread 0 is working. This is reducing the benefit of multi-threading.

Now suppose we use dynamic partitioning. In dynamic partitioning, the four tasks are added to a to-do list (known as the work-queue). Each time a thread becomes idle, it picks up the next task in the work-queue and processes it. Our four tasks will run as follows using dynamic scheduling

Screen shot 2011-05-29 at 6.25.56 PM

At the start, Threads 0 and 1 pick up T1 and T2 respectively. When Thread 0 finishes T1, it starts work on T3. Both threads become idle again at the same time and Thread 1 picks up T4.  Notice how the work finishes a lot faster with dynamic partitioning.

Implementing Dynamic Partitioning:

Worker threads

Each thread in dynamic partitioning is a generic worker thread. It runs a simple loop as follows:

Screen shot 2011-05-30 at 4.08.59 AM

Work-queue implementation

A work-queue implements an enque operation to add a task to the queue and a deque operation to remove a task from the queue. A work-queue can be implement using many different data structures. The simplest is a bounded-buffer implemented using an array with head and tail pointers. Since it will be read/written by multiple threads, the operations need to be wrapped inside locks (or made thread safe some other way).

Screen shot 2011-05-30 at 12.00.09 AM

Several optimizations have been proposed to reduce the contention for the work-queue. The most prominent is work-stealing which I discuss later.

What is the best task size?

This brings us to a very interesting challenge. The size of each task is a very important design decision because there are two competing goals:

1. We want to make each task as small as possible for best load-balancing. The longer the tasks, the longer we can get stuck waiting for that last thread.

2. We want each task to be longer to better amortize the work-queue overhead.

The following is a first-order equation that describes this phenomenon.

task_length: average task length in cycles
dq_enq: sum of deque and enque latencies
ST_cycles: the number of cycles the kernel takes as a single thread
K: The fraction of a task that will need to run alone at the end of the kernel

Screen shot 2011-05-30 at 3.33.09 AM

If we differentiate w.r.t. task_length, equate it to zero, and solve for task_length, we get:

Screen shot 2011-05-30 at 3.39.51 AM

The graph below shows the impact of task_length on overalll speedup. For this graph, I set
dq_enq = 70 // measured on my core2quad using something similar to the FastForward queue
P = 10 //a hypothetical machine with 10 cores
K = 0.5 //a guess based on some empirical data
ST_cycles = 100000
Screen shot 2011-05-30 at 3.47.52 AM

Notice how increasing the task size is a winner initially but continuing to increase the task_length eventually becomes a loss. Thus, while reducing the overhead of the  queue can allow shorter tasks, the decision of the actual task size is still a difficult one.

Choosing the best task_size

Measure queue overhead: The work-queue can be measured by writing a program where the actual work quantum is empty. dq_enq is then equal to the total number of cycles over the number of iterations.

Note: To measure cycles, use the cycle count register on the core. On x86, this register can be read using the rdtsc instruction. as follows:

static unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ (“rdtsc” : “=a”(lo), “=d”(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

Watch out! rdtsc is unrealizable with parallel threads as cycle count registers of all cores aren’t synchronized.

Measure ST_cycles: Run the loop as single thread with no queue overhead and use the CPU cycle counter to measure cycles spent in the entire loop.

Guess K: I need help from a probability expert to figure out K. I am sure its a function of task length variation and can be computed. For now, I always use 0.5.

Plug in the numbers into my simplistic analytic model to get an a ball park of how long your task ought to be. Next, measure the average task length and decide if you need to split/combine your tasks to get closer to the optimal task latency.

In general, a faster work-queue can simplify the decision for task_size.

Work-stealing (Cilk, Intel Thread Building Blocks)

A major overhead of a work-queue  is that it can introduce contention among threads. To reduce this contention, researchers have proposed the notion of distributed queues where multiple queues are used instead of a single shared queue. Since only one or a few threads work with each queue, the contention is reduced.

The most promising of these approaches is work-stealing. First used in MIT’s Cilk, work-stealing has been picked up by several programming frameworks. In work stealing,  each thread is assigned its own private work queue for which it does not need to contend. Work assigned to the thread is enqeued in this queue. However, if a thread finishes the work it was assigned, the thread “steals” work from one of the other threads who still has work pending in its queue. This allows the best of both worlds: for most of the program, threads are working on their own queue but when one thread finishes, and load imbalance becomes a problem, the thread can steal work and re-balance the load.

Screen shot 2011-05-30 at 1.38.49 AM

Conclusion

Using work-queues is preferable to static partitioning as it makes the program’s performance robust to outside events. However, using work-queues requires choosing the task_size. I presented a simple profile-based solution to this problem which can help developers decide the task size for their parallel kernels.

Future: It is also possible to develop a simple hill-climbing algorithm to chose the best task-size automatically at run-time.  I have tried such an algorithm in OpenMP and it works great!

  10 Responses to “Parallel programming: How to choose the best task-size?”

  1. The worker thread model seems inefficient in certain respects. One is contention for the queue as you point out and successfully mitigate, but you also need to determine the appropriate number of worker threads. If you are the only process running you can probably assign one thread per CPU and things will be OK, but otherwise you may be either losing available CPU time or introducing unnecessary context switches.

    Perhaps the OS should be handling this for you, preferably with assistance from the hardware? So that instead of creating queues yourself you’re submitting jobs into a thread queue? Or are there other issues with this sort of approach?

    (When I say “assistance from the hardware” I’m imagining a CPU core that can prepare for execution of the next thread in advance, so that switching from the completed thread to the next thread takes place as quickly as possible – and in particular without the need for the core to execute any kernel-mode code first.)

    • Hi Harry,

      Thanks for another great question. Btw, how do you find my example on the dead-code elimination?

      Choosing the number of threads is actually a bit simpler in the work-queue model. The reason is that each thread is very generic and the to-do work is in a work-queue which can be easily targeted to a different threads. Thus, work-queue allows us to dynamically change the number of threads if something changes during execution. In contrast, if you statically partition the work, then you are unable to adjust the number of threads easily if something changes during execution. Exactly as you point out, the right answer is for a run-time system to chose the number of threads without programmer intervention. In fact, I have written a couple of paper on this very topic that may interest you:

      Feedback-Driven Threading

      Feedback-Driven Pipelining

      I should also point out the taking help from the OS and work-queues already exist in Apple’s Grand Central Dispatch. This is what makes it a very nice framework.

      Your point about hardware assistance is also well-taken. Prefetching the data needed by the next thread can reduce migration cost substantially. I have not seen any research or industry work on this topic. Could be an exciting thing to look into. I am also thinking other applications in low-power domains where quickly switching between apps can improve user experience a lot.

  2. Wouldn’t it also be sensible, where possible, to try to divide the work up into tasks of varying sizes and do the shortest ones last?

    • Hi Harry,

      Sorry for the late response. I wrote a reply before going to sleep last night but I was too sleepy to hit submit:( Here is my response:

      The algorithm you describe is indeed a good algorithm and works pretty well in most cases. In fact, Open MP guided scheduling uses a very similar algorithm. They assign threads larger chunks of iterations in the start and smaller chunks towards the end. The only limitation of this approach is that the algorithm ought to be able to tell when the loop is about to end. This is often not possible in complex algorithms, e.g., the maze. I still think a flavor of your algorithm can be refined for such workloads. We need to brainstorm more. Keep the insights coming:-)

      Aater

      • The credit for Guided Self Scheduling (GSS) goes to David Kuck and Polychronopoulos (1987). http://portal.acm.org/citation.cfm?id=40941

        GSS seems to make sense for some irregular kernels e.g. maze solving or bfs in general. For example, if we keep track of the size of the work-queue (which has an overhead), then as GSS suggests, “size/numThreads” gives a good chunk size. In practice you may want to do “size/ (c*numThreads)”, where c is some small constant used for avoiding creating too big a chunk.

  3. It’s actually a cool and helpful piece of information. I’m satisfied that you simply shared this helpful information with us. Please stay us up to date like this. Thank you for sharing.

  4. I love your blog.. very nice colors & theme. Did you design this website yourself
    or did you hire someone to do it for you? Plz respond as I’m looking
    to construct my own blog and would like to find out where u got this from.
    many thanks

  5. This post is genuinely a pleasant one it assists new webb people,
    who are wishing for blogging.

  6. Oh my goodness! Incredible article dude! Thank you, However I
    am experiencing issues with your RSS. I don’t know why
    I cannot subscribe to it. Is there anybody else having identical RSS issues?
    Anyone that knows the solution will you kindly respond? Thanx!!

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>