May 292011

To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects.  Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.

Continue reading “Parallel programming: How to choose the best task-size?” »

May 262011

This post is a follow up on my post titled why parallel programming is hard. To demonstrate parallel programming, I present a case study of parallelizing a kernel which computes a histogram. I use Open MP for parallelization (see the reason here). The post first introduces some basic parallel programming concepts and then deep dives into performance optimizations.

Update 5/30/2011: I have written my next post on parallel programming. It discusses the motivation, implementation, and trade-offs of dynamic work scheduling.

Continue reading “Writing and Optimizing Parallel Programs — A complete example” »

May 252011

I confess. I have an ulterior motive behind this post. Eventually I want to write a parallel programming tutorial which demonstrates the performance trade-offs in parallel programs. Since the focus of that tutorial is on performance, I prefer using the parallel programming framework with the least syntactic distractions.I think I will choose Open MP because it seems to be the cleanest alternative for parallelizing regular for-loops. This post is to familiarize readers with Open MP before the tutorial such that the tutorial can focus solely on performance optimizations.

Continue reading “Open MP vs pthreads” »

May 202011

Multi-cores are here, and they are here to stay. Industry trends show that each individual core is likely to become smaller and slower (see my post to understand the reason). Improving performance of a single program with multi-core requires that the program be split into threads that can run on multiple cores concurrently. In effect, this pushes the problem of finding parallelism in the code to the programmers. I have noticed that many hardware designers do not understand the MT challenges (since they have never written MT apps). This post is to show them the tip of this massive iceberg.
Update 5/26/2011: I have also written a case study for parallel programming which may interest you.

Continue reading “What makes parallel programming hard?” »