Jul 042011

Writing parallel code is all about finding parallelism in an algorithm. What limits parallelism are the dependencies among different code portions. Understanding the dependencies in a program early on can help the programmers (a) determine the amount of available parallelism, and (b) chose  the best parallel programming paradigm for the program. In this post, I try to layout a taxonomy of dependencies in loops and how it plays into parallel programming.

Continue reading “Parallel Programming: On Using Dependence Information” »

Jun 292011
Loop control flow

I got into a debate with a computer science professor a few months ago when I made a controversial blanket statement that “the code inside loop bodies is the only code that matters for performance.” I should provide some context: I was discussing how multi-threading is about speeding up loops and I don’t care about straight line code which is only gonna execute once (or just a few times). My argument is that programmers do not write billions of lines of straight line code. Its the repetition of code (via loops or recursion) that makes the code “slow.” In fact, I can argue that any time we wait on a computer program to do something useful, we are in fact waiting on a loop (e.g., grep, loading emails, spell checking, photo editing, database transactions, HTML rendering, you name it). It is a rather silly argument but I would like to see some counter arguments/examples. Question: Is parallel programming all about loops/recursions or are there cases where code that executes only once is worth optimizing?

Please note that if the code executing once has a function call which has a loop in it then that counts as a loop, not straight line code. Comments would be great but at least take the time to indicate your vote below.

Are loops the only code that matters for performance?

View Results

Loading ... Loading ...



Jun 262011

After reading my post on the shortcomings of Amdahl’s law, A reader of Future Chips blog, Bjoern Knafla (@bjoernknafla), on Twitter suggested that I should add a discussion on Gustafson’s law on my blog. Fortunately, I have the honor of meeting Dr. John Gustafson in person when he came to Austin in 2009. The following are my mental notes from my discussion with him.  It is a very simple concept which should be understood by all parallel programmers and computer architects designing multicore machines.

Continue reading “Parallel Programming: Amdahl’s Law or Gustafson’s Law” »

Jun 262011

I received a interesting comment from ParallelAxiom, a parallel programming expert, on my post titled “When Amdahl’s law is inapplicable?” His comment made me re-think my post. I must show an example to hammer my point. Thus, I have added an example to the original post and I am adding this new post just so the RSS subscribers can see this update as well. Please look at the original article first in case you have not read it.

Continue reading “Parallel Programming: An example of “When Amdahl’s law is inapplicable?”” »

Jun 252011

I see a lot of industry and academic folks use the term Amdahl’s law without understanding what it really means. Today I will discuss what Gene Amdahl said in 1967, what has become of it, and how it is often misused.

Update: 6/25/2011: On a suggestion from Bjoern Knafla, I have added a closely related article on Gustafson’s law.

Continue reading “Parallel Programming: When Amdahl’s law is inapplicable?” »

Jun 182011

In shared memory systems, multiple threads are not allowed to update shared data concurrently, known as the mutual exclusion principle. Instead, accesses to shared data are encapsulated in regions of code guarded by synchronization primitives (e.g. locks). Such guarded regions of code are called critical sections. The semantics of a critical section dictate that only one thread can execute it at a given time. Any other thread that requires access to shared data must wait for the current thread to complete the critical section.

Continue reading “Parallel Programming: Understanding the impact of Critical Sections” »

Jun 142011

Parallel programming consists of four distinct phases: finding the parallelism, writing the code, debugging the code, and optimizing the code. In my opinion, frameworks like the Apple’s Grand Central Dispatch and Intel’s TBB only help with writing the code; they do not help with finding parallelism, or debugging, or optimizations (e.g., false-sharing, thread waiting, etc). I think that the difficulty in finding the parallelism , which can be an insurmountable barrier for many inexperienced parallel programmers, is often underestimated. In this post, I try to explain this challenge using a couple of parallel programming puzzles.

Continue reading “Parallel Programming: Why new frameworks only solve a part of the problem?” »

Jun 032011

I just want to share a quick tip for new parallel programmers. I have found that using a while(1) loop makes parallel code easier to write, read, and debug. This post just explains why I think its easier. In fact, I am also very curious if the experienced programmers agree with me or not.

Continue reading “Quick Tip: Why while(1) can make parallel code better?” »

Jun 032011

maze I have been busy constructing another parallel programming example this week. My target was to find a problem which can be explained in one sentence, coded in less than 10 lines of C, uses a work-queue, and is simple to parallelize but hard to optimize. The one I finally picked is a maze. I am presenting  my first stab at it today but it is still work in progress. Feedback and suggestions are greatly appreciated.

Continue reading “Parallel Programming: Parallelizing (somewhat) Complex Code (Part 1)” »

May 292011

To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects.  Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.

Continue reading “Parallel programming: How to choose the best task-size?” »