Jul 242011

In this post, I describe pipeline parallelism,  a powerful method to extract parallelism from loops which are difficult to parallelize otherwise. Pipeline parallelism is underrated because of a misunderstanding that it only applies to “streaming” workloads. This statement is very vague and misleading. I discuss the uses, trade-offs, and performance characteristics of pipelined workloads.

You may feel that you know pipeline parallelism already so lets start with a short quiz

Pipeline Parallelism

This quiz tests your understanding of pipeline parallelism.
Congratulations - you have completed Pipeline Parallelism. You scored %%SCORE%% out of %%TOTAL%%. Your performance has been rated as %%RATING%%
Your answers are highlighted below.
Shaded items are complete.


Now lets discuss pipeline parallelism.



A common approach to parallelism is to split each loop such that multiple iterations are run in parallel. This approach is called work-sharing. For example, consider the following loop:

while(!done) {
    Read block from file;
    Compress the block;
    Write block to file;

In the traditional work-sharing approach, threads run different iterations concurrently. Each thread runs an iteration in it’s entirety and then proceeds to then proceeds to the next iteration.

Pipeline Parallelism

An alternate approach is to use pipeline parallelism where we split each loop iteration into stages and threads operate on different stages from different iterations concurrently. Each stage is assigned one or more worker threads and an in-queue which stores the work to be processed by that stage. Each worker threads runs in the following loop:

while(!program ended):
    If work is available in the in-queue assigned to the thread:
        Dequeue work
        Execute stage
        Enqueue work in the In-queue of the next stage

For example, to expose pipeline parallelism in the previously described loop, we can split it into three segments S1-S3:

S1: Read block from file
S2: Compress
S3: Write compressed block to file

Now, we can create a pipeline. Suppose we have three cores C1-C3 executing the three stages S1-S3 respectively. An iteration will proceed as follows: C1 will run S1 and then insert the block in S2′s in queue. C2 will compress the block and enqueue it for writing by S3. S3 will write the block. Thus, C1, C2, and C3 can be working concurrently on stages from different iterations, thereby increasing performance.

To understand the benefit of this, suppose that S1 takes 1 time unit, S2 takes 4 units, and S3 also takes 1 time unit. Thus, time taken to execute one iteration is 6 time units. Now suppose we need to run a total of 8 iterations, which takes 48 units when executing as a single thread as shown in the figure below (see the set of bars labelled NumCores=1). With pipeline parallelism (second set of bars), each iteration is split across cores. Since S1, S2, and S3 from different iterations can run concurrently with respect to each other, execution time reduces. Notice how the slowest stage, S2, dominates the overall time as S1 and S3 only wait for S2 to finish.

Pipeline parallelism is powerful because it can expose parallelism in ordered loops where iterations are non-independent and cannot run concurrently. By splitting each loop iteration into segments, we can expose intra-iteration parallelism.

In Pipeline Parallelism, variables which are shared by different instances of a stage are now pinned to the core running the stage, hence, their locality increases. In contrast, variables written in one stage, say S1, and read in a later stage, say S2, need to be migrated from C1 to C2 which increases cache misses. Thus, whether a workload should be pipelined or not depends on the ratio of inter-stage-shared variables vs. intra-stage variables.

Performance Modeling

The throughput of a pipeline stage is the number of iterations processed in a given amount of time. Thus, the throughput Ti of a pipeline stage i can be defined as:

The higher the throughput, the shorter the execution time. The overall throughput, T, of the whole pipeline is limited by the throughput of the slowest stage of the pipeline. Therefore:

Thus, for example, if the slowest stage of the pipeline for compression is S2 (compress), then performance will be solely determined by the throughput of S2. Therefore, K blocks will be compressed in K/T2 time or K*(time taken to execute S2).

Multiple cores per stage

When multiple cores are assigned to a pipeline stage, its throughput increases (ideally) linearly. Thus, a common mechanism to speed up a pipeline workload is to assign more cores to the slowest stages to balance the throughput. However, programmers are often unable to balance pipelines completely which leads to thread waiting, thus, lost performance opportunity.


Now you can try taking the quiz at the top.


  20 Responses to “Parallel Programming: Do you know Pipeline Parallelism?”

  1. Excellent article!

    I think your last point could be made clearer with another graph, like the ones you used in the “Pipeline Parallelism” session. That same 6 core example with four of the cores assigned to S2 instead of split evenly between all stages makes for a rather striking example!

  2. Wonderful article! It help me a lot. But I still confused about pipeline implementation: one stage is one thread, right? Then how can you pin a thread to certain core? Or do you have any implementation example? Thanks!

    • Hey Liu,

      Thanks for reading. You can do one (or more than cores) per stage. I implement my pipeline framework as follows:

      -I spawn as many thread as there are cores. I use processor affinity flags in OpenMP to pin threadsto cores such that thread 0 corresponds to core 0 and so on. Search KMP_AFFINITY on google.
      -Each stage is assigned one work-queue
      -I implement a “work scheduler” which decides which thread should be running which stage
      -Each worker thread does the following in loop: Call scheduler to know which stage to process work from, Dequeue work, process it, and put the output to the next stage

      I hope this helps. I can share code if you want.


      • Hi Aater,

        Thanks for your reply. I got your idea. But I did have difficulty in coding with the “work queue” and “work scheduler” thing. If you can share code, I will be really grateful. My email: fujun.liu@uky.edu Thanks

  3. Hi i keep reading your articles, they are great and giving nice ideas.

    From what i understand assigning more cores to slower parts will increase performance ofc but will also cause more load/stores and maybe more mis-branch predicts. So the perf gain may not be linear (at least getting more perf is something anyway). On the other hand when we compare 7zip vs. winrar, winrar is more memory bw dependent while 7zip is utilizing cores better. So i assume they have total different approach?

  4. [...] · Aater Suleman, Parallel Programming: Do you know Pipeline Parallelism? [...]

  5. Congrats! It’s indeed a great article. Thanks for sharing.

  6. Good explanation.
    Can you mention a few killer example code where we can apply pipeline parallelism?

  7. Woah this blog is fantastic i love studying your articles. Keep up the good paintings! You already know, a lot of persons are looking round for this info, you could aid them greatly.

  8. Regarding the chart with the three comparisons… Correct me if I’m wrong, but wouldn’t using six cores in sequence like the bottom one still beat the using six cores with two cores per stage example? Visually, just look at it. Even though there would be two cores doing two extra jobs, it would still beat the top example… You can’t really compare six cores to a one core example and call it faster. To me that’s an obvious “no duh.” Now I’m even more confused, lol.

  9. Spot on with this write-up, I honestly think
    this amazing site needs a great deal more attention. I’ll probably be back again to read more, thanks
    for the advice!

    My website :: http://translate.google.cn/

  10. The answer to question 1 is wrong, or at least the question is phrased badly. Consider the case of 1 iteration. It will clearly not take 40 cycles to execute. The slowest stage determines the throughput once the pipeline is up and running, but does not account for filling and emptying the pipeline.

    In fact it might be a good example to explain the difference between throughput (40 cycles), latency (120 cycles) and overall time (40N+80 cycles)

  11. Great article with very clear explanation. The picture helps a lot to understand further. Thank you for sharing.

  12. http://www.ooobrand.com/ordermethod/index.html
    ★品質を重視、納期も厳守、お客様第一主義を貫きは当社の方針です。★驚きの低価格で商品をお客様に提供致します!★早速に購入へようこそ!ブランドスーパーコピーバッグ、財布、時計プラダ スーパーコピー,プラダ 財布 コピー,プラダ 新作 財布ブランド財布コピー,ブランド スーパーコピー 財布,プラダ スーパーコピー 財布,シャネル財布コピールイヴィトン 財布 コピー,ルイヴィトン 財布 コピー 代引き,ルイヴィトン財布スーパーコピー }}}}}}

  13. Very good article, But the one thing that needs some explanation I feel is how the ratio of (# of interstage shared variables) / (# of intrastage variables) affects pipelining. If this is an index (say we call it K), I feel for efficient processing, the condition must be K >=1. If someone further clarifies this, it will be helpful. Anyway, nice & more importantly easy to read explanation of pipelined computation.

  14. I feel that is one of the most vital info for me.
    And i am satisfied reading your article. But wanna observation on few normal things,
    The website style is ideal, the articles is truly great : D.
    Good activity, cheers

  15. Really interesting… look onward to visiting again

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>