Jun 252011
 
amdahlblock3

I see a lot of industry and academic folks use the term Amdahl’s law without understanding what it really means. Today I will discuss what Gene Amdahl said in 1967, what has become of it, and how it is often misused.

Update: 6/25/2011: On a suggestion from Bjoern Knafla, I have added a closely related article on Gustafson’s law.

Gene Amdahl’s insight

Amdahl’s law was derived from Gene Amdahl’s 1967 paper in AFIPS computer conference. On a side note, I am fascinated by the opening sentence of this paper:

For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.

Deja Vu — apparently, the death of single thread performance is not new. But, I digress from the topic.

This following sentence from this paper is the basis of the infamous Amdahl’s law:

… the effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude.

It is important to note that the paper did not have an equation. Neither did it talk about what types of serial bottlenecks exist, what parallel programming paradigm is he referring to, etc.

The equation

Somewhere along the road, we decided to convert Amdahl’s insight into a law and characterize it using an equation. I could not find its origin but the equation goes as follows:

 

This equation makes six simplifying assumptions:

1. The world is black and white: the number of executing threads is either equal to 1 or N; there is nothing in between. This is often false , e.g., in Google Map Reduce, the number of threads in the Map phase may be N but the number of threads in the Reduce phase are often smaller than N but greater than 1.

2. The parallel portion has perfect speedup.This is not true because contention for shared data (critical sections) and shared resources (caches, memory bandwidth) often prohibit the program from reaching perfect linear speedup.

3. The parallel portion has infinite scaling, i.e., performance never saturates. This is incorrect because contention for shared data and shared resources increases with the number of threads. This contention can reach a point that adding more threads does not increase performance (or reduces performance).

4. There is no thread creation/deletion overhead. Allocating and de-allocating threads is expensive and this overhead increases linearly with the number of threads.

5. The length of the serial, single-threaded portion is independent of the number of threads. The single thread work often consists of splitting the work for the parallel portion. This work is often a function of how many threads will be spawned during the parallel work. Furthermore, more threads can lead to more inter-core communication, thereby extending the length of the serial portion (Updated 12/3/2011: thanks to Jim Dempsey @ QuickThreads for this comment).

6. Serial and parallel code runs at the same rate

6. The serial portion cannot be overlapped with the parallel threads. Many workloads embed serial portions inside parallel portions in order to overlap their execution with parallel work, e.g., exploiting pipeline parallelism between serial and parallel portions.

The main distinction that reduces the scope of Amdahl’s law is #6.

What is Amdahl’s Law?

Most literature quotes the equation as Amdahl’s law, not the actual insight, so lets  stick to it.

Who can use Amdahl’s law?

Amdahl’s law is only applicable in certain fork-join programming paradigms. Specifically, it is applicable to workloads where some code runs as a single thread followed by some embarrassingly parallel code, e.g., matrix-matrix-multiply or other HPC kernels.

Who should avoid using the term “Amdahl’s law”?

As with any analytic model, Amdahl’s law should only be used when a workload fits the programming model assumed by the model. There are many programs that do not fit this model. Fundamentally, Amdahl’s law assumes that any code which cannot be parallelized is always on the critical program path. This is not the case in many modern programming paradigms as some non-parallel code sections can run in parallel with other independent code. For example:

1. Critical sections: Critical Sections are portions of code where only one thread can execute at a given time; other threads wanting to execute the critical section must wait. These critical sections can serialize a variable number of threads depending on the contention for the critical section. This contention is sometimes zero and sometimes very high. Thus, Amdahl’s law does not apply to critical section intensive workloads such as databases.

2. Serial stages in a pipeline/task parallel workload: The serial stage only leads to serialization if it is the critical path of the program. It may have zero or more threads waiting for it at any given time. For example, all graphics workloads have a thread which is producing work for the other threads. This serial thread does not become a bottleneck unless it cannot feed the other threads fast enough. However, if it becomes the bottleneck, then the parallel portion seizes to matter. Thus, the Amdahl’s equation does not characterize the behavior of these kernels.

3. Regions of limited parallelism: These exist in programs due to contention for hardware resources or thread creation/deletion overhead. The Amdahl’s equation does not apply to these as it assumes that the serial part is only one thread. For example, Google Map Reduce does not follow the Amdahl’s model.

The above constructs change the first-order analytic models of the program completely and do not fit Amdahl’s. Yet, we often use serial portions and Amdahl’s law synonymously. I must admit that it is one of my pet peeves and I believe that we need to fix this problem as a community. It is best to use more specific terms for different types of serial code in order to avoid misunderstanding.

 

Concrete Example (added after the comment from ParallelAxiom on 6/25/11, I also added the following as a new post for RSS subscribers):

Think about pipeline workloads. Lets use a simple example of a workload called Rank. Rank compares a set of input strings in a file to a given search string and returns the N most similar strings to the search string.

There are three stages in a particular implementation of Rank:

S1. Read input string — this is sequential
S2. Compute similarity score by comparing to search string — multiple strings can be compared concurrently
S3. Insert string in a heap sorted according to the similarity score. Drop the element with the lowest similarity score in case the heap has N+1 elements after the insertion. — this is a critical section in a naive implementation.

Now in the above code, there are three distinct loops that can be done one after the other in which case #1 and #3 will be your Amdahl’s serial portions (where only a single thread exists). However, I can be smart and write this code as a pipeline where  S1, S2, and S3 run concurrently and communicate via work-queues.

Lets suppose each iteration of S1 takes 1K cycles, S2 takes 10K cycles, and S3 takes 2K cycles. Then as I up the number of cores, eventually the throughput of this pipeline willl become limited by S3 because even if I speed up S2 to 1000 cycles per iterations (by giving it 10 cores), S3 will not speed up. Thus, once I have say 5 cores assigned to S2, more cores will not help with performance.

Now, naively (and incorrectly), people call this Amdahl’s law. No it is not Amdahl’s law because the above cannot be characterized by the Amdahl’s equation. If we use Amdahl’s, the serial code is 3K cycles and parallel core is 10K cycles. Thus, with infinite cores:

Cycles_per_iteration_with_P_cores = 3K + 10K/P
Thus,  the fastest speed will be 3K cycles_per_iterations. This is obviously wrong.

The equation which characterizes this pipeline case is as follows:

Cycles_per_iteration = MAX(Cycles_per_S1, Cycle_per_S2/cores_assigned_to_S2, Cycle_per_S3);

This is my poster child example of showing why Amdahl’s doesn’t work for all non-paralllel code.

 

Update: The following is a conversation I had about this example with @jmuffat on Twitter. I think it helps clarify an important point of this example:

@jmuffat said:  Amdahl’s law knows only 2 types: serial or //. Your example shows things that can run concurrently and calls them serial…
My response: That is the point of my article. Amdahl’s shouldn’t be used when serial is off the critical path but ppl often make this mistake. There are hundreds of “experts” that say that any code that cannot be multithreaded is “Amdahl’s bottleneck.” I’ve a paper on critical sections. Despite my insistence, people still call critical sections Amdahl’s but as you say, they aren’t.

Update: 6/25/2011: On a suggestion from Bjoern Knafla, I have added a closely related article on Gustafson’s law.

Note: Hardware architects mis-characterize Amdahl’s law more often when is why I have classified this post under “Software for Hardware Guys.”

  26 Responses to “Parallel Programming: When Amdahl’s law is inapplicable?”

  1. [...] the original post: Parallel Programming: When Amdahl’s law is inapplicable? | Future … Categories: Programming 0 Comments Tags: amdahl, background, discusses-the-background, [...]

  2. Interesting post.

    I think that even another assumption of Amdahls law is soon to be brought into question, namely that every parallel part does only execute at the same speed as the serial part.

    I think you hit the nail on the head when you talked about mixed core CPUs ( I think in your paper about accelerating mp4 decoder ). I believe we’ll have in the not too distant future chips with C16-64 ( read 16 full cores + 64 fast but restricted cores ).

    Integration of CPU/GPU is an early step towards these hybrids. In the end I think it all depends on the available tools if they survive. Unfortunately for the standard Java programmer the current interfaces OpenCL, or CUDA are next to impossible to use.

    To #6 you mentioned above. I liken it to a program running on an OS. Simply because the program might have a slow, serial part, the other parts of the OS do not slow down.

    This may or may not be transferable to a single program depending on the task at hand. I think most of the programs written today are more complex than simply answering one question ( calculate PI ).

    So while one task might be more or less bound to Amdahls law ( request->DB->business logic->response ), multiple concurrent tasks in their entirety may exceed Amdahls limit. It all depends on how you look at it.

    However even in that case Amdahl is right, as the serial part between the different tasks is close to zero.

    So I guess after all this arguing about it I just convinced myself that you are missing the point wrt Amdahls law and your point #6. If you can execute it in parallel with other threads, then the dependency must be NULL.

    But … of course SW which makes assumption about expected input variable, and running threads in parallel to have a result at hand when the input variables are known will kill this argument. Isn’t it fun to think you have 1000s of spare cores available … Maybe we should look into how the brain handles concurrency :)

    My head spins, got to eat breakfast.

    • Hey,

      Interesting comment. Now my head is spinning as well. lol.

      “So I guess after all this arguing about it I just convinced myself that you are missing the point wrt Amdahls law and your point #6. If you can execute it in parallel with other threads, then the dependency must be NULL.”

      I may be misunderstanding you but the above sounds incorrect. I think you are trying to say that if a serial part can run in parallel with something else then there must be no dependency between the parallel part and the serial part. Hence, Amdahl is right.

      The first part of the assumption is wrong. Just because you can run a “non-parallelizable” code snippet in parallel with other does not mean there is no dependency. Think about pipeline workloads. Lets use a simple example of a workload called Rank. Rank compares a set of input strings in a file to a given search string and returns the N most similar strings to the search string.

      There are three stages:

      S1. Read input string — this is sequential
      S2. Compute similarity score by comparing to search string — multiple strings can be compared concurrently
      S3. Insert string in a heap sorted according to the similarity score. Drop the element with the lowest similarity score in case the heap has N+1 elements after the insertion. — this is a critical section in a naive implementation.

      Now in the above code, there are three distinct loops that can be done one after the other in which case #1 and #3 will be your Amdahl’s serial portions where a sinle thread exists. However, I can be smart and write this code as a pipeline where S1, S2, and S3 run concurrently and communicate via work-queues. Per your argument 9if I get it correctly), S1 and S3 are no longer serial. This is not true.

      Lets suppose each iteration of S1 takes 1000 cycles, S2 takes 10000 cycles, and S3 takes 2000 cycles. Then as I up the number of cores, eventually the throughput of this pipeline willl become limited by S3 because even if I speed up S2 to 1000 cycles per iterations (by giving it 10 cores), S3 will not speed up. Thus, once I have say 5 cores assigned to S2, more cores will not help with performance.

      Now, naively (and incorrectly), people call this Amdahl’s law. No it is not Amdahl’s law because the above cannot be characterized by the Amdahl’s equation. If we use Amdahl’s, the serial code is 3K cycles and parallel core is 10K cycles. Thus, with infinite cores:

      Cycles_per_iteration_with_P_cores = 3K + 10K/P
      Thus, the fastest speed will be 3K cycles_per_iterations. This is obviously wrong.

      The equation which characterizes this case is as follows:

      Cycles_per_iteration = MAX(Cycles_per_S1, Cycle_per_S2/cores_assigned_to_S2, Cycle_per_S3);

      I think this is my poster child example of showing why Amdahl’s doesn’t work for all non-serial code. I am making it a part of the post. Let me know if this answers your concern (I hope it does:-) )

      Aater

  3. Hmmm,

    I think you are trying to be clever with the code to work around Amdahl ( poor guy ). However if you use IPS ( Inter Process Synchronization ) or IPC in the threads then you can consider the cycles spent for the communication to be your serial part ( the info has to come from somewhere ), plus reading in the string.

    Assuming you infinitize the work in the middle ( S2 ) you’ll still end up with the summation of the serial parts. Though the more you communicate, the longer it takes.

    So I would say that if S3 depends on the result of S2, which in turn depends on the result of S1, then your above mentioned program takes
    1000 Cycles + 0 ( infinite cores, with no overhead ) + 2000 Cycles
    Take one string for example : You have to read in the string first before you write it to the output unless you start working on he string before you finish reading it in, in which case you just lowered the granularity but you did not destroy gravity … I mean Amdahls law.

    Can we talk now about energy, Schrödinger’s cat and the theory of everything … more fun. But hey this post is close though :)

    • “So I would say that if S3 depends on the result of S2, which in turn depends on the result of S1, then your above mentioned program takes
      1000 Cycles + 0 ( infinite cores, with no overhead ) + 2000 Cycles”

      Nope, this is wrong and I finally see our disconnect. Its the classic latency vs throughput argument. You are right that each input string will take (at least) 3000 cycles to process but that is irrelevant in a pipeline. Whats important is that I will be able to process strings at a rate of 1 per 2K cycles. Thus, 100 strings will take 200K cycles (ignoring the trailing edges).

      In a pipeline, the throughput is equal to the throughput of the slowest stage, which is S3 in our example. It is the whole critical path argument.

    • By the way, if you are interested in seeing more maths on a pipeline, see this paper of mine:
      Feedback Driven Pipeline Parallelism

  4. Oha,

    I see, now what you mean. It is a matter of granularity I think, or a way to look at the problem. Either way, ole Amdahl is in our way. BTW, I don’t like his ‘law at all and am not trying to defend it per se … just chiming in.

    See the way to describe the problem space of your app to Amdahls pleasing is to say you have a inherently parallel problem, which has a serial runtime of at least 3000 cycles.
    Fig 1.d in your paper says it all.

    So if you have 2000 strings, your absolute minimum for all 2000 strings is 3000 cycles.

    In other words, you could have 2000 * number cruncher with infinite cores available. then your total run time would be 3000 cycles. However reality is in our way.
    The perfect runtime:
    1* 1000 + 0 + 1*2000
    The approx runtime ( leaving out many things ):
    1*1000 + X*overhead1 + 10K/P +Y*overhead2 + 1*2000 +Z*overhead3
    where overhead can be argued over what it means and what belongs to that overhead.

    Just like video decoding, where one task might read in data while another thread of the same task is still munching on processing the previous dataset.

    In general the math behind it becomes quite annoying if applied to a truly complex problem because you have to handle interwoven granularity properly ( E.g. pre-fetching data ) but I would not put the runtime of this program to ( 3000 cycles + 10K/P ) * 2000 input strings, neither would I put in a flat 3000 cycles.

    The answer is somewhere in between.

    BTW, the pipeline issue is much more real that you may believe and I had to fight this type of issues a lot with e.g. buffered UDP/TCP sockets. Once the buffer size is exceeded ( because the worker threads can’t handle the data fast enough ) you’ll run into trouble that is non-trivial to solve ( I.e. dropping packets, crash, or analyzing data to decide based on business logic )

    Last word for today, pipeline frameworks are plentiful and two I like are FastFlow or ØMQ ( more generic Message passing API )

    • Hey,

      Not sure if we are still on the same page. A pipeline analytic model which does not have a MAX or a MIN cannot be correct (at a first order). In any kind of pipeline processing, even within a core, the overall time is always characterized by a MAX/MIN function. You cannot sum the contributions (which you seem to be doing here unless I am mistaken). It is a bottleneck analysis where the slowest bottlenecks dominates and if you speed the slowest one up then the next one becomes the limiter. The answer for this loop, as I run it on a real 16-core SMP machine, agrees with my analytic almost 100%. (I measure the overhead of pushing and popping on the queue to be about 70 cycles, I am using my own work-queue algorithm which I think is better the one used outside).

      I understand its a simple pipeline but that is why I use it to prove the first order model. I wrote a pipeline framework of my own for the Pipeline Parallelism paper. I hear you. Its no fun:-) In fact, I can see why your problem must be harder with network I/O involved.

  5. I agree that Amdahl’s Law is frequently misused. As it only presents a simple computing model that most likely will not completely represent all the details of a program. However, it is the simplicity of Amdahl’s Law that makes it so powerful.

    The usefulness of Amdahl’s Law can be better shown in the following example:

    Imagine that you are solving a complex problem and you are developing the application from zero (which you estimate that will result in say ten thousands of lines of code). Naturally, the first thing that you do is to break down your problem into smaller and more manageable units. It is often the case that you decompose the problem into a sequence of specialized modules. It is important to notice that at this time you don’t worry about parallelism, as at first you are interested in the correctness of your application and getting all the features right.

    Once that you are the owner of your (rather complex) serial code, you look into what is the advantage of parallelization. Now, you don’t need to start your parallel application from zero, it is too risky and can be costly too. So you start by looking into your serial code which was built as “sequence of specialized modules”. Here is when you can use Amdahl’s Law to look at which modules should be parallelized and what are the possible gains (by setting an upper limit under a best case scenario).

    For example, consider that after a simple analysis you notice that your application has one module that is not possible (or rather extremely difficult) to parallelize and that this module takes 25% of the overall time for the sequential application, on the other hand the other 75% of the application is embarrassingly parallel. In this case, Amdahl’s Law will tell you that under the best conditions the maximum speedup will be 4x regardless of the number of cores/processors you have.

    • I think we both agree completely.

      Amdahl’s law is powerful and useful as long as you know when to use it. In your example, it is very applicable because the example follows the Amdahl’s model. Some part is embarrassingly parallel and other part is zero parallelism. Amdahl’s law is your best friend here.

      Let me extend your example now. Lets say that after you are done with the embarrassingly parallel code, you want to squeeze some juice out of the serial code as well. You stare at it and realize that it can be split into tasks where some tasks will be dependent on others while others will be able to run in parallel. Now, Amdahl’s cannot be used because this new kernel is neither serial nor completely parallel. My pet peeve is that people continue to call the second kernel Amdahl’s limited and try to use Amdahl’s equation with it.

      By the way, I am amused at how your parallelizing methodology is same as mine — write serial, parallelize incrementally. There are many who would say that we should “think in parallel” from the start.

      • In the case that you describe, where you identify a group of well defined tasks with dependencies. It would be interesting to see if it is feasible to describe the tasks and their dependencies as a DAG (directed acyclic graph) and then schedule the execution of the tasks so that the execution order minimizes the critical execution path of the DAG while using all the available resources in the system.

        About “think in parallel”: Software developing is more about risk management than coding. Producing a full feature serial software that is bug free is quite a challenge, and to add parallelism to the mix is to ask for trouble (in my experience). However, I think that as you get more experience with parallel programming you get to see what design features in your software make your application more easy or difficult to parallelize. With enough experience you can “think in parallelism” while you design your serial code so that upgrading to parallel is easier (but probably not easy).

  6. you’re selling Amdahl short. all he’s saying is that execution time approaches serial as you apply increasing parallelism. there’s always some serial, and with infinite cores that are infinitely scalable with no overhead, you’re stuck with what’s left. changing the description by decomposing the part you called serial before is no disproof of Amdahl’s observation – if anything, you’re going to reiterate it.

    in other words: overlap however much you want; your performance will always be bounded by the pieces that don’t overlap as much. like a lot of “laws”, it’s pretty close to tautology.

    • I totally agree. It does not disprove of Amdahl’s insight (which is why I quoted his insight separately). It is disproof of the commonly used “Amdahl’s law equation” which wasn’t his own work. The equation shall not be used unless the workload follows the black and white serial parallel model.

  7. The formula for Amdahl’s law, as well as the phrase “Amdahl’s law” came from Willis H. Ware in 1972 in a RAND report titled “The Ultimate Computer.” See IEEE Spectrum Vol. 9, No. 3, pp 84-91. The foils Gene used when he presented included an algebraic formula showing how speeds add harmonically, but the formula did not actually appear in the paper, puzzling many who have gone back to read it.

    I’ve had the good fortune to meet Gene Amdahl a number of times, and directly ask him about things that don’t show up in his 1967 paper. He was debating Dan Slotnick about the merits of the ILLIAC IV design, and pointed out that the OS overhead on a SIMD machine gives rise to high serial fractions. Gene does not believe there is a very large serial limit if you have true MIMD as in modern message-passing architectures. Gene is embarrassed at being credited with a formula that shows how speeds add, which is just 8th-grade story problem stuff; he also never meant his observation to be misused, as it usually is, as an excuse not to improve some part of a computer’s performance.

    It’s funny how people make the leap from an algebraic tautology to the assumption that the serial fraction is “probably pretty big,” for which there exists no mathematical foundation whatsoever. It could be parts per trillion. In fact, for the entries on the TOP500 list of computer speeds that use hundreds of thousands of processors, you can work backward to figure out the serial fraction (which is what most people do, since it’s very difficult to derive any other way) and prove that the fraction must be really, really small.

    • I did not know about the SIMD part of Gene’s insight. That clarifies several points in his paper. Thanks.

      I agree with most of what you said but I want to add to your third point about serial fraction being small. Perhaps I am misinterpreting but I disagree with your conclusion that serial portions are generally small. There are many examples of programs that do have long serial portions e.g., you will find long serial portions in commonly used open source databases, branch-and-bound kernels, web browsers, etc. Top 500 workloads are a bit biased because they are super optimized; most programs I have seen are not like that. As parallel programming becomes common, I expect to see more and more mediocre parallel programs.

  8. [...] a programmer’s mind is: how on earth are we to program these things? For problems that aren’t embarrassingly parallel, we really have no idea. IBM Research’s David Ungar has an idea. And it’s radical in the [...]

  9. [...] mind is: how on earth are we to program these things? For problems that aren’t embarrassingly parallel, we really have no idea. IBM Research’s David Ungar has an idea. And it’s radical in [...]

  10. [...] this blog suggests the Amdahl law’s formula per se, has not been proposed by Amdahl himself and there [...]

  11. Hi this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML. I’m starting a blog soon but have no coding skills so I wanted to get advice from someone with experience. Any help would be enormously appreciated!

  12. Has anyone ever encountered negative scaling, i.e. time-to-completion increases (or at least does not decrease) as openmp threads are added:
    on a 16-core Opteron, gcc, CentOS, measured by `time`
    cores real user
    16 21m34.479 = 1294.479 sec 169m37.168 = 10177.168 sec
    8 19m37.265 = 1177.265 sec 81m49.617 = 4909.617 sec
    4 20m42.487 = 1242.487 sec 45m45.416 = 2745.416 sec
    2 20m15.627 = 1215.627 sec 28m1.094 = 1681.094 sec
    1 18m39.303 = 1119.303 sec 17m50.986 = 1070.986 sec

    This is a benchmark listing of a code that was converted to OpenMP from Pthreads. The Pthreads version shows evidence of scaling, i.e. Speedup curve has positive slope & shows speedup of 6 at 16 cores (about 90% parallelism).

    Of course OpenMP and Pthreads can have entirely separate realms of applicability. I am just wondering if there is a classic case out there that shows the behavior I am seeing.

    I have ideas as to why. Would like to hear from others on this.

    Thanks

    • It’s very easy to setup your program to drop in performance from the ST version to the MT version. One of the easiest ways to run into problems with that is because of cache coherency. If more than one core tries to write to the same line they will both keep on invalidating each others caches and receive misses. Make sure the way you’re splitting your data there is as little sharing of data as possible. Especially false sharing.

  13. |

  14. For you to offer any story of their personal life span, Noah Lionel Forbes Dickinson ‘political adjust improvements state policies, yet where by should it get?

    i Mostly, he could be mentioning free steam games. To be able to paraphrase, the quote is saying ‘free sauna video game titles
    is victorious ballots. wi Simple seeing that of which.
    Because the Renaissance free steam games is now
    a lot more common. May perhaps the idea go on.

    Summary

    In summary, free steam games parades along guy’s roads along with person dunes rear.

    That delivers tranquility, this energizes in addition to figures present
    it’s a earning formulation. One more point out goes to the particular merit winning Mls Differ Our
    Daddy loved free steam games and the Father liked free
    steam games.

  15. [...] mind is: how on earth are we to program these things? For problems that aren’t embarrassingly parallel, we really have no idea. IBM Research’s David Ungar has an idea. And it’s radical in [...]

  16. Great added benefit is Celtx is provided for each of them Mac and
    as well PC and as well the tracks are 100% cross-platform correct.
    I secondhand it until the month I jumped for bona fide screenwriting software.
    Indeed, dna paternity test attained definitely fashioned a major breakthrough through the scene of there is no.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>