I received a interesting comment from ParallelAxiom, a parallel programming expert, on my post titled “When Amdahl’s law is inapplicable?” His comment made me re-think my post. I must show an example to hammer my point. Thus, I have added an example to the original post and I am adding this new post just so the RSS subscribers can see this update as well. Please look at the original article first in case you have not read it.
Think about pipeline workloads. Lets use a simple example of a workload called Rank. Rank compares a set of input strings in a file to a given search string and returns the N most similar strings to the search string.
There are three stages in a particular implementation of Rank:
S1. Read input string — this is sequential
S2. Compute similarity score by comparing to search string — multiple strings can be compared concurrently
S3. Insert string in a heap sorted according to the similarity score. Drop the element with the lowest similarity score in case the heap has N+1 elements after the insertion. — this is a critical section in a naive implementation.
Now in the above code, there are three distinct loops that can be done one after the other in which case #1 and #3 will be your Amdahl’s serial portions (where only a single thread exists). However, I can be smart and write this code as a pipeline where S1, S2, and S3 run concurrently and communicate via work-queues.
Lets suppose each iteration of S1 takes 1K cycles, S2 takes 10K cycles, and S3 takes 2K cycles. Then as I up the number of cores, eventually the throughput of this pipeline willl become limited by S3 because even if I speed up S2 to 1000 cycles per iterations (by giving it 10 cores), S3 will not speed up. Thus, once I have say 5 cores assigned to S2, more cores will not help with performance.
Now, naively (and incorrectly), people call this Amdahl’s law. No it is not Amdahl’s law because the above cannot be characterized by the Amdahl’s equation. If we use Amdahl’s, the serial code is 3K cycles and parallel core is 10K cycles. Thus, with infinite cores:
Cycles_per_iteration_with_P_cores = 3K + 10K/P
Thus, the fastest speed will be 3K cycles_per_iterations. This is obviously wrong.
The equation which characterizes this pipeline case is as follows:
Cycles_per_iteration = MAX(Cycles_per_S1, Cycle_per_S2/cores_assigned_to_S2, Cycle_per_S3);
This is my poster child example of showing why Amdahl’s doesn’t work for all non-parallel code.
For reference: In a pipeline, it matters less how much time it takes to process one string (i.e., the 3K in this case). Whats important is the rate, or throughput, at which I am processing strings.
Update: The following is a conversation I had about this example with @jmuffat on Twitter. I think it helps clarify an important point of this example:
@jmuffat said: Amdahl’s law knows only 2 types: serial or //. Your example shows things that can run concurrently and calls them serial…
My response: That is the point of my article. Amdahl’s shouldn’t be used when serial is off the critical path but ppl often make this mistake. There are hundreds of “experts” that say that any code that cannot be multithreaded is “Amdahl’s bottleneck.” I’ve a paper on critical sections. Despite my insistence, people still call critical sections Amdahl’s but as you say, they aren’t.