From the comments on my post which criticized Amdahl’s law, I got some very clear indication that awareness needs to be raised about the different types of serial bottlenecks that exist in modern parallel programs. I want to highlight their differences and explain why different bottlenecks ought to be treated differently from both a theoretical and practical standpoint.
In my definition, a serial bottleneck is code which can lead to thread serialization, i.e., it can cause threads to wait on each other. At a broader level, there are two types of serial bottlenecks:
Fully serial, Always on Critical Path: These are code portions where only a single thread exists. These bottlenecks cause other threads to wait every time they execute. These include kernels which cannot be parallelized at all.
These single threaded regions are the classic Amdahl’s bottlenecks. They have an important property that they always end up on the critical program path, i.e., if you take a single-threaded region and speed it up by a 100 cycles, the overall program execution time reduces by a 100 cycles. In addition to that, they have the following attributes.
-Easier to detect (program is in a fully serial portion if number of alive threads == 1)
-Do not impact the performance of parallel program portions (since they do not run concurrently with the parallel portion)
-Are less common since most people realize their disadvantages
-Shortening them always provides performance benefit (may be small, but its >0)
Partially serial: These include serial portions which are embedded in the parallel portions. I call them partial bottlenecks because (a) they block a variable number of threads (between 0 to N-1), and (b) they can be on or off the critical path thus shortening them may or may not benefit overall performance. These bottlenecks generally arise when threads communicate or contend for shared resources or shared data. The classic example of a partial bottleneck is a critical section. Only one thread can execute a critical section at a given time, all other threads wanting to execute the critical section must wait. However, threads not wanting to execute the critical section can continue to work. Thus, the impact on performance of the critical section depends on the number of waiting threads which in turn depends on the contention for the critical section. A highly contended critical section stalls a lot of threads while a critical section with no contention is practically the same as parallel code.
The following distinguishes partial-bottlenecks from fully-serial bottlenecks:
-The existence and severity of these bottlenecks is highly dependent on the input set, machine configuration, the number of cores, communication latencies, cache sizes, etc.
-These bottlenecks are much harder to identify for the programmer since they only surface at run-time
-They impact only the parallel program portions
-Are very common in servers and sometimes in HPC kernels
-Shortening them is only beneficial if they are causing serialization
-They limit thread scalability, i.e., they can cause the performance to peak at a given number of threads such that more threads reduce performance
I feel that fully-serial bottlenecks are discussed more because of historical reasons. Classic parallel computers consist of loosely connected processors which require hundreds or thousands of cycles to communicate (sometimes via ethernet). When writing code for such a machine, it is logical for the programmers to eliminate thread communication from the parallel portions and leave as fully-serial the kernels where it is impossible to remove (or minimize) thread communication. This trade-off has changed with the advent of multi-core: cores are tightly integrated and thread communication is a smaller overhead. Thus, we can expect more programs to contain partially-serial code with thread communication. Hence this post
Who should this concern?
If you are a theoretical computer scientist, you should know that Amdahl’s law (the equation that is known as the law) only applies to the fully serial portions and not to the partially serial portions. The partial bottlenecks have a very different behavior which has different equations (See my examples of critical sections, pipeline workloads, and task parallelism).
If you are designing hardware for running parallel programs, understanding the differences between these bottlenecks is pivotal. I will give an example from my personal experience. When I was first asked to architect a heterogeneous code chip for parallel programs, I considered only the fully-serial bottlenecks and hence the first architecture I designed had a single fast core and many slow cores. I wrote an OS scheduler which turned on the fast core only in the serial phase and turned on the many small cores only in the parallel region. It worked and I was done! However, as I learned more about parallel programs, and learned about these fine-grain bottlenecks (which are often a bigger issue), the landscape changed completely. I needed enough on-chip power to keep both slow and fast cores on at the same time. I had to design mechanisms to detect these fine-grain serial portions. I had to take thread migration overheads into account. I had to decide how small my small cores can be depending on how scalable my parallel portion is. Thus, I urge hardware designers to understand this distinction as designing multicores without knowing about these finer-grain bottlenecks can lead to very sub-optimal decisions. (Side note for developers: you will be surprised at how few chip architects know this).
If you do performance analysis, you should know that while you can characterize the fully-serial bottlenecks easily, you cannot expect that staring at the code or counting instructions will tell you much about the partially parallel portions since their severity is a function of contention and very dependent on run-time behavior. (Side note for programmers: this makes life very hard for us computer architects because deterministic performance simulations become next to impossible).
If you are an application programmer, you should already know these bottlenecks and everything about them. If not, you learn asap because you will find it useful.
I just want to encourage hardware designers and novice parallel programmers to understand and embrace the trade-offs/opportunities/challenges posed by these partial bottlenecks.