Jun 182011

In shared memory systems, multiple threads are not allowed to update shared data concurrently, known as the mutual exclusion principle. Instead, accesses to shared data are encapsulated in regions of code guarded by synchronization primitives (e.g. locks). Such guarded regions of code are called critical sections. The semantics of a critical section dictate that only one thread can execute it at a given time. Any other thread that requires access to shared data must wait for the current thread to complete the critical section.

There exists two types of critical sections in programs. I call them update critical sections and reduction critical sections.

Update Critical Sections
Update critical sections occur in the midst of the parallel kernels. They protect shared data which multiple threads try to read-modify-write during the kernel’s execution, instead of waiting till the end of the kernel’s execution. Their execution can be overlapped with the execution of non-critical-section code. The critical section protecting the maze data in the article on “Parallelizing Complex Programs” is a good example of an update critical section.

For simplicity, lets assume a kernel which has only one critical section. Each iteration of the loop spends one unit of time inside the critical section and three units of time outside the critical section. The following chart demonstrates the execution timeline of this critical section intensive application.

Screen shot 2011-06-18 at 12.11.05 AM

When a single thread executes, only 25% of execution time is spent executing the critical section. If the same loop is split across two threads, the execution time reduces by 2x. Similarly, increasing the number of threads to four further reduces execution time. As the critical section is always busy, the system becomes critical section limited and further increasing the number of threads from four to eight does not reduce the execution time.

We can capture this using a simple equation. Suppose

Tcs = Time inside critical section
Tnocs = Time outside critical section
Tp = Time with p cores
N = Number of iterations


Then Tp can be computed as:

Screen shot 2011-06-18 at 12.11.23 AM

We compute Pcs, i.e., the number of threads required to saturate the execution time, by solving the above equation for P:

Screen shot 2011-06-18 at 12.11.37 AM

Fine-grain critical sections

To reduce the contention for critical sections, many applications use different locks to protect disjoint data. Since these critical sections are protecting disjoint data, they can execute concurrently, thereby increasing throughput. In this software, the longest critical section, which has the highest contention, is the performance limiter. This can be captured using the following equation:

Screen shot 2011-06-18 at 12.30.30 AM

Note that this model is rather simplistic but it still conveys that long, frequently occurring critical sections can limit performance as the number of threads increases.

Reduction Critical Sections

Reduction critical sections occur at the end of kernels and are used to combine the intermediate results computed by individual threads. The key difference between update and reduction critical sections is that, unlike update critical sections, reduction critical sections occur at the end of a kernel and their execution cannot be overlapped with the execution of the non-critical-section code. Since every thread executes the critical section, the total time spent in executing the critical sections increases linearly with the number of threads. Furthermore, as the number of threads increase, the fraction of execution time spent in the parallelized portion of the code reduces.  Thus, as the number of threads increase, the total time spent in the critical sections increases and the total time spent outside critical sections decreases. Consequently, critical sections begin to dominate the execution time and the overall execution time starts to increase. A simple example is the histogram kernel I analyzed in my article on “Parallelizing and Optimizing Parallel Programs.”

For simplicity, let us assume that a kernel executes for 10 times units as a single thread and 2 of those 10 units are spent in reduction. The following figure shows its execution as the number of threads increases.

Screen shot 2011-06-18 at 12.11.59 AM

Notice how the the parallel region shrinks with more threads but the critical sections begin to dominate the execution. We can capture this using a simple analytic model:

Screen shot 2011-06-18 at 12.12.31 AM

We compute Pcs, i.e., the number of threads required to saturate the execution time, by solving the above equation for P:

Screen shot 2011-06-18 at 12.12.44 AM


Cache Locality: A second order effect

The above analysis assumes that the execution time of each instance of the critical section is independent of the number of cores. This is not a true assumption. In fact, latency of the critical section increases with the number of cores for two reasons. First, critical sections must incur cache misses in fetching the shared data that is resident at another core. With more cores, shared data bounces around more frequently thereby increasing the probability of a cache miss. Second, the cache miss latency among cores increases with the number of cores due to longer wires and more interconnect hops. This increase in the number of misses and the cost of each miss further increases the overhead of critical sections on performance.


When there is contention for shared data, execution of threads gets serialized, which reduces performance.  As the number of threads increases, the contention for critical sections also increases. Therefore, in applications that have significant data synchronization (e.g. Mozilla Firefox, MySQL database, and operating system kernels), critical sections limit both performance (at a given number of threads) and scalability.

Looking forward …


It is important to either shorten the critical sections or create fine-grain critical sections that do not suffer high contention. Parallel programming is already a daunting task and expecting programmers to shrink or eliminate critical sections is unreasonable. I believe that hardware can assist in this matter. Solutions like hardware Transactional memory (TM) have been proposed to shorten the critical sections by letting them run concurrently, as long as it does not violate correctness. I plan to cover TM in a future post. Another orthogonal solution that I proposed in my paper titled “Accelerating Critical Section Execution … (ACS)” is to accelerate critical sections using a faster core on a chip with heterogeneous cores.  A combination of TM and ACS can practically eliminate the overhead of critical sections, thereby relieving the programmer of this burden.


  3 Responses to “Parallel Programming: Understanding the impact of Critical Sections”

  1. In the third case, where the data needs to be consolidated, wouldn’t it usually be preferable to do this from a single thread rather than using critical sections? The processing time would be the same (at least to a first-order approximation) and you’d save the overhead of the critical sections themselves. Plus in many cases I suspect the program would be conceptually simpler, particularly if the threads running in parallel were all launched from the thread that is going to consolidate the data.

  2. You had mentioned that by running the critical part in a faster core in a heterogeneous processor is a solution. I would like to ask: do you think there is some specific pattern of code exists in the critical part of all the applications which can be paralleled? If we can find some similar pattern in the critical section of all the applications, we can try to come up with specific micro-architecture which can speed up the faster core (really a performance gain considering Amdhal’s law). Do i sound correct in my thought?

  3. [...] Understanding the Impact of Critical Sections – on parallel programming and the limitations of scalability due to locks [...]

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>