Aug 092011

Similar to other prediction mechanisms, branch predictors are also better at predicting strongly biased branch outcomes.

This rule is well-understood and commonly used, e.g., the Intel Itanium compiler assumes that prediction accuracy = MAX(percentage_taken, percentage_not_taken) when performing its profile-guided optimizations. Thus, to improve branch prediction, we must increase the number of biased branches while reducing branches that are oscillating. This post shows a simple trick to do so.

Why are biased branches predicted better?

Predictors typically use a 2-bit counter to track a branch. Lets suppose we have the resources to assign every branch its own 2-bit counter (PAp predictor). Suppose a branch is continuously taken and the counter saturates to the strongly-taken (bits:11) state. The prediction accuracy will be 100%. However, there will be a misprediction each time the branch is not-taken, bumping the counter down to bits:10. The counter will bump back up to 11 next time the branch is taken. Thus, each instance of the branch which is not-taken will lead to one misprediction and prediction accuracy will roughly be equal to the percentage of times the branch is taken.

Since the branch is randomly distributed, it is possible for two consecutive instances to be not-taken. This will bump the counter down to the weakly-not-taken state (bits:01) and will lead to three mispredictions: two because of the not-taken instances and one because the next taken instance will mispredict.  Using simple probability, one can see that a branch that is only 70% taken is more likely to be not-taken twice in a row than a branch which is 95% taken .

Thus, a general rule is that the less biased a branch is, the lower its prediction accuracy because of the 2-bit counter behavior.

Please note that the above analysis is equally applicable if the branch was biased not-taken.

What can software do?

Let me explain with an example. Lets suppose that X is random variable between 0 and 99. I want to run the following code:

if (X > 5 && X < 95) //branch is taken 90% of the time

However, if I write the code as:

if(X > 5) //branch is taken 95% of the time
if(X < 95) //branch is taken 95% of the time

The branch predictor has a better shot at predicting both these branches more accurately which may lead to better performance because the counters assigned to both these branches are more likely to stay saturated at taken (because two not-taken are less likely).

In general, any time you are ANDing/ORing conditions in an if-statement, you should think if the combination will be more biased or less biased and pick the version which is more biased.

Play with a 2-bit counter

I have created this small animation so that you can play with a 2-bit counter.

2-Bit Counter Value: 00
Last 10 accesses:

  10 Responses to “Quick Post: Software Trick to Improve Branch Prediction”

  1. If you are doing such micro-optimizations (and are sure that the branch is executed enough times to warrant such attention), you should be looking at your compiler output.

    I tried a couple of compilers and various optimization levels on my PowerPC machine and found that I usually had the same results from both versions. Sometimes it did two tests & two branches; other times it combined the results of the two tests into one branch. But, the details of the source code had no effect in all but one of the cases.

    • Hi Bob,

      I agree. In fact, I was going to add a disclaimer about that to the post but it was 2AM and it slipped my mind. Thanks for testing this out on PowerPC.

      I did my testing on Intel hardware running Mac and did find the difference in output. I didn’t try different compilers/machines so I am thankful that you did.

      I am intrigued that different optimizations produced different code for this branch. Were they branch-specific optimizations? Profile-guided? Can you please share more details for our reference?


      • Also, what compiler did you use? GCC? Intel? MS? Do they produce different results?

        • gcc & IBM’s xlc both under AIX.

          In all my xlc runs, there was no difference in the object code between the two versions of your function. With default optimization, it generated a fairly straightforward compare/branch/compare/branch sequence. With optimization on, it gave compare/compare/conditionNAND/branch with the compares pushed up before most of the subroutine intro.

          gcc w/o optimization was similar to xlc. With optimization (just -O) the combined condition used the arithmetic hack below, while the separated condition gave two compares and two branches.

  2. I just tried this myself, using different optimization levels from -O0 to -O3. The compiler was:

    i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)

    With -O0, the compiler generated exactly the same code for both forms of the source conditional (the one-if-statement and nested-if-statement forms).

    With -O1, the compiler still generate identical code for both forms of the source conditional, although the generated code was much better in this case. Notably, the generated code contained only a single conditional branch, apparently due to an arithmetic trick on the part of the compiler.

    With -O2 and -O3, the generated code was identical to -O1.

    Here’s the source code I used, plus a disassembly of the generated code:

    • Interesting. I tried the same with ICC on Linux this time since I have Linux at work.

      ICC would always generate two branches no matter what I do. The only way I could get it to generate one branch was to write my code as:

      Useful lesson: If you do want a single branch because you are ORing conditions and the combination is more likely to be biased then you can use this convoluted way.

      By the way, another interesting find (I still need to verify carefully) that if I did !(!a || !b) then the compiler produced fewer instructions than (a&b). More on this in a later post I guess unless you guys have insights on this already.

    • I saw the same math trick when using gcc with optimization.

      It subtracts 6 from “X” and then does an unsigned compare against 88. This allows just one compare and one branch.

      • Now that I think about it, these data points are an example of a general theme: compilers do surprising things. Sometimes the surprises are positive, like this math trick gcc is doing. Sometimes the surprises are negative. My personal approach to development is “profile early, profile often”: i.e., run profiling tools throughout the development process to understand where the time is being spent. So that leads me to a related question: what tools do you folks use to measure events such as branch mispredictions and L1 instruction cache misses? Most of my speed-critical development work has been on Linux, where OProfile has been my tool of choice. What tools do Mac and Linux developers use when they need profiling data from the CPU-specific performance counter registers?

        • I have used Vtune in the past but I have also gone to the extent of using a microprocessor simulator to get a handle on performance (it only works if you work for Intel). I also developed a cache simulator using PIN (binary instrumentation tool) once to study the branch and cache behavior of certain workloads.

  3. Hi there! Thank you for the good write up. Perfectly timed content and exactly like what I am hunting for
    a very long time. Thrilled to uncover this article and I’m certainly going to mention it on some social networking platforms!
    Going to bookmark this too for future reference as well. Keep up
    the good work!

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>