May 282011

CodeIn response to my post on parallel programming, a slashdot reader wrote: “if hardware is to give us small cores, then they ought to provide bigger caches.” This accurate and astute comment brought me to the realization that we must discuss what large/small cores really mean to both hardware and software. In today’s post, I discuss this issue from a single thread perspective.

Most early processors were in-order cores. This trend changed during the mid-1990s with the introduction of Alpha 21264 and Intel P6 (aka. Pentium Pro), thus causing out-of-order cores to dominate the world of general-purpose computing until now. Recently, however, the industry has experienced a strong push to bring back the in-order cores. For example, IBM Power 6 went the in-order route after having built out-of-order cores for a decade. Additionally, Intel is promoting in-order cores in the Atom and Knights families and almost all ARM processors are in-order. Given this transition, I feel the need to raise awareness regarding the differences between these two types and how they impact software.

What is In-Order?

2410782386_f3967acc06In-order cores process instructions in program order. Consider the following instructions:

1: load r0, [X]      ; r0 = MEMORY[X]
2: add r1, r1, r2  ; r1 = r1 + r2
3: add r3, r4, r5  ; r3 = r4 + r5

An in-order processor will fetch, execute, and retire these instructions adhering to the exact order in which they are written in the program–known as the program order. The main drawback of in-order processing is that if a particular step takes a long amount of time to complete, all of the following instructions suffer the delay as well. For example, if the load on line 1 takes 100 cycles, then the add on line 2, which is not sourcing any of its operands from the load, must wait 100 cycles for the load to finish.

Think of in-order processors as single-lane highways–if one car stops, every car behind it will also stall.

This brings us to out-of-order processing.

What is Out-of-Order?

In out-of-order processing, instructions are fetched and retired in order but the actual execution happens out-of-order. It is important to note that the programmer is still given the illusion of in-order processing, because the fetch and retire remain in-order. Using practices derived from Tumasulo’s Algorithm, out-of-order cores allow the independent instructions to make progress even if a step ahead of them stalls. For example, if the load on line 1 stalls while waiting for memory, the independent add on line 2 can still be processed during the wait, thus saving substantial time.

Why Build Out-of-Order?

Out-of-order processing can better tolerate long-latency instructions because of its ability to overlap their execution with other independent instructions.

For Reference: Long-latency operations are common and include integer divides (roughly 70 cycles), L1 cache misses (around 10 cycles), and L2 misses/memory accesses  (100-200 cycles).

An out-of-order core also eliminates false dependencies that are created when the compiler has to re-use a register. For example, if the compiler puts variable “x” in register “r0″ and later uses “r0″ for an independent variable “y,” then a naive in-order hardware will enforce that the operations on “x” must finish before the operations on “y” can begin, thereby creating a false dependency. The out-of-order core employs register renaming to eliminate this issue. Note: This benefit is bigger in ISAs with fewer registers because compilers have to re-use registers more frequently. With the increasing number of registers (x86 went from 8 to 16 registers), the benefits of renaming are diminishing.

Why Return to In-Order?

There are Four Reasons:

1. Energy Efficiency: Out-of-order cores contain structures that consume large amounts of energy per-instruction–e.g., the scheduler. This energy-inefficiency is unacceptable in battery-life-limited markets, such as cell phones, tablets, etc.

2. Lower Min-Power: In-order cores have smaller transistor counts, hence they can be reduced to very low power-levels by decreasing frequency. Out-of-order cores cannot be brought down to the same power-levels even at lower frequencies.

3. Higher frequency: In-order cores’ peak frequencies can be higher than those of out-of-order cores because out-of-order cores contain structures that are difficult to pipeline, such as the scheduler. Conversely, the in-order cores are easier to deep pipeline–e.g., the in-order IBM Power chips offer the highest frequencies available on today’s market.

4. Multi-Core Suitability: With multi-core, the assumption is that programmers will write parallel code. Once software has parallelized the work, further employing of out-of-order hardware to parallelize instructions is energy-inefficient. This deems in-order even more suitable for multi-core. Sun Niagara, Intel Knights family, and the expected many-core ARM chips are currently using this concept.

How Does This Impact Software?

Firstly, let us discuss the software that will not be affected, which is the code in which the out-of-order scheduling did not make a beneficial impact in the first place. For example, the following code for chasing a linked-list gains no improvement from out-of-order processing–even though the load for variable “p” misses in the memory frequently, the out-of-order core cannot parallelize processing of the loads because every load requires the previous load to finish (in order to get the load address).

1: while ( p != NULL )
2:  p = p->next;

Now let’s discuss software that will be impacted. Consider the following code example:

1: for (int i = 0; i < N; i++) /*Typo fixed. Thanks to Caleb!*/
2: A[i] = B[rand()%N];

This code has several parallelism possibilities that the out-of-order core can extract. For example, while servicing the load for an element of “B” in iteration “i,” the out-of-order core can simultaneously be fetching the next element of “B” and even the one after that.

-Notice that I am using a random index in “B” to ensure that there is no benefit from hardware prefetching. Read more on this below.

The out-of-order core provides benefit where prefetcher does not work and there is independent work to be done. Part of this benefit can be realized by performing simple software tricks or tests.

How Can Code Be Optimized for In-Order Cores?

First and foremost, over the years, our compilers have innovated at a slower rate due to the fact that out-of-order hardware was able to perform many traditional optimizations automatically. In my opinion, this needs to change and compilers’ researchers must innovate again.

-I conducted a simple experiment (using a simulator) in which the code compiled using GCC operated 3.5-times faster on the out-of-order core than it did on the in-order core; however, when I changed the assembly code using a simple in-order-specific optimization, performance of the in-order core increased to be within 30% of the out-of-order core.

In addition to the compiler work, the application/library programmers should become more performance-conscious. Below is a list of a few examples that come to my mind.

1. Reducing Memory Footprint: Most of us now think of memory as “free.” Yes, you can purchase inexpensive memory, but by using extra memory, you may be sacrificing performance. Programmers should always think about memory usage–e.g., remember that a binary tree stores an extra pointer (8-bytes) for every node of data compared to a linked list.  Furthermore, every leaf node essentially wastes these 8-bytes by storing a NULL (only 1-bit of information). There are countless other examples that support this assertion. In general, it can be better to increase compute in order to save trips to memory. Programmers ought to be aware of these trade-offs and must incorporate them in their design decisions.

2. Loop Unrolling: The idea behind this technique is to write the loop with multiple iterations as one–e.g., the above loop can be written as:

1: for(int i = 0; i < N; i+=2){
2:   A[i] = B[rand()%N)];
3:   A[i+1] = B[rand()%N];
4: }

This not only reduces the number of branches per iteration, but it also provides the compiler with more opportunities with which to schedule instructions and expose instruction-level parallelism. Loop unrolling was previously of less value because out-of-order cores automatically unroll a loop. With in-order cores, either programmers should write unrolled code or make it unroll-friendly (by making iterations as independent as possible) in order to create simplicity for the compiler.

-Compiler-friendly code: This note goes hand-in-hand with my comment above. While it is not an in-order specific optimization, the importance of compiler-friendly code is well-worth mentioning. I have noticed that programmers often neglect to consider the compiler when writing code. I see a lot of people declaring unnecessary pointers. Compilers are quite inept at handling pointers because they cannot decipher where the pointer is actually pointing. Consequently, compilers turn off several optimizations, thus sacrificing measurable performance. I will write more about that issue later this week…

3. Assisting the Hardware Prefetcher: Most cores, in-order or out-of-order, have a hardware prefetcher. A prefetcher detects regular memory access patterns and prefetches the memory locations a program will need, according to its predictions. If correct, the prefetcher can virtually eliminate cache misses. Unfortunately, hardware prefacers can only track very simple memory access patterns, such as strides and streams, etc. For this reason, it is sometimes worthwhile to increase the amount of computes in order to implement regular data structures such as arrays–especially when using an in-order core. The following loop (without the “rand()” as the index for “B”) runs much faster due to the prefetcher’s help.

1: for(int i = 0; i < N; i++){
2:   int c = rand() % N;
3:   A[i] = B[i] + c;
4: }

4. Software Prefetching: Programmers can insert software prefetches for access patterns that the hardware is unable to detect. In general, software prefetches are inserted such that the next element can be prefetched while the current one is still being processed. Consider the following code:

1: int x0, x1, x2;
2: x2 = rand() % N; x1 = rand() % N;   /*x1 must also be initialized here. Thanks to David Emerson for this*/
3: for(int i = 0; i < N; i++){
4:   x0 = rand()%N;
5:   prefetch(B[x0]);
6:   A[i] = B[x2];
7:   x2 = x1;
8:   x1 = x0;
9: }

The above code prefetches the value of every “B” element two iterations ahead–.i.e., iteration “i” prefetches for iteration “i + 2.” This means that when the processor gets to iteration “i + 2,” the “B” element is already in the cache, thereby preventing a cache miss. Please note that prefetching only two iterations ahead is generally not enough to cover the memory latency. I used two iterations solely for the ease of illustration.


As in-order core processors make their way back into general-purpose computing, compilers and programmers need to brush up on some of the common optimizations that became less relevant during the out-of-order era. I detailed a few examples of such optimizations but there are many others that programmers should be aware of when writing code for these in-order cores. This does re-enforce my point that programmers must understand the underlying hardware in order to write efficient code.

  13 Responses to “Back to One-Lane Roads: Programming the In-Order Cores”

  1. Good post, thanks. One question, why is the comparison clause in the third code block i > N instead of i < N?

  2. great article, thanks!
    in the last example, I think x1 should also be initialized before the loop begins

    • Hey David,

      Thanks for reading and pointing out the x1 issue. My code ran fine without the x1 initialization because x1 was initialized to 0 being a global. Nice catch.

      I have made the fix. Please take a look above.


  3. Nice post . Thanks for, writing on this blog page man. I will message you again. I didnt know that!

  4. Hi there,

    Thanks for your comment. Cell BE is one of those machines that does require a lot of leg work by programmers. In fact, I have heard Peter Hofstee (Cell’s chief architect) saying that this was by design. Unfortunately, I am not on top of Cell’s architecture and a search on google did not provide much information about the concurrent read and write feature. Can you please point me to some document or tell me more about the context the feature is mentioned in. I will be glad to read up and explain. Same goes for the FPU.

    Looking forward to your reply.


    • Hello Aeter,
      I think its a shame CELL being discontinued, I was thinking these gonna be a good architecture, maybe evolving the SPEs to handle AVX instructions and fully implement SP floating points.

      I research a bit, and the SPE gotta a thing named DUAL-ISSUE instruction. I dont know exactly what it is. But I found some links, one explain CELL architecture:

      And other with a good blog on porting a mandelbrot program optimized to run on SPEs:

      I think I have a good PDF explaining the CELL BEs SPEs single precision floating points lack of some implementations (like not signaling NaN), but I cant find it right now.

      I bet your next article will explain SIMD programming, right? That will be a good article.
      So far you mention program paralelism accross core, maybe you can mention paralelism on instruction level and things like memory alignment and streamming.

      Thanks for your reply Aeter.

      • Hi,

        Sorry my reply is a bit late. Please don’t worry about the typo. Thanks for the pointers. I will read up on these documents this weekend.

        Re: dual-issue: Like every other term in architecture, dual-issue is also overloaded. However, most people use dual-issue to mean that two instructions can be fetched and retired every cycle. For example, the original Pentium (P5) was dual-issue because it fetched two instructions/cycle.

        I will be writing about SIMD and GPU coding soon so stay tuned.


    • Sorry for the typo :-(
      I mean Aater :-)

  5. One note on the OOO/compiler subject. A very nice thing about OOO is that it makes code more uniformly fast. With in-order cores, code performance becomes closely tied to the microarchitecture, and the compiler must have precise knowledge of the specific core the code is running on in order to generate decent output. OOO changes this, and allows one binary to run fast on very different microarchitectures.

    • Hey again!

      I agree. One can argue that the sole purpose of the OOO cores was to accommodate mediocre code. However, in-order cores seem to be coming back. Thus, programmers will need hardware knowledge in their toolbox.


 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>