Jul 052011
 

Hardware prefetching is an important performance enhancing feature of today’s microprocessors. It has been shown to improve performance by 10-30% without requiring any programmer effort. However, it is possible for a programmer to throw away this benefit by making naive design decisions. Avoiding the common pitfalls only requires the programmer to have a 10K foot view of how a hardware prefetcher works. Providing this view is the goal of my post.

What is Prefetching?

Prefetching is the concept of fetching data closer to the processor before it is needed. The intent is that the data will be readily available when the processor will need it. Without prefetching, the processor will have to stall and wait for the data. Prefetching is applied at many levels, including disk to DRAM, DRAM to on-chip caches, and between different levels of caches. Prefetching from disk to DRAM is done by the operating system and I will leave it for an OS expert to enlighten us on that topic (any volunteers?). Prefetching to caches is typically done by a hardware prefetcher which is what I describe below.

How does a hardware prefetcher work?

A hardware prefetcher is a relatively simple structure. The prefetcher eavesdrops on the memory accesses going from the processor to the caches. It monitors which memory locations are being accessed by the processor currently, predicts which memory locations will be needed in the future, and issues prefetches to those memory locations. Predicting the future accesses is a very difficult problem to solve. Most common prefetchers today can only detect constant-stride patterns, i.e., when the processor is striding through a number of memory locations. For example, the prefetcher will be able to prefetch the elements of A and B in the following code because the stride is fixed:

for(int i = 0; i < N; i++)
    A[i*4] = B[i*3];

However, the prefetcher is not good at predicting random accesses. Thus, if your program jumps all over memory or follows links which are scattered all over the memory, the prefetcher becomes useless. For example, the following code will throw off prefetching conveniently.

for(int i = 0; i < N; i++)
    A[i*4] = B[rand()%N];

Note that my above examples both refer to prefetching data. For completeness, I should add that processors also employ an instruction prefetcher to get instructions into the instruction cache (I-Cache). The instruction prefetcher is less interesting from our perspective for two reasons. First, it does its job very well because instructions are stored in consecutive locations which allows the prefetcher to be very effective. Second, programmers have very little control over instruction placement in memory.

 

What can a programmer do?

There is nothing a programmer needs to do proactively to leverage prefetching. Just keep the following in your mind when writing  your code:

Use arrays as much as possible. Try to use arrays and simpler loops as much as possible. Lists, trees, and graphs have complex traversals which can confuse the prefetcher.

Avoid long strides. Prefetchers detect strides only in a certain range because detecting longer strides requires a lot more hardware storage. In my experience, a stride of more than a 1000 bytes is ignored by the prefetcher. Thus, it can sometimes help to block your algorithm even from a prefetching standpoint.

Use memory pools. If you must use a linked data structure, pre-allocate contiguous memory blocks for its elements and serve future insertions for this pool. This is not a great solution but it is likely to get some prefetching benefit.

Re-use memory locations. It is often the case that we “accidentally” allocate lists/trees in contiguous locations initially because malloc returns contiguous locations due to its bump point algorithms. However, as we make subsequent insertions and deletions to the list (calling malloc and free)  the list gets scattered around memory. It isn’t always possible but it is a good idea to re-use lists’ deleted elements for the newly incoming elements.

More information for the curious

Hardware prefetchers are not always your friend. They operate in one of the three states:

-Not detecting any patterns so quite

-Detecting the correct pattern so improving performance

-Detecting the wrong patterns so reducing performance.

The last one is less intuitive and requires some explanation. If the prefetcher locks on to the wrong pattern, it starts fetching  incorrect cache lines into the cache. This has two issues: (1) it wastes precious memory bandwidth in fetching useless lines, and (2) the useless lines occupy limited cache storage evicting other useful lines.

Techniques have been proposed to reduce the negative effects of prefetching by turning off the prefetcher when it begins to hurt. In fact, some of the seminal work in this area came from a colleague of mine. If the topic interests you, you should read his paper on Feedback-Directed Prefetching.

  3 Responses to “What programmers need to know about hardware prefetching?”

  1. Nothing is usually the correct answer: But you can help sometimes, as some architectures provide a cache prefetch instruction, which can be inserted when, on occasion, you know better than the hardware. For example, the dcbt instruction in the IBM POWER architecture: http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/dcbt.htm

  2. This article is incomplete. There are three different kinds of prefetching that should be distinguished:

    1. Hardware prefetching which the article discusses, which detects easy-to-recognize access patterns and prefetches data that it anticipates will be used next. Using unit strides, where consecutive memory addresses are accessed, is usually better for performance.

    2. Hardware prefetching that is programmer-directed. Similar to the DMA engine on the Cell BE, but prefetches data into a cache. I cannot give details, but I know of one implementation which allows the programmer to set a starting address, length, and stride, and indicate whether the data is temporal (to be held a long time) or not. The processor, when it is not loading/storing actual data, uses idle cycles of its memory unit to prefetch the data according to the stream specified by the programmer or compiler. This is explicitly-directed hardware prefetching, and requires explicit direction by either the programmer or compiler.

    3. Software prefetching. This is more commonly available than the other two methods. Here, a prefetch instruction, sometimes similar to a load/store instruction, is inserted into the code by the programmer or compiler, to specifically prefetch one cache line of data. Temporal locality hints are sometimes specified. This type of prefetching is required to get the most performance out of most RISC, EPIC and superscalar processors, but is hard when the access patterns are irregular. Choosing the right prefetch distance, or number of iterations or bytes “ahead” that the data are prefetched as against the current iteration, is important.

  3. [...] What programmers need to know about hardware prefetching? (futurechips.org) [...]

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>