Hardware prefetching is an important performance enhancing feature of today’s microprocessors. It has been shown to improve performance by 10-30% without requiring any programmer effort. However, it is possible for a programmer to throw away this benefit by making naive design decisions. Avoiding the common pitfalls only requires the programmer to have a 10K foot view of how a hardware prefetcher works. Providing this view is the goal of my post.
What is Prefetching?
Prefetching is the concept of fetching data closer to the processor before it is needed. The intent is that the data will be readily available when the processor will need it. Without prefetching, the processor will have to stall and wait for the data. Prefetching is applied at many levels, including disk to DRAM, DRAM to on-chip caches, and between different levels of caches. Prefetching from disk to DRAM is done by the operating system and I will leave it for an OS expert to enlighten us on that topic (any volunteers?). Prefetching to caches is typically done by a hardware prefetcher which is what I describe below.
How does a hardware prefetcher work?
A hardware prefetcher is a relatively simple structure. The prefetcher eavesdrops on the memory accesses going from the processor to the caches. It monitors which memory locations are being accessed by the processor currently, predicts which memory locations will be needed in the future, and issues prefetches to those memory locations. Predicting the future accesses is a very difficult problem to solve. Most common prefetchers today can only detect constant-stride patterns, i.e., when the processor is striding through a number of memory locations. For example, the prefetcher will be able to prefetch the elements of A and B in the following code because the stride is fixed:
for(int i = 0; i < N; i++) A[i*4] = B[i*3];
However, the prefetcher is not good at predicting random accesses. Thus, if your program jumps all over memory or follows links which are scattered all over the memory, the prefetcher becomes useless. For example, the following code will throw off prefetching conveniently.
for(int i = 0; i < N; i++) A[i*4] = B[rand()%N];
Note that my above examples both refer to prefetching data. For completeness, I should add that processors also employ an instruction prefetcher to get instructions into the instruction cache (I-Cache). The instruction prefetcher is less interesting from our perspective for two reasons. First, it does its job very well because instructions are stored in consecutive locations which allows the prefetcher to be very effective. Second, programmers have very little control over instruction placement in memory.
What can a programmer do?
There is nothing a programmer needs to do proactively to leverage prefetching. Just keep the following in your mind when writing your code:
Use arrays as much as possible. Try to use arrays and simpler loops as much as possible. Lists, trees, and graphs have complex traversals which can confuse the prefetcher.
Avoid long strides. Prefetchers detect strides only in a certain range because detecting longer strides requires a lot more hardware storage. In my experience, a stride of more than a 1000 bytes is ignored by the prefetcher. Thus, it can sometimes help to block your algorithm even from a prefetching standpoint.
Use memory pools. If you must use a linked data structure, pre-allocate contiguous memory blocks for its elements and serve future insertions for this pool. This is not a great solution but it is likely to get some prefetching benefit.
Re-use memory locations. It is often the case that we “accidentally” allocate lists/trees in contiguous locations initially because malloc returns contiguous locations due to its bump point algorithms. However, as we make subsequent insertions and deletions to the list (calling malloc and free) the list gets scattered around memory. It isn’t always possible but it is a good idea to re-use lists’ deleted elements for the newly incoming elements.
More information for the curious
Hardware prefetchers are not always your friend. They operate in one of the three states:
-Not detecting any patterns so quite
-Detecting the correct pattern so improving performance
-Detecting the wrong patterns so reducing performance.
The last one is less intuitive and requires some explanation. If the prefetcher locks on to the wrong pattern, it starts fetching incorrect cache lines into the cache. This has two issues: (1) it wastes precious memory bandwidth in fetching useless lines, and (2) the useless lines occupy limited cache storage evicting other useful lines.
Techniques have been proposed to reduce the negative effects of prefetching by turning off the prefetcher when it begins to hurt. In fact, some of the seminal work in this area came from a colleague of mine. If the topic interests you, you should read his paper on Feedback-Directed Prefetching.