Two readers (HWM and Amateur Professional) have rightfully called me out that my recent article–“Ten things every programmer must know ”–provides only a list of items to study without successfully conveying the importance of learning those items. These readers make a very valid point indeed and the purpose of this post is to provide the motivation. The short answer is that the importance of efficient code has increased in recent years and it is hard to write efficient code without knowing hardware.
Update 6/19/2011: I have written a computer science self-assessment for developers to test their knowledge about computer science.
Why do we need efficient code?
It was idealistic to expect common programmers to understand basic hardware five years ago because there was less (or no) motivation–most programmers were coding for desktops/laptops with large, high-frequency cores that became faster every year. In recent years, however, changes in CPU design and emergence of new platforms have changed mainstream programming. Today’s coders write code for multi-cores, smart phones, and GPUs. Multi-core CPUs require parallel programming, which is more sensitive to hardware configuration than single thread code. iPhone, Android, and iPads house leaner cores and limited battery lives which deem efficient coding a must. GPUs are even more sensitive to code quality and often-times require the entire algorithm to be hardware-centric. In my opinion, the emergence of these contemporary platforms provides more than enough motivation for programmers to learn basic hardware.
It makes sense financially…
My hypothesis is this–knowing what lies “under the hood” not only helps programmers write better code, it also allows them to understand and debug their code’s behavior faster. Don’t get me wrong, I am not asserting that every programmer should be an expert in hardware. I just believe everyone must have some basic understanding of the car they are driving in order to avoid incident. I also argue that the amount of information they need is small and can be picked up easily. See my list here if you have not yet seen it. It is my position that learning these concepts will net a positive impact on coders’ productivity and the resulting codes’ efficiency. In turn, this will lead to better user experiences and longer battery lives in the client-size platforms. I also must add that more efficient programming of the back-end servers directly translates into dollars these days–the most important metric . A simple example: if I speedup my code by 10%, the revenue I pay to Amazon EC2 for computes reduces by 10%.
Is hardware-aware coding better?
Yes, and I will back up my claim with three simple examples.
1. The importance of blocking a matrix-matrix-multiply kernel is very well understood. I personally find it difficult to motivate blocking without explaining locality, caches, cache replacement, and memory bandwidth. If we explain blocking without the motivation, the programmers are unable to see the entire picture.
When asked to speedup the following kernel, a candidate I was interviewing replied “blocking.”
for(int i = 0; i < N; i++)
A[i] = B[i];
For reference: This kernel cannot benefit from blocking because blocking is about re-use in the cache and there is no re-use in this kernel.
I think the root cause is that we are teaching programming concepts like blocking without teaching the motivation behind them.
2. It is well-known that striding through a 2D array in the wrong order can kill spacial locality in the cache, stress the TLBs, and throw off hardware prefetching, thus rendering a program up to 10x slower–not to mention the extra energy consumed in the memory. A programmer who doesn’t understand hardware is more likely to make this mistake and will never even know what they have done wrong. The expert readers may think that everyone understands this concept but that is not true, e.g., the benchmark art in SPEC CPU 2000 suite accesses a two dimensional array in the wrong order in its inner loop. Similarly, a programmer who does not understand hardware may not see any difference between arrays of structs and structs of arrays, which has been shown to be a major performance factor in GPU programming (GPU Computing Gems Emerald Edition (Applications of GPU Computing Series). One can argue that its the compilers job to fix the above; however, compilers have their own handicaps when the uninformed programmer “tricks” them with misleading information.
3. Branch prediction, predication, and the motivation behind conditional moves can only be appreciated if one understands pipelined processors and branch predictors. A general rule I often hear is that branches in the inner loop should be avoided. This is an incomplete statement on its own. Many branches are easy-to-predict in hardware and leaving such branches in the code is often better and more practical than adding the redundant computation to remove those branches (see this paper). It is difficult for the programmers to identify easy-to-predict branches without understanding branch prediction.
-By the way, virtual functions and switch/case statements fall in this category as well. Programmers use them both without realizing that they are often translated into indirect branch instructions. These indirect jumps are much harder to predict than direct jumps and most lean cores do not employ the expensive indirect jump predictors capable of handling indirect jumps effectively.
The above examples were all first-order single-thread examples. My example on parallel programming demonstrates how coherence and cache line sizes also become first-order effects when writing code for multi-core.
I want to reiterate my point that programmers must learn basic hardware concepts. These simple concepts, most of which programmers can learn within hours, can potentially improve the overall quality of the software being written. I once again call out professors to ensure that software programmers are at least taught basic hardware and vice-versa.