This post is inspired by an old discussion I had with Jim Laurus at Microsoft Research before ASPLOS 2009. Jim believes that computer architects are building the wrong computers. He thinks that we add features to processors that no programmer cares about, making it necessary for the programmers to rely solely on hardware for performance. His poster child example was that programmers want sequential consistency; but unfortunately, all processors are going farther and farther away from it. While I completely disagreed with him at that moment, I have developed more understanding of his position over the years. In this article, I will go against my own kind and argue that we, computer architects, are indeed using the wrong metrics to design future microprocessor chips.
In my post on factors that influence computer design, I showed the triangle of performance, power, and programmer effort. These factors, while broadly understood, are often not applied in their essence. Most architects design CPUs holding programmer effort constant: we use existing benchmarks with no changes. Thus, if the code was hammered to fit in the shoe horn we designed the last time, our next design can only be a better shoe horn –we will never invent the “wheel” per say. Don’t agree with me? take a look at any of our top conferences like ISCA, Micro, etc. All the papers written there take pride in the fact that they used SPEC or PARSEC workloads untainted. Yes, it makes it for controlled scientific experiments, necessary for getting social conformance, but their studies miss an important variable vis-a-vis programmer effort. SPEC workloads were written for high-performance out-of-order cores as that was the name of the game back when they were written. The software pays no attention to instruction scheduling or optimizations as it was written in an era when power was less relevant and programmers relied on compiler/hardware to extract any performance. Now if I used SPEC to design my computer, I am likely to end up with a heavy weight out-of-order core again because the workloads were written to be sub-optimal. Now if the same workloads were written to run on DSPs, their algorithms would have been very different, e.g., the H.264 can be optimized 100x easily for simple in-order cores.
My message: we as a community need to start appreciating where we are wrong and fix it. I understand that its difficult to measure programmer effort quantitively which makes us architects stay away from it. I ask: Is it necessary to quantify everything into a bar chart? Can we not argue about certain things qualitatively. Yes, its “non-scientific” but perhaps that can lead to some scientific metrics down the road. It may sound fluffy but never forget that the industry does actually use this fluffy metric to make real decisions because they cannot ignore this factor. I will conclude with an appeal that architecture papers must at least discuss how their new architecture idea will impact programmer effort and how it will be impacted if the workloads were in fact changed.
I will close with a message from Doug Carmean’s talk at UT Austin on 3/23/2011, “Computer architecture is the same as Hardware/Software co-design.”
The most important factor to serial processing is of course, processor IPC (instructions per clock) versus clock speed. The main problem is that for quite some time, IPC did not advance and appeared to have reached a plateau in advancement until very recently. Clock speed advancements also hit this virtual wall of process development and heat versus current. The obvious focus changed on core parallelism as a “band aid” to counter this stalemate, in much the same way Hyperthreading was first used as a band aid to offset the long pipeline of Pentium 4 and Netburst processors
Many tasks can be accelerated in parallel function, but many cannot. And as this has become more and more clear over the last few years, more breakthroughs in IPC and clock speed have come through, pushing serial performance up another solid notch. It’s a balancing act, unfortunately. Intel’s recent advance in Sandy Bridge is a GREAT example of increasing IPC per core by rearranging the decode stages for more efficiency and caching only what has to be cached, not keeping multiple working copies of the same data in the pipeline. This allowed a better internal reallocation of processing resources to boost single threaded performance by as little as no change, up to as much as the noted 30% commonly quoted in the media as SnB’s IPC advantage over the previous Nehalem based architecture.
Single thread performance has begin marching forward again, albeit slower and more cautiously then during the march to Netburst, and rightly so. Because of the large, multi-year push for parallelism in software development, there have been a significant number of advances in multithreaded software, meaning that investment is still very valid. But Intel, and to a lesser extent AMD as well, are shifting focus back towards ensuring serial processing isn’t in a state of stagnation as it had fallen into for some time.
I, therefore, believe that it may not be needed to push for faster serial performance when that goal is already becoming the balanced focus it needs to be. AMD’s Bulldozer/Llano concept of SMT/SMP hybrid approach is meant to significantly balance these parallel and serial loads out with architecture directly addressing this balance. Intel’s excellent progress in this area with Sandy Bridge gives me excellent hope that they’re on the ball with this thinking as well.
Interesting thought.I would say that Intel and AMD push for serial thread performance for two reasons.
1. They know how to
2. Most apps remain single threaded
So your point is well-taken that we no longer need to “push” single thread performance since its already on everyone’s radar. SMT/SMP is clearly a win and it does provide this serial/parallel balance.
I would add one side point to that … parallel processing is also going in the direction of SIMD (the whole GPGPU thing). Thus, chips should most certainly put more SIMD than before because that can then cater the embarrassingly parallel workloads. The GPUs on die are interesting but not enough because the latency of sending work to the GPU makes it so that you only ship tasks which are large enough to amortize the overhead of sending them. My next rave, closer SIMD units! (you will see a post soon:)
Side note: I just published an article title Multi-core or multi-nonsense. Funny how you call multi-core a band-aid. You will find the article interesting.
Looking forward to it!
Its up.
http://www.futurechips.org/2011/05/17/multi-core-multi-nonsense-multi-opportunit.html
Terrific work! This is the type of information that should be shared around the web. Shame on the search engines for not positioning this post higher!
Hi Mike,
I really feel encouraged after reading your comment. Thanks.
As for the search engines, I agree but I am hoping it will improve with time as we get more visitors and they link to our site from their sites. We do need a lot more collaboration between different parts of CS/EE to improve both sides and remove the inefficiencies. Perhaps the word is best spread peer to peer. I will continue to write so please keep your feedback coming.
-Thanks,
Aater
Thanks for use full information
Excellent beat ! I would like to apprentice at the same time as you amend your
web site, how can i subscribe for a blog site?
The account helped me a applicable deal. I were tiny bit acquainted of
this your broadcast offered brilliant transparent
idea