The list of Top 500 fastest computers in the world just came out and the Japanese K-computer is the fastest and the most energy-efficient computer at the same time. It is hard to build computers that are both fast and energy-efficient so I set out to understand what Fujitsu has done right. This quick post is a summary of my investigation. For the very impatient, my crude experience-based analysis says that the special purpose instructions and highly specialized functional units in the core give them their edge.
The following paragraph at HPCwire.com gave me my first clue:
The exceptional energy efficiency of K is provided courtesy of the 8-core SPARC64 VIIIfx processor, a 58 watt chip that delivers 128 peak gigaflops. That’s nearly up to the standards of an HPC-style GPU, a processor which basically does nothing but FLOPS. For comparison, an IBM Power7 CPU provides about 256 gigaflops, but consumes 200 watts, while IBM’s other HPC chip, the PowerPC A2 SoC used in Blue Gene/Q looks to be around twice as energy-efficient as the current crop of GPUs.
Since the power efficiency seems to be in the core, I set out to understand what is it about the SPARC64 VIIIfx which makes it so energy efficient. The best resource I found was the 2009 Hot Chips talk from Fujitsu. The VIIIfx chip looks as the follows:
There are 8-cores and a shared L2 cache in the middle of the chip. I found five unique features in this chip.
1. Lack of threading?
Given that hardware multi-threading or SMT is known to benefit simple in-order cores a lot, it is unexpected to see an HPC chip with no support for threading. However, the reason they did not do it becomes clear if you look at their ISA extensions that I discuss next.
2. HPC – ACE (Arithmetic Computational Extensions)
HPC ACE are Fujitsu’s extensions to SPARC that add Large register sets and SIMD. Large Register Sets increase the number of FP registers from 32 to 256 and the number of integer registers from 160 to 192. This is where the problem lies. Since each SMT hardware context requires its own set of registers, adding another SMT context is very expensive if the number of registers is so high. Thus, the designers chose to have a single thread per core. For reference, you can see that the FP registers (FPR + FUB) is a big chunk of the core. The integer register file is not marked but it should be of comparable size. Thus, the core logic is only a small fraction and hence threading is not feasible.
For Reference: Threading is done to increase utilization of the core logic. If core logic is only a small fraction of the chip, threading is infeasible.
3. Software Controlled Cache
This is a rather unique feature of this machine. They have introduced software controlled caches, much like a GPU. They have added instructions to the ISA to bring stuff into the cache. They also use the old time concept of sectored cache which I found very enticing. I will skip the details of the sectored cache for now.
4. Conditional Moves
Just like the x86 CMOVE instruction, Fujitsu has introduced the concept of predicated operations in SPARC. They can allow an instruction to write its result conditionally based on the value of a particular register. This eliminates branches in the inner loops of HPC kernels.
5. An surprisingly deep pipeline
My first assumption was that the core is a shallow 4-5 stage pipeline. The reason is that HPC machines are designed for throughput, which does not go well with deep pipelines because they reduce energy efficiency by adding flop power and speculation. However, I was surprised to see an impressive 15-stage deep pipeline which looks as follows:
And the winner is …
The lack of threading, the presence of a large register file, and a deep pipeline are all power-hungry features. Thus, they cannot explain the power-efficiency of the K. I am inclined to believe that the core is low-power because of the ACE instructions. Probably, Fujitsu has used super-optimized units for the instructions they found to be common in HPC workloads and tailored the whole pipeline around it. By the way, this makes a very strong case for functional heterogeneous computing.