Jun 202011

The list of Top 500 fastest computers in the world just came out and the Japanese K-computer is the fastest and the most energy-efficient computer at the same time. It is hard to build computers that are both fast and energy-efficient so I set out to understand what Fujitsu has done right. This quick post is a summary of my investigation. For the very impatient, my crude experience-based analysis says that the special purpose instructions and highly specialized functional units in the core give them their edge.

The following paragraph at HPCwire.com gave me my first clue:

The exceptional energy efficiency of K is provided courtesy of the 8-core SPARC64 VIIIfx processor, a 58 watt chip that delivers 128 peak gigaflops. That’s nearly up to the standards of an HPC-style GPU, a processor which basically does nothing but FLOPS. For comparison, an IBM Power7 CPU provides about 256 gigaflops, but consumes 200 watts, while IBM’s other HPC chip, the PowerPC A2 SoC used in Blue Gene/Q looks to be around twice as energy-efficient as the current crop of GPUs.

Since the power efficiency seems to be in the core, I set out to understand what is it about the SPARC64 VIIIfx which makes it so energy efficient. The best resource I found was the 2009 Hot Chips talk from Fujitsu. The VIIIfx chip looks as the follows:

Screen shot 2011-06-20 at 3.26.10 AM

There are 8-cores and a shared L2 cache in the middle of the chip. I found five unique features in this chip.

1. Lack of threading?

Given that hardware multi-threading or SMT is known to benefit simple in-order cores a lot, it is unexpected to see an HPC chip with no support for threading. However, the reason they did not do it becomes clear if you look at their ISA extensions that I discuss next.

2. HPC – ACE (Arithmetic Computational Extensions)

HPC ACE are Fujitsu’s extensions to SPARC that add Large register sets and SIMD.  Large Register Sets increase the number of FP registers from 32 to 256 and the number of integer registers from 160 to 192. This is where the problem lies. Since each SMT hardware context requires its own set of registers,  adding another SMT context is very expensive if the number of registers is so high. Thus, the designers chose to have a single thread per core. For reference, you can see that the FP registers (FPR + FUB) is a big chunk of the core. The integer register file is not marked but it should be of comparable size. Thus, the core logic is only a small fraction and hence threading is not feasible.

For Reference: Threading is done to increase utilization of the core logic. If core logic is only a small fraction of the chip, threading is infeasible.

Screen shot 2011-06-20 at 3.31.58 AM

3. Software Controlled Cache

This is a rather unique feature of this machine. They have introduced software controlled caches, much like a GPU. They have added instructions to the ISA to bring stuff into the cache. They also use the old time concept of sectored cache which I found very enticing. I will skip the details of the sectored cache for now.

4. Conditional Moves

Just like the x86 CMOVE instruction, Fujitsu has introduced the concept of predicated operations in SPARC. They can allow an instruction to write its result conditionally based on the value of a particular register. This eliminates branches in the inner loops of HPC kernels.

5. An surprisingly deep pipeline

My first assumption was that the core is a shallow 4-5 stage pipeline. The reason is that HPC machines are designed for throughput, which does not go well with deep pipelines because they reduce energy efficiency by adding flop power and speculation. However, I was surprised to see an impressive 15-stage deep pipeline which looks as follows:

Screen shot 2011-06-20 at 3.42.38 AM


And the winner is …

The lack of threading, the presence of a large register file, and a deep pipeline are all power-hungry features. Thus, they cannot explain the power-efficiency of the K. I am inclined to believe that the core is low-power because of the ACE instructions. Probably, Fujitsu has used super-optimized units for the instructions they found to be common in HPC workloads and tailored the whole pipeline around it. By the way, this makes a very strong case for functional heterogeneous computing.

  7 Responses to “Why the K-computer is the fastest and energy-efficient?”

  1. why do you think a deep pipeline is power-hungry? I’m guessing you’re thinking of netburst, which was deep to achieve high clocks – a correlation. at most this suggests that Fujitsu might be able to run their chip at much higher clock if they were prepared to take the heat ;)

    • :-) I wasn’t thinking NetBurst although its a good example. “Deeper pipelines are relatively power hungry” is just a general rule of thumb we architects use. There are two reasons for this:

      1. Additional misspecualtion penalty: Branch prediction penalty goes up with a deeper pipeline because you have to throw away more work on every misprediction.
      2. More flop power: Each pipeline stage requires an additional set of flops that burns additional power.

      A great resource is this paper by my colleague who was a Pentium 4 architect: Increasing Processor Performance by Implementing Deeper Pipelines

      Yes, Fujitsu could run it faster if they could take the heat:-). Have you seen the video of the guy who overclocked Pentium 4 to 7GHz using Nitro cooling?

  2. Deeper pipelines burn more power because more logic switches more frequently. I say more logic because pipelining ultimately adds to the base logic/capacitance (in addition to the flip-flops), and switching more frequently because of increased pipeline frequency (activity factor). However, if the benefit (reduction in runtime, and thus longer core-shutdown periods with almost 0 power) exceeds the cost (more instantaneous power), it may provide overall lower average power consumption (low overall energy consumption). I suspect they are not worried about branch mispredictions since its an HPC code.

    Also, I think that deep pipelining is not the only or major culprit in high power consumption of netburst. E.g. modern chips shut-down (power gate) aggressively different structures including the whole core. This brings significant power savings. This technology was not there at Netburst time.

  3. I’m impesresd you should think of something like that

  4. It’s a shame you don’t have a donate button! I’d most certainly donate to this brilliant blog!
    I suppose for now i’ll settle for book-marking
    and adding your RSS feed to my Google account.
    I look forward to fresh updates and will share this blog with my Facebook group.
    Talk soon!

  5. I got this website from my friend who told me concerning this web page and now this time I am browsing this web page and reading very informative
    content here.

  6. Sustain the helpful work and producing in the crowd!

 Leave a Reply



You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>