The U8-Series Microarchitecture

We’ve had the pleasure of being briefed on the key aspects of the U8 microarchitecture, and we’ll be able to have a more in-depth look (albeit high-level) at how the new CPU design functions.

At the highest level, the U8 is a 3-wide issue out-of-order CPU with a pipeline depth of 12 stages, feeding 3 execution units. It’s a pretty traditional OoO-design and the noteworthy design choice here is the core’s use of physical register files instead of an architectural one, such as seen in initial Arm designs such as the A72.

One thing to note as we’re covering the microarchitecture is that SiFive didn’t disclose the exact sizes of some of the structures, which is somewhat natural given the core’s purported scalable configuration design where one can change many aspects of the IP, and we’re only covering the generic U8-Series microarchitecture as individual implementations (Such as an U84) will have different configurations.

The fetch unit of the core is able to request instructions out of the L1I at 16 bytes per cycle and put it into the fetch queue of the front-end. The RISC-V ISA has a variable instruction encoding size, so it’s not possible to map this to an exact number on instructions as one can on the Arm ISA, but if we naively assume a 32-bit average, it would correspond to 4 instructions per cycle. Of course, this isn’t surprising as the decoder on the U8 is 4-wide, feeding expanded instructions into the instruction queue.

The interesting thing here about the core is that the instruction queue is only able to issue 3 instructions out to the rename stage. Having the fetch width being higher than your issuing rate helps in the case of branch mispredictions and bubbles and allows the front-end to catch up with the execution backend, something we’ve also seen in other cores; however, we never quite saw an implementation in which the decoder was wider than the issue rate (Actually, only Intel's recent Tremont microarchitecture would also fit this characteristic). Beyond it being a deliberate design decision for the balance of the microarchitecture, maybe it’s also a forward-looking implementation on the part of the decoder whilst we may see wider issue configurations in future U8 designs.

Moving on to the mid-core, we see a traditional design into the rename stage, a re-order buffer and three dispatch engines feeding into the execution pipelines. The diagram here is a bit misleading in terms of the arrows going into the issue queues – it doesn’t mean that it’s only one instruction per issue queue, the core can still dispatch up to 3 instructions into the integer issue queues for example.

It would have been interesting to hear about the exact structure sizes on this part of the core but SiFive didn’t cover these details during the presentation.

On the integer execution block, we see that it’s actually composed of three execution pipelines. Each has its own issue queue, feeding into three ALU pipelines with different capabilities. One pipeline serves just as a regular ALU, a second one shares the port with the branch unit, while the third pipeline is a more complex one capable of integer multiplication and division.

Unfortunately, SiFive didn’t go into any detail of the floating-point pipelines or the L/S units. On the FP side, things should be relatively simple in terms of the execution capabilities, at least on the U84 core. Currently, RISC-V does not have any SIMD/Vector instructions as that ISA extension has not been finalized yet. SiFive explains that this might happen at the end of the year, and the U87 is poised to adopt the new vector capabilities next year.

SiFive and RISC-V Performance Targets, PPA and Conclusion
Comments Locked

68 Comments

View All Comments

  • AshlayW - Saturday, November 2, 2019 - link

    Thanks for the read :) very informative
  • hecksagon - Sunday, November 3, 2019 - link

    A bit of research shows that Intel, AMD, and all the major ARM SOC vendors are using SRAM for all cache. Intel does use some DRAM in its Iris Pro graphics, which can be used by the CPU cache.
  • peevee - Tuesday, November 5, 2019 - link

    "The registers and such aren't static though"

    They absolutely are.
  • name99 - Wednesday, October 30, 2019 - link

    I'd put it differently.
    Pragmatism equals design wins, and I see ARM as the most pragmatic company out there.

    Intel insisted on x86 uber alles, even where it made no sense (Larrabee, mobile) and paid the price.

    RISC-V has insisted on a certain kind of intellectual purity that makes no sense in terms of commerce, or the future properties of CPU manufacturing (plentiful transistors).

    ARM on the other hand, has always done a really masterful job of adding enough new functionality to get what they need at not too much cost, of changing the ISA when appropriate (but not too often), of accepting that some markets (like ARM-M) need different types of vector/DSP extensions from what's appropriate for ARM-A.
  • bji - Wednesday, October 30, 2019 - link

    But this is not the MIPS ISA. It's a new ISA. Why are you mentioning MIPS?
  • name99 - Wednesday, October 30, 2019 - link

    It feels very much like MIPS. (Since it comes from much the same people, substantially [too much so, IMHO] unchanged in those beliefs since the 1980s.)
  • Wilco1 - Wednesday, October 30, 2019 - link

    It's a MIPS variant indeed. This is why it's so funny when people try to claim it's a modern ISA - it's literally based on 80's RISCs. Same people, same minimalistic approach to reducing instructions at the cost of larger codesize and lower performance. No lessons learnt from MIPS...
  • bji - Wednesday, October 30, 2019 - link

    ARM is literally 80's RISC too. What is your point? That nobody who have designed an ISA in the past can make a better ISA in the future?
  • Wilco1 - Wednesday, October 30, 2019 - link

    My point is it's the same people repeating the exact same mistakes. It has the same issues as MIPS like no register offset addressing or base with update. Some things are worse, for example branch ranges and immediate ranges are smaller than MIPS. That's what you get when you're stuck in the 80's dogma of making decode as simple as possible...

    Arm never did things like the other RISCs. Is it possible to learn and do better today? Sure, look at AArch64 for example.
  • name99 - Thursday, October 31, 2019 - link

    That's an exceptionally silly riposte. Are you unaware that ARM has constantly evolved their instruction set, not just tweaks but experimenting with substantial changes (like Thumb and Thumb2)?

    There is a HUGE amount of learning that informed ARMv8, from the dropping of predication and shifting everywhere, to the way constants are encoded, to high-impact ideas like load/store pair and their particular version of conditional selection, to the codification of the memory ordering rules.
    Look at SVE as the newest version of something very different from what they were doing earlier.

Log in

Don't have an account? Sign up now