AMD wants to reduce the cost of communication and is on a crusade to bring down the cost of transferring bits between memory and compute significantly by putting RAM on top of the CPU/GPU.
Company CEO Dr. Lisa Su recently delivered a high level presentation at the International Solid-State Circuits Conference (ISSCC) 2023, speaking extensively about the need to cut down on the amount of energy (expressed in Joules) per computation operations (FLOPS).
Otherwise(as she puts it) the next Zettaflop-capable supercomputer will need a nuclear power station to keep running – and that’s not something realistic or sustainable.
Distance
Instead, the biggest improvements in performance-per-watt, Su believes, will be achieved by reducing the physical distance between the memory and where computation takes place (either on the CPU or the GPU). She used the example of the MI300 accelerator which uses a next-generation AMD Instinct APU with unified HBM (High bandwidth memory) to deliver some significant power savings.
Concurrently, AMD has already integrated processing-in-memory to reduce the energy required to access data.
Su presentation mentioned, “Key algorithmic kernels can be executed directly in memory, saving precious communication energy” – and for that AMD is collaborating with Samsung Electronics, whose expertise in DRAM is undeniable.
Closer is better
Memory-on-chip is already mainstream: AMD packs it in its AMD Ryzen 9 7950X3D and before that on its Ryzen 7 5800X3D (note that this memory is the faster and more expensive SRAM rather than DRAM). HBM is present in AMD’s Instinct MI accelerators and in Nvidia’s popular A100 accelerator, the brains behind ChatGPT. Apple’s M-series uses HBM connected to the processor but on the package rather than on the chip die.
Eventually, HPC will move towards memory-on-chip full scale as this is the most straight-forward low hanging fruit as workloads that demand extremely large amounts of high bandwidth push tackling power requirements (and associated costs) up the priority list.
Fujitsu’s A64FX processor, launched in 2019, is a true trailblazer and pioneer, merging dozens of Arm cores with 32GB of HBM2 memory sitting atop and offering a whopping 1TBps of bandwidth and with HBM3 already available on Nvidia’s Hopper H100 enterprise GPU, things will get even more interesting. Rambus plans to go beyond the HBM3 specs and hinted, last April, at up to 1.05TBps of bandwidth.
Increased interest in HBM, the cloud of the 1-ton gorilla that is Apple and the never-ending quest for bandwidth without needing an exotic power supply (and equally exotic cooling system) means that HBM – in the long run – is likely to supplant DIMM (and GDDR) as the main memory format: Blame it on Apple.
Dr. Su expects the first Zettascale supercomputer to be unveiled before 2035: that leaves us with 12 years to find the perfect solution unless AI gets there first.