from Hacker News

The AMD Radeon Instinct MI300A's Giant Memory Subsystem

by pella on 1/18/25, 12:28 PM with 99 comments

  • by btown on 1/18/25, 3:12 PM

    I've often thought that one of the places AMD could distinguish itself from NVIDIA is bringing significantly higher amounts of VRAM (or memory systems that are as performant as what we currently know as VRAM) to the consumer space.

    A card with a fraction of the FLOPS of cutting-edge graphics cards (and ideally proportionally less power consumption), but with 64-128GB VRAM-equivalent, would be a gamechanger for letting people experiment with large multi-modal models, and seriously incentivize researchers to build the next generation of tensor abstraction libraries for both CUDA and ROCm/HIP. And for gaming, you could break new grounds on high-resolution textures. AMD would be back in the game.

    Of course, if it's not real VRAM, it needs to be at least somewhat close on the latency and bandwidth front, so let's pop on over and see what's happening in this article...

    > An Infinity Cache hit has a load-to-use latency of over 140 ns. Even DRAM on the AMD Ryzen 9 7950X3D shows less latency. Missing Infinity Cache of course drives latency up even higher, to a staggering 227 ns. HBM stands for High Bandwidth Memory, not low latency memory, and it shows.

    Welp. Guess my wish isn't coming true today.

  • by mk_stjames on 1/18/25, 4:06 PM

    So the 300A is an accelerator coupled with a full 24-core EPYC and 128GB of HBM all on a single chip (or, packaged chiplets, whatever).

    Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation? Assuming you could program for the accelerator part, there is an entire world of x86-fixed CAD, engineering, and entertainment industry (rendering, etc) where people want a single, desktop machine with 128GB + of fast ram to number crunch.

    There are Blender artists out there that build dual and quad RTX4090 machines with Threadrippers for $20k+ in components all day, because their render jobs pay for it.

    There are engineering companies that would not bat an eye at dropping $30k on a workstation if it mean they could spin around 80 gigabyte CATIA models of cars or aircraft loaded in RAM quicker. I know this at least because I sure as hell did with with several HP Z-series machines costing whole-Toyota-Corolla prices over the years...

    But these combined APU chips are relegated to these server units. In the end is this a driver problem? Just a software problem? A chicken and egg problem where no one is developing the support because there isn't the hardware on the market, and there isn't the hardware on the market because AMD thinks there is no use case?

    Edit: and note my use cases mentioned don't rely on latency, really, like videogamers need to hit framerates. The cache miss latency mentioned in the article doesn't matter as much for these type of compute applications where the main problems are just loading and unloading the massive amount of data. Things like offline renders and post-processing CFD simulations. Not necessarily a video output framerate.

  • by erulabs on 1/18/25, 10:48 PM

    its interesting that two simultaneous and contradictory views are held by AI engineers:

    - Software is over

    - An impenetrable software moat protects Nvidia's market capitalization

  • by neuroelectron on 1/18/25, 3:27 PM

    >Still, core to core transfers are very rare in practice. I consider core to core latency test results to be just about irrelevant to application performance. I’m only showing test results here to explain the system topology.

    How exactly are "applications" developed for this? Or is that all proprietary knowledge? TinyBox has resorted to writing their own drivers for 7900 XTX

  • by ChuckMcM on 1/18/25, 9:34 PM

    That is quite a thing. I've been out of the 'design loop' for chips like this for a while so I don't know if they still do full chip simulations prior to tapeout but woah trying to simulate that thing would take quite the compute complex in itself. Hat's off to AMD for getting it out the door.
  • by amelius on 1/18/25, 2:44 PM

    I'm curious why this space hasn't been patented to death.
  • by buyucu on 1/19/25, 12:30 PM

    MI300 is an insanely good GPU. There is nothing that Nvidia sells that even comes close. The H100 only has 80GB of memory, whereas MI300 has 192GB. If you are training large models, AMD is the way to go.
  • by behnamoh on 1/18/25, 6:21 PM

    AMD is done, no one uses their GPUs for AI because AMD were too dumb to understand the value of software lock-in like Nvidia did with CUDA.