from Hacker News

Basic Facts about GPUs

by ibobev on 6/24/25, 12:15 PM with 81 comments

by elashri on 6/24/25, 2:52 PM
Good article summarizing good chunk of information that people should have some idea about. I just want to comment that the title is a little bit misleading because this is talking about the very choices that NVIDIA follows in developing their GPU archs which is not what always what others do.
For example, the arithmetic intensity break-even point (ridge-point) is very different once you leave the NVIDIA-land. If we take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27 FLOPs/byte which is about double that of the A100’s 13 FLOPs/byte. The larger on-package HBM (128 – 256 GB) GPU memory also shifts the practical trade-offs between tiling depth and occupancy. Although this is very expensive and does not have CUDA (which can be good and bad at the same time).
by eapriv on 6/24/25, 3:09 PM
Spoiler: it’s not about how GPUs work, it’s about how to use them for machine learning computations.
by LarsDu88 on 6/24/25, 7:40 PM
Maybe this should be titled "Basic Facts about Nvidia GPUs" as the WARP terminology is a feature of modern Nvidia GPUs.
Again, I emphasize "modern"
An NVIDIA GPU from circa 2003 is completely different and has baked in circuitry specific to the rendering pipelines used for videogames at that time.
So most of this post is not quite general to all "GPUs" which a much broader category of devices that don't necessarily encompass the type of general purpose computation we use modern Nvidia GPUs for.
by Agentlien on 6/25/25, 4:01 PM
I wasn't expecting the strong CUDA/ML focus. My own work is primarily in graphics and performance in video games; while this is all familiar and useful it feels like a very different view of the hardware than mine.
by SoftTalker on 6/24/25, 2:16 PM
Contrasting colors. Use them!
by saagarjha on 6/27/25, 9:30 AM
> The “Peak Compute” roof of 19.5 TFLOPS is an ideal, achievable only with highly optimized instructions like Tensor Core matrix multiplications and high enough power limits.
As mentioned below, 19.5 TFLOPS is the FP32 compute roofline, which doesn't support Tensor Cores. If you want to use those you need to use FP16 and you can get substantially improved performance.
by gdiamos on 6/25/25, 12:27 PM
Wow - the title is "basic facts" - but it should be "key insights"
You wouldn't believe how many PhDs I've met who have no idea what a roofline is.
by bjornsing on 6/25/25, 5:20 AM
So how are we doing with whole program optimization on the compiler level? Feels kind of backwards that people are optimizing these LLM architectures, one at a time.
by kittikitti on 6/24/25, 1:58 PM
This is a really good introduction and I appreciate it. When I was building my AI PC, the deep dive research into GPU's took a few days but this lays it out in front of me. It's especially great because it touches on high-value applications like generative artificial intelligence. A notable diagram from the page that I wasn't able to find represented well elsewhere was the memory hierarchy of the A100 GPU's. The diagrams were very helpful. Thank you for this!
by geoffbp on 6/25/25, 5:40 AM
“Arithmetic Intensity (AI)”
Hmm
by b0a04gl on 6/24/25, 2:06 PM
been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.
later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.
reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.
GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.
this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,
> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??
by neuroelectron on 6/24/25, 7:09 PM
ASCII diagrams, really?