from Hacker News

SIMD < SIMT < SMT: Parallelism in Nvidia GPUs (2011)

by shipp02 on 6/10/24, 6:05 AM with 37 comments

  • by Remnant44 on 6/11/24, 6:20 PM

    I think the principle things that have changed since this article was written is mostly each category taking inspiration from the other.

    For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.

    Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.

    SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.

  • by Const-me on 6/11/24, 7:53 PM

    > How many register sets does a typical SMT processor have? Er, 2, sometimes 4

    Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.

    For example, my laptop has AMD Zen3 processor. Each core has 192 scalar physical registers, while only ~16 general-purpose scalar registers defined in the ISA. This gives 12 register sets; they are shared by both threads running on the core.

    Similar with SIMD vector registers. Apparently each core has 160 32-byte vector registers. Because AVX2 ISA defines 16 vector register, this gives 10 register sets per core, again shared by 2 threads.

  • by narrowbyte on 6/11/24, 4:37 PM

    quite interesting framing. A couple things have changed since 2011

    - SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD

    - likewise for pervasive masking support and "Single instruction, multiple flow paths"

    In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.

  • by HALtheWise on 6/11/24, 5:27 PM

    For a SIMD architecture that supports scatter/gather and instruction masking (like Arm SVE), could a compiler or language allow you to write "Scalar-style code" that compiles to SIMD instructions? I guess this is just auto-vectorization, but I'd be interested in explicit tagging of code regions, possibly in combination with restrictions on what operations are allowed.
  • by jabl on 6/11/24, 8:50 PM

    A couple of related questions:

    - It has been claimed that several GPU vendors behind the covers convert the SIMT programming model (graphics shaders, CUDA, OpenCL, whatever) into something like a SIMD ISA that the underlying hardware supports. Why is that? Why not have something SIMT-like as the underlying HW ISA? Seems the conceptual beauty of SIMT is that you don't need to duplicate the entire scalar ISA for vectors like you need with SIMD, you just need a few thread control instructions (fork, join, etc.) to tell the HW to switch between scalar or SIMT mode. So why haven't vendors gone with this? Is there some hidden complexity that makes SIMT hard to implement efficiently, despite the nice high level programming model?

    - How do these higher level HW features like Tensor cores map to the SIMT model? It's sort of easy to see how SIMT handles a vector, each thread handles one element of the vector. But if you have HW support for something like a matrix multiplication, what then? Or does each SIMT thread have access to a 'matmul' instruction, and all the threads in a warp that run concurrently can concurrently run matmuls?

  • by mkoubaa on 6/11/24, 7:21 PM

    This type of parallelism is sort of like a flops metric. Optimizing the amount of wall time the GPU is actually doing computation is just as important (if not more). There are some synchronization and pipelining tools in CUDA and Vulkan but they are scary at first glance.
  • by James_K on 6/12/24, 1:26 AM

    > Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

    Got me laughing at the first line.