by shipp02 on 6/10/24, 6:05 AM with 37 comments
by Remnant44 on 6/11/24, 6:20 PM
For example, SIMD instructions gained gather/scatter and even masking of instructions for divergent flow (in avx512 that consumers never get to play with). These can really simplify writing explicit SIMD and make it more GPU-like.
Conversely, GPUs gained a much higher emphasis on caching, sustained divergent flow via independent program counters, and subgroup instructions which are essentially explicit SIMD in disguise.
SMT on the other hand... seems like it might be on the way out completely. While still quite effective for some workloads, it seems like quite a lot of complexity for only situational improvements in throughput.
by Const-me on 6/11/24, 7:53 PM
Way more of them. Pipelines are deep, and different in-flight instructions need different versions of the same registers.
For example, my laptop has AMD Zen3 processor. Each core has 192 scalar physical registers, while only ~16 general-purpose scalar registers defined in the ISA. This gives 12 register sets; they are shared by both threads running on the core.
Similar with SIMD vector registers. Apparently each core has 160 32-byte vector registers. Because AVX2 ISA defines 16 vector register, this gives 10 register sets per core, again shared by 2 threads.
by narrowbyte on 6/11/24, 4:37 PM
- SIMD (at least intel's AVX512) does have usable gather/scatter, so "Single instruction, multiple addresses" is no longer a flexibility win for SIMT vs SIMD
- likewise for pervasive masking support and "Single instruction, multiple flow paths"
In general, I think of SIMD as more flexible than SIMT, not less, in line with this other post https://news.ycombinator.com/item?id=40625579. SIMT requires staying more towards the "embarrassingly" parallel end of the spectrum, SIMD can be applied in cases where understanding the opportunity for parallelism is very non-trivial.
by HALtheWise on 6/11/24, 5:27 PM
by jabl on 6/11/24, 8:50 PM
- It has been claimed that several GPU vendors behind the covers convert the SIMT programming model (graphics shaders, CUDA, OpenCL, whatever) into something like a SIMD ISA that the underlying hardware supports. Why is that? Why not have something SIMT-like as the underlying HW ISA? Seems the conceptual beauty of SIMT is that you don't need to duplicate the entire scalar ISA for vectors like you need with SIMD, you just need a few thread control instructions (fork, join, etc.) to tell the HW to switch between scalar or SIMT mode. So why haven't vendors gone with this? Is there some hidden complexity that makes SIMT hard to implement efficiently, despite the nice high level programming model?
- How do these higher level HW features like Tensor cores map to the SIMT model? It's sort of easy to see how SIMT handles a vector, each thread handles one element of the vector. But if you have HW support for something like a matrix multiplication, what then? Or does each SIMT thread have access to a 'matmul' instruction, and all the threads in a warp that run concurrently can concurrently run matmuls?
by mkoubaa on 6/11/24, 7:21 PM
by James_K on 6/12/24, 1:26 AM
Got me laughing at the first line.