by rck on 3/8/23, 8:25 PM with 44 comments
by ddren on 3/8/23, 11:12 PM
by londons_explore on 3/8/23, 11:18 PM
It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.
by obl on 3/8/23, 9:39 PM
In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.
by ttoinou on 3/8/23, 10:29 PM
by nsajko on 3/9/23, 12:10 AM
Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.
by lukko on 3/9/23, 1:58 PM
The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.
by teucris on 3/8/23, 10:46 PM
In the 3D graphics space, this kind of knuckle-shaving is deeply revered!
by ladon86 on 3/8/23, 10:13 PM
If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?
by ww520 on 3/9/23, 2:08 AM