from Hacker News

Full screen triangle optimization

by rck on 3/8/23, 8:25 PM with 44 comments

by ddren on 3/8/23, 11:12 PM
I wonder how this is implemented in the GPU. From my time working on a 3D renderer a long time ago, triangles with offscreen vertices would be clipped into smaller triangles, so in the end you would still be rendering multiple triangles anyway. I imagine it would also be possible to clip the scanlines instead.
by londons_explore on 3/8/23, 11:18 PM
A bigger reason to do this is that on some (shoddy) hardware, the user sees a tear line along the diagonal of the triangles.
It's as if sometimes one triangle was rendered before the vsync, while the other was rendered after it.
by obl on 3/8/23, 9:39 PM
```
  In actual hardware shading is done 32 or 64 pixels at a time, not four. The problem above just got worse.
```
While it's true that there are "wasted" execution in 2x2 quads for derivative computation, it's absolutely not the case that all lanes of a hardware thread (warp / wavefront) have to come from the same triangle. That would be insanely inefficient.
I dont think that it's publicly documented how the "packing" of quads into lanes is done in the rasterizer for modern GPUs. I'd guess something opportunistic (maybe per tile) taking advantage of the general spatial coherency of triangles in mesh order.
by ttoinou on 3/8/23, 10:29 PM
Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible
by nsajko on 3/9/23, 12:10 AM
> In my microbenchmark1 the single triangle approach was 0.2% faster than two.
Sounds like something that would be within the margin of error? Seems especially meaningless because it's just the average of the timings, instead of something that would visualize the distribution, like a histogram or KDE.
by lukko on 3/9/23, 1:58 PM
This is interesting, but also wouldn't the texture mapping / UVs be more confusing and possibly outweigh the benefit of micro-optimisation?
The good thing about having 4 vertices is can just use a vertex position and set of texture coordinates (x,y) on each one and the texture can just be mapped exactly.
by teucris on 3/8/23, 10:46 PM
> In my microbenchmark1 the single triangle approach was 0.2% faster than two. We are definitely deep into micro-optimization territory here :)
In the 3D graphics space, this kind of knuckle-shaving is deeply revered!
by ladon86 on 3/8/23, 10:13 PM
Would this still be true on a tiled rendering GPU, i.e. mobile?
If not, is there any possibility that dividing a fullscreen quad into _more_ triangles would actually end up faster?
by ww520 on 3/9/23, 2:08 AM
That's a pretty neat trick, letting the GPU to exclude the out of bound regions of the enlarged triangle and only render the visible rectangle.