from Hacker News

Bunnymark GL in Jai – 200k sprites at 200fps [video]

by farzher on 5/1/22, 5:54 PM with 68 comments

  • by daenz on 5/3/22, 11:56 PM

    It's been awhile since I've done game engine work, but is this impressive? The first thing that comes to mind is they're using instanced rendering. This allows the CPU to only deal with 1 sprite, while telling the GPU to render multiple instances of the sprite, and use a GPU buffer to find each sprite's transformation matrix. All the CPU has to do is update that mmap'ed buffer with new position information (or do something more clever to derive transformations).

    Am I missing something that makes the video novel?

  • by xaedes on 5/3/22, 9:33 PM

    Nice demo! We need more of this approach.

    You really can achieve amazing stuff with just plain e.g. OpenGL optimized for your rendering needs. With todays GPU acceleration capabilities we could have town-building games with huge map resolutions and millions of entities. Instead its mostly only used to make fancy graphics.

    Actually I am currently trying to build something like that [1]. A big big world with hundreds of millions of sprites is achievable and runs smoothly, video RAM is the limit. Admittedly it is not optimized to display those hundreds of millions of sprites all at once, maybe just a few millions. Would be a bit too chaotic for a game anyway I guess.

    [1] https://www.youtube.com/watch?v=6ADWXIr_IUc

  • by aappleby on 5/4/22, 3:14 AM

    Very rough guesstimates:

    200000 * 200 * 2 = 80M tris/sec

    200000 * 200 * 32x32px = 40 gpix/sec (if no occlusion culling)

    Neither of those numbers are particularly huge for modern GPUs.

    I'd wager that a compute shader + mesh shader based version of this could hit 2M sprites at 200 fps, though at some point we'd have to argue about what counts as "cheating" - if I do a clustered occlusion query that results in my pipeline discarding an invisible batch of 128 sprites, does that still count as "rendering" them?

  • by quadcore on 5/3/22, 11:53 PM

    Using Goroutines, I also made 10k 2D rabbits wander on a map for 5% of my laptop cpu (they'd sleep a lot admitedly). One goroutine per rabbit, how amazing when you think about it. That's when Go really got me.

    edit: oh they do rabbits in the video as well what a bunny coincidence

    edit2: the goroutines werent drawcalling btw, they were just moving the rabbits. The drawcalls were still made using a regular for loop, in case you wonder.

  • by _aavaa_ on 5/1/22, 7:07 PM

    This by the looks of it is in Jonathan Blow’s Jai language.

    How are you finding working with it? Have you done a similar thing in C++ to compare the results and the process of writing it?

    200k at 200fps on an 8700k with a 1070 seems like a lot of rabbits. Are there similar benchmarks to compare against in other languages?

  • by juancn on 5/3/22, 10:39 PM

    Neat. Isn't this akin to 400k triangles on a GPU? So as long as you do instancing it doesn't seem too difficult (performance wise) in itself. Even if there are many sprites, texture mapping should solve for the taking pixels to the screen part.

    My guess is that the rendering is not the hardest part, although it's kinda cool.

  • by chmod775 on 5/1/22, 7:48 PM

    Bit of a tangent and useless thought experiment, but I think you could render an infinite amount of such bunnies, or as many as you can fit in RAM/simulate. One the CPU, for each frame, iterate over all bunnies. Do your simulation for that bunny and at the pixel corresponding to its position, store its information in a texture at that pixel if it is positioned over the bunny currently stored there (just its logical position, don't put it in all the pixels of its texture!). Then on the GPU have a pixel shader look up (in surrounding pixels) the topmost bunny for the current pixel and draw it (or just draw all the overlaps using the z-buffer). For your source texture, use 0 for no bunny, and other values to indicate the bunny's z-position.

    The CPU work would be O(n) and the rendering/GPU work O(m*k), where n is the number of bunnies, m is the display resolution and k is the size of our bunny sprite.

    The advantage of this (in real applications utterly useless[1]) method is that CPU work only increases linearly with the number of bunnies, you get to discard bunnies you don't care about really early in the process, and GPU work is constant regardless of how many bunnies you add.

    It's conceptually similar to rendering voxels, except you're not tracing rays deep, but instead sweeping wide.

    As long as your GPU is fine with sampling that many surrounding pixels, you're exploiting the capabilities of both your CPU and GPU quite well. Also the CPU work can be parallelized: Each thread operates on a subset of the bunnies and on its own texture, and only in the final step the textures are combined into one (which can also be done in parallel!). I wouldn't be surprised if modern CPUs could handle millions of bunnies while modern GPUs would just shrug as long as the sprite is small.

    [1] In reality you don't have sprites at constant sizes and also this method can't properly deal with transparency of any kind. The size of your sprites will be directly limited by how many surrounding pixels your shader looks up during rendering, even if you add support for multiple sprites/sprite sizes using other channels on your textures.

  • by farzher on 5/1/22, 5:54 PM

    i finally got around to writing an opengl "bunnymark" to check how fast computers are.

    i got 200k sprites at 200fps on a 1070 (while recording). i'm not sure anyone could survive that many vampires

  • by liftm on 5/3/22, 11:43 PM

    Does this work with large semi-transparent objects? (My 10-year-old experience with 2D game engines was that 10k objects wasn't really a problem, unless you were trying to make clouds or fog from ~200x100px sized, half-transparent images. Have a 100 of those, and you'd run at 5 FPS.)
  • by sqrt_1 on 5/3/22, 10:35 PM

    I assume each sprite is moved on the CPU and the position data is passed to the GPU for rendering.

    Curious how you are passing the data to the GPU - are you having a single dynamic vertex buffer that is uploaded each frame?

    Is the vertex data a single position and the GPU is generating the quad from this?

  • by andrewmcwatters on 5/4/22, 12:19 AM

    How much time is spent in Jai? How much time is spent presenting the graphics? Unfortunately, graphics benchmarks like this are hard because they don't tell us much. You have to profile these two parts separately.
  • by jaqalopes on 5/4/22, 12:15 AM

    Gotta be honest this is beyond my current comprehension, but seeing the visuals on this while stoned was a trippy pleasure.
  • by SemanticStrengh on 5/3/22, 10:58 PM

    Yes although the performance is probably largely due to occlusion? Also the sprites do not collides with their environnement
  • by jancsika on 5/4/22, 4:45 AM

    Is there a way to do it as 1 sprite with 200k SVG filters applied to it at 1fps?
  • by adanto6840 on 5/4/22, 1:27 AM

    Anecdote: In Unity, using DrawMeshInstancedIndirect, you can get >100k sprites _in motion_ and still maintain >100 FPS.

    Using some slight shader/buffer trickery, and depending on what you're trying to do (as is always the case with games & rendering at this scale), you can easily get multiples of that -- and still stay >100FPS.

    I agree, more of this approach is great. And I am totally flabbergasted at how abysmally poor the performance is with SpriteRenderer Unity's built-in sprite rendering technique.

    That said, it's doable to get relatively high-performance with existing engines -- and the benefits they come with -- even if you can definitely, easily even, do better by "going direct".