Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
Technical Blog: http://www.rastergrid.com/blog/
It may not be scan converting and rasterizing them. But it is passing them through the vertex and geometry shaders. Which means the GPU reads them from the buffers and has to do transformation at least. You have to do vertex processing for each visible object twice (though the second time is just pass-through). That's a lot of redundant reading of memory. You read each object, write it to another location, then read it from there to render it.
Again: how is this faster than just regular rendering via a deferred renderer?
the thing you're missing alfonse is that the transform feedback pass is just drawing a long list of GL_POINTS (with rasterization disabled), each point contains vertex attributes, those vertex attributes are the entire objects transform and bounding volume (so in my case that's a mat4x3 for the transform and a vec4 for the sphere). The output of this transform feedback pass is a list of vertex attributes for each lod (I just output the mat4x3, the sphere has done its job) intended to be used in a glDrawElementsInstanced, as the per-instance data not the mesh data.
You might think this is a CPU job, but when you're talking about 10's of thousands of instanced objects being passed over the bus each frame (more if you take into account the shadow passes), then you can start to see the saving of doing this simple bounds/lod test on the GPU itself and then telling it to draw from the list it's just generated. To be honest I'm not that bothered about the frustum culling, I have a quad tree to cull the majority on the CPU anyway, it's the lod selection that's the real gain - that realistically has to be done per-instance, whereas frustum culling can be batched like I do in my quad tree.
Last edited by peterfilm; 07-12-2012 at 02:42 AM.
here's some numbers:-
CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
NOTE: this is just the culling/lod selection. I've commented out the drawing code.
So as you can see, it's definitely worth doing the culling on the GPU!
Just that pesky readback that spoils the party and drags the fps down significantly (by readback I mean that in the drawing code it has to get the value of the GL_PRIMITIVES_GENERATED in order to feed that value into the primCount parameter of glDrawElementsInstanced to actually draw the mesh instances themselves).
Last edited by peterfilm; 07-12-2012 at 02:59 AM.
Looking at the numbers I find the discrepancy quite astonishing but I don't quite follow the data flow. Do you mind lining up your GPU approach in list of subsequent operations for dumb people like me?
Edit: If possible add the CPU path as well as to enable people to compare the approaches.
Edit 2: By no means I intend to judgmental here! It simply looks quite intriguing and I'd like to see how it works.
I'd gladly do that, but aqnuep has already done a splendid job of writing this stuff up on his blog.
it's got diagrams and everything! ignore the hi-z business for now.
disclosure: i'd already got this stuff working before i found his blog (looking for optimisations), so please don't think i'm a copy cat (not that there'd be anything wrong with that, I just want to retain some kudos for the idea...god knows i get little enough of them).
Thank you (and aqnuep of course)! I thought I read that but it was actually the earlier instance culling post.
No I will not!ignore the hi-z business for now.
Dark Photon: What do you make of that ~1.2 ms gain? If you're tight on budget it seems reasonable. Otherwise ... I don't know.
BTW, shame on me for being blinded by those sneaky FPS.
It'd be good to have data on which specific GPU and CPU this test was done on to ground these benchmarks. Peter?
I like the spirit of AMD_query_buffer_object. I'm all for nuking GPU pipeline bubbles and keeping the work blasting as fast as possible on the GPU. The author list on that extension is interesting too :-)
Maybe AMD and NVidia can work out a deal here: AMD implements NV_vertex_buffer_unified_memory (batch buffers bindless only; no shader pointers) in exchange for NVidia implementing AMD_query_buffer_object. Result: Everybody gets improved perf from their GPUs. :-)
Last edited by Dark Photon; 07-12-2012 at 06:20 AM.