Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 2 of 4 FirstFirst 1234 LastLast
Results 11 to 20 of 34

Thread: transform feedback + glDrawElementsInstanced

  1. #11
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    941
    Quote Originally Posted by Alfonse Reinheart View Post
    And... this is supposed to be fast? Using a geometry shader and performing per-triangle frustum culling/LOD selection, while using transform feedback? How is this faster than just rendering the models using traditional CPU-based methods of whole-object culling and LOD? You have this whole read/write/read loop going on in the shader. That requires an additional buffer just to write this intermediate data that you then render.
    No, nobody said that. You perform per-instance or per-object frustum culling/LOD selection using a geometry shader. That's orders of magnitude less work than the actual rendering.

    Quote Originally Posted by Alfonse Reinheart View Post
    Also in general, when I think performance, I don't think geometry shaders.
    While using a geometry shader does has its cost, it's not the evil itself
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  2. #12
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,723
    Quote Originally Posted by aqnuep View Post
    No, nobody said that. You perform per-instance or per-object frustum culling/LOD selection using a geometry shader. That's orders of magnitude less work than the actual rendering.
    How exactly? It is actually rendering. In order for the output primitive count to match the input primitive count, you have to be outputting the primitives you want to render. Which means that this pass is drawing all of the triangles for every LOD for every object that exists in the scene.

    It may not be scan converting and rasterizing them. But it is passing them through the vertex and geometry shaders. Which means the GPU reads them from the buffers and has to do transformation at least. You have to do vertex processing for each visible object twice (though the second time is just pass-through). That's a lot of redundant reading of memory. You read each object, write it to another location, then read it from there to render it.

    Again: how is this faster than just regular rendering via a deferred renderer?

  3. #13
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    the thing you're missing alfonse is that the transform feedback pass is just drawing a long list of GL_POINTS (with rasterization disabled), each point contains vertex attributes, those vertex attributes are the entire objects transform and bounding volume (so in my case that's a mat4x3 for the transform and a vec4 for the sphere). The output of this transform feedback pass is a list of vertex attributes for each lod (I just output the mat4x3, the sphere has done its job) intended to be used in a glDrawElementsInstanced, as the per-instance data not the mesh data.
    You might think this is a CPU job, but when you're talking about 10's of thousands of instanced objects being passed over the bus each frame (more if you take into account the shadow passes), then you can start to see the saving of doing this simple bounds/lod test on the GPU itself and then telling it to draw from the list it's just generated. To be honest I'm not that bothered about the frustum culling, I have a quad tree to cull the majority on the CPU anyway, it's the lod selection that's the real gain - that realistically has to be done per-instance, whereas frustum culling can be batched like I do in my quad tree.
    Last edited by peterfilm; 07-12-2012 at 02:42 AM.

  4. #14
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    here's some numbers:-

    instances:-
    26781

    CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
    590fps

    GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
    1995fps

    NOTE: this is just the culling/lod selection. I've commented out the drawing code.

    So as you can see, it's definitely worth doing the culling on the GPU!
    Just that pesky readback that spoils the party and drags the fps down significantly (by readback I mean that in the drawing code it has to get the value of the GL_PRIMITIVES_GENERATED in order to feed that value into the primCount parameter of glDrawElementsInstanced to actually draw the mesh instances themselves).
    Last edited by peterfilm; 07-12-2012 at 02:59 AM.

  5. #15
    Advanced Member Frequent Contributor
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    894
    Looking at the numbers I find the discrepancy quite astonishing but I don't quite follow the data flow. Do you mind lining up your GPU approach in list of subsequent operations for dumb people like me?

    Edit: If possible add the CPU path as well as to enable people to compare the approaches.

    Edit 2: By no means I intend to judgmental here! It simply looks quite intriguing and I'd like to see how it works.

  6. #16
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    I'd gladly do that, but aqnuep has already done a splendid job of writing this stuff up on his blog.
    it's got diagrams and everything! ignore the hi-z business for now.
    http://rastergrid.com/blog/2010/10/g...-geometry-lod/

    disclosure: i'd already got this stuff working before i found his blog (looking for optimisations), so please don't think i'm a copy cat (not that there'd be anything wrong with that, I just want to retain some kudos for the idea...god knows i get little enough of them).

  7. #17
    Advanced Member Frequent Contributor
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    894
    Thank you (and aqnuep of course)! I thought I read that but it was actually the earlier instance culling post.

    ignore the hi-z business for now.
    No I will not!

  8. #18
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882
    Quote Originally Posted by peterfilm View Post
    CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
    590fps

    GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
    1995fps
    So 1.69ms/frame for CPU, and .501ms/frame for GPU. Net savings: 1.19ms across 26781 instances (aka 0.44ms/10,000 instances).

    (FPS really is a horrible way to bench. Non-linear. Interesting thread though!)

  9. #19
    Advanced Member Frequent Contributor
    Join Date
    Apr 2010
    Location
    Germany
    Posts
    894
    Dark Photon: What do you make of that ~1.2 ms gain? If you're tight on budget it seems reasonable. Otherwise ... I don't know.

    BTW, shame on me for being blinded by those sneaky FPS.

  10. #20
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882
    Quote Originally Posted by thokra View Post
    What do you make of that ~1.2 ms gain?
    Well, if you've got really lose framerate requirements it might not be so important. But for those that have 16.66ms to do everything or they're dead, 1.2ms is a lot of time and worth reclaiming.

    It'd be good to have data on which specific GPU and CPU this test was done on to ground these benchmarks. Peter?

    I like the spirit of AMD_query_buffer_object. I'm all for nuking GPU pipeline bubbles and keeping the work blasting as fast as possible on the GPU. The author list on that extension is interesting too :-)

    Maybe AMD and NVidia can work out a deal here: AMD implements NV_vertex_buffer_unified_memory (batch buffers bindless only; no shader pointers) in exchange for NVidia implementing AMD_query_buffer_object. Result: Everybody gets improved perf from their GPUs. :-)
    Last edited by Dark Photon; 07-12-2012 at 06:20 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •