Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 4 123 ... LastLast
Results 1 to 10 of 34

Thread: transform feedback + glDrawElementsInstanced

  1. #1
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124

    transform feedback + glDrawElementsInstanced

    In order to avoid the query object stall when combining EXT_transform_feedback with glDrawElementsInstanced it seems to be recommended to use the ARB_draw_indirect extension - but for the life of me I can't find any information on how I get transform feedback to populate the GL_DRAW_INDIRECT_BUFFER needed for the new set of functions this extension introduces.
    I've seen people talk about OpenCL, but how do I get OpenGL's transform feedback mechanism to do it?
    thanks.

    (I've deliberately littered this post with the keyword breadcrumbs I've been searching with for people with the same question!)

  2. #2
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    What do you mean by query object stall with transform feedback and DrawElementsInstanced exactly? What's your use case? Do you feed back vertex array data or instance data using transform feedback?

    If you feed back vertex array data then you should use DrawTransformFeedback to do a non-indexed rendering of the fed back vertex array data.

    If you feed back instance data then you would need atomic counters in the vertex shader or geometry shader, though I'm not aware of any driver supporting non-fragment shader atomic counters currently.
    However, on AMD hardware you can use the new GL_AMD_query_buffer_object extension to feed back the result of a primitive query to a draw indirect buffer in a non-blocking manner. Example #4 in the spec might be just what you are looking for.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  3. #3
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    yes i'd just been reading the AMD_query_buffer_object extension just now! spooky. Frustratingly this extension is not supported on the nvidia quadro 4000 even though it's exactly what i need (example #4 could have been written with me in mind).
    yes i'm trying to do frustum culling and lod selection on the gpu, just as you have done in your demos and just as I talk about in my other forum thread (where the question was performance).
    now I've got everything writing to multiple streams, one stream for each lod, and the culling/lod selection is very fast indeed (still approx 50 million per tests per second, but with multiple streams i don't have to do multiple passes over the same instance data!) - but i've now identified the GL_PRIMITIVES_GENERATED query as a pretty significant bottleneck. This is why I'm looking for ways of getting the primitive generated count to the draw command without the CPU readback.

  4. #4
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    btw, when i say a significant bottleneck i mean it takes the overall framerate down below doing the culling/lod on the CPU and using glMapBufferRange() to upload the results. So unless I can sort this out, I'll be abandoning the GPU approach.

  5. #5
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    Well, you have at least two options:

    1. Use AMD_query_buffer_object if you can limit your target audience to AMD hardware (however, I hope that NVIDIA will implement it soon too).
    2. Use the visibility results of the previous frame to avoid the stall (you can even have a 2 frame delay). Obviously, this might result in popping artifacts, however, if your camera is not moving super fast and if you have decent frame rates, that one or two frame delay should not have any visible effect on your rendering.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  6. #6
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    well that's where it gets complicated (option 2 i mean). You see the instance renderer is used in a number of cull/renders - multiple viewports, quad buffered stereo, cascaded shadow maps.... it's just not practical to have a vbo for each lod for each cull/render phase. Apart from the memory wastage, there's also the code complexity.
    Ah well, life eh.

  7. #7
    Junior Member Regular Contributor peterfilm's Avatar
    Join Date
    Sep 2009
    Location
    UK
    Posts
    124
    i really love the simplicity of that AMD extension. The idea of the GL writing the query result into a buffer so we can then bind that buffer to the GL_DRAW_INDIRECT_BUFFER target is just gorgeous.

    It's bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU - I mean, OpenGL is supposed to be primarily for graphics and this is one of the oldest requirements for any graphics application. I don't see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API's to do such a simple thing.

    NVidia, just implement the extension already, for the love of god.

  8. #8
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    It's bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU
    Um, why? Frustum culling is, at its core, a very different operation. GPUs are for drawing triangles. Culling is about doing arbitrary computations to determine a binary value.

    Also, I'm curious as to exactly how writing the query result (which is either the number of fragments that pass or a true/false value) allows you to do LOD selection. Frustum culling I can kind of understand, sort-of. You can write a 0 value when the query is not visible. But how exactly does LOD selection work.

    I don't see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API's to do such a simple thing.
    Because OpenGL is for rendering and GPGPU APIs are for generic computations. Frustum culling and LOD selection are generic computations that are used to feed rendering.

    I'm not saying it's a bad extension. But personally, I'd say that LOD selection is something that the CPU should be doing, considering how dirt simple it is (distance fed into a table).

    NVidia, just implement the extension already, for the love of god.
    Personally, if NVIDIA's going to implement any of AMD's recent extensions, I'd rather see multi_draw_indirect, sample_positions, or depth_clamp_separate.

  9. #9
    Advanced Member Frequent Contributor
    Join Date
    Dec 2007
    Location
    Hungary
    Posts
    985
    Quote Originally Posted by Alfonse Reinheart View Post
    Also, I'm curious as to exactly how writing the query result (which is either the number of fragments that pass or a true/false value) allows you to do LOD selection. Frustum culling I can kind of understand, sort-of. You can write a 0 value when the query is not visible. But how exactly does LOD selection work.
    You don't use an occlusion query, but a primitive query. You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
    By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.

    Quote Originally Posted by Alfonse Reinheart View Post
    Personally, if NVIDIA's going to implement any of AMD's recent extensions, I'd rather see multi_draw_indirect, sample_positions, or depth_clamp_separate.
    NVIDIA already implemented AMD_multi_draw_indirect a while ago. Btw, using the query buffer and the multi draw indirect extension can be used together to further limit the number of draw calls necessary for the idea what peterfilm wants to implement.
    Disclaimer: This is my personal profile. Whatever I write here is my personal opinion and none of my statements or speculations are anyhow related to my employer and as such should not be treated as accurate or valid and in no case should those be considered to represent the opinions of my employer.
    Technical Blog: http://www.rastergrid.com/blog/

  10. #10
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
    By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.
    And... this is supposed to be fast? Using a geometry shader and performing per-triangle frustum culling/LOD selection, while using transform feedback? How is this faster than just rendering the models using traditional CPU-based methods of whole-object culling and LOD? You have this whole read/write/read loop going on in the shader. That requires an additional buffer just to write this intermediate data that you then render.

    Also in general, when I think performance, I don't think geometry shaders.

    Also, why not just use glDrawTransformFeedback or its stream version to render it?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •