Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 16 of 19 FirstFirst ... 61415161718 ... LastLast
Results 151 to 160 of 184

Thread: Official feedback on OpenGL 4.0 thread

  1. #151
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,732

    Re: Official feedback on OpenGL 4.0 thread

    I hope they extend the bindless idea to textures/samplers etc. and remove the need for texture units at all.
    I think you've misunderstood what the point of bindless is, and how it creates optimizations.

    Bindless (as far as vertex attributes go) works as a performance optimization because of cache issues. When you render a scene in a game, you have to go through every object in an arbitrary order, bind everything to the context, and render. When you bind a buffer object name for the purpose of rendering, the driver has to convert this name into an actual object pointer, then fetch the actual GPU address from this object (because buffer objects have state other than just an object pointer). Each memory access is almost guaranteed to be a cache miss, since the last time this memory was accessed was on the last frame (and the entire game logic loop has likely run since then, thus emptying the cache).

    The key that makes bindless work for vertex attributes is that the only state you need for rendering is the GPU address. So if the application stores the GPU address, the driver doesn't need to do any memory accesses at all. Outside of actually putting that GPU address in the graphics FIFO, of course.

    This is not the case for sampler objects or texture objects. Sampler objects contain only state; a pointer to the object would only save you one cache miss at best. Texture objects have a GPU address of the textures, but even then, they have crucial state associated with it that the texture accessing unit needs to know (the range of available LODs). So again, you need to read from the actual texture object.

    As for "removing the need for texture units", why would you want this? All that means is that you have to bind textures and samplers to programs instead of the context. And binding to the context is faster.

    There are effectively 3 kinds of textures: global textures (textures like shadow maps that are used for all objects in a scene), instance textures (the textures for a particular mesh that uses a particular program), and program textures (textures that must be used with a particular program. Lookup tables that don't change between users of a program). Trying to make everything into a program-local texture is not helpful.

    If I want to change what the global texture is currently, I know what texture unit index I assigned it. So I can just bind a different texture there and every program I use from there on will use it. Under what you're wanting, I must go through every program and change what texture they use.

    With per-instance textures, under the current scheme, I can pre-assign the diffuse texture to texture unit 0, the specular map to texture unit 1, etc. And as long as I set all of the program sampler uniforms correctly, I never have to change the sampler uniform state when rendering different models with the same program. Under your system, with GLSL assigning uniform locations, for each program I use, I have to store what uniform location the diffuse, specular, etc texture is and bind them to that location.

    So no: textures and samplers are fine as is.

  2. #152
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882

    Re: Official feedback on OpenGL 4.0 thread

    Quote Originally Posted by Stephen A
    We all know that OpenGL implementations are extremely unstable (much more so than Direct3d implementations, for whatever reason).
    Where NVidia is concerned, sorry, but this is blatently false.

    But if you factor in "other vendors"..., I'll have to defer to someone else's experience there.

    We're in a unique position with embedded systems where we pick the hardware, so we can just take the best hardware+software combo out there at the time (features, performance, stability, pricing, volume availability, etc.) and run with it.

    How large is the mean performance gain in typical OpenGL 3.2 applications right now? ... What are the real numbers like?
    Several of us have reported 2X speed-ups without changing anything else. And at least my tests were on a real shipping app and database written intelligently (VBOs, frustum culling, state sorting, batch combining, etc.) but where decent frustum culling is still critical to regulate useless GPU vertex/fill hit. The mod for bindless batches is a trivial source change, and yielded a whopping 2X perf boost with real use cases!

    That's much less expensive than spending the dev time to trying to contort your rendering pipe through some other means to push more polys (in an effort to compensate for the CPU-side waste you have without bindless). Often times this comes to the detriment of frustum culling efficiency, meaning more wasted "junk" thrown down the pipe due to irrelevant batches which take valuable GPU cycles to discard.

    (7x sounds great, but it is only valid for solely batch-limited applications.
    Right. 2X is awesome though. You've just doubled the throughput of your GPU, for $0 cost in hardware and nearly no cost in software changes. Merely by eliminating some needless "pointer chasing" in the driver which otherwise kills your CPU cache and radically cuts your CPU and GPU utilization.

    Is there a way to achieve a similar performance gain without thoroughly breaking the current buffer/attribute/uniform model?
    Display lists. ...Oh, wait. Those were obsoleted in GL 3.0. And they're very time consuming to build, rendering them useless for run-time built/loaded geometry.

    And there's "instancing" but this assumes you're rendering a bunch of copies of the same thing -- makes for much more boring scenes, AND reduces your CPU culling efficiency AND complicates LOD.

    Why are bindless graphics superior to geometry instancing?
    See previous paragraph, but more on that below.

    As to the former point (instancing being a boring bunch of copies), yeah, you can use texture lookups (or more "instance" vtx attribs with ARB_instanced_arrays) and more shader logic/expense to try and vary the appearance a little per instance in the shader. Works, but what a pain. Recall, why are we barking up that tree anyway? Because of batch submission overhead, which is what bindless greatly reduces.

    As to the latter (reduced culling efficiency), with instancing we just have to suck it up and deal with it. With instancing, you're shoving larger groups of "stuff" down the pipe to try and reduce batch submission overhead, so the irrelevant portion of that is larger and just gonna soak up cycles. You can try and reduce some of that by trying to do on-GPU frustum culling and dynamic rewriting of batches. But what a pain and even more polys to render to try and avoid rendering other polys! Instancing is useful but not a "silver bullet". It also complicates LOD.

    If we could just do more culling/LOD/batch submission work faster on the CPU, then we could avoid alot of this on-GPU inefficiency or contortion. That's exactly what batch bindless gives you.

  3. #153
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,732

    Re: Official feedback on OpenGL 4.0 thread

    Display lists. grin ...Oh, wait. Those were obsoleted in GL 3.0.
    He was talking about a new way. That is, a new extension that would provide the beneficial effects of bindless without the negative effects of it.

    Both VAR and buffer objects are a way to deal with user-allocated graphics-owned memory. But buffer objects have a good abstraction, while VAR doesn't.

    Because of batch submission overhead, which is what bindless greatly reduces.
    I don't see how bindless takes away from instancing or is in any way fundamentally superior to it.

    Instancing, regardless of method (draw_instanced or instance_arrays), is a way of drawing multiple copies of a mesh with different material properties. In essence, it has nothing to do with batch submission overhead (ie: calling gl*Pointer) and has everything to do with state change overhead. Setting parameters on a program takes time, and instancing is a way of removing this time-taking step.

    Bindless doesn't help with changing program parameters. Indeed, if you're trying to render 10,000 copies of a mesh with bindless, you'll find it no faster than without bindless. It simply doesn't help with the fact that you need to, after every draw call, make at least one glUniform call.

    Bindless reduces the cache overhead of making gl*Pointer calls. That's all it does.

    As to the latter (reduced culling efficiency), with instancing we just have to suck it up and deal with it.
    I don't see how culling is more or less efficient with instancing. Unless you're doing instancing by just rendering all instances all the time, you're presumably using a streaming buffer object to upload an instance list to the GPU every frame, or at least every other frame. Can't you do your culling when building this list?

    If we could just do more culling/LOD/batch submission work faster on the CPU, then we could avoid alot of this on-GPU inefficiency or contortion. That's exactly what batch bindless gives you.
    No it doesn't. Indeed, bindless has essentially nothing to do with GPU and everything to do with CPU inefficiency. The CPU cache issues are what drives the performance increases for bindless graphics, not anything that has to do with the GPU.

  4. #154
    Member Regular Contributor
    Join Date
    Aug 2008
    Posts
    381

    Re: Official feedback on OpenGL 4.0 thread

    Quote Originally Posted by Alfonse Reinheart
    I think you've misunderstood what the point of bindless is, and how it creates optimizations.
    Removing texture units wouldn't really be for direct speed optimizations, at least not in the same way bindless graphics does, rather to simplify scene-graphs etc. I guess I just saw a similarity between bindless + my wish to no longer need texture units.

    Quote Originally Posted by Alfonse Reinheart
    As for "removing the need for texture units", why would you want this? All that means is that you have to bind textures and samplers to programs instead of the context. And binding to the context is faster.
    It may currently be faster binding textures to context for rendering, but this can potentially be avoid completely in the rendering stage + instead done once per program/kernel at setup time.
    If you don't know what order materials are to be applied in, then currently the simplest way to render each scene object may be to have code something like:

    Code :
    for each scene object
    {
      // Apply material
      // Apply textures
      for each texture/sampler used by material
      {
        glActiveTexture(texture.location);
        glBindTexture(convertToGLEnum(texture.type), texture.handle);
        glBindSampler(texture.location, texture.sampler);
      }
      glUseProgram(object.material.prog);
      #set UBOs etc#
      #set any subroutines?#
      // Apply geometry
      glBindVertexArray(object.geometry.vao);
      // Draw
      glDraw...
    }

    If you want to restore state after unapplying the material, then you'd also need to cache the bound textures + save/restore them.
    You could also have a pre-render pass that figures out what materials each scene object uses, then sorts by material/depth/etc to minimize state changes/overdraw + decides what texture units textures will be bound to (taking into account the max combined texture image units with minimum values of 2/16/32/48/48/80 in OpenGL 2.1/3.0/3.1/3.2/3.3/3.4).

    Quote Originally Posted by Alfonse Reinheart
    There are effectively 3 kinds of textures: global textures (textures like shadow maps that are used for all objects in a scene), instance textures (the textures for a particular mesh that uses a particular program), and program textures (textures that must be used with a particular program. Lookup tables that don't change between users of a program). Trying to make everything into a program-local texture is not helpful.

    If I want to change what the global texture is currently, I know what texture unit index I assigned it. So I can just bind a different texture there and every program I use from there on will use it. Under what you're wanting, I must go through every program and change what texture they use.

    With per-instance textures, under the current scheme, I can pre-assign the diffuse texture to texture unit 0, the specular map to texture unit 1, etc. And as long as I set all of the program sampler uniforms correctly, I never have to change the sampler uniform state when rendering different models with the same program. Under your system, with GLSL assigning uniform locations, for each program I use, I have to store what uniform location the diffuse, specular, etc texture is and bind them to that location.
    The same could perhaps be done if texture units were removed, texture handles or addresses were passed directly to programs, and textures to be shared across multiple programs use uniform blocks like other globally shared data.
    Code :
    uniform sampler2D[*]tex1;
    uniform Shadow
    {
      sampler2D[*]shadowTex
    }
    uniform MyMaterial
    {
      sampler2D[*]diffuse;
      sampler2D[*]ambient;
      sampler2D[*]specular;
      ...
    }

    The code for rendering all scene objects would then be:

    Code :
    for each scene object
    {
      // Apply material
      glUseProgram(object.material.prog);
      #set UBOs etc#
      // Apply geometry
      glBindVertexArray(object.geometry.vao);
      // Draw
      glDraw...
    }

    And for further reducing what needs to be done at render time, at the cost of more resources required, you could have 1 program per material instance with all attached uniforms/UBOs, but if programs are too heavyweight in their current form to have 1 material instance per scene object, then maybe there should be a more lightweight kernel object, something like:
    Code :
    glBuildProgram(prog);
    glGenKernelFromProgram(prog, ["main",] 1, kernel);
    Then each scene object could have it's own kernel object with attached uniforms/UBOs, set up once per scene, rather than each time a program is used.

    Code :
    for each scene object
    {
      // Apply kernel
      glUseKernel(object.material.kernel);
      // Apply geometry
      glBindVertexArray(object.geometry.vao);
      // Draw
      glDraw...
    }

    Quote Originally Posted by Alfonse Reinheart
    So no: textures and samplers are fine as is.
    GL maintained texture units may be better for global textures than other alternatives, but since UBOs now seem to be the way to handle shared program parameters, then perhaps samplers should also go into them (not sure if an elegant solution to shared global state really exists unless CPU+GPU get closer). Some of the reasons I don't like the current system of texture unit => sampler2D:
    Harder to learn/understand (non-intuitive for beginners, although a bit easier now without also needing to Enable/Disable texture binding points per texture unit)
    Extra code complexity
    Extra limitations for OpenGL programs
    Can't be used in uniform blocks

  5. #155
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,732

    Re: Official feedback on OpenGL 4.0 thread

    You could also have a pre-render pass that figures out what materials each scene object uses, then sorts by material/depth/etc to minimize state changes/overdraw + decides what texture units textures will be bound to (taking into account the max combined texture image units with minimum values of 2/16/32/48/48/80 in OpenGL 2.1/3.0/3.1/3.2/3.3/3.4).
    I don't know what this thing you have about the max combined texture image units is. Do you think that this is some fictitious API-created limitation rather than a real hardware limitation? If you compile and link a shader that uses more samplers than the hardware allows, you won't even get to the rendering; you will get a linker error. So if you have a compiled and linked program, you necessarily have enough texture units for it.

    And state sorting has been a part of all scene graphs that are serious about performance for quite some time. The sorting algorithms have changed, but the need for sorting itself hasn't.

    The same could perhaps be done if texture units were removed, texture handles or addresses were passed directly to programs, and textures to be shared across multiple programs use uniform blocks like other globally shared data.
    Except that per-instance textures requires:

    1: Storing what uniform location the instance textures go in.

    2: Changing program uniforms for ever instance.

    Neither of these is necessary under the current system.

    GL maintained texture units may be better for global textures than other alternatives, but since UBOs now seem to be the way to handle shared program parameters, then perhaps samplers should also go into them
    Uniform buffers are also not set directly into programs; they are set the exact same way as texture objects. You bind uniform buffers to the context, and you tell the program which buffer index in the context to use.

    So I don't really see any difference here.

  6. #156
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882

    Re: Official feedback on OpenGL 4.0 thread

    Quote Originally Posted by Alfonse Reinheart
    Because of batch submission overhead, which is what bindless greatly reduces.
    I don't see how bindless takes away from instancing or is in any way fundamentally superior to it.
    This isn't "instancing vs. bindless -- a fight to the death". They're both useful. And they tend to be most useful in different circumstances.

    When you want a bunch of slightly-mutated copies of each other and can eat the "cull waste", you want instancing. When you don't (you really want different objects), and you just want to reduce batch overhead (to pump more batches), then you want batch bindless (instancing is like hammering in a screw here).

    Of course, you can use batch bindless in both cases. It's just that it's the most benefit in the latter case.

    I didn't say bindless was superior to instancing, or vice versa. I said instancing is a useful tool, but it is not a silver bullet. Bindless does allow you more flexibility in some circumstances (particularly non-instanced) because the CPU time wasted launching batches is greatly reduced. That means the cases in which you are forced to use instancing to "hammer in the screw" are less, and you can get finer grained CPU culling (and more state changes) with some of that reclaimed CPU time (read more complex scenes, not just a bunch of copies of stuff, and more efficient use of the GPU).

    Net, batch bindless means less wasted time on the CPU (just submitting batches) and less wasted time on the GPU (due to the resulting pipeline "bubbles" and through sending more irrelevant instanced "cruft" down the pipe than you need to, which the GPU is just going to throw away, after wasting cycles on it).

    Indeed, bindless has essentially nothing to do with GPU and everything to do with CPU inefficiency.
    If you discount the time the GPU is just sitting there twiddling its thumbs waiting on the CPU (which I don't/can't), then you are correct (...when talking about batch bindless specifically).

    Instancing...in essence, ... has nothing to do with batch submission overhead (ie: calling gl*Pointer) and has everything to do with state change overhead.
    If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead. The more we can give the GPU to chew on at once, the less chance it'll be waiting on the CPU. And the more CPU time we'll have left over to do other things.

    (...well, assuming you're not eating bunches of CPU memory bandwidth transferring huge amounts of data to the GPU every frame.)

    As to the latter (reduced culling efficiency)...
    I don't see how culling is more or less efficient with instancing. Unless you're doing instancing by just rendering all instances all the time...
    In the simplest case for static instances, that's exactly what you'd like to do: here's a list of things -- just draw them. That way you can put the batch data on the GPU and just leave it there. Cheap, fast draw submission. Though if you let instance groups get too big, this flies in the face of LOD and culling efficiency which can net you a loss without careful balancing. Which brings us to...

    ...you're presumably using a streaming buffer object to upload an instance list to the GPU every frame, or at least every other frame. Can't you do your culling when building this list?
    Similar to what Rob was recently describing with his "streaming VBO" write-up. No, not yet. But given the LOD/cull shortcomings of static instancing, in cases where instancing ("copies of something") is what we want, we may end up doing something like this. (Though it's not without its problems. The cull and LOD state of the instance group can totally change in one frame in our world.)

    However, there are many other circumstances (complex scenes) where instancing is just hammering in a screw. This is where batch bindless is the most useful. No pipeline/engine contortions. Just get rid of the pointless CPU overhead so we can push more varied content.

  7. #157
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,732

    Re: Official feedback on OpenGL 4.0 thread

    I said instancing is a useful tool, but it is not a silver bullet.
    That's why I don't understand why you brought it up; nobody ever claimed instancing was a silver bullet or that bindless wasn't more general-purpose.

    If you discount the time the GPU is just sitting there twiddling its thumbs waiting on the CPU (which I don't/can't), then you are correct (...when talking about batch bindless specifically).
    Even so, it is important to denote the actual source of the problem, not the apparent source. And as you point out, the actual problem is the CPU, not the GPU.

    And why are you unable to test CPU time?

    If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead.
    Except that "calling glDrawElements or similar in a loop" is not instancing. Without some kind of state change between glDrawElements calls, you will be drawing the exact same thing each time, which is not particularly useful. The purpose of instancing is to remove the state change overhead for the state changes necessary to do instanced rendering.

    It has nothing to do with the overhead of glDrawElements itself and everything to do with the overhead of glBindTexture, glUniform, and other state changes.

  8. #158
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    2,882

    Re: Official feedback on OpenGL 4.0 thread

    Quote Originally Posted by Alfonse Reinheart
    I said instancing is a useful tool, but it is not a silver bullet.
    That's why I don't understand why you brought it up; nobody ever claimed instancing was a silver bullet or that bindless wasn't more general-purpose.
    "Stephen A" said:
    Quote Originally Posted by Stephen A
    Why are bindless graphics superior to geometry instancing?
    which precipitated the whole instancing vs. bindless discussion. Hopefully this has clarified it for him.

    Quote Originally Posted by Alfonse Reinheart
    Quote Originally Posted by Dark Photon
    If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead.
    Except that "calling glDrawElements or similar in a loop" is not instancing. Without some kind of state change between glDrawElements calls, you will be drawing the exact same thing each time
    Right, if you didn't change a vertex attribute and/or index data ptr. But that would be useless, so you would for instancing, and arguably this hits batch setup. This approach is similar but less efficient than ARB_instanced_arrays, but which does this more efficiently under-the-covers with vertex stream frequency dividers. You could also do something similar with state changes (uniform/etc.)

    But anyway, we're picking nits and I think on the same page here.

  9. #159
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,732

    Re: Official feedback on OpenGL 4.0 thread

    Right, if you didn't change a vertex attribute and/or index data ptr.
    Is that how you render an object in different places? By changing a vertex attribute?

    No; usually, if you want to render the same object in two places, you set some uniforms, render it, change some uniforms and render it again. You don't bind or unbind vertex attributes, so bindless saves you nothing here.

    A bit more on-topic, after looking at the ARB_instanced_arrays spec, I'm not even sure what even means anymore. The divisor used to mean something else, but they seemed to have changed the spec. The divisor used to divide the index of the attribute, so as to match some old D3D 9 functionality. Now it has interactions with the instance value, which is entirely different from how it used to work.

    I'm not sure it was a good idea for the ARB to repurpose an extension like this.

  10. #160
    Super Moderator Frequent Contributor Groovounet's Avatar
    Join Date
    Jul 2004
    Posts
    936

    Re: Official feedback on OpenGL 4.0 thread

    Quote Originally Posted by Alfonse Reinheart
    Right, if you didn't change a vertex attribute and/or index data ptr.
    Is that how you render an object in different places? By changing a vertex attribute?

    No; usually, if you want to render the same object in two places, you set some uniforms, render it, change some uniforms and render it again. You don't bind or unbind vertex attributes, so bindless saves you nothing here.

    A bit more on-topic, after looking at the ARB_instanced_arrays spec, I'm not even sure what even means anymore. The divisor used to mean something else, but they seemed to have changed the spec. The divisor used to divide the index of the attribute, so as to match some old D3D 9 functionality. Now it has interactions with the instance value, which is entirely different from how it used to work.

    I'm not sure it was a good idea for the ARB to repurpose an extension like this.
    I don't think it actually changed. For each instance, the divised attributes start back at the beginning just like any attribute. At least, that the way I understand it but I you can point me to further interaction with instanced draw called.

    With all the 'instancing' techniques we might have to be more accurate when we use this word 'instancing'.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •