Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 3 123 LastLast
Results 1 to 10 of 21

Thread: rendering process optimization

  1. #1
    Junior Member Newbie
    Join Date
    Aug 2015
    Posts
    23

    rendering process optimization

    I'm rendering textured quads using an FBO and a streaming VBO. I then apply lights, using another streaming VBO. I'm looking to optimize my rendering process, but need some feedback if my ideas are a good trade-off and perhaps other ideas if you guys have them.

    Currently rendering process currently looks like this:

    Code :
    render(){
     
    texturedQuadShader.bind();
    texturedQuadVBO.bind();
    texturedQuadVBO.upload();
    texturedQuadVBO.drawElements();
     
    //particles
    particleShader.bind()
    particleVBO.bind();
    particleVBO.upload();
    particleVBO.drawElemets();
     
    //lights
    lightShader.bind();
    lightVBO.bind();
    lightVBO.upload();
    lightVBO.drawElements();
    }

    The reason I'm doing this is because the vertex attributes differ between textured quads, particles and lights.

    Code :
    Textured Quads:
    coords = 2  floats
    texCoords = 2 shorts (normalized)
    color = 4 bytes (normalized)
    total = 16 bytes
     
    Particles:
    coords = 2 floats
    normal = 3 floats
    color = 4 bytes (normalized)
    total = 24 bytes
     
    Lights:
    PossitionXYZ = 3 floats
    radiusXY = 2 floats
    color = 4 floats
    total = 36 bytes

    STEP 1:

    Just have one VBO. It would then look like this:

    Code :
    init(){
     
    VBO.bind()
    }
     
    render(){
     
    VBO.upload()
    texturedQuadShader.bind();
    texturedQuadVBO.drawElements();
     
    //particles
    particleShader.bind()
    particleVBO.drawElemets();
     
    //lights
    lightShader.bind();
    lightVBO.drawElements();
    }

    Tradeoff:
    pros: I never have to switch VBO's. I'll keep it bound during execution.
    cons: I'll have to come up with a universal vertex attribute format for all my primitives. This will increase my texturedQuads attribute size from 16 bytes to maybe 20 (which isn't as nice alignment). It will also require some more computation in the Vertex shaders, transforming the input to something that can be used.

    STEP 2:

    In addition to step 1, just have one super-shader with a couple of if-statements and uniforms (booleans isLight, isTexturedQuad, isParticle, etc)

    Code :
    init(){
     
    VBO.bind()
    shader.bind()
    }
     
    render(){
     
    VBO.upload()
     
    setUniform(IsTexturedQuad = true)
    texturedQuadVBO.drawElements();
     
    setUniform(IsParticle = true)
    particleVBO.drawElemets();
     
    setUniform(isLight = true)
    lightVBO.drawElements();
    }

    Tradeoff:
    pros: one shader bound at all time.
    cons: 3 if-statements in vertex shader

    So, what do you think? Would I benefit from implementing these ideas? Is there a smarter way of doing things?

  2. #2
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,728
    With a single VBO you don't need one common vertex format. You can quite easily use a different set of glVertexAttribPointer calls for one region of it to what you use for another region.

    However, and from your description, none of these are actually bottlenecks at all. You have 3 VBO changes and 3 shader changes per frame, which is incredibly low: you're not going to gain anything by reducing those numbers.

    If your program is running slow then you'll need to look elsewhere for optimization potential. Most likely candidates are your VBO uploads (if you get these wrong you can easily cut performance to about one-third as you'd break the ability of the GPU to run asynchronously) or inefficient shader code.

  3. #3
    Junior Member Newbie
    Join Date
    Aug 2015
    Posts
    23
    Quote Originally Posted by mhagain View Post
    With a single VBO you don't need one common vertex format. You can quite easily use a different set of glVertexAttribPointer calls for one region of it to what you use for another region.

    However, and from your description, none of these are actually bottlenecks at all. You have 3 VBO changes and 3 shader changes per frame, which is incredibly low: you're not going to gain anything by reducing those numbers.

    If your program is running slow then you'll need to look elsewhere for optimization potential. Most likely candidates are your VBO uploads (if you get these wrong you can easily cut performance to about one-third as you'd break the ability of the GPU to run asynchronously) or inefficient shader code.
    Very well! I had no idea what was considered low and not, so this was an answer I was looking for. Care to look at my VBO-code?

    Code :
        public void bindAndUpload(){
        	if (count == 0){
        		return;
        	}
     
        	buffer.flip();
     
        	glBindVertexArray(vertexArrayID);
     
        	glBindBuffer(GL_ARRAY_BUFFER, attributeElementID);
     
        	for (int i = 0; i < NR_OF_ATTRIBUTES; i++){
        		glEnableVertexAttribArray(i);
        	}
     
            glBufferSubData(GL_ARRAY_BUFFER, 0, count*ELEMENT_SIZE, buffer);
     
        }
     
        public void flush(int last){
     
        	if (count == 0){
        		return;
        	}
        	glDrawElements(GL_TRIANGLES, (last-tmpCount)*6, GL11.GL_UNSIGNED_INT, tmpCount*24);
        	tmpCount = last;
     
        }
     
        public void flush(){
     
        	if (count == 0){
        		return;
        	}
        	glDrawElements(GL_TRIANGLES, (count-tmpCount)*6, GL11.GL_UNSIGNED_INT, tmpCount*24);
            clear();
        }

    I have two flush methods so that I can switch stencil-test between "layers" in my frame. E.g. first I render the GUI and setting a stencil bit, then I render the actual model. This way I can also control which lights are applied to which "layer", but it's beyond the scope of my question.

    I am aware I upload the VBO and then straight after start drawing elements, having to wait for the actual upload to complete, but I've tried "double buffering", but there was not much performance boost and a 1-frame input delay that I didn't like much.

    As I said, I'm doing a fully streamed VBO. I've seen examples of high-end machines rendering up to 1 000 000 vertices with static VBO's. My Nvidia 660 starts stuttering when I do 20 000 16x16 textured quads. I don't know if that's a good or bad number.

    My shader is simply:
    Vertex:
    transform coordinates
    Fragment:
    sample 2 textures (diffuse and normal)
    multiply with color input
    write to 2 textures.

  4. #4
    Member Regular Contributor malexander's Avatar
    Join Date
    Aug 2009
    Location
    Ontario
    Posts
    369
    Quote Originally Posted by jaketehsnake View Post
    I am aware I upload the VBO and then straight after start drawing elements, having to wait for the actual upload to complete, but I've tried "double buffering", but there was not much performance boost and a 1-frame input delay that I didn't like much.
    You don't need to delay the frame data at all with double buffering. What it means is to write to two different VBOs, one every other frame, and draw with the VBO you just filled. If you write to the same VBO, it causes a stall because the glSubBufferData() must wait until the GPU has finished rendering the previous frame with the previous contents. So what you want to do is:

    Code :
    glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[0]);
    glBufferSubData(...);
    glDrawElements();
     
    // next frame:
    glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[1]);
    glBufferSubData(...);
    glDrawElements();

    You can also triple buffer this way, which some presentations have suggested is the safe number to get rid of GPU stalls.

  5. #5
    Junior Member Newbie
    Join Date
    Aug 2015
    Posts
    23
    Quote Originally Posted by malexander View Post
    You don't need to delay the frame data at all with double buffering. What it means is to write to two different VBOs, one every other frame, and draw with the VBO you just filled. If you write to the same VBO, it causes a stall because the glSubBufferData() must wait until the GPU has finished rendering the previous frame with the previous contents. So what you want to do is:

    Code :
    glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[0]);
    glBufferSubData(...);
    glDrawElements();
     
    // next frame:
    glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[1]);
    glBufferSubData(...);
    glDrawElements();

    You can also triple buffer this way, which some presentations have suggested is the safe number to get rid of GPU stalls.
    Thanks! I'll implement this right away.

  6. #6
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,728
    Another way of handling this, if the buffer size doesn't change and if you overwrite the entire buffer each frame, is to use glBufferData. The driver should automatically handle the double-(or triple-)buffering for you and you won't need to allocate any extra buffers yourself.

    Since you mentioned "streaming" in your OP I'm going to assume that this doesn't apply to your current use case, but it's worth bearing in mind in case any future uses can match this pattern.

  7. #7
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    4,167
    Hey jake, first step I think would be to nail down what your bottleneck is. First, bench rendering from static VBOs pre-uploaded to the GPU (i.e. not uploading -- aka streaming -- every frame). For max perf (since you mention NVidia), I'd either wrap each batch in a display list, or use bindless vertex attributes to launch the batches; however, you have so few batches per frame it probably won't make a difference whether you do or not.

    Once you have that baseline metric on your GPU for render time w/o upload (e.g. frame time w/o vsync, for a specific frame render scenario), then you can bench that against the "with upload" (streaming) case to see to what degree you're upload bound or bound by something else, and to measure alternate streaming implementations against. If you are upload bound, I see a number of things we could do to speed you up. For a primer, read this: Buffer_Object_Streaming. And that's just a start. On NVidia, you can really make the performance of streaming VBOs fly, especially if you support reuse. Been there; done that. And there's been some new GL features added since then too!

  8. #8
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,728
    The other thing I'm pondering here is those flush methods. Do they imply that what you list as "drawElements" in your OP may in fact be making more than 1 call to flush? If you have lots and lots of draw calls per frame then performance will go through the floor.

  9. #9
    Junior Member Newbie
    Join Date
    Aug 2015
    Posts
    23
    Quote Originally Posted by mhagain View Post
    The other thing I'm pondering here is those flush methods. Do they imply that what you list as "drawElements" in your OP may in fact be making more than 1 call to flush? If you have lots and lots of draw calls per frame then performance will go through the floor.
    They are called maybe 1-10 times per loop iteration. That shouldn't be a problem should it?

  10. #10
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,728
    Quote Originally Posted by jaketehsnake View Post
    They are called maybe 1-10 times per loop iteration. That shouldn't be a problem should it?

    By "loop iteration" I assume you mean "frame", so that's 1-10 draw calls per frame, which isn't a problem.

    You say you're drawing textured quads. Are they large? And do they overlap? And do they blend with each other?

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •