rendering process optimization

I’m rendering textured quads using an FBO and a streaming VBO. I then apply lights, using another streaming VBO. I’m looking to optimize my rendering process, but need some feedback if my ideas are a good trade-off and perhaps other ideas if you guys have them.

Currently rendering process currently looks like this:


render(){

texturedQuadShader.bind();
texturedQuadVBO.bind();
texturedQuadVBO.upload();
texturedQuadVBO.drawElements();

//particles
particleShader.bind()
particleVBO.bind();
particleVBO.upload();
particleVBO.drawElemets();

//lights
lightShader.bind();
lightVBO.bind();
lightVBO.upload();
lightVBO.drawElements();
}

The reason I’m doing this is because the vertex attributes differ between textured quads, particles and lights.


Textured Quads:
coords = 2  floats
texCoords = 2 shorts (normalized)
color = 4 bytes (normalized)
total = 16 bytes

Particles:
coords = 2 floats
normal = 3 floats
color = 4 bytes (normalized)
total = 24 bytes

Lights:
PossitionXYZ = 3 floats
radiusXY = 2 floats
color = 4 floats
total = 36 bytes

STEP 1:

Just have one VBO. It would then look like this:


init(){

VBO.bind()
}

render(){

VBO.upload()
texturedQuadShader.bind();
texturedQuadVBO.drawElements();

//particles
particleShader.bind()
particleVBO.drawElemets();

//lights
lightShader.bind();
lightVBO.drawElements();
}


Tradeoff:
pros: I never have to switch VBO’s. I’ll keep it bound during execution.
cons: I’ll have to come up with a universal vertex attribute format for all my primitives. This will increase my texturedQuads attribute size from 16 bytes to maybe 20 (which isn’t as nice alignment). It will also require some more computation in the Vertex shaders, transforming the input to something that can be used.

STEP 2:

In addition to step 1, just have one super-shader with a couple of if-statements and uniforms (booleans isLight, isTexturedQuad, isParticle, etc)


init(){

VBO.bind()
shader.bind()
}

render(){

VBO.upload()

setUniform(IsTexturedQuad = true)
texturedQuadVBO.drawElements();

setUniform(IsParticle = true)
particleVBO.drawElemets();

setUniform(isLight = true)
lightVBO.drawElements();
}


Tradeoff:
pros: one shader bound at all time.
cons: 3 if-statements in vertex shader

So, what do you think? Would I benefit from implementing these ideas? Is there a smarter way of doing things?

With a single VBO you don’t need one common vertex format. You can quite easily use a different set of glVertexAttribPointer calls for one region of it to what you use for another region.

However, and from your description, none of these are actually bottlenecks at all. You have 3 VBO changes and 3 shader changes per frame, which is incredibly low: you’re not going to gain anything by reducing those numbers.

If your program is running slow then you’ll need to look elsewhere for optimization potential. Most likely candidates are your VBO uploads (if you get these wrong you can easily cut performance to about one-third as you’d break the ability of the GPU to run asynchronously) or inefficient shader code.

[QUOTE=mhagain;1279325]With a single VBO you don’t need one common vertex format. You can quite easily use a different set of glVertexAttribPointer calls for one region of it to what you use for another region.

However, and from your description, none of these are actually bottlenecks at all. You have 3 VBO changes and 3 shader changes per frame, which is incredibly low: you’re not going to gain anything by reducing those numbers.

If your program is running slow then you’ll need to look elsewhere for optimization potential. Most likely candidates are your VBO uploads (if you get these wrong you can easily cut performance to about one-third as you’d break the ability of the GPU to run asynchronously) or inefficient shader code.[/QUOTE]

Very well! I had no idea what was considered low and not, so this was an answer I was looking for. Care to look at my VBO-code?


    public void bindAndUpload(){
    	if (count == 0){
    		return;
    	}
    	
    	buffer.flip();
    	
    	glBindVertexArray(vertexArrayID);
    	
    	glBindBuffer(GL_ARRAY_BUFFER, attributeElementID);
    	
    	for (int i = 0; i < NR_OF_ATTRIBUTES; i++){
    		glEnableVertexAttribArray(i);
    	}
        
        glBufferSubData(GL_ARRAY_BUFFER, 0, count*ELEMENT_SIZE, buffer);
    	
    }

    public void flush(int last){

    	if (count == 0){
    		return;
    	}
    	glDrawElements(GL_TRIANGLES, (last-tmpCount)*6, GL11.GL_UNSIGNED_INT, tmpCount*24);
    	tmpCount = last;
    	
    }
    
    public void flush(){
    	
    	if (count == 0){
    		return;
    	}
    	glDrawElements(GL_TRIANGLES, (count-tmpCount)*6, GL11.GL_UNSIGNED_INT, tmpCount*24);
        clear();
    }

I have two flush methods so that I can switch stencil-test between “layers” in my frame. E.g. first I render the GUI and setting a stencil bit, then I render the actual model. This way I can also control which lights are applied to which “layer”, but it’s beyond the scope of my question.

I am aware I upload the VBO and then straight after start drawing elements, having to wait for the actual upload to complete, but I’ve tried “double buffering”, but there was not much performance boost and a 1-frame input delay that I didn’t like much.

As I said, I’m doing a fully streamed VBO. I’ve seen examples of high-end machines rendering up to 1 000 000 vertices with static VBO’s. My Nvidia 660 starts stuttering when I do 20 000 16x16 textured quads. I don’t know if that’s a good or bad number.

My shader is simply:
Vertex:
transform coordinates
Fragment:
sample 2 textures (diffuse and normal)
multiply with color input
write to 2 textures.

You don’t need to delay the frame data at all with double buffering. What it means is to write to two different VBOs, one every other frame, and draw with the VBO you just filled. If you write to the same VBO, it causes a stall because the glSubBufferData() must wait until the GPU has finished rendering the previous frame with the previous contents. So what you want to do is:


glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[0]);
glBufferSubData(...);
glDrawElements();

// next frame:
glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[1]);
glBufferSubData(...);
glDrawElements();

You can also triple buffer this way, which some presentations have suggested is the safe number to get rid of GPU stalls.

[QUOTE=malexander;1279338]You don’t need to delay the frame data at all with double buffering. What it means is to write to two different VBOs, one every other frame, and draw with the VBO you just filled. If you write to the same VBO, it causes a stall because the glSubBufferData() must wait until the GPU has finished rendering the previous frame with the previous contents. So what you want to do is:


glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[0]);
glBufferSubData(...);
glDrawElements();

// next frame:
glBindBuffer(GL_ARRAY_BUFFER, attributeElementID[1]);
glBufferSubData(...);
glDrawElements();

You can also triple buffer this way, which some presentations have suggested is the safe number to get rid of GPU stalls.[/QUOTE]

Thanks! I’ll implement this right away.

Another way of handling this, if the buffer size doesn’t change and if you overwrite the entire buffer each frame, is to use glBufferData. The driver should automatically handle the double-(or triple-)buffering for you and you won’t need to allocate any extra buffers yourself.

Since you mentioned “streaming” in your OP I’m going to assume that this doesn’t apply to your current use case, but it’s worth bearing in mind in case any future uses can match this pattern.

Hey jake, first step I think would be to nail down what your bottleneck is. First, bench rendering from static VBOs pre-uploaded to the GPU (i.e. not uploading – aka streaming – every frame). For max perf (since you mention NVidia), I’d either wrap each batch in a display list, or use bindless vertex attributes to launch the batches; however, you have so few batches per frame it probably won’t make a difference whether you do or not.

Once you have that baseline metric on your GPU for render time w/o upload (e.g. frame time w/o vsync, for a specific frame render scenario), then you can bench that against the “with upload” (streaming) case to see to what degree you’re upload bound or bound by something else, and to measure alternate streaming implementations against. If you are upload bound, I see a number of things we could do to speed you up. For a primer, read this: Buffer_Object_Streaming. And that’s just a start. On NVidia, you can really make the performance of streaming VBOs fly, especially if you support reuse. Been there; done that. And there’s been some new GL features added since then too!

The other thing I’m pondering here is those flush methods. Do they imply that what you list as “drawElements” in your OP may in fact be making more than 1 call to flush? If you have lots and lots of draw calls per frame then performance will go through the floor.

They are called maybe 1-10 times per loop iteration. That shouldn’t be a problem should it?

By “loop iteration” I assume you mean “frame”, so that’s 1-10 draw calls per frame, which isn’t a problem.

You say you’re drawing textured quads. Are they large? And do they overlap? And do they blend with each other?

You are correct mhagain. Terminology isn’t my strong suit.

The quads aren’t that large. It’s a tile-based game I’m doing. It looks like this:

[ATTACH=CONFIG]1246[/ATTACH]

where the “map” is built by 1616 “texels”, rendered within a 6464 quad, so I get 4x scale.

The reason I’m doing multiple draw calls per frame is so that I don’t render tiles “behind” the GUI, like the top and bottom panels in the image. I call the map and the Gui for different “layers”. It’s also a neat way of being able to apply different lights to these layers.

And yes, alas I’m using blend. I discard the texels that are fully transparent in the FS, but do blend if they’re not.

The quads do overlap alot. The tiles in the picture are usually made up of three layers grass/dirt -> forest/mountain -> road/building. On top of that there can be an entity or something like that.

I assume you’re fishing for me to skip blend. I have actually tried. I’ve tried rendering the scene front to back with stencil test, but that gave me some undesired results like no soft edges, etc. I’ve also tried implementing my own blend function where I read from the same texture as I’m rendering to - to see if alpha was < 1.0 and then do my own blend, but that ended up in a flickering nightmare. (I know now that it’s a no-go thing opengl-wise)

DarkPython: Good advice! I’ll look in to it.

More the case that I’m sniffing around large overlapping quads - blending is just icing on the cake but it’s typical in this sort of use case.

Lots of overlapping quads can mean that your bottleneck may actually be fillrate. There’s not much you can do CPU-side about that.

[QUOTE=mhagain;1279352]More the case that I’m sniffing around large overlapping quads - blending is just icing on the cake but it’s typical in this sort of use case.

Lots of overlapping quads can mean that your bottleneck may actually be fillrate. There’s not much you can do CPU-side about that.[/QUOTE]

Indeed. I’ve now noticed that if I fill the screen with 64x64 quads, I get stuttering. But it I fill it with 16x16 it’s fine, even though it’s 16 x more work. I would normally say that the less primitives, the faster the execution, but it’s not. Why is this? Is it a thread synchronization thing? Is there nothing to be done? I’m thinking I could actually slip up each quad into 4 or 6 triangles instead of one, but that just seems counter-intuitive and stupid

But it I fill it with 16x16 it’s fine, even though it’s 16 x more work.

But it’s not. It’s only 16 times the number of vertices. As far as rasterization, fragment shader, and per-sample operations are concerned, it’s the same amount of work: one per pixel.

And assuming you’re not doing much in your vertex shader, then those are where most of your “work” is actually being done: fetching from textures. That happens once per pixel. So they will generally have more or less the same performance.

Why is this?

It could be any number of things. You may even be misidentifying “stuttering;” what you perceive as stuttering may simply be some form of tearing when rendering at faster framerates than your monitor displays. Without actual timing measurements, there’s no way to be sure.

My main concern here is this: if all you’re doing is rendering quads that fill up the screen exactly once, then none of them should show any visible loss of framerate. On even embedded GPUs, you should be getting hundreds of FPS. So if you’re seeing a significant change in rendering time (measured, not eye-balled), then that suggests that something very strange is going on.

[QUOTE=Alfonse Reinheart;1279407]But it’s not. It’s only 16 times the number of vertices. As far as rasterization, fragment shader, and per-sample operations are concerned, it’s the same amount of work: one per pixel.

And assuming you’re not doing much in your vertex shader, then those are where most of your “work” is actually being done: fetching from textures. That happens once per pixel. So they will generally have more or less the same performance.

It could be any number of things. You may even be misidentifying “stuttering;” what you perceive as stuttering may simply be some form of tearing when rendering at faster framerates than your monitor displays. Without actual timing measurements, there’s no way to be sure.

My main concern here is this: if all you’re doing is rendering quads that fill up the screen exactly once, then none of them should show any visible loss of framerate. On even embedded GPUs, you should be getting hundreds of FPS. So if you’re seeing a significant change in rendering time (measured, not eye-balled), then that suggests that something very strange is going on.[/QUOTE]

Without any glFlush / glFinish / Vsync I do get 200 - 300 FPS, but naturally there is tearing going on. Stuttering occurs when I activate Vsync, which almost always leads to the FPS to drop from 60 to 30. And with glFinish, the rendering process usually takes up 60-100% of my time. This doesn’t happen if I draw the 16x16 quads though.

What’s strange is that when I monitor my GPU with GPU-Z while rendering and experiencing 30 FPS, the GPU clock stays at ~150 MHZ (which is lowest), while load is at 80-100%. On other graphical applications the clock rises a bit and load is low.

Ok, so for some reason your program (on that CPU, GPU, and configuration) is sometimes overrunning 16.6ms (one 60Hz frame), and flipping down from 60Hz to 30Hz, which you’re perceiving to be stuttering. You just need to figure out why and fix it.

What’s strange is that when I monitor my GPU with GPU-Z while rendering and experiencing 30 FPS, the GPU clock stays at ~150 MHZ (which is lowest), while load is at 80-100%. On other graphical applications the clock rises a bit and load is low.

Disable dynamic GPU clock speed (PowerMizer, or whatever it’s called on your GPU), and nail it to the highest clock speeds. Does this help your problem? If so, it’s GPU side.

If not, look CPU-side. If this a desktop GPU, try:


   << render frame >>
  glFinish();
  SwapBuffers();
  glFinish();

Time everything from the beginning to end of this code segment, and report the result (in milliseconds).
Also time everything from beginning to right after the first glFinish().

This should get rid of the often-annoying behavior of some drivers to start capturing GL commands for the “next” frame before the “current” frame has actually swapped. This can lead to a number of different stuttering/popping-like artifacts in your application due to variable latency and CPU blocking at random points in the frame, not to mention messing up your per-frame statistics collection.

If you can’t find it, distill the essence of what you’re doing into a short stand-alone test program (using GLUT, GLFW, or whatever), and post it for folks here to download, compile, run, and give you more detailed feedback on what you’re doing. At least post some code snippets of what you’re doing to get more tips.

Hello Python,

So, timing the snippet you provided with vsync disabled ate ~ 17% of 1/60th second.
No glFinsih lets me render at +300FPS. The clock then goes up to max. This is with my advanced shader that has multiple passes and dynamic lightning, etc.
With a simple shader i get 700FPS, or 12% of 1/60th second. Are these ok numbers? If they are, then I suppose I have some kind of driver/GPU issue. I only get lag sometimes and if I don’t, I usually notice that the clock has increased somewhat.

[QUOTE=jaketehsnake;1279451]Hello Python,

So, timing the snippet you provided with vsync disabled ate ~ 17% of 1/60th second.
No glFinsih lets me render at +300FPS. The clock then goes up to max. This is with my advanced shader that has multiple passes and dynamic lightning, etc.
With a simple shader i get 700FPS, or 12% of 1/60th second. Are these ok numbers?[/QUOTE]

Sounds pretty good to me. That’s ~2.8ms when you’ve got 16.66ms to render a frame.

If this is consistent and your frame times are always < 16.66ms/frame, then you’re fast enough to ensure that you always make the swap period. Apply the old maxim of performance optimization: once you’re fast enough, “stop optimizing!”.

Now turn VSync back on, but keep the glFinish after SwapBuffers in-place (if you’re on a desktop GPU). Do you see consistent 16.66ms frame times? If so, you’re done.

If they are, then I suppose I have some kind of driver/GPU issue. I only get lag sometimes and if I don’t, I usually notice that the clock has increased somewhat.

Hmm… You should capture the frame times for one of those frames when you see a lag (with VSync on). Is it exceeding 16.66ms?

(BTW, what GPU is this (make/model)? If it’s not a desktop GPU, more needs to be said here for you to understand what you’re seeing.)

Hello again Python,

swap with glfinish is practically always at 14% and rarely, but sometimes, goes to 20%. I suppose that’s consistent enough. Well, it never goes up to anywhere near 100% at least, which would cause a dropped frame if sync were enabled. So, I suppose I’m done. Should I keep glFinsih even now that I’m finished though? I’ve seen different opinions on this and that it’s just a debugging tool. I’m on a desktop Nvidia gxt 660M, but I want spread my application on multiple desktop environments.

Embarrassingly enough, I can’t seem to reproduce the lag with v-sync anymore. I’ve been fixing with my code and maybe I got it, maybe it will come back some day and haunt me. The only candidate I can think of was the input system. I used GLFW’s callbacks to process an input event as soon as it occurred. Now, I’ve changed that so that the events queue up and I processes them all in one function call in a controlled fashion. Could have been that before an input event came in between some critical CPU -> GPU thing.

But before, what I would see was that swapbuffers took 200% minus the rest of what went on in the loop. And it wasn’t occasional, it either occurred during all of the execution or not at all. So Mr Vsync clearly sometimes had the notion that the GPU couldn’t keep up with 60Hz and decided to go with 30Hz instead. But clearly with our experiments Mr Vsync is dead wrong.

It sounds like you might be.

Should I keep glFinsih even now that I’m finished though? I’ve seen different opinions on this and that it’s just a debugging tool. I’m on a desktop Nvidia gxt 660M, but I want spread my application on multiple desktop environments.

Your call. Here are some reasons why you might want to leave the glFinish after SwapBuffers in. All of them derive from the fact that (on a desktop or Tegra mobile GPU), glFinish after SwapBuffers synchronizes the CPU’s draw thread with the video scan-out clock. That’s a good thing! Note: video scan-out clock is also called the vertical sync clock, aka VSync clock. This is the clock used for timing when SwapBuffers actually happens, which determines when the user actually gets to see that cool frame you just rendered.

In many cases at the beginning of the frame, applications often sample the state of the simulation world and user input controls to determine what to render (what the new camera position/orientation is, where the entities are, what state the effects are in, etc.) If you leave the glFinish() after SwapBuffers in-place, AND you tune your rendering so that you always make the VSync period, THEN you can be relatively sure that this “beginning of frame” processing will happen at very regular and consistent intervals. Every 16.66ms, if you are running at a standard 60Hz scan-out rate with SwapInterval 1. Time it and see! The end-to-end latency of the system (from user input to frame displayed) is regular like clockwork. This has the effect of making what is rendered very smooth (not jumpy) and consistent in terms of frame-to-frame differences. This feels good to a user.

However, if you do “not” leave the glFinish() after SwapBuffers(), then your CPU draw thread is NOT synchronized with the GPU’s video scan-out clock, so your “beginning of frame” processing will happen at seemingly pseudo-random intervals. For example, you might hit beginning-of-frame 3 ms after the last time, then again 5ms after that for the next, then 22ms after that, 13ms after that, etc. In other words, your CPU draw thread is running out-of-sync with the rendering. This makes it difficult-to-impossible to generate a result that looks and feels smooth to the user (no stuttering, lags, stepping, popping, etc.).

Why would it be so erratic? Without glFinish() after SwapBuffers(), the GL driver will often queue the SwapBuffers request for later, and immediately return, letting you go ahead with beginning-of-frame processing and GL call submission for “the next” frame, before the previous frame has completed rendering much less been displayed to the user (i.e. well before the SwapBuffers has actually been performed!). It might even read a full-frame or more ahead of “reality” (what’s been displayed to the user thus far). The GL driver will block on seemingly random calls depending on driver/GPU-specific and driver-internal criteria you can’t control (e.g. command queue fills up, GPU pipeline is backlogged, etc.). Move the camera to a simpler scene, and the driver may be able to read much further ahead into subsequent frames. Move the camera to view more complex scenes, and the driver might only be able to read half a frame ahead. So you just end up blocking in random places. This results in your frame inputs being sampled at random intervals in time, giving your system an erratic end-to-end latency.

Also, another reason to leave Finish after SwapBuffers is it’s very useful to have per-frame CPU timing statistics in-place to diagnose performance problems (e.g. frame overruns, where it took the CPU+GPU more than 16.66ms to render a frame). With SwapBuffers+Finish, it’s easy to see when this happens, and from there, to track down the offending bottlenecks in that specific frame. If your “beginning of frame” is completely uncorrelated with the scan-out (VSync) clock, then your CPU frame timing is not nearly as useful. Yes, you can use GPU timers (timer queries) but there are a number of problems with that. GPU time is only half the story. What matters is the aggregate CPU+GPU time. And synchronizing the CPU with the GPU at end-of-frame is an easy way to give you that.

Embarrassingly enough, I can’t seem to reproduce the lag with v-sync anymore. I’ve been fixing with my code and maybe I got it, maybe it will come back some day and haunt me. The only candidate I can think of was the input system. I used GLFW’s callbacks to process an input event as soon as it occurred. Now, I’ve changed that so that the events queue up and I processes them all in one function call in a controlled fashion. Could have been that before an input event came in between some critical CPU -> GPU thing.

Try removing the Finish after SwapBuffers briefly for testing. Does your lag gremlin come back? :slight_smile:

But before, what I would see was that swapbuffers took 200% minus the rest of what went on in the loop. And it wasn’t occasional, it either occurred during all of the execution or not at all. So Mr Vsync clearly sometimes had the notion that the GPU couldn’t keep up with 60Hz and decided to go with 30Hz instead. But clearly with our experiments Mr Vsync is dead wrong.

Possibly. But I wouldn’t be so quick to pin the blame on Mr. VSync. My bet is Mr. GL driver and the internal “read-ahead” buffering I described above. Without the Finish after Swap, I’d completely expect the behavior you’re seeing.

CAVEAT: Again let me caveat that the glFinish after SwapBuffer we’re discussing to synchronize the CPU draw thread with the GPU output is only a reasonable approach on desktop or Tegra GPUs (sort-last architecture). This Finish after SwapBuffers is a really bad idea for other mobile GPUs (sort-middle architecture, sometimes called tile-based GPUs) which have a completely different design with much longer GPU draw latencies. On those GPUs, Finish after Swap can easily double your frame times!