Huge render time for simple draw calls

_Silver · August 5, 2015, 9:37am

Hello everyone,

Currently working on a robot simulation project, I come to you after many days of work to get into OpenGL, build my environment and optimize render time. My aim is to get several objects (spheres, cubes, tubes, perhaps some simple VRML models displays in special cases) displayed and updated quickly. Most of them would be static and I would never get more than around 50~100 of them so it should be simple, right ?

Here is how I processed : for reasons unrelated to OpenGL integration to my project, my classes are structured as shown in the following figure. I simplified it and isolated the visual / OpenGL-related classes for this help request.
[ATTACH=CONFIG]1206[/ATTACH]

For now, my objects are generated from a config file with the following format :

OBSTACLE			=	>SPHERE
POSITION			=	1,1,0
ROTATION			=	90,90,0
TRANSPARENCY			=	1.0
COLOR				=	0,60,100
DRAWTYPE			= 	>STREAM
RADIUS				=	0.1

OBSTACLE			=	>SPHERE
POSITION			=	0.0,0,1
ROTATION			=	10,20,40
TRANSPARENCY			=	1.0
COLOR				=	255,0,0
DRAWTYPE			= 	>STREAM
RADIUS				=	0.1

OBSTACLE			=	>PATH
POSITION			=	0,0,1
ROTATION			=	20,0,0
TRANSPARENCY			=	0.6
COLOR				=	0,120,220
DRAWTYPE			= 	>STATIC
RADIUS				=	0.05

From this, I instanciate my classes, generate the vertices, colors, normal vectors, etc. An example (screenshot while I set a movement trajectory for both spheres) looks like this :
[ATTACH=CONFIG]1207[/ATTACH]

First, I displayed around 10 shapes like this following the simplest tutorials, with simple diffuse + ambient lighting. The uniforms I would then send would approximately take 100ms per frame (CPU clock) … But I feel it isn’t a surprise to start like that since I’m a beginner.

Now, I implemented VBOs, VAOs and UBOs to get better results and it did work. I’m not satisfied yet though since it takes between 20 and 50ms to render a single frame (CPU clock) with the previous config (3 objects : 1 tube & 2 spheres).

I feel I am doing at least a few things wrong which lead to this unsatisfying computing time.

Here are some initialization stuff and the main loop I use to render the scene :


	/************************************/
	/******** INITIALIZING STUFF ********/
	/************************************/
	bool loop = true;

	/* Sunlight parameters */
	sunlight.fAmbientIntensity = glm::vec4(0.25);
	sunlight.vColor = glm::vec4(1.0f, 1.0f, 1.0f, 0.0f);
	glm::vec3 sunlightPos(10.0f,11.0f,12.0f);
	glm::vec3 sunlightTmp = -glm::normalize(sunlightPos);
	sunlight.vDirection = glm::vec4(sunlightTmp.x, sunlightTmp.y, sunlightTmp.z, 0.0f);

	glm::mat4 projection;
   	glm::mat4 modelview;
	glm::mat4 normalMatrix;
	glm::mat4 camera;
	
   	projection = glm::perspective(70.0, (double) getWindowWidth()/getWindowHeight(), 1.0, 100.0);
	modelview = glm::mat4(1.0);
	camera = glm::lookAt(glm::vec3(2, 2, 2), glm::vec3(0, 0, 0), glm::vec3(0, 1, 0));

	initUbos();	

	glBindBuffer(GL_UNIFORM_BUFFER, lightUboId);
		glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(Light), &sunlight);
	glBindBuffer(GL_UNIFORM_BUFFER, 0);

	glBindBuffer(GL_UNIFORM_BUFFER, staticMatricesUboId);
		glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(glm::mat4), &projection);
	glBindBuffer(GL_UNIFORM_BUFFER, 0);

		/****************************/
		/******** MAIN LOOP *********/
		/****************************/
		while(loop){

		/* Cleaning window */
		glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
		

		for(int i=0;i<env.getNObstacles();i++){

			modelview = glm::mat4(1.0);

			glUseProgram(env.getObstacle(i).getShape()->getGraph().getShader().getProgramId());

				/* Translate element*/
				modelview = glm::translate(modelview,glm::vec3(env.getObstacle(i).getShape()->getPosition(0),env.getObstacle(i).getShape()->getPosition(1),env.getObstacle(i).getShape()->getPosition(2)));		
				/* Rotate element */
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(2), glm::vec3(0.0,0.0,1.0)); // Z axis
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(1), glm::vec3(0.0,1.0,0.0)); // Y axis
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(0), glm::vec3(1.0,0.0,0.0)); // X axis
				
				/* Normal matrix - for lighting purposes */
				normalMatrix = glm::transpose(glm::inverse(modelview));

				modelview = camera*modelview;

				glBindVertexArray(env.getObstacle(i).getShape()->getGraph().getVaoId());

						glBindBuffer(GL_UNIFORM_BUFFER, streamingMatricesUboId);
							glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(glm::mat4), &modelview);
							glBufferSubData(GL_UNIFORM_BUFFER, sizeof(glm::mat4), sizeof(glm::mat4), &normalMatrix);
						glBindBuffer(GL_UNIFORM_BUFFER, 0);

					glDrawArrays(GL_TRIANGLES, 0, env.getObstacle(i).getShape()->getGraph().getN());

				glBindVertexArray(0);

			glUseProgram(0);
		
		}

        // Refreshing window	
        SDL_GL_SwapWindow(window); 

	}

Just in case, here is my vertex shader :

#version 150

in vec3 in_Vertex;
in vec4 in_Color;
in vec3 in_Normal;

layout(std140) uniform StaticMatrices
{ 
   mat4 projection;
};

layout(std140) uniform StreamingMatrices
{ 
   mat4 modelview;
   mat4 normalMatrix;
};

out vec4 color;
smooth out vec3 vNormal;

void main() {

    gl_Position = projection * modelview * vec4(in_Vertex, 1.0);
    color = in_Color;
	vec4 vTemp = normalMatrix * vec4(in_Normal,0.0);
	vNormal = vTemp.xyz;
	
}

And my fragment shader :

#version 150

in vec4 color;
smooth in vec3 vNormal;

out vec4 out_Color;

layout(std140) uniform Light
{ 
   vec3 vColor; 
   vec3 vDirection; 
   float fAmbientIntensity; 
};

void main() {

    float fDiffuseIntensity = max(0.0, dot(normalize(vNormal), -vDirection)); 
	out_Color = color*vec4(vColor*(fAmbientIntensity+fDiffuseIntensity), 1.0);

}

Here is my computer’s configuration in case I am being too optimistic about my render resources :
[ul]
[li]CPU : Intel Core 2 Duo CPU E6750 2.66GHz[/li][li]RAM : 4Go[/li][li]GPU : NVIDIA GeForce GT 630[/li][li]OS : Windows 7 SP1 32bits[/li][li]IDE : Visual Studio Express 2012[/li][/ul]

What do you think I do wrong that causes useless resources consumption ?

Many thanks for your time. I hope I have been concise and provided all the necessary data and if not, I shall answer quickly to complete my post.

Silver

GClements · August 5, 2015, 12:10pm

How many vertices and triangles in the scene?

mhagain · August 5, 2015, 4:15pm

Can you post your initUbos function, and can you definitely confirm that this function is called every frame, please?

_Silver · August 6, 2015, 1:37am

Hello and thank you for your answers.

1- For every cube, I have 36 vertices and 12 triangles. In that case where I display two spheres and a tube : my spheres have 61440 vertices and 61440/3=20480 triangles (no vertices shared yet). My tube has 237006 vertices (79002 triangles). I currently takes for a single sphere or tube around 10 to 15ms to display. I tested changing my sphere generator recursive algorithm’s loops to 0 to generate a 20 vertices sphere (icosahedron) : it also takes 10 to 15ms to display (CPU clock).

2- My bad, initUbos() is not in the loop : I added some initialization stuff previous to the loop in the code. I will edit that immediatly.

initUbos() code : (shader programs are bound to UBO after their compilation and linking)

void SSS_Graphics_Scene::initUbos(){

	lightUboBindingPoint = 1;
	staticMatricesUboBindingPoint = 2;
	streamingMatricesUboBindingPoint = 3;

	if(glIsBuffer(lightUboId) == GL_TRUE)
        glDeleteBuffers(1, &lightUboId);

	if(glIsBuffer(staticMatricesUboId) == GL_TRUE)
        glDeleteBuffers(1, &staticMatricesUboId);

	if(glIsBuffer(streamingMatricesUboId) == GL_TRUE)
        glDeleteBuffers(1, &streamingMatricesUboId);

	glGenBuffers(1, &lightUboId);
	glGenBuffers(1, &staticMatricesUboId);
	glGenBuffers(1, &streamingMatricesUboId);

	glBindBuffer(GL_UNIFORM_BUFFER, lightUboId);
		glBufferData(GL_UNIFORM_BUFFER, sizeof(Light), 0, GL_STATIC_DRAW);
	glBindBuffer(GL_UNIFORM_BUFFER, 0);

	glBindBuffer(GL_UNIFORM_BUFFER, staticMatricesUboId);
		glBufferData(GL_UNIFORM_BUFFER, sizeof(glm::mat4), 0, GL_STATIC_DRAW);
	glBindBuffer(GL_UNIFORM_BUFFER, 0);

	glBindBuffer(GL_UNIFORM_BUFFER, streamingMatricesUboId);
		glBufferData(GL_UNIFORM_BUFFER, 2*sizeof(glm::mat4), 0, GL_STREAM_DRAW);
	glBindBuffer(GL_UNIFORM_BUFFER, 0);

	glBindBufferBase(GL_UNIFORM_BUFFER, lightUboBindingPoint, lightUboId); 
	glBindBufferBase(GL_UNIFORM_BUFFER, staticMatricesUboBindingPoint, staticMatricesUboId); 
	glBindBufferBase(GL_UNIFORM_BUFFER, streamingMatricesUboBindingPoint, streamingMatricesUboId); 

}

_Silver · August 11, 2015, 5:14am

Up

I haven’t found any more optimization to make yet

_Silver · August 13, 2015, 7:29am

Hello everyone,

I am happy to report that after cleaning my code and VBOs/VAOs/UBOs/etc. optimizations my frame display in less than 4ms.
[b]EDIT : nope, it’s still 40ms… Conversion mistake.

Switching from 2000 to 500000 vertices makes almost no difference. Any ideas ? :s

Can I specify precisely when I want buffers to be refreshed or do I have to rely on glBufferSubData calls and STATIC/DYNAMIC/STREAM options ? It may help alot.[/b]

Thank you for your time

Dark_Photon · August 13, 2015, 5:21pm

[QUOTE=Silver;1278889][b]EDIT : nope, it’s still 40ms… Conversion mistake.

Switching from 2000 to 500000 vertices makes almost no difference. Any ideas ? :s[/b][/QUOTE]

I can think of quite a few things, but the first thing is to identify your primary bottleneck.

Even before that though, let me ask the more pragmatic question: do you “really” need 20480 triangles in a sphere and 79002 triangles in a tube? In other words, is this a practical exercise or a theoretical exercise?

As to identifying your primary bottleneck…

First off if you can vary the vertex count and see no difference in performance, you’re obviously not vertex limited. If you reduce the size of the viewport a bit (while still fitting the same field-of-view into it containing the same batches) and you see a reduction in performance, you may be fill limited.

Given that it sounds like you may be a bit new to this, it’s almost certain that you’re instead CPU limited. It’s pretty common when you’re first starting out, so don’t feel bad.

One common way that you end up CPU limited is you have too many state changes between batches. Best case: you set up your GL state and then have zero, or near zero, state changes between draw calls (setting up and enabling vertex attributes and the index list being the exception). Let’s see what you’re doing in your inner draw loop. If there’s much going on between your draw calls, that’s a candidate for a CPU bottleneck. Try getting rid of some in such a way that your vertex and fragment load is the same, and recheck performance.

Another common way is that your batches are too small. Depending on how you provide batches to the GPU, there’s a point beyond which you’re completely bound by the number of batch draw calls; adding triangles per batch is basically “free” (covered by the pipelining of the CPU/GPU system). So you may be there.

There’s all kinds of other things we can suggest to increase performance (reduce draw time) once you get past being CPU bound (if in fact you are CPU bound). But we’ll get there…

Post a short GLUT test program that illustrates what you’re doing exactly, and I’m sure you’ll have plenty of folks chiming in with how you can improve your code.

_Silver · August 14, 2015, 5:00am

Hello Dark Photon and many thanks for your answer, I have to admit that I’m a bit lost and confused at that point of my work because all I seem to do is sending data to shaders.

It is a practical exercise. My code that generates the triangles takes resolution parameters as inputs, so I can easily change it. Those values are extreme and I chose them to be sure I was not vertex limited.
The amount of triangles needed for a tube depend on a section’s perimeter resolution and tube trajectory resolution. I just want to have a pretty result and I can reduce those indeed. I know I can also optimize the amount of vertices used in those (I don’t necessarily need 3:1 vertices/triangle ratios with shared vertices and I know that), but I think that’s far from being my bottleneck indeed.

I have to investigate and learn a little bit to be able to lift this “fill limit” possibility. I’ll update this quickly.

[QUOTE=Dark Photon;1278897]One common way that you end up CPU limited is you have too many state changes between batches. Best case: you set up your GL state and then have zero, or near zero, state changes between draw calls (setting up and enabling vertex attributes and the index list being the exception). Let’s see what you’re doing in your inner draw loop. If there’s much going on between your draw calls, that’s a candidate for a CPU bottleneck. Try getting rid of some in such a way that your vertex and fragment load is the same, and recheck performance.

Another common way is that your batches are too small. Depending on how you provide batches to the GPU, there’s a point beyond which you’re completely bound by the number of batch draw calls; adding triangles per batch is basically “free” (covered by the pipelining of the CPU/GPU system). So you may be there.[/QUOTE]
Here my lack of understanding clearly limits me. Right now, here is what I do :
Before loop

[ul]
[li]Fill VBO with vertices and colors[/li][li]Fill VAO with calls[/li][/ul]
During loop (max. 3-20 times depending on amount of objects)

[ul]
[li]Generate transformation matrices[/li][li]Fill UBO with matrices[/li][li]Draw the triangles[/li][/ul]
I have the feeling that even if most of my objects are static, unnecessary work is done with buffers. It was encouraged by the fact sending a uniform matrix (without UBOs) was taking me around 5-10ms on its own … and I don’t see much execution time difference when I set all my objects buffers to streaming instead of static.

When I measure object per object, frame per frame execution time with CPU clock, almost 100% of frame displaying time happens between the clocks I set around glDrawArrays (cf following code). It may make no sense to measure this though since between CPU execution time, GPU execution time, CPU/GPU data transfers, I don’t really know what my timers measure or not …

cubo = std::clock();

glBindBuffer(GL_UNIFORM_BUFFER, streamingMatricesUboId);
							
	glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(glm::mat4), &modelview);
	glBufferSubData(GL_UNIFORM_BUFFER, sizeof(glm::mat4), sizeof(glm::mat4), &normalMatrix);

glBindBuffer(GL_UNIFORM_BUFFER, 0);

ceubo = std::clock();

cdraw = std::clock();

glDrawArrays(GL_TRIANGLES, 0, env.getObstacle(i).getShape()->getGraph().getN());

cedraw = std::clock();

d1 = 1000 * (ceubo - cubo) / (double) CLOCKS_PER_SEC;
d2 = 1000 * (cedraw - cdraw) / (double) CLOCKS_PER_SEC;

std::cout << "UBO : " << d1 << "ms" << std::endl;
std::cout << "Draw : " << d2 << "ms" << std::endl << std::endl;

Anyway I definitely got much to learn ! Thank you for your answer.

Your dedicated Silver

EDIT : OK. After many measurements : each object takes around 10-15ms to display, which leads to the final 50ms frame display.
99% of execution time lies in :

for(int i=0;i<env.getNObstacles();i++){
 
			modelview = glm::mat4(1.0);
 
			glUseProgram(env.getObstacle(i).getShape()->getGraph().getShader().getProgramId());
 
				/* Translate element*/
				modelview = glm::translate(modelview,glm::vec3(env.getObstacle(i).getShape()->getPosition(0),env.getObstacle(i).getShape()->getPosition(1),env.getObstacle(i).getShape()->getPosition(2)));		
				/* Rotate element */
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(2), glm::vec3(0.0,0.0,1.0)); // Z axis
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(1), glm::vec3(0.0,1.0,0.0)); // Y axis
				modelview = glm::rotate(modelview,(float) env.getObstacle(i).getShape()->getRotation(0), glm::vec3(1.0,0.0,0.0)); // X axis
 
				/* Normal matrix - for lighting purposes */
				normalMatrix = glm::transpose(glm::inverse(modelview));
 
				modelview = camera*modelview;
 
				glBindVertexArray(env.getObstacle(i).getShape()->getGraph().getVaoId());
 
						glBindBuffer(GL_UNIFORM_BUFFER, streamingMatricesUboId);
							glBufferSubData(GL_UNIFORM_BUFFER, 0, sizeof(glm::mat4), &modelview);
							glBufferSubData(GL_UNIFORM_BUFFER, sizeof(glm::mat4), sizeof(glm::mat4), &normalMatrix);
						glBindBuffer(GL_UNIFORM_BUFFER, 0);
 
					glDrawArrays(GL_TRIANGLES, 0, env.getObstacle(i).getShape()->getGraph().getN());
 
				glBindVertexArray(0);
 
			glUseProgram(0);
 
		}

I put many clocks to have detailed data and here is what it gives :

[ul]
[li] Modelview init to identity + glUseProgram : 5 to 8 ms [/li][li] modelview & normalMatrix operations : 0ms[/li][li] VAO block & draw : 3 to 6 ms[/li][li] glUseProgram(0) : 0ms[/li][/ul]
Which makes around 8 to 14ms execution time.

Yet I have results like this (example) :

Modelview init to identity + glUseProgram : 5ms
modelview & normalMatrix operations : 0ms
VAO block & draw : 5ms
glUseProgram(0) : 0ms
All together (measured separately) : 22 ms
around 12ms of execution time missing somewhere for every mesh displayed … I don’t know where.

mhagain · August 14, 2015, 7:11am

env.getObstacle(i).getShape()->getGraph().getShader().getProgramId()
env.getObstacle(i).getShape()->getPosition(0)
env.getObstacle(i).getShape()->getPosition(1)
env.getObstacle(i).getShape()->getPosition(2)
env.getObstacle(i).getShape()->getRotation(0)
env.getObstacle(i).getShape()->getRotation(1)
env.getObstacle(i).getShape()->getRotation(2)
env.getObstacle(i).getShape()->getGraph().getVaoId()
env.getObstacle(i).getShape()->getGraph().getN()

What do these do? Do getObstacle/getShape/getGraph/etc create or destroy any objects at runtime? If you cache and reuse the result of getObstacle(i) rather than getting it every time, does performance improve? Likewise for getShape and all the rest?

_bob · August 14, 2015, 7:22am

OpenGL Instancing could be nice in this case…

IonutCava · August 14, 2015, 7:33am

As mhagain said, try to cache the current shape instead of requesting it every single time. I assume your classes have all sorts of abstractions in place, so you are performing a hell of a lot of indirections.
Also, you can reduce the number of calls in your loop a lot. For example, why set modelview matrix to identity if you assign it a new value later? Just use the version of “glm::translate” that only takes 3 scalar values.
glBindBuffer(GL_UNIFORM_BUFFER, 0), glBindVertexArray(0), glUseProgram(0) are all redundant API calls. Just call them after the loop. You don’t need to reset them every time just to have them set to something else in the next iteration.

All of these things hurt performance A LOT.

And your usage of a UBO is far from optimal. Try to upload ALL of the matrices in one go (in a pre-pass), then index them in the shader based on whatever per-draw index you wanna use.

_Silver · August 14, 2015, 7:51am

[QUOTE=mhagain;1278904]

env.getObstacle(i).getShape()->getGraph().getShader().getProgramId()
env.getObstacle(i).getShape()->getPosition(0)
env.getObstacle(i).getShape()->getPosition(1)
env.getObstacle(i).getShape()->getPosition(2)
env.getObstacle(i).getShape()->getRotation(0)
env.getObstacle(i).getShape()->getRotation(1)
env.getObstacle(i).getShape()->getRotation(2)
env.getObstacle(i).getShape()->getGraph().getVaoId()
env.getObstacle(i).getShape()->getGraph().getN()

What do these do? Do getObstacle/getShape/getGraph/etc create or destroy any objects at runtime? If you cache and reuse the result of getObstacle(i) rather than getting it every time, does performance improve? Likewise for getShape and all the rest?[/QUOTE]
No, they don’t create any objects at all, they are only getters. It may be better to cache getObstacle(i)->getShape() at every frame (may, because allocation & assignment vs getters ? I didn’t study the case) : I did it, but it had no significant execution improvement as we could guess.

Good thought though, thanks. I could do that in a cleaner way, like sending the current shape to a function that would display a single mesh.

_Silver · August 14, 2015, 8:04am

[QUOTE=IonutCava;1278906]As mhagain said, try to cache the current shape instead of requesting it every single time. I assume your classes have all sorts of abstractions in place, so you are performing a hell of a lot of indirections.
Also, you can reduce the number of calls in your loop a lot. For example, why set modelview matrix to identity if you assign it a new value later? Just use the version of “glm::translate” that only takes 3 scalar values.
glBindBuffer(GL_UNIFORM_BUFFER, 0), glBindVertexArray(0), glUseProgram(0) are all redundant API calls. Just call them after the loop. You don’t need to reset them every time just to have them set to something else in the next iteration.

All of these things hurt performance A LOT.

And your usage of a UBO is far from optimal. Try to upload ALL of the matrices in one go (in a pre-pass), then index them in the shader based on whatever per-draw index you wanna use.[/QUOTE]

Thanks for this feedback. I didn’t know about the glm::translate definition with the 3 values and the binding calls redundancy !

My current UBO usage is based on what I understood : I separated light/projection & modelview/normalMatrix because light/projection don’t change between objects. I go look for a one-go upload and indexing in the shader, thanks

_Silver · August 14, 2015, 8:11am

Wow, thank you, I’ve come accross it several times without really grasping what it was but it’s a hell of an optimization I could make.

IonutCava · August 14, 2015, 8:41am

The data you are uploading into the UBO is fine, it’s just that you can upload more of it at a time and get a huge bump in performance. This isn’t a trivial task if you never attempted it before, so just leave it for later. A simple approach to keep in mind, might just be to use glBindBufferRange so you wouldn’t need to change your shaders. Just tighten the loop as much as you can and report back.

_Silver · September 2, 2015, 5:30am

UPDATE : I have changed uniform management to a single UBO, updated for every object : gained a few ms.

I think I have found something terribly wrong I was doing but you will perhaps be able to confirm. This is how I was updating every object’s VBO :

glBufferSubData(GL_ARRAY_BUFFER, 0, verticesSizeBytes, vertices);
glBufferSubData(GL_ARRAY_BUFFER, verticesSizeBytes, fragSizeBytes, frag);
glBufferSubData(GL_ARRAY_BUFFER, verticesSizeBytes + fragSizeBytes, vNormalSizeBytes, vNormal);

Looking at OpenGL instancing, I wonder if these positions / colors / normals sent separately vs sending each vertex position/color/normal “together” in terms of memory make that huge time difference.

Alfonse_Reinheart · September 2, 2015, 6:42am

There are two issues with the code you provided. The first is that you’re doing 3 uploads when you really only need one. The second is that you’re not interleaving your vertex attributes. The latter can make things perform faster.

Interleaving is a separate thing from instancing, even with instanced vertex arrays.

_Silver · September 2, 2015, 8:52am

Thank you, I now have implemented both interleaving and instancing and have a 30 to 50ms frame display time depending on the amount of objects and their resolution, with simultaneous complex multibody dynamics simulation and one of the objects being teleoperated (not taken into account in the time measurements, but indeniably using lots of resources). The computer I currently work on being quite old and slow, I believe I will stop there or maybe make a final vertices generation optimization (shared cube vertices, etc.).

Thank you everyone for your good advice and support !