VAO framerate issues

etienne85 · June 29, 2014, 3:08pm

Hello,
I’m building a basic ground with a bunch of squares made up of two triangles.
I’m storing it in a large array of vertices of size 2001(length)*2001(width)*6(vertices for the two triangles)*3(vertices coords).
So the total triangles number is 24 024 006 (or 72 072 017 vertices).
I know this could be largely optimised (avoiding to reuse same vertices, using VBO) but what bother me here is the framerate I get (20 FPS) as I think this should be no problem for a quite recent hardware (Nvidia GTX 460 M) to display over 24 millions polygons every frame?
I wonder where the problem is and if I’m doing something wrong here.
For instance, Is it a good practice to use a single large VAO buffer or should it be split in several smaller buffers?
Here is some code of the way I’m doing it:

init part:


GLuint VertexArrayID;
	glGenVertexArrays(1, &VertexArrayID);
	glBindVertexArray(VertexArrayID);

	int range=1000;
	int vaoSize = (2*range+1)*(2*range+1)*6*3;
	GLfloat g_vertex_buffer_data[vaoSize];
	cout<<"vao mem consumption: "<<sizeof(g_vertex_buffer_data)<<endl;
	size_t index = 0;
	for(int x=-range; x<=range; x++){
		for(int z=-range; z<=range; z++){
                        //FIRST TRIANGLE
                        //vertice 1
			g_vertex_buffer_data[index++] = x;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z;
                        //vertice 2
			g_vertex_buffer_data[index++] = x+1;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z;
                        //vertice 3
			g_vertex_buffer_data[index++] = x;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z+1;

                        //SECOND TRIANGLE
                        //vertice 4
			g_vertex_buffer_data[index++] = x+1;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z;
                        //vertice 5
			g_vertex_buffer_data[index++] = x+1;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z+1;
                        //vertice 6
			g_vertex_buffer_data[index++] = x;
			g_vertex_buffer_data[index++] = 0.0f;
			g_vertex_buffer_data[index++] = z+1;

		}
	}
	cout<<"total vertices: "<<--index<<" | total triangles: "<<index/3<<endl;

	// This will identify our vertex buffer
	GLuint vertexbuffer;

	// Generate 1 buffer, put the resulting identifier in vertexbuffer
	glGenBuffers(1, &vertexbuffer);

	// The following commands will talk about our 'vertexbuffer' buffer
	glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);

	// Give our vertices to OpenGL.
	glBufferData(GL_ARRAY_BUFFER, sizeof(g_vertex_buffer_data), g_vertex_buffer_data, GL_STATIC_DRAW);

then in the main loop I use this code to display the array along with some other usual Opengl calls:


// 1rst attribute buffer : vertices
		glEnableVertexAttribArray(0);
		glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
		glVertexAttribPointer(
				0,                  // attribute 0. No particular reason for 0, but must match the layout in the shader.
				3,                  // size
				GL_FLOAT,           // type
				GL_FALSE,           // normalized?
				0,                  // stride
				(void*)0            // array buffer offset
		);

		// Draw the triangles !
		glDrawArrays(GL_TRIANGLES, 0, vaoSize/3);

		glDisableVertexAttribArray(0);

If you have any clue to spot this performance issue, I would really appreciate!

etienne85 · July 1, 2014, 2:07pm

Argg I just lost the message I wrote because of urls forbidden in message !!! (to moderator, this is very inconvenient!!)
Answering my own question, I think bottleneck is the memory bandwith betwing System and GPU.
Given an article I cannot mention here (no URL) data transfer rate between System RAM and GPU RAM is around 6 GB/s.
[EDIT]
there is an error in my first post: number of vertices and triangles calculated are wrong…
there are 24 024 006 vertices and 8 008 002 triangles.
(just in case someone was looking at it,)
hence 24 millions vertices = 24 * 3(three coords x,yz)*4(size of float) = 288 millions Bytes sent from RAM to GPU each frame.
[/EDIT]
In my case, I get 20 FPS, 20 * 288 = 5.7 GB/s which is pretty much data rate transfer to GPU.
Which woulc explain why I only get “20 FPS”.
Anyone willing to confirm this calcul …

Nikki_k · July 1, 2014, 4:54pm

[QUOTE=etienne85;1260237]
If you have any clue to spot this performance issue, I would really appreciate![/QUOTE]

What about your shader? After all that’s where the real work is done.

DarioCiao · July 2, 2014, 4:25am

I get 40+ fps on my old (5 years) laptop by rendering 8 millions triangles (non indexed, so 24 milion vertices).

A common issue my girl had with her NVIDIA laptop (assuming you have a laptop, integrated cards were born there, but now also desktops have integrated graphics sometimes) is that by default that laptop used its integrated INTEL card instead of the GTX card for all applications but games detected by NVIDIA drivers (and when not connected to power surge always used integrated chip), you just need to go to NVIDIA control panel and be sure you use your graphics card for executables you create (it’s intuitive to do, but you need to do it, oh a nice thing, her antivirus has a executable named AVP.exe and NVIDIA drivers detected it to be the game Alien Versus Predator… lol XD epic… don’t know if they fixed that or noXD).

This is not a feature i like much (ok power saving and longer the life of the graphics card by not using it for conventional stuff should be good):
what if you create a game and players not play it because they don’t know how to turn on the graphics card because your game is not detected by drivers?

Resolve the issue:
Create a program (as simple as possible, just with crude OpenGL calls) that reproduce the issue (50 lines of code?) <–

create buffers
fill with terrain data and upload to GPU (just a flat grid of triangles, don’t even bother fill in random values)
use a shader wich just uouput a single color
render loop

This is important because there may be secondary code you not showed in your application that is messing up things, so before doing it the hard way, make the code work the simple way in a single file that we can happily compile… And learn to profile (this is not easy especially in graphics applications), what if your framerate is dropping because there is a"for loop" in your app that is eating CPU cicles? profiling should show that cycle immediatly.

Check if framerate drops, if not start adding features (shaders, texturing) to see where is the bottleneck, if yes then you can post the code here so we can check for common errors, if code is correct then problem is your machine (if you play AAA games at decent level of detail you machine is OK and you definitely have to investigate your code)

etienne85 · July 2, 2014, 2:57pm

Thanks for answers, for the record I’m not using shaders so no extra calcuulations.
Likewise, I don’t have any second card on my latpop (I think you referred to NVidia optimus system which switch between slow and fast card) and even being on Linux, games runs very well.

I get 40+ fps on my old (5 years) laptop by rendering 8 millions triangles (non indexed, so 24 milion vertices).

I’m surprised, are you speaking for each frame OR per/seconds? If first, then we have same data but In my case I find bottleneck being bandwith between RAM (where is stored the vertex buffer) and GPU.
Indeed (correct me if I’m wrong) 24 millions vertices uploaded EACH frame to the graphic cards, knowing 1 vertex has 3 floats, each float using 4 bytes is 24 * 3 * 4 = 288 millions Bytes per Frame = 288 MB/Frame
Hence @20 FPS, data rate is 5.7 GB/s which I think corresponds to RAM to GPU bandwith (to be confirmed) and in your case it is 11.4 GB/s which seems fare above mine.

Tonight, I switched to VBO implementation: reducing number of vertices to 4 millions (/ by a factor of 6) for the same ground area covered and same number triangles. As expected framerate was back to normal.

To experiment, I pushed the limit to see what happens.
Quadrupling surface I get 16 millions vertices and 96 millions indices => data rate is 163 4 Bytes + 96* 4 Bytes = 576 MB/Frame
Now I get 15 FPS => 8 GB/s for 32 millions triangles
oO this is surprising… I find different bandwith than before (8GB/s vs 5.7 GB/s),I wonder if this the right way to calculate?
In this case I assume we are transferring both Vertex + Indices buffers to GPU each frame. So I wonder if instead Vertex Buffer can remain on GPU memory saving bandwith? I need to investigate this as I’m still quite new to OpenGL.
Any idea about those results? If you want I can post my code for you to check (just need some cleaning).
Best Regards,

DarioCiao · July 4, 2014, 7:41am

My result is 40 frame per seconds rendering 8 million triangles every frame. This is not surprising because I have a very old graphics card. Since your are new I’ll clarify things a bit.

Indeed your bottleneck is RAM bandwith because you are not using VBOs.

Modern OpenGL is ALL about using VBOs for storing vertices AND indices data… with VBOs you can store data efficiently on GPU (or stream from RAM when you really need it). this is usefull for 2 reasons:

you don’t have to wait data coming from RAM
you are not limited by RAM (BUS) bandwith

Don’t even bother using client side arrays or immediate mode drawing.

VAO. VAO are used in pair with VBOs (so I probably got wrong your question because you asked about VAOs but specified that you are not using VBOs)

VAOs are used to “memorize” the configuration (configuration IS NOT binding) of VBOs, VAO are especially usefull on NVIDIA cards that suffer much more calls to glAttribPointer*. The VBO configuration is something that just eat CPU cycles without effectively moving any data.

the GL_STATIC_DRAW flag is what tell drivers to store data to GPU (this is just a hint! there’s no guarantee that data is moved to GPU)

Laptops with no dedicated VRAM are heavily limited by bandwith and will not gain any benefit from good graphics cards.

Common bottlenecks:
CPU cycles (if performing too much drawcalls or too many GL calls)
Memory quantity (RAM MBs and VRAM MBs => they cause OUT_OF_MEMORY error)
BUS bandwith (limited to few GB/s)
Fillrate (pixel operations done by GPU)

Also note that you have no benefit having more vertices rendering at a time more than the number of pixels in the screen. (even when doing heavy tessellation i still have small triangles with a 4/5 pixels surface that on my 1366x768 screen means I have to render just 230.000 vertices every frame, every extra vertex can’t add any visible detail).

You should keep in count that you may have on GPU stored million of vertices but actually rendering only a small part of them (check what frustum culling is for example)

Your graphics card specifications says.
VertexOperations: 56700 MVertices/sec than means that theorically to achieve 60fps you can render 945 MVertices/Frame (pure theorical limit, in practice that will be much lower but still higher than 32 Millions XD)

reto.koradi · July 4, 2014, 5:30pm

The code in the original post is using VBOs.

DarioCiao · July 4, 2014, 6:24pm

ops you are right: a sentence made me thinking he was not using VBOs (even after I looked the code at first time)

I know this could be largely optimised (avoiding to reuse same vertices, using VBO) but what bother me here is the framerate I get (20 FPS)

Then he must submit a stand-alone code that compiles and reproduce the problem, at least on his machine.

mhagain · July 5, 2014, 1:06am

[QUOTE=etienne85;1260296]Argg I just lost the message I wrote because of urls forbidden in message !!! (to moderator, this is very inconvenient!!)
Answering my own question, I think bottleneck is the memory bandwith betwing System and GPU.
Given an article I cannot mention here (no URL) data transfer rate between System RAM and GPU RAM is around 6 GB/s.[/QUOTE]

This is extremely unlikely to be your bottleneck because you’re using a VBO, which should therefore be storing your vertex data on the GPU itself. In other words: with a VBO there should be no transfer between system memory and GPU memory happening at all.

I say “should be” because sometimes your driver might decide to store your VBO in system memory. OpenGL just specifies GL_STATIC_DRAW as a hint: you’re telling the driver that the contents of the VBO aren’t going to change, and the driver will make a decision on where to store it based on that. At some point while your program is running your driver might decide to move it to different storage depending on how you use it, but since you’re just filling it once and then never going near it again, this won’t happen.

Another case where the driver might decide to store it in system memory is if there isn’t enough GPU memory available for it. Perhaps you’ve got some very large textures?

But in your case you can say that your VBO is most likely stored in GPU memory and you need to look elsewhere for your bottleneck.

You don’t describe your data much aside from saying it’s terrain. What kind of terrain is it? Are you drawing all ~24 million vertices per frame? Are all ~24 million visible on-screen at the same time? Are you getting much overdraw? Have you a complex vertex shader? Have you a complex fragment shader? There are all far more likely to be causing poor performance.

danbartlett · July 5, 2014, 4:10am

You’re right on the amount of data being used.

If the vertex data is the same for each vertex that is shared between triangles, then using indexed drawing (glDrawElements) would help reduce the amount of data you are transferring & would allow the GPU to re-use some calculations. Sometimes tex-coords aren’t shared between adjacent triangles even though the position is the same, so indexed rendering wouldn’t help in that case.

Non-indexed size calculation:
data size = widthheightnumber of vertices per quadvertex size
= 20012001612
= ~288M per draw call

Indexed data size calculation:
data size = total vertex data size + total index data size
data size = widthheightvertex size + widthheightnumber of vertices per quadindex size
= 2001200112 + 2001200164
= ~48M + ~96M
= ~144M per draw call

Another benefit of using indexed rendering is that if you were to draw the vertices (0,1,2, 1,3,2, …) then the subsequent times the GPU comes across a repeated index, it is possible for it to easily look up previously cached vertex information rather than re-calculating. If you are providing the GPU with repeated data ((0.1f, 0.1f, 0.0f), (0.2f, 0.1f, 0.0f), (0.1f, 0.2f, 0.0f), (0.2f, 0.1f, 0.0f), (0.2f, 0.2f, 0.2f), (0.1f, 0.2f, 0.0f), …) then it would typically be more efficient for it to re-calculate the vertex data instead of comparing every byte of submitted data for recent vertices.

With larger vertex sizes (if you were to add color/tex coords etc per vertex) then indexed rendering would gain even more over non-indexed rendering.

If you’re drawing terrain, there are lots of algorithms out there for not drawing the full 2001*2001 vertices, which would probably provide the best speed-up.

If you are just drawing a regular grid using a heightmap and don’t want to use any approximations, then there’s only widthheightsize of float = 200120014 = ~16M bytes of actual data so perhaps you could pass this height as a single float vertex attribute (or a texture) & generate the X & Y coordinates in the vertex shader from gl_VertexID (similar to this example, but you would also need to set Z coordinate from height attribute whereas they set it to zero). Since the vertex data is quite small in this case, you could use glDrawArrays/glDrawArraysInstanced for this approach (16M vertex data, but no vertex caching) which would probably beat indexed rendering (16M vertex data + 96M index data with vertex caching) although it might be worth checking indexed/non-indexed to see which is faster.

Perhaps with indexed instanced rendering (glDrawElementsInstanced) using a small chunk of the grid at a time and still generating X&Y coordinates you could get a balance between keeping the index data size down & being able to make use of the vertex cache. Using less than 64k vertices per instance would allow you to reduce index size to 2 bytes instead of 4 per index.

Any time you’re generate coordinates on the fly you’ll need to be careful you’re doing it in such a way that floating-point inaccuracies don’t creep in. For example:

X1 = (gl_InstanceID + gl_VertexID) * tile_width

would probably generate the same value for any combination of gl_InstanceID + gl_VertexID: (2 & 3), (3 & 2), (4 & 1), (1 & 4) but

X2 = gl_InstanceID*tile_width + gl_VertexID*tile_width

would probably result in slightly different values for each combination (2 & 3), (3 & 2), (4 & 1), (1 & 4), which would result in cracks appearing in your grid.

DarioCiao · July 5, 2014, 6:07am

try:
glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
glBufferData(GL_ARRAY_BUFFER, sizeof(g_vertex_buffer_data),NULL, GL_STATIC_DRAW);

and then:


//upload data

glGenVertexArrays(1, &VertexArrayID);
glBindVertexArray(VertexArrayID);

glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0,(void*)0   );

int attemps=0;
do{
    float * ptr = (float*)  glMapBuffer(GL_ARRAY_BUFFER,GL_WRITE_ONLY);
    for(int i=0; i<sizeof(g_vertex_buffer_data); i++)
        ptr[i]=g_vertex_buffer_data[i];

    if(++attemps>5){
        printf("given up after %d attemps", attemps);
        exit(EXIT_FAILURE);
    }
}
while( glUnmapBuffer(GL_ARRAY_BUFFER)==GL_FALSE );


//...

//render loop
glBindVertexArray(VertexArrayID);
glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
glDrawArrays(GL_TRIANGLES, 0, vaoSize/3);

to upload data to GPU. this should force the drivers to put the VBO in GPU memory (if any, remember that laptops may have integrated graphics card or no dedicated memory wich is an issue you can’t fix in any way but changin computer)

etienne85 · July 7, 2014, 4:21pm

Back again after some days… thanks for your clever answers,
Still stuck on the issue,

What I’ve done so far:

rewrite the code to make it the simplest possible and joined at the end of this post for you to review if you want.
used a profiler (CodeXL which works on Linux) but didn’t help me much

This is extremely unlikely to be your bottleneck because you’re using a VBO, which should therefore be storing your vertex data on the GPU itself. In other words: with a VBO there should be no transfer between system memory and GPU memory happening at all.

I’m starting thinking same as you, I’m investigating other possibilites but don’t see hat else it could be and not sure 100% data are indeed stored on GPU.

If you’re drawing terrain, there are lots of algorithms out there for not drawing the full 2001*2001 vertices, which would probably provide the best speed-up.

I totally agree and the solutions you gave are meaningful, but I really want to know why I can only display such low number of polygons on a gaming card (1.5 GB dedicated memory).

To resume, up to date here is what perfs I have:

VAO only: 24 millions vertices | 8 millions triangles -> 20 FPS (terrain: 2001*2001 quads | 1 quad made of 2 traingles using 6 vertices):
VBO : 96 millions indices + 16 millions vertices | 32 millions triangles => 15 FPS (terrain: 4001*4001 quads | 1 quad made of 2 triangles sharing indexed vertices)

You don’t describe your data much aside from saying it’s terrain. What kind of terrain is it? Are you drawing all ~24 million vertices per frame? Are all ~24 million visible on-screen at the same time? Are you getting much overdraw? Have you a complex vertex shader? Have you a complex fragment shader? There are all far more likely to be causing poor performance.

You can check my code below, this is the simplest scene you can make: No Shaders, No textures (wireframe), Yes all 24 million visible at the same time,
And here is a screenshot to make it more concrete:

[ATTACH=CONFIG]674[/ATTACH]

Finally here is my whole code (compiled under linux) but should compile also under windows (not tested) with maybe some minor changes as I’m using only opensources frameworks…
(rename files with the right extensions instead of .txt or download zip including source + linux bin)
If anyone willing to review, compile or test the code, let me know what you think and if it reproduce the same issue.
Many thanks,

DarioCiao · July 8, 2014, 11:41am

right now I don’t know why but I can’t get my linux VM to apt-get glfw, by the way copy pastin the relevants GL parts in SFML I get no strange performance penalty. Have you tried to upload data with
glBufferData (,NULL,);
and
glMapBuffer ();
glUnmapBuffer();

?
glBufferData with the NULL parameter avoid the creation of clientside arrays (wich is what should limit the bandwith). If that don’t work then just go core profile 3.3 and stop use deprecated functionalities

malexander · July 8, 2014, 1:27pm

Perhaps try breaking up the VBOs into smaller buffers and issue several draw calls. I ran into a similar performance issue with a very large mesh (20M points, 80M verts). Breaking the mesh into submeshes of 1M points improved performance by 2-3x on Nvidia hardware. On AMD hardware, it’ll be the difference between taking many seconds to display, and realtime display. Hardware seems to prefer drawing with many smaller meshes than an equivalent large one.

For example, you could break your grid into 100x100 subgrids, arranged in a 20x20 fashion. Toss each subgrid into its own VAO, then just bind/draw & repeat. This would have the added benefit of being easy to frustum cull the subgrids that are not visible. You get some duplication of data along where the subgrids meet up, but the small batch and cull optimizations should more than make up for it.

etienne85 · July 10, 2014, 3:14pm

I did some more test and I could make some progress
At least to discover that my perfs were even worse than I thought! Indeed I didn’t see that I was doing clipping in my code so NOT all polygons were displayed on screen.
After disabling clipping, now my framerate goes down to even less than 4 FPS to display 32 millions triangles (16 millions vertices, 96 millions indices)!!
As it was guessed, bottleneck is not bandiwth between system and GPU, at least this is one hypothesis eliminated.
But the question remains: than who is the culprit?

@DarioCiao: thanks for trying, if you need help I can post the packages needed to compile under Linux
And I didn’t try to use glMapBuffer as I know now, bandwith is not the issue.