Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: VAO framerate issues

  1. #1
    Junior Member Newbie
    Join Date
    Jun 2014
    Posts
    5

    VAO framerate issues

    Hello,
    I'm building a basic ground with a bunch of squares made up of two triangles.
    I'm storing it in a large array of vertices of size 2001(length)*2001(width)*6(vertices for the two triangles)*3(vertices coords).
    So the total triangles number is 24 024 006 (or 72 072 017 vertices).
    I know this could be largely optimised (avoiding to reuse same vertices, using VBO) but what bother me here is the framerate I get (20 FPS) as I think this should be no problem for a quite recent hardware (Nvidia GTX 460 M) to display over 24 millions polygons every frame?
    I wonder where the problem is and if I'm doing something wrong here.
    For instance, Is it a good practice to use a single large VAO buffer or should it be split in several smaller buffers?
    Here is some code of the way I'm doing it:

    init part:
    Code :
    GLuint VertexArrayID;
    	glGenVertexArrays(1, &VertexArrayID);
    	glBindVertexArray(VertexArrayID);
     
    	int range=1000;
    	int vaoSize = (2*range+1)*(2*range+1)*6*3;
    	GLfloat g_vertex_buffer_data[vaoSize];
    	cout<<"vao mem consumption: "<<sizeof(g_vertex_buffer_data)<<endl;
    	size_t index = 0;
    	for(int x=-range; x<=range; x++){
    		for(int z=-range; z<=range; z++){
                            //FIRST TRIANGLE
                            //vertice 1
    			g_vertex_buffer_data[index++] = x;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z;
                            //vertice 2
    			g_vertex_buffer_data[index++] = x+1;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z;
                            //vertice 3
    			g_vertex_buffer_data[index++] = x;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z+1;
     
                            //SECOND TRIANGLE
                            //vertice 4
    			g_vertex_buffer_data[index++] = x+1;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z;
                            //vertice 5
    			g_vertex_buffer_data[index++] = x+1;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z+1;
                            //vertice 6
    			g_vertex_buffer_data[index++] = x;
    			g_vertex_buffer_data[index++] = 0.0f;
    			g_vertex_buffer_data[index++] = z+1;
     
    		}
    	}
    	cout<<"total vertices: "<<--index<<" | total triangles: "<<index/3<<endl;
     
    	// This will identify our vertex buffer
    	GLuint vertexbuffer;
     
    	// Generate 1 buffer, put the resulting identifier in vertexbuffer
    	glGenBuffers(1, &vertexbuffer);
     
    	// The following commands will talk about our 'vertexbuffer' buffer
    	glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
     
    	// Give our vertices to OpenGL.
    	glBufferData(GL_ARRAY_BUFFER, sizeof(g_vertex_buffer_data), g_vertex_buffer_data, GL_STATIC_DRAW);


    then in the main loop I use this code to display the array along with some other usual Opengl calls:
    Code :
    // 1rst attribute buffer : vertices
    		glEnableVertexAttribArray(0);
    		glBindBuffer(GL_ARRAY_BUFFER, vertexbuffer);
    		glVertexAttribPointer(
    				0,                  // attribute 0. No particular reason for 0, but must match the layout in the shader.
    				3,                  // size
    				GL_FLOAT,           // type
    				GL_FALSE,           // normalized?
    				0,                  // stride
    				(void*)0            // array buffer offset
    		);
     
    		// Draw the triangles !
    		glDrawArrays(GL_TRIANGLES, 0, vaoSize/3);
     
    		glDisableVertexAttribArray(0);

    If you have any clue to spot this performance issue, I would really appreciate!

  2. #2
    Junior Member Newbie
    Join Date
    Jun 2014
    Posts
    5
    Argg I just lost the message I wrote because of urls forbidden in message !!! (to moderator, this is very inconvenient!!)
    Answering my own question, I think bottleneck is the memory bandwith betwing System and GPU.
    Given an article I cannot mention here (no URL) data transfer rate between System RAM and GPU RAM is around 6 GB/s.
    [EDIT]
    there is an error in my first post: number of vertices and triangles calculated are wrong...
    there are 24 024 006 vertices and 8 008 002 triangles.
    (just in case someone was looking at it,)
    hence 24 millions vertices = 24 * 3(three coords x,yz)*4(size of float) = 288 millions Bytes sent from RAM to GPU each frame.
    [/EDIT]
    In my case, I get 20 FPS, 20 * 288 = 5.7 GB/s which is pretty much data rate transfer to GPU.
    Which woulc explain why I only get "20 FPS".
    Anyone willing to confirm this calcul ....
    Last edited by etienne85; 07-01-2014 at 03:50 PM.

  3. #3
    Intern Contributor
    Join Date
    Mar 2014
    Posts
    50
    Quote Originally Posted by etienne85 View Post
    If you have any clue to spot this performance issue, I would really appreciate!

    What about your shader? After all that's where the real work is done.

  4. #4
    Junior Member Newbie
    Join Date
    Jul 2014
    Posts
    15

    reply

    I get 40+ fps on my old (5 years) laptop by rendering 8 millions triangles (non indexed, so 24 milion vertices).

    A common issue my girl had with her NVIDIA laptop (assuming you have a laptop, integrated cards were born there, but now also desktops have integrated graphics sometimes) is that by default that laptop used its integrated INTEL card instead of the GTX card for all applications but games detected by NVIDIA drivers (and when not connected to power surge always used integrated chip), you just need to go to NVIDIA control panel and be sure you use your graphics card for executables you create (it's intuitive to do, but you need to do it, oh a nice thing, her antivirus has a executable named AVP.exe and NVIDIA drivers detected it to be the game Alien Versus Predator... lol XD epic.. don't know if they fixed that or noXD).

    This is not a feature i like much (ok power saving and longer the life of the graphics card by not using it for conventional stuff should be good):
    what if you create a game and players not play it because they don't know how to turn on the graphics card because your game is not detected by drivers?

    Resolve the issue:
    Create a program (as simple as possible, just with crude OpenGL calls) that reproduce the issue (50 lines of code?) <--

    1) create buffers
    2) fill with terrain data and upload to GPU (just a flat grid of triangles, don't even bother fill in random values)
    3) use a shader wich just uouput a single color
    4) render loop


    This is important because there may be secondary code you not showed in your application that is messing up things, so before doing it the hard way, make the code work the simple way in a single file that we can happily compile.. And learn to profile (this is not easy especially in graphics applications), what if your framerate is dropping because there is a"for loop" in your app that is eating CPU cicles? profiling should show that cycle immediatly.

    Check if framerate drops, if not start adding features (shaders, texturing) to see where is the bottleneck, if yes then you can post the code here so we can check for common errors, if code is correct then problem is your machine (if you play AAA games at decent level of detail you machine is OK and you definitely have to investigate your code)
    Last edited by DarioCiao!; 07-02-2014 at 04:31 AM. Reason: forget a part

  5. #5
    Junior Member Newbie
    Join Date
    Jun 2014
    Posts
    5
    Thanks for answers, for the record I'm not using shaders so no extra calcuulations.
    Likewise, I don't have any second card on my latpop (I think you referred to NVidia optimus system which switch between slow and fast card) and even being on Linux, games runs very well.

    I get 40+ fps on my old (5 years) laptop by rendering 8 millions triangles (non indexed, so 24 milion vertices).
    I'm surprised, are you speaking for each frame OR per/seconds? If first, then we have same data but In my case I find bottleneck being bandwith between RAM (where is stored the vertex buffer) and GPU.
    Indeed (correct me if I'm wrong) 24 millions vertices uploaded EACH frame to the graphic cards, knowing 1 vertex has 3 floats, each float using 4 bytes is 24 * 3 * 4 = 288 millions Bytes per Frame = 288 MB/Frame
    Hence @20 FPS, data rate is 5.7 GB/s which I think corresponds to RAM to GPU bandwith (to be confirmed) and in your case it is 11.4 GB/s which seems fare above mine.

    Tonight, I switched to VBO implementation: reducing number of vertices to 4 millions (/ by a factor of 6) for the same ground area covered and same number triangles. As expected framerate was back to normal.

    To experiment, I pushed the limit to see what happens.
    Quadrupling surface I get 16 millions vertices and 96 millions indices => data rate is 16*3* 4 Bytes + 96* 4 Bytes = 576 MB/Frame
    Now I get 15 FPS => 8 GB/s for 32 millions triangles
    oO this is surprising... I find different bandwith than before (8GB/s vs 5.7 GB/s),I wonder if this the right way to calculate?
    In this case I assume we are transferring both Vertex + Indices buffers to GPU each frame. So I wonder if instead Vertex Buffer can remain on GPU memory saving bandwith? I need to investigate this as I'm still quite new to OpenGL.
    Any idea about those results? If you want I can post my code for you to check (just need some cleaning).
    Best Regards,

  6. #6
    Junior Member Newbie
    Join Date
    Jul 2014
    Posts
    15

    reply

    My result is 40 frame per seconds rendering 8 million triangles every frame. This is not surprising because I have a very old graphics card. Since your are new I'll clarify things a bit.

    Indeed your bottleneck is RAM bandwith because you are not using VBOs.

    Modern OpenGL is ALL about using VBOs for storing vertices AND indices data.. with VBOs you can store data efficiently on GPU (or stream from RAM when you really need it). this is usefull for 2 reasons:

    1) you don't have to wait data coming from RAM
    2) you are not limited by RAM (BUS) bandwith

    Don't even bother using client side arrays or immediate mode drawing.

    VAO. VAO are used in pair with VBOs (so I probably got wrong your question because you asked about VAOs but specified that you are not using VBOs)

    VAOs are used to "memorize" the configuration (configuration IS NOT binding) of VBOs, VAO are especially usefull on NVIDIA cards that suffer much more calls to glAttribPointer*. The VBO configuration is something that just eat CPU cycles without effectively moving any data.

    the GL_STATIC_DRAW flag is what tell drivers to store data to GPU (this is just a hint! there's no guarantee that data is moved to GPU)

    Laptops with no dedicated VRAM are heavily limited by bandwith and will not gain any benefit from good graphics cards.

    Common bottlenecks:
    CPU cycles (if performing too much drawcalls or too many GL calls)
    Memory quantity (RAM MBs and VRAM MBs => they cause OUT_OF_MEMORY error)
    BUS bandwith (limited to few GB/s)
    Fillrate (pixel operations done by GPU)

    Also note that you have no benefit having more vertices rendering at a time more than the number of pixels in the screen. (even when doing heavy tessellation i still have small triangles with a 4/5 pixels surface that on my 1366x768 screen means I have to render just 230.000 vertices every frame, every extra vertex can't add any visible detail).

    You should keep in count that you may have on GPU stored million of vertices but actually rendering only a small part of them (check what frustum culling is for example)

    Your graphics card specifications says.
    VertexOperations: 56700 MVertices/sec than means that theorically to achieve 60fps you can render 945 MVertices/Frame (pure theorical limit, in practice that will be much lower but still higher than 32 Millions XD)
    Last edited by DarioCiao!; 07-04-2014 at 07:58 AM.

  7. #7
    Intern Contributor
    Join Date
    Mar 2014
    Location
    San Jose, CA
    Posts
    58
    Quote Originally Posted by DarioCiao! View Post
    Indeed your bottleneck is RAM bandwith because you are not using VBOs.
    The code in the original post is using VBOs.

  8. #8
    Junior Member Newbie
    Join Date
    Jul 2014
    Posts
    15

    reply

    Quote Originally Posted by reto.koradi View Post
    The code in the original post is using VBOs.
    ops you are right: a sentence made me thinking he was not using VBOs (even after I looked the code at first time)

    I know this could be largely optimised (avoiding to reuse same vertices, using VBO) but what bother me here is the framerate I get (20 FPS)
    Then he must submit a stand-alone code that compiles and reproduce the problem, at least on his machine.

  9. #9
    Senior Member OpenGL Pro
    Join Date
    Jan 2007
    Posts
    1,215
    Quote Originally Posted by etienne85 View Post
    Argg I just lost the message I wrote because of urls forbidden in message !!! (to moderator, this is very inconvenient!!)
    Answering my own question, I think bottleneck is the memory bandwith betwing System and GPU.
    Given an article I cannot mention here (no URL) data transfer rate between System RAM and GPU RAM is around 6 GB/s.
    This is extremely unlikely to be your bottleneck because you're using a VBO, which should therefore be storing your vertex data on the GPU itself. In other words: with a VBO there should be no transfer between system memory and GPU memory happening at all.

    I say "should be" because sometimes your driver might decide to store your VBO in system memory. OpenGL just specifies GL_STATIC_DRAW as a hint: you're telling the driver that the contents of the VBO aren't going to change, and the driver will make a decision on where to store it based on that. At some point while your program is running your driver might decide to move it to different storage depending on how you use it, but since you're just filling it once and then never going near it again, this won't happen.

    Another case where the driver might decide to store it in system memory is if there isn't enough GPU memory available for it. Perhaps you've got some very large textures?

    But in your case you can say that your VBO is most likely stored in GPU memory and you need to look elsewhere for your bottleneck.

    You don't describe your data much aside from saying it's terrain. What kind of terrain is it? Are you drawing all ~24 million vertices per frame? Are all ~24 million visible on-screen at the same time? Are you getting much overdraw? Have you a complex vertex shader? Have you a complex fragment shader? There are all far more likely to be causing poor performance.

  10. #10
    Member Regular Contributor
    Join Date
    Aug 2008
    Posts
    456
    You're right on the amount of data being used.

    If the vertex data is the same for each vertex that is shared between triangles, then using indexed drawing (glDrawElements) would help reduce the amount of data you are transferring & would allow the GPU to re-use some calculations. Sometimes tex-coords aren't shared between adjacent triangles even though the position is the same, so indexed rendering wouldn't help in that case.

    Non-indexed size calculation:
    data size = width*height*number of vertices per quad*vertex size
    = 2001*2001*6*12
    = ~288M per draw call

    Indexed data size calculation:
    data size = total vertex data size + total index data size
    data size = width*height*vertex size + width*height*number of vertices per quad*index size
    = 2001*2001*12 + 2001*2001*6*4
    = ~48M + ~96M
    = ~144M per draw call

    Another benefit of using indexed rendering is that if you were to draw the vertices (0,1,2, 1,3,2, ...) then the subsequent times the GPU comes across a repeated index, it is possible for it to easily look up previously cached vertex information rather than re-calculating. If you are providing the GPU with repeated data ((0.1f, 0.1f, 0.0f), (0.2f, 0.1f, 0.0f), (0.1f, 0.2f, 0.0f), (0.2f, 0.1f, 0.0f), (0.2f, 0.2f, 0.2f), (0.1f, 0.2f, 0.0f), ...) then it would typically be more efficient for it to re-calculate the vertex data instead of comparing every byte of submitted data for recent vertices.

    With larger vertex sizes (if you were to add color/tex coords etc per vertex) then indexed rendering would gain even more over non-indexed rendering.

    If you're drawing terrain, there are lots of algorithms out there for not drawing the full 2001*2001 vertices, which would probably provide the best speed-up.

    If you are just drawing a regular grid using a heightmap and don't want to use any approximations, then there's only width*height*size of float = 2001*2001*4 = ~16M bytes of actual data so perhaps you could pass this height as a single float vertex attribute (or a texture) & generate the X & Y coordinates in the vertex shader from gl_VertexID (similar to this example, but you would also need to set Z coordinate from height attribute whereas they set it to zero). Since the vertex data is quite small in this case, you could use glDrawArrays/glDrawArraysInstanced for this approach (16M vertex data, but no vertex caching) which would probably beat indexed rendering (16M vertex data + 96M index data with vertex caching) although it might be worth checking indexed/non-indexed to see which is faster.

    Perhaps with indexed instanced rendering (glDrawElementsInstanced) using a small chunk of the grid at a time and still generating X&Y coordinates you could get a balance between keeping the index data size down & being able to make use of the vertex cache. Using less than 64k vertices per instance would allow you to reduce index size to 2 bytes instead of 4 per index.

    Any time you're generate coordinates on the fly you'll need to be careful you're doing it in such a way that floating-point inaccuracies don't creep in. For example:
    Code :
    X1 = (gl_InstanceID + gl_VertexID) * tile_width
    would probably generate the same value for any combination of gl_InstanceID + gl_VertexID: (2 & 3), (3 & 2), (4 & 1), (1 & 4) but
    Code :
    X2 = gl_InstanceID*tile_width + gl_VertexID*tile_width
    would probably result in slightly different values for each combination (2 & 3), (3 & 2), (4 & 1), (1 & 4), which would result in cracks appearing in your grid.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •