PDA

View Full Version : Performance problem with a small voxel engine



lorant
11-11-2011, 11:02 AM
Hi all,

I'm working on a small voxel engine. To check if performance is correct, I currently render a scene somewhat similar to Minecraft. My problem is, I think my framerate is lower than what it should be, when compared with what the game achieve on the same computer.

My scene is divided into 16 * 16 chunks, and each chunk contains 16 * 128 * 16 voxels. AFAIK this is the same setup than Minecraft.

I also use the same technique: I construct a mesh for each chunk, discarding all the quads that are between adjacent non-empty cubes. There is no occlusion culling, nor frustum culling (those are planned, but first I want to get the base of the engine right).

I use one VAO and two VBOs for each chunk; one contains 4 vertices per quad (12 floats), the other contains 4 (identical) one-byte color index. The actual color is fetched in the fragment shader from a 1D texture.

There is only one big index buffer, used for all VBOs.

All of these buffers are constructed only once.

Here's my render code:



glUseProgram(program);

for(int i = 0; i < world_width; ++i){
for(int j = 0; j < world_depth; ++j){
mat4 mvp = chunks[i][j]->get_model_matrix() * view_projection_matrix;
glUniformMatrix4fv(uniform_mvp_matrix, 1, GL_FALSE, mvp.c_ptr());

glBindVertexArray(chunks[i][j]->vertex_array);
glDrawElements(GL_TRIANGLES, chunks[i][j]->nb_quads * 6, GL_UNSIGNED_INT, 0);
}
}


As for the content of the scene, it's generated from a simple Perlin noise, so that a large proportion of cubes are adjacent. The size of the scene take into account the fact that I don't have frustum culling (in Minecraft on "far" setting there is 33*33 chunks around the player).

I made a short video to give you a better idea of what I'm rendering:

http://youtu.be/jij3T3rIoDg

The framerate is barely 40 fps (on a radeon 4850), where as in Minecraft I get at least 60 fps (with occlusion culling deactivated), often more. Also, I think the scenes in the game are more complex (i.e. smaller homogeneous zones). I've seen other videos of similar coding attempts, with much better framerate.

So I think there's a problem with my implementation, but I have a hard time figuring what. It's not the shaders, since they are very basic (no lighting), and changing them to even more basic don't improve anything.

Any idea on how I could find out the cause of the low framerate?

_arts_
11-11-2011, 12:26 PM
You can't expect fast rendering without frustum culling neither occlusion queries. This is where you can get the major improvement simply because the first one allows you to render only what's inside the frustum (so let's say about 4 times less than without it), and occlusion queries allow to easily and quickly remove occluded geometry.

Also ensure your algorithm for culling hidden faces is very fast. Since you do a minecraft-like rendering, then occlusion query might not be very needed for you, depending on the algorithm you have.

Finally, use VBO with index arrays.

lorant
11-11-2011, 12:56 PM
Thanks for the answer.

I will of course add frustum culling, but I don't think it's the problem here, since the scene is about the size of what is inside the view frustum in Minecraft (probably smaller, in fact). As for occlusion, as you said the benefit is much less than in usual situations (and in my tests I deactivated occlusion in the game).

My algorithm for discarding hidden (i.e. inside solid) faces is only run once, when the buffers are created, not at rendering time; so it should not affect framerate.

I do use indexed VBOs, with the additional "trick" that it's the same index buffer for all VBOs (since they all contain only quads, all constructed in the same order).

I plan to try various optimizations, including the use of a geometry shader. But I'd like first to understand why my framerate is slower than Minecraft, when AFAIK the technique is the same.

_arts_
11-12-2011, 01:10 AM
Do you render quads ? If so, it's a very bad idea. Most of the time they are not supported by the hardware. Use triangles instead.

Second. From what you said with the trick of you index buffer, I think you make drawing call for every faces (or every voxel) ? Try to reduce the number of calls. You can also have a look at MultiDrawElements functions.

I didn't understood what you said about the frustum. Do you mean you always see the full 'world' ?

Finally, have a look at instancing. I'm not very sure (since you'll have to instance few polygons), but it might help you too. (maybe someone here will disagree with this).

lorant
11-12-2011, 01:49 AM
My primitive are triangles. When I say quad I mean a face of a cube-voxel, sorry for the confusion.

I have only one draw call for each chunk; this is the core of the technique used by Minecraft: constructing a single mesh for chunks of 16*128*16 voxels, discarding invisible ones.

The "trick" I refer to is that, since all my VBOs only contains long list of faces, which all have 4 vertices in the same order, I can use one global IBO for all of them.

In my test the full world is always rendered, but it's a quarter the size of the scenes in Minecraft, so that should be about what is left after frustum culling, or even smaller.

As for instancing, others have tried it and for this particular case it seems to hurt performance. The reason is that a lot of faces can be discarded (when they are between adjacent cubes), and with cube instancing you can't do that (except may be in the vertex shader). I'm not sure if instancing quads instead of sending indexed vertices would improve the framerate, maybe that's worth a try. A geometry shader is probably a better option, though.

_arts_
11-12-2011, 03:38 AM
You can try to have a single VAO/VBO for all your chunks. This will reduce the number of VAO bindings from 256 (16*16) to 1 (if I understood correctly how you manage them). (edit: this should not be negligible).

Also, I haven't noticed it before, but why do you send an uniform for each chunk ? This can affect performance too. Can't you have the same modelview matrix for all the scene ? I am actually unsure how you manage it. From what I understood you use this modelview matrix to place your voxels in the world. But since they are static, it's a bit useless.

OK for the trick you refer too. Actually it only saves memory, it has no impact on the performance.

As for optimization purpose you also can try to use unsigned short instead of unsigned int for the buffer containing indices.

lorant
11-12-2011, 08:28 AM
Unfortunately, I can't have a single VAO/VBO, because at a later stage I will need to load frequently new part of the landscape, and unload old ones. (Plus, voxels may change frequently)

That's also why I have a model matrix for each chunk. The landscape will be huge, and I want the render coordinates to always be centered around the camera so that there's no risk of rounding errors. So when I load a new group of chunks, I can shift the position of old ones just by updating the model matrix.

Although now I think about it, I only really need a translation vector, not a full matrix.

I changed to unsigned short of the indices, with no effect on framerate. Will this save some GPU memory, or are they stored as int anyway?

I played a little with the perlin noise, adding harmonics both at higher and lower frequency, and with some settings I can improve the framerate while still retaining a sufficiently detailed scene. Because this kind of renderer is not bound by the number of solid voxels, but by the surface between solid and empty voxels, it's hard to tell what is a complex scene, and what is not.

Still, some seems to achieve much better performance than me, for scenes that can't be much simpler than mine (for exemple, here: http://youtu.be/l8w2V3gPC7I ). So I'm still a bit suspicious about my implementation.

Anyway, thanks for the suggestions, I think I will move to the next stage now, and try a geometry shader approach.

lorant
11-12-2011, 01:16 PM
I have a new question…

I've implemented a basic geometry shader, which takes a list of points as input, and for each one emits a cube.

Obviously I have poor performance: I need to construct only the visible faces, i.e. discard those that are between adjacent solid voxels.

In order to do that, I need to access my voxel data in the geometry shader, and I wonder what would be the best way.

My voxel data is:

- one byte per voxel
- voxels are grouped in chunks of 16 * 128 * 16
- the scene contains 16 * 16 of these chunks

I can't have just one big 3D texture, because chunks will be loaded/unloaded on a regular basis.

Texture arrays seem to be limited to 2D textures.

The only option I see is to create one texture object for each chunk, and to bind the correct one before each call to glDraw. That's 256 texture bind per frame, I'm not sure if this will hurt performance or not.

Is there any other way?

_arts_
11-13-2011, 02:01 AM
Unfortunately, I can't have a single VAO/VBO

You really should keep few VBO. Some people here would even say to have a single VBO. You can quite easily have a single or few VBO and update sub-parts of it/them.


I changed to unsigned short of the indices, with no effect on framerate. Will this save some GPU memory, or are they stored as int anyway?

This saves GPU memory since short are twice smaller than full integers. This can make the rendering slightly faster too, depending on situations.


The landscape will be huge, and I want the render coordinates to always be centered around the camera so that there's no risk of rounding errors.

I don't understand how you see things here. Commonly, we place static geometry (I consider it static as it doesn't move, it can only be destroyed) statically, meaning we affect them global positions, in world coordinates. This will always be faster than having to do transformations on all the geometry (with T&amp;L or uploading new vertex positions in the buffer). If you have to transform each voxel, then it will have a performance drop inevitably.


I played a little with the perlin noise [...] I can improve the framerate while still retaining a sufficiently detailed scene. Because this kind of renderer is not bound by the number of solid voxels, but by the surface between solid and empty voxels, it's hard to tell what is a complex scene, and what is not.

How much was the improvement ? It seems to me, depending on the function, you have a scene with more or less overdrawings.

The way you do things makes that:

1. you need to bind VBO too much
2. you need to send uniforms too much, and add additional and needless transforms to each chunk.
3. the index array is quite useless since each of your chunks don't share physical vertices. If you use a global VBO with physical vertices and a global buffer for indicies, you can gain significant improvment here too.
4. you have more or less occlusions depending on the function you use to create your scene. Managing occlusions is not a useless step. Depending on your algorithm, you might be interested in OpenGL occlusion queries. Just removing adjacent faces might be not enough if the scene has lot of holes (parts where some cubes have no face adjacencies but faces hidden by other cubes).

Also remember that when one looks at a cube, she can only see at more 3 of its faces. So for each of the cubes you render, at least half of the faces will be hidden by the view. You can gain a lot of improvement here too.

Without improving all of these points, you can't expect to have a fast enough renderer.

For about geometry shader, people generally avoid it because it is not very efficient. But I'm not good enough in this area to talk more about this.

lorant
11-13-2011, 03:44 AM
You really should keep few VBO. Some people here would even say to have a single VBO. You can quite easily have a single or few VBO and update sub-parts of it/them.
The problem is, the VBOs for different chunks have very different sizes. And I need to be able to replace old chunks with new ones when the camera moves. With a single VBO, I'll have to allocate the maximum possible size for each sub-part, which is huge (almost 400,000 vertices) and most of the time useless (but occasionnaly needed).



Commonly, we place static geometry (I consider it static as it doesn't move, it can only be destroyed) statically, meaning we affect them global positions, in world coordinates.
I wanted the gpu to work in camera space, because the world will be potentially huge, and in some cases I think that would introduce rounding errors. I need the world coordinates to be at least 5 digits, so if I'm not mistaken that would leave only 2 digits of precision for the gpu, which is probably not enough?



How much was the improvement ? It seems to me, depending on the function, you have a scene with more or less overdrawings.
I could achieve almost 60 fps. The current scene is only a placeholder, which tries to be similar to Minecraft terrain in order to compare performance. But "similar" is difficult to evaluate for this kind of renderer, so I in doubt I try to choose what looks like the worst case.



3. the index array is quite useless since each of your chunks don't share physical vertices. If you use a global VBO with physical vertices and a global buffer for indicies, you can gain significant improvment here too.
I don't understand why I would share more vertices with a global VBO… Currently I only share vertices inside a face, because each voxel needs its own color, and for lighting I'll need different data for each face. Doing more optimization (sharing between similar voxels) would slow down the chunk updates, which I want to avoid.



4. you have more or less occlusions depending on the function you use to create your scene. Managing occlusions is not a useless step.
In my case this is a double-edged sword, since the scene may be modified at any time so that what is occluded suddenly come become visible. Also, in this kind of game it's frequent to go to a view point where you can see everything. Minecraft added occlusion culling very late, and mainly for laptops: it runs really fine with occlusion deactivated on desktops. I will add occlusion later, but I don't want to rely on it.

My goal at this point is to have an engine as good as Minecraft, using the same techniques, so that I can see if I can improve on that.



Also remember that when one looks at a cube, she can only see at more 3 of its faces. So for each of the cubes you render, at least half of the faces will be hidden by the view.
That's a really good point. I relied on GL_CULL_FACE for that, but doing it myself will reduce the VBOs sizes.

Thanks again for all the suggestions!

EDIT: Actually, I spoke too fast. It can't reduce VBO size, since I don't want to update them when the camera move. So the only place I could discard non-facing-camera faces would be in the vertex shader, but GL_CULL_FACE probably does a better job than me.

Or maybe I should sort the faces in my VBO to group them by orientations, and split my glDrawElements calls into six glDrawRangeElements (or maybe glDrawElementsBaseVertex)? Would three draw calls with a sub-part of the buffer be faster than one call with the complete buffer?

aqnuep
11-13-2011, 11:09 AM
I wanted the gpu to work in camera space, because the world will be potentially huge, and in some cases I think that would introduce rounding errors. I need the world coordinates to be at least 5 digits, so if I'm not mistaken that would leave only 2 digits of precision for the gpu, which is probably not enough?
Well, while actually there is a risk of rounding errors in case you have a big world, but you are rendering only cubes and for axis aligned cubes the single precision floating point numbers should be adequate for a very huge world.

You need to calculate that how large is the biggest possible chunk that can be still renderred accurately with single precision floats. It is definitely not 16*128*16 in size but much much larger. Maybe your whole world would fit that way.

Alfonse Reinheart
11-13-2011, 11:20 AM
I wanted the gpu to work in camera space, because the world will be potentially huge, and in some cases I think that would introduce rounding errors. I need the world coordinates to be at least 5 digits, so if I'm not mistaken that would leave only 2 digits of precision for the gpu, which is probably not enough?

Each chunk should be in it's own model space, relative to a local origin. Details can be found here. (http://www.arcsynthesis.org/gltut/Positioning/Tut07%20The%20Perils%20of%20World%20Space.html)

aqnuep
11-13-2011, 11:48 AM
Each chunk should be in it's own model space, relative to a local origin. Details can be found here. (http://www.arcsynthesis.org/gltut/Positioning/Tut07%20The%20Perils%20of%20World%20Space.html)
Yes, I think that's why he's replacing the modelToWorld matrix uniform between draw calls, though I'm not convinced that the granularity he uses is justified.

Also, you can use other tricks to batch multiple commands and use arrays in uniform blocks in the vertex shader, though using such techniques may be not an alternative on the target hardware.

Alfonse Reinheart
11-13-2011, 12:08 PM
To be honest though, all he needs is the right vertex shader.


I need the world coordinates to be at least 5 digits

No, you don't. Your world is made up of blocks, all of which have the same size. Therefore, your world's granularity only needs to be in sizes of blocks.

Your block coordinates should be in integers (either signed unnormalized shorts, or just floats that happen to use integer values). (0,0), (1,1), etc. You then apply a scale to them, as part of the initial transformation, to scale them up to the world size you actually need. But because you don't stop at world-space, your full model-to-camera transform should have all of the precision you need. You go from integer coordinates directly to camera space. Since both the integer coordinates and camera space are close to the camera, you don't have any precision problems.

So you can use all 7 floating-point digits before precision becomes a concern.

lorant
11-13-2011, 12:31 PM
Each chunk should be in it's own model space, relative to a local origin.
Yes, that's what I'm currently doing.

_arts_ suggested I change that to world coordinates in order to avoid setting the model matrix uniform 256 times per frame (if I understood correctly). This would be possible since the chunks never move, but I was afraid of rounding errors. Your link seems to confirm that.


No, you don't. Your world is made up of blocks, all of which have the same size. Therefore, your world's granularity only needs to be in sizes of blocks.

Your block coordinates should be in integers (either signed unnormalized shorts, or just floats that happen to use integer values). (0,0), (1,1), etc. You then apply a scale to them, as part of the initial transformation, to scale them up to the world size you actually need. But because you don't stop at world-space, your full model-to-camera transform should have all of the precision you need. You go from integer coordinates directly to camera space. Since both the integer coordinates and camera space are close to the camera, you don't have any precision problems.

That's also what I do: my cubes are 1 unit square. My fear of rounding errors was in the case suggested above, storing the buffers in world space instead of model space.


So you can use all 7 floating-point digits before precision becomes a concern.

So the GPU doesn't need some "room" to do its work? That might be an argument in favor of world space, even if this is unorthodox.


I'm not convinced that the granularity he uses is justified.

I don't think I can increase the size of chunks, because they will be modified on a regular basis, so I need to limit the size of VBOs updates, and the work needed on the CPU to construct the mesh.

EDIT: I did a quick test, and removing the 256 calls to glUniformMatrix for the model matrix only gives me a few fps: I go from 44-45 to 46-47. So I prefer to keep my buffers in model space, where I don't have to mind about rounding errors.

Dan Bartlett
11-13-2011, 03:08 PM
If you render the opaque chunks roughly sorted from nearest to furthest away you might gain a bit. You could do this by calculating the distance to each chunk, or by splitting the world up into quadrants (or octants) surrounding the viewpoint + render each quadrant (or octant) with loop variables heading away from the viewpoint, so instead of something like this:



for (int x = 0; x< world_width; x++)
for (int y = 0; y< world_depth; y++)
render_chunk(x, y);

you could use:


int viewpoint_chunk_x = clamp(calculate_viewpoint_chunk_x(), 0, world_width);
int viewpoint_chunk_y = clamp(calculate_viewpoint_chunk_y(), 0, world_depth);

for (int x = viewpoint_chunk_x; x< world_width; x++)
for (int depth = viewpoint_chunk_y; y< world_depth; y++)
render_chunk(x, y);

for (int x = viewpoint_chunk_x; x< world_width; x++)
for (int depth = viewpoint_chunk_y - 1; y> 0; y--)
render_chunk(x, y);

for (int x = viewpoint_chunk_x - 1; x> 0; x--)
for (int depth = viewpoint_chunk_y; y< world_depth; y++)
render_chunk(x, y);

for (int x = viewpoint_chunk_x - 1; x > 0; x--)
for (int depth = viewpoint_chunk_y - 1; y > 0; y--)
render_chunk(x, y);


While inside the volume, some form of frustum culling would help a lot too.

Also, try running it through a profiler, perhaps a normal profiler + an OpenGL specific one. Perhaps the fragment shader could be an issue too, since this is the "inner loop" and anything that could be moved outside of that could provide a boost.

lorant
11-14-2011, 12:04 AM
If you render the opaque chunks roughly sorted from nearest to furthest away you might gain a bit.

I had already tried it, it made no difference on performance.



While inside the volume, some form of frustum culling would help a lot too.

Yes I'll add it at a later stage, but the current size of the scene is what will be in the view frustum in the end (I will load chunks centered around the camera, with a view distance of at least 16 chunks). The reason I didn't implement frustum culling yet is because I want to try several different formats for the voxels (i.e. various combinations of arrays and octrees).


Also, try running it through a profiler, perhaps a normal profiler + an OpenGL specific one.

I have trouble running gDEBugger on my 64 bit linux, and trouble getting SDL 1.3 + Glew to work on the windows side... :(

I'll probably switch to SDL 1.2.


Perhaps the fragment shader could be an issue too, since this is the "inner loop" and anything that could be moved outside of that could provide a boost.

If I replace my (already simple) fragment shader with a completely straightforward one (with only one assignment), I get exactly the same framerate. If I do the same thing with the vertex shader, I get a very small gain (from 50 fps to 51 or 52). So I guess that means I'm CPU bound, but my render functions are really simple. For each 256 chunk I do a matrix product, set the model matrix uniform, bind the VAO and call glDrawElements.

_arts_
11-14-2011, 01:23 AM
Remove your cpu matrix multiplication and try the same test without uniforms. Can be a good start to know if you're CPU limited. But actually I don't beleive you are: you have a slighter better graphic card than mine, so I expect you have at least the same kind of CPU (here I have an Athlon X2).

Do you currently update your VBO once before rendering, or do you do this regularly ?


_arts_ suggested I change that to world coordinates in order to avoid setting the model matrix uniform 256 times per frame (if I understood correctly). This would be possible since the chunks never move, but I was afraid of rounding errors. Your link seems to confirm that.

You can have a precision of 10e-3 with a world size of (20000-1)*(20000-1), which looks more than far enough to me for a minecraft game (it can make a world of 20km wide with a precision of 1 millimeter).

You can improve a bit more, with no uniforms, a single vbo (so single calls to glVertexAttrib*) and fewer calls to DrawElements, and uses more clever indices. And as Dan Barlett suggested, draw from front to back. But you're right, that's only about optimization. All what you could gain at the end might be of the order of 3-5% :)

Also, make a try with instancing.

And as a side-note, since your shaders are really simple, expect a drop when you'll add lighting (mainly per-pixel lighting), at least if you're not CPU limited.

PS: can you keep the same world to make the tests ? I see that you often have different results (started with 40 fps and now you're about 50 fps), I guessed due to change in your perlin noise functions ? This absolutely won't help.

lorant
11-14-2011, 02:28 AM
PS: can you keep the same world to make the tests ? I see that you often have different results (started with 40 fps and now you're about 50 fps), I guessed due to change in your perlin noise functions ? This absolutely won't help.
Sorry, I forgot to mention, since my first post I removed the quads at the edge of the world, which is why I am now at 50fps, with the same noise function and settings.



Remove your cpu matrix multiplication and try the same test without uniforms. Can be a good start to know if you're CPU limited.
I tried that, but I gain only a few fps (from 50 to 51 or 52 with the original noise function). That's why I prefer to stay with model space coordinates and keep the model matrix for each chunk (also, see below).



You can have a precision of 10e-3 with a world size of (20000-1)*(20000-1), which looks more than far enough to me for a minecraft game (it can make a world of 20km wide with a precision of 1 millimeter).
20km could be enough for me, but Minecraft goes beyond that. Also, I've found in this post from Notch (http://notch.tumblr.com/post/3746989361/terrain-generation-part-1) that he uses local (camera space) coordinates for rendering, so that does not explain the difference I have in framerate.



Do you currently update your VBO once before rendering, or do you do this regularly ?
All my VBOs are constructed and uploaded only once.


You can improve a bit more, with no uniforms, a single vbo (so single calls to glVertexAttrib*) and fewer calls to DrawElements, and uses more clever indices.
Ah yes, I completely forgot about interleaved buffers... That might very well be the source of my problem. I'll test that.

What do you mean by "more clever indices"?


Also, make a try with instancing.
From I've read about others' attempts, that would only hurt performance in my case. This kind of renderer relies on the fact that a huge proportion of faces are discarded, so I can't instance cubes, and AFAIK instancing quads is pointless.


And as a side-note, since your shaders are really simple, expect a drop when you'll add lighting (mainly per-pixel lighting), at least if you're not CPU limited.
That's why I try very hard to get the base engine right! ^ ^

Also, I'll decide what kind of lighting I'll implement depending on the performance I can get. In a voxel world, I think there's a lot of way you can "cheat" about the lighting and still get decent looking and dynamic results.

lorant
11-14-2011, 03:34 AM
Yay!

Moving to interleaved buffers (32 bits aligned) made the framerate jump from 50 fps to more than 130... I feel silly now, I should have thought of that. But I had no idea it would have so much impact!

I'll probably play a little with the geometry shader approach to see if I can improve on that, but that's already good enough.

Thanks everyone for your help!

Dan Bartlett
11-14-2011, 05:16 AM
Good to hear you got it sorted.

Out of interest, does having the one-byte color index values aligned on 4-byte boundaries but still in a separate buffer object give any improvement over having them tightly packed? Although interleaved is usually better if possible.

Have you remembered to switch back-face culling on too?(never mind already seen you mention this in another post)

Maybe now the major bottleneck is gone, the other optimizations such as rendering front to back will have an effect, although that one would have more effect if rendering fragments is expensive.

If matrix multiplication becomes a bottleneck (although 256 matrix multiplications shouldn't be), then one trick you could do if the chunks are always aligned with the world is to simplify the matrix multiplication.

If you have a constant view-projection matrix across all chunks:


[a e i m]
[b f j n]
[c g k o]
[d h l p]

And you are multiplying it by the chunk model matrix that is simply a translation from the origin:


[1 0 0 x]
[0 1 0 y]
[0 0 1 z]
[0 0 0 1]

The the matrix multiplication could be simplified to:


[a e i m][1 0 0 x] [a e i ax+ey+iz+m]
[b f j n][0 1 0 y] = [b f j bx+fy+jz+n]
[c g k o][0 0 1 z] [c g k cx+gy+kz+o]
[d h l p][0 0 0 1] [d h l dx+hy+lz+p]

Only the last column varies across each of the chunks.

Does removing the glBindVertexArray call have much of an impact on performance, if it does then putting everything into one buffer object as other people have mentioned might help, using glBufferSubData to stream in new chunks + glDrawRangeElements to draw each visible chunk. If using more recent extensions, then glMapBufferRange to allow writing to a range of the buffer + glDrawElementsBaseVertex to allow use of a shared index buffer could be useful.

_arts_
11-14-2011, 05:29 AM
Moving to interleaved buffers (32 bits aligned) made the framerate jump from 50 fps to more than 130...

What exactly do you mean with interleaved buffers ?

lorant
11-14-2011, 06:05 AM
Out of interest, does having the one-byte color index values aligned on 4-byte boundaries but still in a separate buffer object give any improvement over having them tightly packed? Although interleaved is usually better if possible.
You're right! I just replaced the one byte color index with a 32 bit integer, and also get a framerate of 130 fps. So it was alignment the problem.


Does removing the glBindVertexArray call have much of an impact on performance, if it does then putting everything into one buffer object as other people have mentioned might help, using glBufferSubData to stream in new chunks + glDrawRangeElements to draw each visible chunk. If using more recent extensions, then glMapBufferRange to allow writing to a range of the buffer + glDrawElementsBaseVertex to allow use of a shared index buffer could be useful.
The problem is different chunks can have _very_ different sizes, and the maximum size is huge. And because there will be frequent chunk loads/unloads, I don't see how I can put everything in a single buffer without doing some scary memory management to avoid fragmentation.

I will do the matrix optimization, as the CPU will have more work to do in the real application.


What exactly do you mean with interleaved buffers ?
To have coordinates and color in the same buffer, "mixed" in this order: coord of vertex 1, color of vertex 1, position of vertex 2, color of vertex 2, and so on.

But as Dan Bartlett suggested, the problem was that my color buffer was using one byte colors. Aligning the color buffer on 32 bits takes more memory, but is way faster.

The good news is I'll probably find some use to the extra bytes when I'll implement lighting.

Dan Bartlett
11-14-2011, 06:56 AM
I've come across the alignment problem when using 3 bytes for colours (red, green and blue) tightly packed. It's better to include the extra byte for alpha and use a stride of 4 or fill them all with 255 even if the alpha field is unused.

Looking at http://developer.amd.com/media/gpu_assets/ATI_OpenGL_Programming_and_Optimization_Guide.pdf for data types that take up 2 bytes you'll also have performance problems if you use 1 or 3 elements, since each will take up 2 or 6 bytes and would therefore need 2 bytes padding to be 32-bit aligned for better performance (at least on ATi hardware, and probably on others too).

lorant
11-14-2011, 11:50 PM
That document is very interesting!

However, that just doesn't match what is going on with my code.

I've done some more tests, and the only way to have good performance is to use 32 bits integers for the color index. If I use 32 bits aligned shorts or bytes, performance drops. This happens whether I use two buffers or a single interleaved one (the drop is more important in the latter case).

Which is exactly the opposite of what is stated in the ATI document: that I should avoid using integers in VBOs.

So either I have something horribly wrong in my code, or there's something I don't understand correctly in the recommendations.

If I'm not mistaken, the only places I could do something wrong in my code is in the vertex shader, or the buffer format. The vertex shader is really simple:



#version 330 core

in vec3 position;
in int color;

uniform mat4 mvp_matrix;

out int vs_color;
out vec3 vs_position;

void main(void)
{
gl_Position = mvp_matrix * vec4(position, 1.0);
vs_position = position;
vs_color = color;
}


If a use a color buffer containing 32 bits integer:



vector<int32_t> colors;

glVertexAttribIPointer(color_attribute, 1, GL_INTEGER, sizeof(int32_t), 0);

glBufferData(GL_ARRAY_BUFFER, colors.size() * sizeof(int32_t),
&amp;colors[0], GL_STATIC_DRAW);

...then I get good performance (130 fps).

If a use 32 aligned bytes instead:


struct VertexColor {
int8_t color;
int8_t pad1;
int16_t pad2;
};
vector<VertexColor> colors;

glVertexAttribIPointer(color_attribute, 1, GL_BYTE, sizeof(VertexColor), 0);

glBufferData(GL_ARRAY_BUFFER, colors.size() * sizeof(VertexColor),
&amp;colors[0], GL_STATIC_DRAW);

...everything is correctly displayed, but performance drops to 50 fps.

The important thing is I know how to have a good framerate, but I'd like to understand what is going on…

Alfonse Reinheart
11-15-2011, 12:00 AM
However, that just doesn't match what is going on with my code.

Of course it doesn't. You are using integral attributes: attributes that are actually integers, rather than those that are converted to floats.

That document is for R300, R400, and R500, none of which could use integral attributes. Though, it should be noted that this is the first time you've said you're using integral attributes.

When I said that you should use integers, I did not mean using `glVertexAttribIPointer`. I meant using the regular `glVertexAttribPointer` with integer types and no normalization.

The vertex shader still gets "floats", but the data in the buffer are "integers".

lorant
11-15-2011, 12:36 AM
Ok, I understand now. Thanks for the explanation. And I'm sorry I forgot to mention using int in the shader...

I switched to integral attributes to be able to use texelFetch in the fragment shader. I should have posted the code, but in my mind it was completely straightforward...

Alfonse Reinheart
11-15-2011, 11:12 AM
I switched to integral attributes to be able to use texelFetch in the fragment shader.

So you switched to using integral vertex attributes, so your fragment shader could use texelFetch. Couldn't you just use a cast in the fragment shader?

lorant
11-15-2011, 12:15 PM
I just tried with floats and a cast, performance is the same, at least on my ATI.

I prefer to keep integers, because in the future I'll probably use the 3 unused bytes (using masks) for some lighting data, and I'm afraid that conversions from CPU int, to CPU/GPU float, to GPU int could introduce errors.

(Also, I may move texelFetch to the vertex shader, depending on whether I choose vertex or pixel lighting.)