The order in which a model vertexes are loaded, influences the performances?

Hi everyone

I’m trying to figure out a strange behaviour of my code. 2 Models almost equal that differ in specifying their vertexes in a spatially sorted and unsorted manner, shows very different performances.

I got 2 models:

  1. let’s called it ‘original’ made of about 640K vertexes.
  2. a LOD of the original. I obtain it by an octree subdivision…it has almost 600K vertexes.

Most important, I’m performing a point based rendering, using point sprites (oriented disc taken from a texture atlas). Every vertex has its normal and color that are loaded in 3 different VBOs.

My code is quite simple:

activate depth test
activate alpha test (set function)

//render loop
clear color and depth buffer
activate shaders
do the rendering (point_sprite etc etc render call etc etc) -> then up to the vert and frag shaders
disable shaders
glutPostRedisplay();
glutSwapBuffers();

Recording in a benchmark the time needed for execute each frame I noticed that the Original model perfom better than its LOD…2 times better at least…
When loading the two different models the procedure is exactly the same and nothing else changes…
Playing around with the code I noticed that my LOD has the vertexes specified in a certain spatial order (as result of the octree subdivision) while the original no…
If I shuffle the vertexes positions in my LOD the result is that their performances are comparable as one would expect…

I made further tries and I noticed that by disabling the depth test, the LOD model appears more ‘consistent’, like its front face is compact, while the original is far more fuzzy…
is it because the first (or last) vertexes drawn are contiguous?
Many inevitable artifacts affect my rendering like aliasing…I was wondering If some of these issues may discard many fragments and lead to a faster rendering…while the LOD being sorted is less affected by them but slows down…

If you can help I would be very grateful (it’s an urgent matter)

best regards

One possible factor: if primitives overlap, that imposes an ordering dependency. Primitives must either be processed in the order in which they are specified, or the implementation must at least ensure that the result is as if they had been processed in order. If primitives don’t overlap, then the result is trivially independent of the processing order, which makes processing easier to parallelise.

Thank you for the reply.

Anyway I don’t have really understand the following:

Primitives must either be processed in the order in which they are specified, or the implementation must at least ensure that the result is as if they had been processed in order

What do you mean precisely by ordering dependency? How that may affect parallelization? (my data is load in the vbo contiguously as 3 float (xyz) for each vertex)
In the first place I though this was related to depth testing…but it seems it is not…

I have only point sprites, without any connection and their size is manipulated in the vertex shader by glPointSize. Anyway I think that inevitably most of the time points overlap, especially because that’s the behaviour I pursue and implemented (point should be large enough to reconstuct the surface they belong to).

The loading order of the vertices does not have impacts on the rendering. However, the order in which you send these vertices to GL has, specially if depth testing is enabled. Using any spatial partitioning structures should normally help to send these vertices in a more coherent order so that the rendering will be (normally) improved compared to an erratic order of vertices. Since you are using octrees, how the vertices are grouped inside nodes is relevant, as GClements said.

But since you render sprites, I would more go for a wrong send-order of the vertices to GL.

Ty too Silence.

Why do you think that in case of sprites, a sparse order would be better?
Anyway why exactly a spatial coerence should help performances?

You will generally want to draw in ‘front-to-back’ order in order to let the hardware discards z-failed fragments as soon as possible. More you have occlusion, and more this is true.

One thing you can try to ensure this is to rotate your camera around the scene (and thus, or to go threw your tree in another order) and see if (and how) the framerate will change.

This makes sense to me, but why I have the same behaviour even if I disable depth test?

If a draw call renders multiple primitives which modify a given pixel, the pixel’s value at the end of the draw call must be that resulting from the last primitive which included that pixel.

If depth tests are enabled, the rendering order still matters in cases where both primitives have the same depth value for the pixel (if the depth comparison is GL_LESS or GL_GREATER, the second primitive will fail the test and the value from the first primitive will be used; if the comparison is GL_LEQUAL or GL_GEQUAL, the second primitive will pass the depth test and the value from the second primitive will be used).

Additionally, depth tests and blending involve a read-modify-write operation on the framebuffer. For each primitive, the value read must be that written by the preceding primitive.

But if two primitives can easily be determined not to overlap, then none of this matters. The two primitives can be rendered in either order or in parallel, which may allow for higher utilisation of the GPU.

I read your first post certainly badly, making me believed that the issue disappeared when you disable depth testing, which obviously was not the case. So definitely, what I said above was not relevant for you…

So here are my other two cents, not sure if that could help you:

How things are going if you disable alpha testing ?
Do you have the same number of draw calls with and without the octree ?
Same question for buffers bindings, shader bindings, uniform sendings ?
Do you make use of transparency (blending) or only alpha testing ?
Do the alpha testing operation directly in the fragment shader (and discard the fragment if appropriate)

for GCLements
So in both cases, sparse data would perform better. So theorically you expect the worst case to be a model with N points with the same coordinates? with and without depth test?
Anyway how this might reconnect to my case, where a shuffle of data position (before upload to vbo, different indexing during draw doesn’t work) alter significantly the perfomances?

for Silence.
Let’s say i’m definetively more interested in depth test, cause i’m gonna use it, and i should explain the difference while i use it in performances…disabling it, was just a try…

  • alpha test don’t change much
  • i debugged with and horrible cout, it prints the same number of points (at least the numbers given to drawarrays…)
  • both the approach have identical conditions. They execute the same code.
  • no only alpha (i need it to make the sprite represent a circle, and cut out the border)
  • no it is from the ‘fixed pipeline test’. I don’t perform it on shaders…only backface culling, that is now off for these tests:
    glEnable(GL_ALPHA_TEST);
    glAlphaFunc(GL_GREATER, 0.1);

[QUOTE=GClements;1283604]If a draw call renders multiple primitives which modify a given pixel, the pixel’s value at the end of the draw call must be that resulting from the last primitive which included that pixel.

If depth tests are enabled, the rendering order still matters in cases where both primitives have the same depth value for the pixel (if the depth comparison is GL_LESS or GL_GREATER, the second primitive will fail the test and the value from the first primitive will be used; if the comparison is GL_LEQUAL or GL_GEQUAL, the second primitive will pass the depth test and the value from the second primitive will be used).

Additionally, depth tests and blending involve a read-modify-write operation on the framebuffer. For each primitive, the value read must be that written by the preceding primitive.

But if two primitives can easily be determined not to overlap, then none of this matters. The two primitives can be rendered in either order or in parallel, which may allow for higher utilisation of the GPU.[/QUOTE]

Really interesting. And since the OP uses point size, this happens even more often (square of the pixel size).

I add another important detail.
I have a LOD manager, that subdivides a model in 8^x cubes (here x=3, 512 cubes) and render them based on the distance of the camera from each of these cubes.

While ordering and loading the vertex data in the cubes the same problem appeared, slowing down during rendering.
I made a try and set only the high level lod to be rendered: that is the LOD model. By shuffling the order in which the draws were called for each subcubes I had no substantial changes…
Again seems like the order of the data in the VBO made the difference!!!

Using lods I loaded for a test the vertexes of the LOD model. I partitioned the space in 512 cubes…
So the drawing check the camera cube center distance, if the value is inside a threshold (put to infinity in this test-> i draw always every element) draw the cube content.
Then i tried to shuffled the order on which I was drawing the data in each cube, without a real change…(on 512 cubes, the inner data on each may be not enough to shuffle as I did manually before…?)

this made me think that the order in which the data is on the VBO is making the difference?