Unexpected frame time spike (instanced drawing, vertex shader)

I’m drawing a particle system using instances with the following vertex shader and am noticing very awful performance.

The frame time goes back down to normal when I uncomment the last line of main() and comment out the line above it.

I also notice the CPU use going up considerably where it’s virtually nil otherwise. I’ve tried playing around with precision and const keywords without any luck.

Just wondering if anyone has recognized this problem before and if there is a work-around.

This is on Windows 8.1 using a Quadro K1100M “Forceware” driver. Will be happy to post more info if asked.

struct Particle
{
vec4 pos;
vec4 vel;
vec4 col;
float age;
};

layout (binding=0) buffer particle_data
{
Particle particles[];
};

// declare a const array of offset for the vertex position coordinates (so each instance can easily be rendered as a small box)
mediump const vec4 vert_pos_offsets[4] = {
vec4(-0.005, 0.005, 0.0, 0.0),
vec4(-0.005, -0.005, 0.0, 0.0),
vec4( 0.005, 0.005, 0.0, 0.0),
vec4( 0.005, -0.005, 0.0, 0.0)
};

out mediump vec4 color;
void main(void)
{

Particle p = particles[gl_InstanceID];
color = p.col;
gl_Position = p.pos + vert_pos_offsets[gl_VertexID];
// gl_Position = p.pos;
};

[QUOTE=teeburt;1263609]I’m drawing a particle system using instances with the following vertex shader and am noticing very awful performance. … Windows 8.1 using a Quadro K1100M “Forceware” driver. …


...
// declare a const array of offset for the vertex position coordinates (so each instance can easily be rendered as a small box)  
mediump const vec4 vert_pos_offsets[4] = {
  vec4(-0.005,  0.005, 0.0, 0.0),
  vec4(-0.005, -0.005, 0.0, 0.0),
  vec4( 0.005,  0.005, 0.0, 0.0),
  vec4( 0.005, -0.005, 0.0, 0.0)
};   
...
gl_Position = p.pos + vert_pos_offsets[gl_VertexID];
...
[/FONT]

[/QUOTE]

Sounds like Ilian’s results here:

See if his advice gets you some benefit.

Also, what you have doesn’t look like quite the right syntax. See the attached thread.

GLSL 4.20 (or ARB_shading_language_420pack) added the ability to use C-style initializers for initializers like that. And considering his hardware (and his use of a buffer variable), his shader ought to be able to use that.

[QUOTE=Dark Photon;1263610]Sounds like Ilian’s results here:

See if his advice gets you some benefit.

[/QUOTE]

Yes, that look similar to what I am experiencing. Unfortunately Ilian’s solution has no effect for me. I also tried moving the const array into a uniform buffer and the problem is still there.

I will try and test on some other platforms to see if it keeps showing up. Hard to see how just using gl_VertexID would bog things down so terribly.

It’s probably not the use of gl_VertexID. It’s the fact that “vert_pos_offsets[gl_VertexID]” has to do actual work. It has to be an array access. Whereas the other version doesn’t require accessing an array.

Your shader doesn’t do very much. It simply fetches some memory, then stores it in output variables. So something relatively small like that will have a large impact in the VS’s performance.

It’s probably not the use of gl_VertexID. It’s the fact that “vert_pos_offsets[gl_VertexID]” has to do actual work. It has to be an array access. Whereas the other version doesn’t require accessing an array.

How much time can this take? Let’s play hangman.


I’ll buy an “U”.

Perhaps my real question is, what would make this array access take so long… [STRIKE]Should I look for a sync issue with the compute shader that updates the buffer data?[/STRIKE]

EDIT: Nevermind that - the question is: why is it expensive to read from a constant array using gl_VertexID as the index?

The first thing I’d do would be to look into the KHR_debug output, Nvidia may give some hints if the shader hits a slow path.

		 				 	teeburt, a curious question:  You said you were on "Windows 8.1 using a Quadro K1100M", right?  Are you OpenGL directly, or using OpenGL ES with some emulator?  (if so, which?)

why is it expensive to read from a constant array using gl_VertexID as the index?

I’d try to add a constant offset to the vertex-position. Likely the vertex-shader is skipped if nothing is done there.
So it’d be the overhead of invoking the vertex-shader that causes the time-spike. Not the simple addition that is applied, nor the array-access by itself.

Thanks for your help all. Finally found the problem. Accessing structured buffer was slow due to structure size (stride).

Back to my example, you can see the total size of the struct is 32+32+32+4 == 52 bytes.

//BAD
struct Particle
{
vec4 pos;
vec4 vel;
vec4 col;
float age;
};

Huge no-no!! I added some padding to make it an even 64 (on the CPU). Now running at full frame rate. Lesson learned! At least it wasn’t a one-liner.

By the way, this is on OpenGL 4.3 (desktop) using Nvidia drivers. Linux and Windows both have the problem.

//GOOD
struct Particle
{
vec4 pos;
vec4 vel;
vec4 col;
float age;
float pad1;
float pad2;
float pad3;
};

What layout was your buffer variable using?

The only thing I specified was the binding point (my example actually was the entire shader). I don’t understand much about the layout specifier, but if it applies here that would be enlightening.

I should make it clear: I only had to change the size of the struct on the CPU to see a huge increase in performance.

It’s not so much of a performance question as “your code shouldn’t have worked before.” And if you really aren’t specifying the layout… your code certainly doesn’t work now. Oh, I’m sure it appears to work, but one driver update could change all that. Taking it to someone else’s machine could break it too.

There are specific rules for how buffer-backed interface blocks (uniform and shader storage) will layout their members. By default, the layout is entirely up to the implementation. So even with your padded struct, OpenGL makes no guarantee about how your buffer block is laid out. Which means you can’t declare a C/C++ struct and expect it to match. You managed to find a layout that matches what your current implementation does, but a different implementation could do something different.

If you want a layout that is well-defined, one that you can base a C/C++ struct off of, use std140 or std430.

That was very enlightening - thank you. I am surprised it worked at all. Now I am using std140 and kept the padding in my C struct.

I also tried std430, with no padding (using the 4.5 core spec as a guide), but it did not perform well.

When dealing with random memory access of intrinsic vector-types alignment can make a huge difference in performance as unaligned data can increase the number of cache-misses dramatically - at least this is what one learns from Intel’s Programming Guide for CPUs.
The padding makes the vec4 lie on properly aligned memory-locations.

[QUOTE=hlewin;1263714]When dealing with random memory access of intrinsic vector-types alignment can make a huge difference in performance as unaligned data can increase the number of cache-misses dramatically - at least this is what one learns from Intel’s Programming Guide for CPUs.
The padding makes the vec4 lie on properly aligned memory-locations.[/QUOTE]

This is what I assume as well. I didn’t profile for cache misses yet though so I can’t be sure. Actually, I expect to run into the same problem for indirect rendering. It would be nice to know how to find a good ‘word size’ for a wide range of devices.

EDIT: I mean indirect rendering with shader storage blocks - either way it seems like its the same problem of aligning memory correctly for the vertex shader.

My guess would be that gl-buffers are always at least 16-byte aligned internally. Maybe even 32 for cards supporting doubles.
I’d say, align any vec4 on a 16-byte and vec4d on 32-byte boundaries. For CPU the argument is that this makes a vector reside in one cache-line, not spanning over two which may mislead the memory-unit, so that the second half isn’t in cache when it is needed. With sequential block-access alignment does not play such a big role as a consecutive access can be predicted by the memory-unit: if having read a few words in a linear fashion it will fetch the next words by good guess.

Yeah, the trouble is supporting multiple platforms. I even considered writing a separate compute shader to copy from std140 into an ‘unqualified’ buffer layout, to see if the implementations are choosing a fast layout. But this is doubtful, since there’s no way to indicate whether I prefer to save time or save space…

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.