I’m drawing a particle system using instances with the following vertex shader and am noticing very awful performance.
The frame time goes back down to normal when I uncomment the last line of main() and comment out the line above it.
I also notice the CPU use going up considerably where it’s virtually nil otherwise. I’ve tried playing around with precision and const keywords without any luck.
Just wondering if anyone has recognized this problem before and if there is a work-around.
This is on Windows 8.1 using a Quadro K1100M “Forceware” driver. Will be happy to post more info if asked.
// declare a const array of offset for the vertex position coordinates (so each instance can easily be rendered as a small box)
mediump const vec4 vert_pos_offsets[4] = {
vec4(-0.005, 0.005, 0.0, 0.0),
vec4(-0.005, -0.005, 0.0, 0.0),
vec4( 0.005, 0.005, 0.0, 0.0),
vec4( 0.005, -0.005, 0.0, 0.0)
};
out mediump vec4 color;
void main(void)
{
Particle p = particles[gl_InstanceID];
color = p.col;
gl_Position = p.pos + vert_pos_offsets[gl_VertexID];
// gl_Position = p.pos;
};
[QUOTE=teeburt;1263609]I’m drawing a particle system using instances with the following vertex shader and am noticing very awful performance. … Windows 8.1 using a Quadro K1100M “Forceware” driver. …
...
// declare a const array of offset for the vertex position coordinates (so each instance can easily be rendered as a small box)
mediump const vec4 vert_pos_offsets[4] = {
vec4(-0.005, 0.005, 0.0, 0.0),
vec4(-0.005, -0.005, 0.0, 0.0),
vec4( 0.005, 0.005, 0.0, 0.0),
vec4( 0.005, -0.005, 0.0, 0.0)
};
...
gl_Position = p.pos + vert_pos_offsets[gl_VertexID];
...
[/FONT]
GLSL 4.20 (or ARB_shading_language_420pack) added the ability to use C-style initializers for initializers like that. And considering his hardware (and his use of a buffer variable), his shader ought to be able to use that.
Yes, that look similar to what I am experiencing. Unfortunately Ilian’s solution has no effect for me. I also tried moving the const array into a uniform buffer and the problem is still there.
I will try and test on some other platforms to see if it keeps showing up. Hard to see how just using gl_VertexID would bog things down so terribly.
It’s probably not the use of gl_VertexID. It’s the fact that “vert_pos_offsets[gl_VertexID]” has to do actual work. It has to be an array access. Whereas the other version doesn’t require accessing an array.
Your shader doesn’t do very much. It simply fetches some memory, then stores it in output variables. So something relatively small like that will have a large impact in the VS’s performance.
It’s probably not the use of gl_VertexID. It’s the fact that “vert_pos_offsets[gl_VertexID]” has to do actual work. It has to be an array access. Whereas the other version doesn’t require accessing an array.
Perhaps my real question is, what would make this array access take so long… [STRIKE]Should I look for a sync issue with the compute shader that updates the buffer data?[/STRIKE]
EDIT: Nevermind that - the question is: why is it expensive to read from a constant array using gl_VertexID as the index?
teeburt, a curious question: You said you were on "Windows 8.1 using a Quadro K1100M", right? Are you OpenGL directly, or using OpenGL ES with some emulator? (if so, which?)
why is it expensive to read from a constant array using gl_VertexID as the index?
I’d try to add a constant offset to the vertex-position. Likely the vertex-shader is skipped if nothing is done there.
So it’d be the overhead of invoking the vertex-shader that causes the time-spike. Not the simple addition that is applied, nor the array-access by itself.
The only thing I specified was the binding point (my example actually was the entire shader). I don’t understand much about the layout specifier, but if it applies here that would be enlightening.
I should make it clear: I only had to change the size of the struct on the CPU to see a huge increase in performance.
It’s not so much of a performance question as “your code shouldn’t have worked before.” And if you really aren’t specifying the layout… your code certainly doesn’t work now. Oh, I’m sure it appears to work, but one driver update could change all that. Taking it to someone else’s machine could break it too.
There are specific rules for how buffer-backed interface blocks (uniform and shader storage) will layout their members. By default, the layout is entirely up to the implementation. So even with your padded struct, OpenGL makes no guarantee about how your buffer block is laid out. Which means you can’t declare a C/C++ struct and expect it to match. You managed to find a layout that matches what your current implementation does, but a different implementation could do something different.
If you want a layout that is well-defined, one that you can base a C/C++ struct off of, use std140 or std430.
When dealing with random memory access of intrinsic vector-types alignment can make a huge difference in performance as unaligned data can increase the number of cache-misses dramatically - at least this is what one learns from Intel’s Programming Guide for CPUs.
The padding makes the vec4 lie on properly aligned memory-locations.
[QUOTE=hlewin;1263714]When dealing with random memory access of intrinsic vector-types alignment can make a huge difference in performance as unaligned data can increase the number of cache-misses dramatically - at least this is what one learns from Intel’s Programming Guide for CPUs.
The padding makes the vec4 lie on properly aligned memory-locations.[/QUOTE]
This is what I assume as well. I didn’t profile for cache misses yet though so I can’t be sure. Actually, I expect to run into the same problem for indirect rendering. It would be nice to know how to find a good ‘word size’ for a wide range of devices.
EDIT: I mean indirect rendering with shader storage blocks - either way it seems like its the same problem of aligning memory correctly for the vertex shader.
My guess would be that gl-buffers are always at least 16-byte aligned internally. Maybe even 32 for cards supporting doubles.
I’d say, align any vec4 on a 16-byte and vec4d on 32-byte boundaries. For CPU the argument is that this makes a vector reside in one cache-line, not spanning over two which may mislead the memory-unit, so that the second half isn’t in cache when it is needed. With sequential block-access alignment does not play such a big role as a consecutive access can be predicted by the memory-unit: if having read a few words in a linear fashion it will fetch the next words by good guess.
Yeah, the trouble is supporting multiple platforms. I even considered writing a separate compute shader to copy from std140 into an ‘unqualified’ buffer layout, to see if the implementations are choosing a fast layout. But this is doubtful, since there’s no way to indicate whether I prefer to save time or save space…