Pass-through vertex shader vs fixed pipeline: SLOW

Hi,

I am displaying a very simple model at about 63 fps with the fixed pipeline. The model just has plain triangles, and no textures.

The same model with a passthrough vertex shader attached (see below) only runs at about 30 fps. I tried VBOs and display lists, same result.

Even though I was expecting the silicon-driven fixed pipeline test to be faster, I wasn’t expecting such a difference.

My graphics card is a GeForce GT 430.

Does this result make sense to you? Another question I have is: would this speed performance decrease with higher-end graphics cards? My guess is that expensive cards like the GTX 480 might have more vertex processor units, but their fixed pipeline “units” might also be faster - I’m not sure!

Thanks,
Fred

Passthrough vertex shader source code:

#version 400 compatibility
void main(void)
{
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}

Have u compared the passthrough vertex shader performance with the core profile (u will need to handle attributes and uniforms for matrices urself) and not the compatability profile?

How are ur geomtry and other opengl states setup maybe its something else down the pipeline that is forcing the software path?

Do you use this with a fragment shader?

Hi,

When using both a passthrough vertex and fragment shaders the fps is the same.

I did some tests with a Mobility Radeon HD 5470 and the results are similar, the penalty is smaller though. The raw model is displayed at 45 fps, and 32 fps with pass-through shaders attached.

I am not using vertex attributes.

I am not using a core GL 3.x profile, created with wglCreateContextAttribsARB(), just a regular GL 1.x profile created with wglCreateContext().

Using a GL 3.x profile and vertex attributes is a bit tricky for me but it is doable. Do you guys think this could make a noticeable difference? From what I can understand, GL 3.x demands that you always have shaders bound. Now the question is, will a GL 3.x application with shaders bound be just as fast as a GL 2.x application with the fixed pipeline?

Cheers,
Fred

Do you re-bind shader(s) each frame ?

I wonder what would happen if, instead of using gl_ModelViewProjectionMatrix, you precalculated the MVP matrix once-only in software and sent it to the shader as a uniform? I suspect that your shader compiler is not optimizing out this multiplication properly and is actually recalculating MVP per-vertex instead (even if it doesn’t change).

Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.

63 FPS + 30 FPS to display a simple model is very low, what else are you doing? If that was solely all you were doing, you should expect 100’s or 1000’s of frames per second on that level hardware. Are you doing anything else that forces software rendering, and do you have up to date drivers (from NVidia) for your card?

Is

void main(void)
{
    gl_Position = ftransform();
}

just as slow?

Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.
[/QUOTE]
And why on earth do you do that ??
If you want to compare performance, try not to add differences between the tests…

Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.
[/QUOTE]
And why on earth do you do that ??
If you want to compare performance, try not to add differences between the tests… [/QUOTE]
The ‘raw’ model test has no glUseProgram since it doesn’t use shaders. For the shader test, I do want to keep the glUseProgram calls as I will eventually have multiple shaders being used in a frame. Switching shaders by calling glUsePrograms multiple times will be inevitable.

I do not believe these two calls are responsible for the 30 fps framerate drop I have, but I might be wrong.
And removing them is tricky in my situation.

It is just as slow. With a very small model, I see a very, very, very minor performance gain (1000 fps vs 950 fps).

The 63 fps model is relatively big. Lots and lots of draw calls (I either use display lists of VBOs, no difference whatsoever), unoptimized. Just plain vanilla triangles.

The large model might very well be falling back to software emulation on the vertex pipeline then, so display lists or VBOs would be expected to make no difference. And with software T&L unordered triangle soup is a very bad recipe for performance as you’ll get no vertex caching and potentially multiple passes over the same vertexes. Adding in lots and lots of draw calls (which kill you on the CPU side) and things are going to get even worse.

Did you try calculating MVP in software and sending it once per frame as a uniform instead of using gl_ModelViewProjectionMatrix yet?

The small model runs at 1450 fps normally. So I am loosing 450 fps because of the vertex shader. So quite a big drop here too.

Trying that now.

I’m almost sure that glUseProgram is not “guilty” for the fps drop.
You have some other problems. Try to find them out using glGetError. I bet it will return some error that causes low performance. With vertex shader like yours you should get even higher performance than with FFP. There is no doubt that default shaders are very well optimized, but they are more complicated that those of yours.
It would be much easier to debug application with debug_output extension.

Do you recompile/relink program each frame? Although the program is simple, recompilation/relinking can require considerable amount of time.

I have just tried and I’m getting the same results. I can only test on the ATI platform today (GL4-class Mobility Radeon HD 5470). My framerate goes from 45 fps raw to 33 fps with a vertex shader.

I am trying to use vertex attributes as well as a GL3 context now.

No, I just link once. I then only use glUseProgram.

Is the fixed function pipeline implemented with shaders behind the scene, in my situation (GT 430 / Radeon HD 5xxx)? You seem to be saying this is the case - is it?

BTW I don’t believe I run my VS in software emulation. First, I don’t know why that would be the case, and then, my model would be way, way slower with software T&L.

Graphics hardware was programmable long before shaders. With shaders that hardware just became open. On the other hand, it would be very difficult to keep default fixed functionality as totally separate rendering path and develop and optimize another path that uses shaders, especially when huge amount of users relay on programmable path.

I bet you hit some fallback path. If you post some code-fragments maybe we could say something more about it.

I did some 4 more tests.

Configuration: Windows XP SP3 32-bit, GeForce GT 430, Drivers 266.58.

Test 1) GL 1.x context, Display Lists: 64 fps
Test 2) GL 1.x context, VBO : 45 fps
Test 3) GL 1.x context, VBO + shaders: 22 fps
Test 4) GL 3.2 context, VBO + shaders: 22 fps

Vertex Attributes are used in tests 3 and 4.

Shaders code:

// Vertex shader

#version 150 core

in vec3 in_vertex;
out vec3 out_color;

uniform mat4 modelViewProjectionMatrix;

void main(void)
{
gl_Position = modelViewProjectionMatrix * vec4(in_vertex, 1.0);
out_color = vec3(1.0, 0.0, 0.0);
}

// Fragment shader

#version 150 core

in vec3 out_color;
out vec4 out_fragcolor;

void main(void)
{
out_fragcolor = vec4(out_color, 1.0);
}

The GL 1.x context is created with wglCreateContext(HDC). The 3.2 context is created with the ARB extension.

Here is a GL trace of 1 single frame of my application when running under a GL 3.2 core profile context:

| SwapBuffers(1A011B83)
| glViewport(0, 0, 751, 704)
| glScissor(0, 0, 751, 704)
| glEnable(GL_SCISSOR_TEST)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glClearColor(0.200000, 0.200000, 0.400000, 1.000000)
| glClearDepth(1.000000)
| glDepthMask(TRUE)
| glClear(GL_DEPTH_BUFFER_BIT|GL_COLOR_BUFFER_BIT)
| glEnable(GL_DEPTH_TEST)
| glDisable(GL_SCISSOR_TEST)
| glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glUseProgram(3)
| glUniformMatrix4fv(0, 1, FALSE, 0607CDA0)
| glBindBuffer(GL_ARRAY_BUFFER, 1)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 2)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 4)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 3)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 6)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 5)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 8)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 7)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 10)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 9)
| glDrawElements(GL_TRIANGLES, 54, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 12)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 11)
| glDrawElements(GL_TRIANGLES, 54, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 14)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 13)
| glDrawElements(GL_TRIANGLES, 60, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 16)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 15)
| glDrawElements(GL_TRIANGLES, 60, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glUseProgram(0)
| glGetError()
| glDisable(GL_DEPTH_TEST)
| glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glViewport(0, 0, 800, 600)
| glGetError()
| SwapBuffers(1A011B83)

Note that the framework that I am using unbinds vertex and element buffers after the fact (glBindBuffer(GL_ARRAY_BUFFER, 0) and glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)). I wonder if this could be the cause of the slowdown, but I would be very surprised if it was.

Any ideas?

Cheers,
Fred

Shall I conclude from these test results that the VBO switch is the thing that kills my framerate?

But in this case why would attaching simple shaders make the framerate drop from 45 to 22 fps? That makes little sense.