PDA

View Full Version : Pass-through vertex shader vs fixed pipeline: SLOW



fred_em
03-23-2011, 01:28 AM
Hi,

I am displaying a very simple model at about 63 fps with the fixed pipeline. The model just has plain triangles, and no textures.

The same model with a passthrough vertex shader attached (see below) only runs at about 30 fps. I tried VBOs and display lists, same result.

Even though I was expecting the silicon-driven fixed pipeline test to be faster, I wasn't expecting such a difference.

My graphics card is a GeForce GT 430.

Does this result make sense to you? Another question I have is: would this speed performance decrease with higher-end graphics cards? My guess is that expensive cards like the GTX 480 might have more vertex processor units, but their fixed pipeline "units" might also be faster - I'm not sure!

Thanks,
Fred

Passthrough vertex shader source code:

#version 400 compatibility
void main(void)
{
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}

mobeen
03-23-2011, 02:59 AM
Have u compared the passthrough vertex shader performance with the core profile (u will need to handle attributes and uniforms for matrices urself) and not the compatability profile?

How are ur geomtry and other opengl states setup maybe its something else down the pipeline that is forcing the software path?

tksuoran
03-23-2011, 03:35 AM
Do you use this with a fragment shader?

fred_em
03-23-2011, 05:44 AM
Hi,

When using both a passthrough vertex and fragment shaders the fps is the same.

I did some tests with a Mobility Radeon HD 5470 and the results are similar, the penalty is smaller though. The raw model is displayed at 45 fps, and 32 fps with pass-through shaders attached.

I am not using vertex attributes.

I am not using a core GL 3.x profile, created with wglCreateContextAttribsARB(), just a regular GL 1.x profile created with wglCreateContext().

Using a GL 3.x profile and vertex attributes is a bit tricky for me but it is doable. Do you guys think this could make a noticeable difference? From what I can understand, GL 3.x demands that you always have shaders bound. Now the question is, will a GL 3.x application with shaders bound be just as fast as a GL 2.x application with the fixed pipeline?

Cheers,
Fred

ZbuffeR
03-23-2011, 05:56 AM
Do you re-bind shader(s) each frame ?

mhagain
03-23-2011, 06:01 AM
I wonder what would happen if, instead of using gl_ModelViewProjectionMatrix, you precalculated the MVP matrix once-only in software and sent it to the shader as a uniform? I suspect that your shader compiler is not optimizing out this multiplication properly and is actually recalculating MVP per-vertex instead (even if it doesn't change).

fred_em
03-23-2011, 06:58 AM
Do you re-bind shader(s) each frame ?
Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.

Dan Bartlett
03-23-2011, 07:23 AM
63 FPS + 30 FPS to display a simple model is very low, what else are you doing? If that was solely all you were doing, you should expect 100's or 1000's of frames per second on that level hardware. Are you doing anything else that forces software rendering, and do you have up to date drivers (from NVidia) for your card?

Is
void main(void)
{
gl_Position = ftransform();
} just as slow?

ZbuffeR
03-23-2011, 07:35 AM
Do you re-bind shader(s) each frame ?
Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.

And why on earth do you do that ??
If you want to compare performance, try not to add differences between the tests...

fred_em
03-23-2011, 07:50 AM
Do you re-bind shader(s) each frame ?
Yes, I have a glUseProgram(progid) at the beginning, then a glUseProgram(0) at the end of each frame.

And why on earth do you do that ??
If you want to compare performance, try not to add differences between the tests...
The 'raw' model test has no glUseProgram since it doesn't use shaders. For the shader test, I do want to keep the glUseProgram calls as I will eventually have multiple shaders being used in a frame. Switching shaders by calling glUsePrograms multiple times will be inevitable.

I do not believe these two calls are responsible for the 30 fps framerate drop I have, but I might be wrong.
And removing them is tricky in my situation.

fred_em
03-23-2011, 07:54 AM
63 FPS + 30 FPS to display a simple model is very low, what else are you doing? If that was solely all you were doing, you should expect 100's or 1000's of frames per second on that level hardware. Are you doing anything else that forces software rendering, and do you have up to date drivers (from NVidia) for your card?

Is
void main(void)
{
gl_Position = ftransform();
} just as slow?
It is just as slow. With a very small model, I see a very, very, very minor performance gain (1000 fps vs 950 fps).

The 63 fps model is relatively big. Lots and lots of draw calls (I either use display lists of VBOs, no difference whatsoever), unoptimized. Just plain vanilla triangles.

mhagain
03-23-2011, 08:03 AM
The large model might very well be falling back to software emulation on the vertex pipeline then, so display lists or VBOs would be expected to make no difference. And with software T&L unordered triangle soup is a very bad recipe for performance as you'll get no vertex caching and potentially multiple passes over the same vertexes. Adding in lots and lots of draw calls (which kill you on the CPU side) and things are going to get even worse.

Did you try calculating MVP in software and sending it once per frame as a uniform instead of using gl_ModelViewProjectionMatrix yet?

fred_em
03-23-2011, 08:58 AM
The large model might very well be falling back to software emulation on the vertex pipeline
The small model runs at 1450 fps normally. So I am loosing 450 fps because of the vertex shader. So quite a big drop here too.


Did you try calculating MVP in software and sending it once per frame as a uniform instead of using gl_ModelViewProjectionMatrix yet?
Trying that now.

Aleksandar
03-23-2011, 11:02 AM
I do not believe these two calls are responsible for the 30 fps framerate drop I have, but I might be wrong.
And removing them is tricky in my situation.
I'm almost sure that glUseProgram is not "guilty" for the fps drop.
You have some other problems. Try to find them out using glGetError. I bet it will return some error that causes low performance. With vertex shader like yours you should get even higher performance than with FFP. There is no doubt that default shaders are very well optimized, but they are more complicated that those of yours.
It would be much easier to debug application with debug_output extension.

Do you recompile/relink program each frame? Although the program is simple, recompilation/relinking can require considerable amount of time.

fred_em
03-24-2011, 06:12 AM
Did you try calculating MVP in software and sending it once per frame as a uniform instead of using gl_ModelViewProjectionMatrix yet?
I have just tried and I'm getting the same results. I can only test on the ATI platform today (GL4-class Mobility Radeon HD 5470). My framerate goes from 45 fps raw to 33 fps with a vertex shader.

I am trying to use vertex attributes as well as a GL3 context now.

fred_em
03-24-2011, 06:38 AM
Do you recompile/relink program each frame? Although the program is simple, recompilation/relinking can require considerable amount of time.
No, I just link once. I then only use glUseProgram.

fred_em
03-24-2011, 06:43 AM
With vertex shader like yours you should get even higher performance than with FFP.
Is the fixed function pipeline implemented with shaders behind the scene, in my situation (GT 430 / Radeon HD 5xxx)? You seem to be saying this is the case - is it?

BTW I don't believe I run my VS in software emulation. First, I don't know why that would be the case, and then, my model would be way, way slower with software T&L.

Aleksandar
03-24-2011, 08:38 AM
Is the fixed function pipeline implemented with shaders behind the scene, in my situation (GT 430 / Radeon HD 5xxx)? You seem to be saying this is the case - is it?
Graphics hardware was programmable long before shaders. With shaders that hardware just became open. On the other hand, it would be very difficult to keep default fixed functionality as totally separate rendering path and develop and optimize another path that uses shaders, especially when huge amount of users relay on programmable path.


BTW I don't believe I run my VS in software emulation. First, I don't know why that would be the case, and then, my model would be way, way slower with software T&L.
I bet you hit some fallback path. If you post some code-fragments maybe we could say something more about it.

fred_em
04-05-2011, 06:30 AM
I did some 4 more tests.

Configuration: Windows XP SP3 32-bit, GeForce GT 430, Drivers 266.58.

Test 1) GL 1.x context, Display Lists: 64 fps
Test 2) GL 1.x context, VBO : 45 fps
Test 3) GL 1.x context, VBO + shaders: 22 fps
Test 4) GL 3.2 context, VBO + shaders: 22 fps

Vertex Attributes are used in tests 3 and 4.

Shaders code:

// Vertex shader

#version 150 core

in vec3 in_vertex;
out vec3 out_color;

uniform mat4 modelViewProjectionMatrix;

void main(void)
{
gl_Position = modelViewProjectionMatrix * vec4(in_vertex, 1.0);
out_color = vec3(1.0, 0.0, 0.0);
}

// Fragment shader

#version 150 core

in vec3 out_color;
out vec4 out_fragcolor;

void main(void)
{
out_fragcolor = vec4(out_color, 1.0);
}

The GL 1.x context is created with wglCreateContext(HDC). The 3.2 context is created with the ARB extension.

Here is a GL trace of 1 single frame of my application when running under a GL 3.2 core profile context:

| SwapBuffers(1A011B83)
| glViewport(0, 0, 751, 704)
| glScissor(0, 0, 751, 704)
| glEnable(GL_SCISSOR_TEST)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glClearColor(0.200000, 0.200000, 0.400000, 1.000000)
| glClearDepth(1.000000)
| glDepthMask(TRUE)
| glClear(GL_DEPTH_BUFFER_BIT|GL_COLOR_BUFFER_BIT)
| glEnable(GL_DEPTH_TEST)
| glDisable(GL_SCISSOR_TEST)
| glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glUseProgram(3)
| glUniformMatrix4fv(0, 1, FALSE, 0607CDA0)
| glBindBuffer(GL_ARRAY_BUFFER, 1)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 2)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 4)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 3)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 6)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 5)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 8)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 7)
| glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 10)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 9)
| glDrawElements(GL_TRIANGLES, 54, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 12)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 11)
| glDrawElements(GL_TRIANGLES, 54, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 14)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 13)
| glDrawElements(GL_TRIANGLES, 60, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ARRAY_BUFFER, 16)
| glVertexAttribPointer(0, 3, GL_FLOAT, FALSE, 0, 00000000)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 15)
| glDrawElements(GL_TRIANGLES, 60, GL_UNSIGNED_INT, 00000000)
| glBindBuffer(GL_ARRAY_BUFFER, 0)
| glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)
| glUseProgram(0)
| glGetError()
| glDisable(GL_DEPTH_TEST)
| glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA)
| glColorMask(TRUE, TRUE, TRUE, TRUE)
| glViewport(0, 0, 800, 600)
| glGetError()
| SwapBuffers(1A011B83)

Note that the framework that I am using unbinds vertex and element buffers after the fact (glBindBuffer(GL_ARRAY_BUFFER, 0) and glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0)). I wonder if this could be the cause of the slowdown, but I would be very surprised if it was.

Any ideas?

Cheers,
Fred

fred_em
04-05-2011, 07:24 AM
Shall I conclude from these test results that the VBO switch is the thing that kills my framerate?

But in this case why would attaching simple shaders make the framerate drop from 45 to 22 fps? That makes little sense.

dukey
04-05-2011, 08:17 AM
probably older rendering paths are better optimised

mhagain
04-05-2011, 08:17 AM
VBO switching is generally considered a bad thing, but you're not doing nearly enough of it to have this kind of perf impact IMO. All the same there's definitely huge room for optimization there (especially with the unbind/rebind thing which is just plain weird) and which likely explains the perf difference you've noticed between VBOs and Display Lists. In fact everything here really should be going into a single pair of buffer objects rather than creating multiple objects (especially for such tiny data sets - I see many with a mere 3 elements in there).

My psychic hat tells me that this framework originally drew (and likely still does with GL 1.x) using strips and fans and was converted over to VBOs by just replacing each discrete strip or fan with it's own VBO. You also tend to see this kind of thing with a certain style of OO code where the developer has implemented everything - down to the finest level of detail - as a class, and has made the design decision that each such class should be totally self contained, walled off, and know nothing at all about the outside world. Not good at all for this kind of use case.

I'm not sure that I understand why 2 SwapBuffers calls seem to be being made per frame. That looks highly dubious - unless the first one is inadvertently logged from the previous frame, of course.

If you've got a stencil buffer it really should be cleared at the same time as the depth buffer, because it's most likely combined with the depth buffer as a single 32-bit depth/stencil buffer (in D24S8 format).

None of this explains the shader performance problem. Maybe try dropping your #version in each shader?

fred_em
04-05-2011, 09:55 AM
The two SwapBuffers is a mistake of mine in the cut/paste. There really is only 1 SwapBuffers.

The GL trace I showed is for a very small object. The 64/45/22 fps object is a very large one so your explanation does make sense.

The VBO switch is the killer thing here.

One very important question remains unanswered though: in my original test, I only used display lists. I was comparing the framerate of the raw object and the same object with simple vertex+fragment shaders attached. And my framerate was dropping from 64 fps to 30 fps.

This, remains a mystery.

mhagain
04-05-2011, 10:28 AM
What about taking your #version down to 110, or did you try that already with your GL 2.1 test? I'd be interested in knowing if that has any bearing on performance. Where I'm coming from is trying to rule stuff out and identify the point at which the perf drop-off happens. The renderer is certainly not efficient anyway, but all the same you should definitely not get that kind of thing happening.

Trying progressively more complex models may also be a good idea. Especially what I'm thinking of is the crossing-over point around the 64k vertexes mark.

fred_em
04-06-2011, 05:53 AM
Omitting the version number in the shader or specifying #version 110 has no effect on the framerate.

I have found the explanation of the huge drop in the framerate. I mean, not the once caused by the VBO switches, the one that had yet to be explained, where I had passthrough shaders and display lists (30 fps) VS. just plain display lists (64 fps).

The framework I am using is OpenSceneGraph. To use shaders/programs I can either assign them to individual graph leaves or to the top node of the graph. When I assign a unique (application-wide) shader to individual graph leaves, the OpenGL trace shows that OSG is clever enough to avoid unnecessary shader switches, resulting in the rendering just being:

glUseProgram(id)
glCallList(id1)
glCallList(id2)
glCallList(id3)
glCallList(id4)
...
glUseProgram(0)

However, the work done on the CPU probably changes significantly, probably because OSG still has to 'diff' the state changes across graph leaves.
Specifying the shader on the topmost node shows the exact same GL trace, but no framerate drop. Probably because there is no work on the CPU side.

So my guess is that the CPU was making everything stall.

I was misleaded by the fact the GL trace was 'clean' when working on individual graph leaves, assuming the CPU work was almost zero. It wasn't.

Cheers,
Fred

Inventor3D
04-11-2011, 06:18 AM
I realize there has already been some discussion the performance of a vertex shader vs. fixed functionality. I would like to find out if (or confirm that) vertex shaders are ALWAYS slower. I ran a 30,000 mesh at 240 fps with fixed functionality. Then I attached just a vertex shader, using just this code:

void main()
{
gl_Position = ftransform();
}

The result is that fps is cut in half.

If vertex shaders are always slower, then I'll have to work with the fragment shader alone. This is unfortunate because I can't pass varying variables to the fragment shader without using a vertex shader (right?) So, to do lighting, I'm passing the normals through the secondaryColor (which is annoying)...

Any advice from an expert would bring me much delight.

mhagain
04-11-2011, 06:23 AM
In the normal case vertex shaders should actually be faster; especially a simple passthrough vertex shader.

Why? Because on modern hardware the fixed pipeline is emulated through shaders, that's why. So given that you're always running a vertex shader anyway, simplifying that shader to either one that only meets your specific needs, or to simple passthrough, should be faster.

The only exception to this rule is if your implementation is emulating vertex shaders in software.

I know people say that FPS counts shouldn't be used, but at the same time - the performance you're getting seems quite low for the mesh you're rendering. I can easily hit similar speeds for similar counts on a crappy office PC with a crappy integrated Intel. So definitely something other than just use of vertex shaders is wrong here.

Inventor3D
04-11-2011, 06:28 AM
Interesting. Does this mean that my implementation must be emulating the vertex shader in software? I'm a newbie.

mhagain
04-11-2011, 06:37 AM
Check my edit - the next obvious question would be: "what hardware have you?"

Inventor3D
04-11-2011, 07:11 AM
Thanks. I have a I3. Windows properties reports: WDC WD3200BEVT ATA device. I got the comp last summer... Are you suggesting that on a different hardware, the fixed functionality wouldn't be faster than this?:

void main()
{
gl_Position = ftransform();
}

The "mesh" could be going slower because it's a terrain in which no triangles are hidden from view; also it includes multi-texturing. But even without texturing, the results are essentially the same.

mhagain
04-11-2011, 07:18 AM
WDC WD3200BEVT ATA
Hmmmm, that's a hard disk; don't think they can do 3D graphics too well. ;)

What about your display adapter?

Inventor3D
04-11-2011, 07:34 AM
I just tested it on another laptop: this one was bought just a few months ago, but it's also a i3. Same results. Somehow I don't think that my implementation is emulating vertex shaders in software.

Inventor3D
04-11-2011, 07:35 AM
Under display adapter, Device Manager says "Intel(R) HD Graphics"

BionicBytes
04-11-2011, 07:59 AM
Device Manager says "Intel(R) HD Graphics"
OMG: when ever you see Intel and odd/strange GL behaviour we all cringe!
Intel make absolutely terrible OpenGL drivers. I bet with 99% certainty that's your problem right there. You really, really need to test on an nVidia card or AMD to get a proper idea how you app is performing.

Inventor3D
04-11-2011, 08:26 AM
That's very helpful. I don't know if I may ask this here, but do you know if I can add an nVidia card to my laptop? Or must I buy a whole new laptop?

mhagain
04-11-2011, 08:31 AM
It depends on the laptop, but I really doubt if it's possible; it's a long time since I checked out this particular market but laptops with replaceable graphics were never very common. Your options are more or less: (1) buy a new laptop, (2) accept that you're going to get sucky performance with OpenGL on Intel graphics, or (3) switch to D3D (which still sucks on Intel, just not as much).

BionicBytes
04-11-2011, 09:24 AM
There is no way you can replace your GFX card or add a new GFX to a laptop.
The new range of DX11 class h/w had been released recently so now is a good time to buy a new laptop with a decent onboard video card (AMD or nVidia). I have laptops at home with GeForce 8 and others with AMD 48xx series - both are excellent.

In the end you get what you pay for. Buy a DX11 / OpenGL 4.1 laptop so you can develop something decent but don't spend mega bucks as these things do go out of date within a few years!

Inventor3D
04-11-2011, 10:32 AM
Thanks for the advice and helpful info. I'll try to get a hold of an nVidia card for my desktop and test it from there.

Inventor3D
04-12-2011, 03:12 PM
I tested it with an nVidia card.... and sure enough the vertex shader goes just as fast as the fixed pipeline. So, that's cool. :)