Matrix Palette Index

Thanks in advance for any help/advice

I am trying to move my skinning animation into the shader from the cpu.

I am passing the matrix palette as


const int BoneCount = 86;
uniform mat4 Matrix_Palette[BoneCount];

and I am passing the palette indices as a vec4 attribute


attribute vec4 indices;

My problem is that using the attribute instead of a constant index value drops the framerate in half…

e.g


    vec4 skinned_position = Matrix_Palette[int(indices.x)] * position;  //30 fps
//VS
    vec4 skinned_position = Matrix_Palette[0] * position;  // 60 fps


Please help! This is driving me nuts

Half… what is your gpu?
Maybe vsync is on, and your code was running barely at 60-61fps, and with this modification it became 59fps, getting clamped to 30.

It’s a ATI Radeon HD 5750 on Mac Snow Leopard.

I think the problem is using a variable or vertex attribute to index the uniform array.

I can even do double the work with


    vec4 skinned_position = Matrix_Palette[0] * position * 0.5;
    vec4 skinned_position2 = Matrix_Palette[1] * position * 0.5;    
    vec4 outposition = skinned_position + skinned_position2;

at 60 fps.

while


   int j=1;
    vec4 skinned_position2 = Matrix_Palette[j] * position * 0.5;
    gl_Position = Projection * Modelview * skinned_position2;

is still 29.

grrrrr
Problem does not occur on Macbook Pro with Geforce NVIDIA 9600M GT

So… what about vsync? Is that on? You should turn it off.

Or you should use GL_ARB_timer_query to do better performance profiling.

The app is using displaylink to drive the render loop
http://developer.apple.com/library/mac/#qa/qa1385/_index.html

So “vsync” is inherently on.

Doesn’t really matter - the fact remains using non-constant array indexing is causing a massive slowdown on the ati card.

I am only rendering a 3k VBO.

I am filing a radar bug report with apple.

grrrrrrr

Doesn’t really matter - the fact remains using non-constant array indexing is causing a massive slowdown on the ati card.

No, it’s not. It’s causing your application to go from 61fps to 59fps. Because vsync is on, 59fps is not actually possible, so it drops to the next lowest synchronous framerate: 30.

Rule #1 of graphics profiling: Do not profile while vsync is on. If “NSTimer” or whatever is vsync’d, then don’t use it when profiling.

I am filing a radar bug report with apple.

That won’t help you fix your code.

I am sorry - I think you are failing to understand my point.

I can do double the processing work (over 3k matrix vector multiplies):


    vec4 skinned_position = Matrix_Palette[0] * position * 0.5;
    vec4 skinned_position2 = Matrix_Palette[1] * position * 0.5;    
    vec4 outposition = skinned_position + skinned_position2;

at 60 fps.

while


   int j=1;
    vec4 skinned_position2 = Matrix_Palette[j] * position * 0.5;
    gl_Position = Projection * Modelview * skinned_position2;

is still locked at 29.

Thanks for trying to help - but only recommending to disable vsync really isn’t helping me at all…

I realize the additional processing is causing it to miss the vsync and that is why it is 30 fps. My point is that I am doing very little processing in the shader. The fact that an array index causes it to miss the vsync IS the problem. I don’t care that it might cost 10 fps or 30 fps - it is WAY too expensive.

Most importantly, the fact that it only happens on a significantly faster ATI desktop card and not the slower Nvidia mobile chip is further evidence that there is driver issue.

Thanks again for your help and let me know if you have any other ideas other than disabling vsync

I think you are failing to understand my point.

No, I understand your point just fine. The simple fact is this: until you turn off vsync, all profiling data you get must be considered suspect. That is, nothing you see about how an application performs can be considered reliable if vsync is enabled.

I know it looks like something odd is going on. But unless vsync is turned off, timing data simply cannot be considered a reliable measure of anything that’s actually happening.

This should not be a difficult thing to do. You could probably have done it in the time it took you to compose that last message.

Sadly, Apple has not implemented ARB_timer_query (part of OpenGL 3.3), so there aren’t very many options available for profiling OpenGL.

I made some progress (still haven’t found an easy way to disable vsync when using displaylink - it’s not like the old days in windows or linux or even mac)…

But if I lower the bone count to 50 everything works pegged at 60 fps. :slight_smile:

Then I stumbled across an old similar post on this board
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=267068

It is not the exact same issue, but a problem with an ati card not handling more than 70 or so mat4 uniforms.

I checked on this card - it has GL_MAX_VERTEX_UNIFORM_COMPONENTS 4096

Which should still be 256 mat4??

The model I am testing with has 86 bones - and I was going to target around 32 so 50 works for me i guess…

But I still might file a radar with apple…

I guess for some reason, they don’t plan on anyone creating a really big array
uniform mat4 Matrix_Palette[BoneCount];

or perhaps it is some sort of GPU limitation.

That gpu is capable of computing 100s of times more indexed_matrix*vector multiplications than you’re doing. (I’ve run vtx-shader skinning like that on a HD2600, 500k vertices @ 60fps)
So, I’m inclined to think that for whatever reason the driver is doing a software-fallback.
Being able to disable vsync to quickly see vast differences in performance (and thus see sw-fallbacks) is quite important, so do try to find a way :slight_smile: .

A quick fix might be: “const int BoneCount = 86;” to become “#define BoneCount 86” , I remember vaguely having some performance issues with such constants on GeForces on some driver version.

I second that.

Besides frame-timing being useless, subframe timings are totally hosed because of how the GL driver reads ahead for subsequent frames after SwapBuffers, blocking on random calls when some implementation-dependent limits are reached.

Under some drivers you can put a glFinish() after your SwapBuffers call which, on some drivers, will forceably wait for V-sync. This will make your sub-frame timings more useful, but will still leave you with useless frame timings and is probably driver dependent behavior. Using NV_fence/ARB_sync is yet another way to do this.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.