ATI_stencil_separate slower than conventional stencil calls ?

kard · August 15, 2003, 12:06am

I get some strange test results when I use ATI_separate_stencil instead of the conventional calls to perform a Z-pass shadow volume algorithm.

Enabling this extension produces slower performances than with conventional stencil OGL functions on my system (PIII 800, ATI 9700 pro). Here are the two algorithms:

Conventional calls:

glColorMask(GL_FALSE,GL_FALSE,GL_FALSE,GL_FALSE);

glEnable(GL_STENCIL_TEST);
glStencilFunc(GL_ALWAYS,0,~0);
glStencilMask(~0);

glCullFace(GL_FRONT);
glStencilOp(GL_KEEP,GL_KEEP,GL_INCR);
DrawShadowVolume();

glCullFace(GL_BACK);
glStencilOp(GL_KEEP,GL_KEEP,GL_DECR);
DrawShadowVolume();

With ATI_separate_stencil extension:

glColorMask(GL_FALSE,GL_FALSE,GL_FALSE,GL_FALSE);
glDisable(GL_CULL_FACE);

glEnable(GL_STENCIL_TEST);
glStencilFunc(GL_ALWAYS,0,~0);
glStencilMask(~0);

glStencilOpSeparateATI(GL_FRONT,GL_KEEP,GL_KEEP,GL_INCR_WRAP_EXT);
glStencilOpSeparateATI(GL_BACK,GL_KEEP,GL_KEEP,GL_DECR_WRAP_EXT);
DrawShadowVolume();

Both of them produce indistinguishable and accurate visual results.

Any idea? Is it possible that such an extension is CPU-dependant?

[This message has been edited by kard (edited 08-15-2003).]

Humus · August 15, 2003, 3:34am

How large is the performance difference?

kard · August 15, 2003, 6:26am

not really important… 5 to 15 % slower. But I expected improvement not a slowdown.

zeckensack · August 15, 2003, 6:54am

Is there a reason why you use GL_INCR/GL_DECR in the ‘conventional’ code and GL_INCR_WRAP/GL_DECR_WRAP w the extension?
IMO you should use wrap for both. Maybe this produces the performance delta

zeckensack · August 15, 2003, 7:07am

Originally posted by kard:
not really important… 5 to 15 % slower. But I expected improvement not a slowdown.
There will only be an improvement if you’re fill limited during volume rendering. Move around the lights to enlarge the volume, bump up resolution, that sort of thing.

kard · August 15, 2003, 7:39am

Is there a reason why you use GL_INCR/GL_DECR in the ‘conventional’ code and GL_INCR_WRAP/GL_DECR_WRAP w the extension?
IMO you should use wrap for both.

You’re right, but the first piece of code is extracted from a more general path, and the GL_XXXX_WRAP are not necessary as the stencil fragments are often incremented before decremented. Our engine falls in ZFail configuration if the viewer goes through the shadow volumes.

On the other hand GL_XXXX_WRAP tags are indispensable with separate_stencil as there is no mean to impose the front faces to be rendered first.

Maybe this produces the performance delta

I will try this…

kard · August 15, 2003, 9:18am

Originally posted by zeckensack:
Is there a reason why you use GL_INCR/GL_DECR in the ‘conventional’ code and GL_INCR_WRAP/GL_DECR_WRAP w the extension?
IMO you should use wrap for both. Maybe this produces the performance delta

OK, I replace GL_INCR and GL_DECR by GL_INCR_WRAP_EXT and GL_DECR_WRAP_EXT in the conventional path, and the performance difference is still the same.

Humus · August 15, 2003, 12:04pm

Originally posted by zeckensack:
There will only be an improvement if you’re fill limited during volume rendering. Move around the lights to enlarge the volume, bump up resolution, that sort of thing.[/b]

I think you mean that there will only be an improvement if you’re mostly geometry limited. Fillrate demand doesn’t change, but the geometry load is halved.

zeckensack · August 15, 2003, 12:10pm

Originally posted by Humus:
I think you mean that there will only be an improvement if you’re mostly geometry limited. Fillrate demand doesn’t change, but the geometry load is halved.
Heh, yes, what was I thinking …
Thanks for the correction.

dorbie · August 15, 2003, 1:21pm

Do you need to wrap? Use glClearStencil to specify a mid value for the clear (or less probably since cumulative incs from multiple objects make the inc more dangerous than the dec) and try it without the wrap. You should be OK so long as you don’t overflow or underflow. This should allow you to use the two sided stencil extension without worrying about decrements on a zero buffer while still avoiding the wrap.

The big question is will it slow the clear or slow your stencil tested fill (due to non zero value in both cases). The cure may be worse than the problem, or it may be a win good luck! and let us know if it is a win, lose or draw.

[This message has been edited by dorbie (edited 08-15-2003).]

kard · August 15, 2003, 10:35pm

Originally posted by Humus:
I think you mean that there will only be an improvement if you’re mostly geometry limited. Fillrate demand doesn’t change, but the geometry load is halved.

I totally agree with you, and I’m certainly more limited by fill rate than by geometry. But does this point could justify any slowdown. In every case ATI_separate_stencil should be more efficient as the number of fragment operations is the same but the vertex processing is half the conventional one.

Did I miss something?

Humus · August 15, 2003, 11:31pm

True, it shouldn’t slow down though. If the slowdown is small, such as the 5% number you said or below I may accept that it might be some additional driver overhead, but up to 15% seems too much.

imported_jwatte · August 16, 2003, 12:02am

Maybe there is some internal pipeline or cache which gets choked when every fragment has an operation, but which works more smootly (more coherent?) when you’re in regular stencil operation?

Could separate stencil perhaps disable early Z, while regular stencil doesn’t?

As was pointed out, the number of fragment operations is the same between the two approaches, but the geometry load is half in the separate stencil case. Thus, unless you know to be geometry limited, I’d just as well not use separate stencil.

kard · August 16, 2003, 12:42am

Originally posted by Humus:
True, it shouldn’t slow down though. If the slowdown is small, such as the 5% number you said or below I may accept that it might be some additional driver overhead, but up to 15% seems too much.

Moreover the average slowdown is near 15%…

zeckensack · August 16, 2003, 6:30am

Another wild guess:
Have you tried using StencilFuncSeperateATI (using same params for front and back)?

I know, the way the spec is written this really shouldn’t matter. glStencilFunc is specced to set both front and back state, so that the function could be implemented as a wrapper that calls glStencilFuncSeperateATI twice.

Maybe that’s not quite working, so it could be worth a test.

kard · August 16, 2003, 6:58am

Originally posted by zeckensack:
Another wild guess:
Have you tried using StencilFuncSeperateATI (using same params for front and back)?

I originally followed the specs and used the glStencilFuncSeparateATI function. I replaced it by the usual glStencilFunc after I thought that the slowdown was due to that call.

But I didn’t notice any difference.

Humus · August 16, 2003, 12:22pm

Have you tried increasing the amount of geometry used to see if at some point the performance delta goes to zero or double sided starts to get faster?

kard · August 17, 2003, 12:58am

Originally posted by Humus:
Have you tried increasing the amount of geometry used to see if at some point the performance delta goes to zero or double sided starts to get faster?

OK, I perform some test with various geometry settings, and here are the results:

Nb of vtx of the scene | nb of vtx of the silhouettes | slowdown or gain
60584 | 24159 | - 9 %
222477 | 67511 | - 12 %
578864 | 109883 | - 4 %
1484985 | 221467 | + 1 %

So it seems that with ultra high geometry database there is effectively a minor gain (1%). But such a level of geometry precision is yet impracticable (the stencil shadow algorithm needs 3191437 vertexes to be rendered + the huge amount of fill rate). I ask to myself if this extension is useful with the current implementation.

[This message has been edited by kard (edited 08-17-2003).]

Humus · August 17, 2003, 9:28am

Oki, it’s hard to know but may be some trouble with the driver. Looks reasonable. I’d suggest you send the app with comments to devrel@ati.com and hope they can find something out, whether it’s the driver or something else.