discussing shadows

I’ve been working on some methods to do shadows using the stencil buffer, but performance and a slight technical detail is making things difficult.

My method is sort of similar to the papers presented by Mark Kilgard (Crow77)
but in my case, I dont need to figure out a silhouette or flatten the object into a plane, or even care about polygon orientation.
The only drawback is fillrate limitations.
What I find weird is that if the window is about 500x500, I get over 500 FPS for a single light, but 18 FPS for 3 lights. Making the window smaller helps…

I would like to ask what methods are working for you.

I will definitly have to check out this Doom 3 thing since people are saying it will have dynamic lighting with stencil shadow volumes and such.

V-man

That’s quite a difference.

It suggests you have some combinatorial pass thing going on instead of clearing stencil between lights and adding the light to the framebuffer for each illumination pass. You should be able to get linear performance for number of lights or better in practice since the depth buffer fill get’s amortized over all lights. That’s just brute force, what you ultimately do is restrict rendering for each light to the parts of a scene illuminated by that light.

Either that or something is pathologically slow on your platform, for example a stencil clear without a depth clear, or the way you choose to clear stencil. Or blended is significantly slower than non blended. It’s difficult to guess without more information.

Originally posted by dorbie:
[b]That’s quite a difference.

It suggests you have some combinatorial pass thing going on instead of clearing stencil between lights and adding the light to the framebuffer for each illumination pass. You should be able to get linear performance for number of lights or better in practice since the depth buffer fill get’s amortized over all lights. That’s just brute force, what you ultimately do is restrict rendering for each light to the parts of a scene illuminated by that light.

Either that or something is pathologically slow on your platform, for example a stencil clear without a depth clear, or the way you choose to clear stencil. Or blended is significantly slower than non blended. It’s difficult to guess without more information.[/b]

Actually, Im doing a brute force technic and soon will atempt a combinatorial technic.
brute force being:

for each light do
{
clear stencil
generate stencil info
turn on light, turn on blend(GL_ONE, GL_ONE) to accumulate lights
render scene with GL_EQUAL
}
I should be able to combine 4 or more lights without clearing the stencil with my newer idea.

I was on Gf2 MX 400 before (500 FPS 1 light and 18 FPS 3 lights)
Now on Gf2 GTS (600 FPS 1 light, 450 FPS 2 lights, and 33 FPS 3 lights)

I triedglDepthMask(GL_FALSE);
glClear(GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
glDepthMask(GL_TRUE);

instead of
glClear(GL_STENCIL_BUFFER_BIT);

no effect.

Im considering render_to_texture. Im thinking that a pure texture hack (never using the stencil) should solve this problem.

//EDIT// Did I mention Im only doing the shadows of 2 cubes onto a square surface so this is only 26 triangles all made with TRIANGLE_STRIP.

V-man

[This message has been edited by V-man (edited 06-01-2002).]

Are you sure the stencil clear is the problem ? What happens if you try to remove it (except invalid shadows, of course) ? Is it still slow ?

Y.

You’re doing the right thing as far as I can tell. I’m all for rendering to a shadow texture :-), but stencil should work and it’s all you need.

The knee between 2 & 3 textures vs 1 & 2 blows a lot of theories out of the water.

FWIW I think more complex overlapping geometry becomes problematic without a stencil clear.

Maybe watch your state very carefully and try to isolate the cause by turning off individual passes.

Strange.

Do you turn off the light after each pass? You can get away with one light and just reposition it between passes. Each pass should only have it’s own light on. I wouldn’t expect multiple lights to matter all that much with such simple geometry so I’m grasping at straws.

In you testing are you using the same light type for all three lights? Also, are you always setting LIGHT0 rather than enabling a different light on each pass? (I don’t think that would be the issue, but I could imagine something dumb happening in that case) Also, have you tried reorienting your lights, this could have a big impact.

Finally, as for partial buffer clears like the only stencil buffer clear or the stencil/depth clear with depth masked, I would not expect them to run slow on modern HW, but I would expect full clears to possibly run extra fast.

-Evan

OK, the problem is that Im an idiot (maybe)

I dont have one of those infinite loops (idle()) in my project, so I plugged in FPS counting code and decided to use my mouse and keyboard to cause OnPaint to be called. It turns out, for 2 lights, there is no problem. For three lights, somehow it is causing an interuption and causing a huge delay between timeGetTime() calls. But why is that??? I used a WM_TIMER for OnPaint to be called. Getting ~390 FPS for 3 lights now.

Yes, this isnt actual FPS, it just measures how long it takes to render + swapbuffers.

time1=timeGetTime();
RenderScene();
SwapBuffers();
totaltime=timeGetTime()-time1;
//totaltime gets averaged over 25 frames
//FPS is computed and displayed

I have been thinking about ways to do render_to_texture shadows. But the big question is, do you just project the shadow of every single object onto every other object (For n object, you will have n*n passes)?
It’s easy to figure out what’s behind the object and cut down on render time, but still, that’s gone require a whole lot of blending passes if many objects are behind one after the other and the light is at the very front. I dont see an easy solution yet…

V-man

[This message has been edited by V-man (edited 06-02-2002).]

It may be some issue with the fifo size or perhaps the dispatch queue in the driver & where you take your timestamp.

Measure time from frame to frame AFTER the swap call is made. Better still unless you have some heavy duty MFC stuff you want to build in forget about the ondraw callbacks & roll your own loop.

Hmm… It occurs to me that when you go over the performance limit where the swap call starts blocking you may get actual frame time (roughly), but if frame time is less than the event loop you’ll get the time taken just to issue the calls.

So the knee kicks in more or less where the app starts looping faster than your card can draw the scene issued in the ondraw callback. When swap isn’t blocking you get a tiny time, when it starts blocking you get something close to the actual draw time.

Test this, try a faster event timer and see if you get a slower (but more accurate)measured frame rate (provided there’s no other bottleneck on the CPU).

This is because you measure from start of frame to end of frame. Measuring frame to frame you’d at least see the event timer interval as your frame rate (roughly), when your loop was slower than the draw.

[This message has been edited by dorbie (edited 06-03-2002).]

Here is some benchmarks I did. The numbers were fluctuating so I took an average so there is a certain percentage of error in the numbers.

The result I get with RenderScene+glFlush() is the kind of numbers I like

INFRAME counting for all, RenderScene() + SwapBuffers()

300 x 300 = 90000 pixels 95 FPS timer = 10 msec
500 x 500 = 250000 pixels 34 FPS timer = 10 msec
600 x 600 = 360000 pixels 23 FPS timer = 10 msec
700 x 700 = 490000 pixels 16.6 FPS timer = 10 msec

300 x 300 = 90000 pixels 370 FPS timer = 50 msec
500 x 500 = 250000 pixels 345 FPS timer = 50 msec
600 x 600 = 360000 pixels 337 FPS timer = 50 msec
700 x 700 = 490000 pixels 16.7 FPS timer = 50 msec

300 x 300 = 90000 pixels 363 FPS timer = 100 msec
500 x 500 = 250000 pixels 360 FPS timer = 100 msec
600 x 600 = 360000 pixels 360 FPS timer = 100 msec
700 x 700 = 490000 pixels 360 FPS timer = 100 msec

INFRAME counting for all, RenderScene() + glFlush()

300 x 300 = 90000 pixels 430 FPS timer = 10 msec
500 x 500 = 250000 pixels 390 FPS timer = 10 msec
600 x 600 = 360000 pixels 390 FPS timer = 10 msec
700 x 700 = 490000 pixels 390 FPS timer = 10 msec

300 x 300 = 90000 pixels 360 FPS timer = 50 msec
500 x 500 = 250000 pixels 360 FPS timer = 50 msec
600 x 600 = 360000 pixels 360 FPS timer = 50 msec
700 x 700 = 490000 pixels 410 FPS timer = 50 msec

Time between frames instead of start to finish, or call glFinish after swap, NOT glFlush. It’s heavyweight but it’ll do for now to get you some sensible timing information.

I’m pretty sure my theory is close to what’s going on.

glFinish will just tell it to finish. The driver already knows it should finish so what would be the point.

In this case (Flushing the command buffer), I wanted to know how long it takes for the rasterization to complete. I will try the frame to frame count also.

The idea is that there is a kind of collision going on with SwapBuffer calls, so the driver halts the latest Swap to finish up with the previous Swap. Correct? Why dont it just dismiss the latest SwapBuffer call.

V-man

V-man

To know that rasterization has completed, read back a single framebuffer pixel before your timeGetTime() call. I e:

start = timeGetTime();
draw();
swapBuffers();
readOnePixel();
stop = timeGetTime();

Trust me V-man :slight_smile:

There’s no need to read back a pixel, read the man page for glFinish, it is not the same as glFlush. It does exactly what I intend it to do and it does the right thing for your timing. glFlush merely dispatches buffered commands so your timing will still be bogus unless swap blocks, i.e. it will make almost no difference. glFinish will block on everything completing including fragment rasterization and will therefore time your rendering.

The issue is blocking vs non blocking swap.

Think about the swap call blocking vs non blocking in your code and the implications for your timing results. You want to FORCE a blocking swap because right now it only blocks when you manage to fill the FIFO with a full frame. This only happens when your timer loops faster than your card can draw a frame, it therefore explains your strange timing results.

Anyhoo, try it and see, it’ll demonstrate what I’m saying.

You either time frame to frame, force a block on swap or live with bogus times.

P.S. I think the swap performs an implicit flush.

[This message has been edited by dorbie (edited 06-09-2002).]

Damn, I made a bobo there with the flush thing. Finish is what I wanted (to block of course).
Flush is useless on a PC, right? (client and server on the same motherboard).

I will try to benchmark some more situations when I get the chance.
Tried some tests on Release EXE (maximize performance). Some code gets optimized extremely well by VC++. For example, Using arrays of struct isn’t a good idea. 2 times slower than a simple array.

struct Vertex3D
{
float x, y, z;
};

Vertex3D vertices[number];<-- not a good idea

vertices[number*3];<-- much better

V-man

Huh?

Are you sure? Wouldn’t of thought there would be much difference, if any. Thats really poor on the compiler side, if it is slower.

The only difference is indexing needs to be multiplied by 12 instead of 4. Wouldn’t of thought it cost that much with correctly pipelined address calculation.

You can either use visual studios’
#pragma pack(… ) directive or/and the __declspec(align(…)) directive should help.

(don’t remeber what did what right now ;])
good luck.

Originally posted by Nutty:
[b]Huh?

The only difference is indexing needs to be multiplied by 12 instead of 4. Wouldn’t of thought it cost that much with correctly pipelined address calculation.[/b]

Of course, multiplying by 4 is a single shift, while multiplying by 12 is two shifts and an add. One can easily check if this is the cause of the slowdown by adding a dummy float at the end of the structure (or declspec(align)), in order to make it 16 bytes.

I belive that VC++ 6.0 defaults to 8 byte struct member aligment so
his struct should allready be padded to 16 bytes.
Check with sizeof.
You can change it in settings or with preprocessor directives.
When you use timeGetTime do you set interval with timeBeginPeiod?
You can try QueryPerformanceCounter for timing, it is more expensive than
timeGetTime but it has better precision.