SLI Optimization

Hi,

I been playing with an SLI machine for a couple of days now but I cannot get good performance from it.

I have read all the SLI notes and conference talks from the nVIDIA website.

Here is what I do in my program:

a) Resize the screen to a bigger resolution; render my model from the light position and using FBO save the Depth map (for shadow mapping).
b) Render my model again and compare the depths for the cast shows and save the result using FBO into another texture.
c) Blur this texture to get soft edges in the shadows.
d) Resize to the desire screen size, render the model with the light model and all the effects and map the soft shadows from the previous steps.

Now I have followed the steps from the NVIDIA GPU programming guide:

  1. Limit OpenGL Rendering to a single window – Done.
  2. Request PDF_SWAP_EXCHANGE – Done
  3. Avoid Front Buffer Rendering – I need to look into this.
  4. Limit pbuffer Usage – I not using any pbuffer
  5. Render Directly into textures instead of using glCodyTexSubImage – I’m only using FBO Render to texture.
  6. Use Vertex Buffer Objects or Display Lists – I do use VBO.
  7. Limit Texture Working Set – Done.
  8. Render the Entire Frame do not use glViewport, glScissors– Read Below ???
  9. Limit Data Readback – Done.
  10. Never Call glfinish() – Done.

So from all these steps I feel confused about not using glViewport, because if I want to render my cast shadows with a higher quality, I need to use glViewport to resize the context so I can render to a texture of a bigger size. Also when doing GPU work it is necessary to align the texture to the screen therefore it is necessary to use glViewport.
So how can I avoid using such a useful function?

Also do I need to change how I call SwapBuffers()?. I’m doing something like this at the moment:

  1. PeekMessage()
    2a) If message then process it
    2b) If not a message then update and later do all the rendering calls (step a to d)
  2. SwapBuffers()
  3. Back to step one

Please let me know if I’m doing something wrong or if you have any other tip.

Thanks.

Your step a) and d) sound like the culprit.
You hopefully don’t mean you set a different onscreen resolution everytime?
You can render into FBOs sized bigger than your onscreen resolution directly.

So from all these steps I feel confused about not using glViewport, because if I want to render my cast shadows with a higher quality, I need to use glViewport to resize the context so I can render to a texture of a bigger size. Also when doing GPU work it is necessary to align the texture to the screen therefore it is necessary to use glViewport.
So how can I avoid using such a useful function?

I understand this as smaller viewport size limit the effectiveness of SLI.
If you need to render in different sizes what about having different sized FBOs?
You algorithm sounds like a ping-pong mechanism using two textures. You should try to use “fullscreen” size on anything, screen and FBOs.

The algorithm looks like it needs a lot of fillrate, so split frame rendering (SFR) might provide better scaling in your case.

You should only call SwapBuffers() if you actually rendered a new frame. With swap_exchange the back buffer contents are undefined after the swap (in SLI AFR on a different board!)

Check if the load indicators you can enable in the control panel indicate any scaling in your application.

Originally posted by Relic:
[b]Your step a) and d) sound like the culprit.
You hopefully don’t mean you set a different onscreen resolution everytime?
You can render into FBOs sized bigger than your onscreen resolution directly.

[quote]So from all these steps I feel confused about not using glViewport, because if I want to render my cast shadows with a higher quality, I need to use glViewport to resize the context so I can render to a texture of a bigger size. Also when doing GPU work it is necessary to align the texture to the screen therefore it is necessary to use glViewport.
So how can I avoid using such a useful function?

I understand this as smaller viewport size limit the effectiveness of SLI.
If you need to render in different sizes what about having different sized FBOs?
You algorithm sounds like a ping-pong mechanism using two textures. You should try to use “fullscreen” size on anything, screen and FBOs.

The algorithm looks like it needs a lot of fillrate, so split frame rendering (SFR) might provide better scaling in your case.

You should only call SwapBuffers() if you actually rendered a new frame. With swap_exchange the back buffer contents are undefined after the swap (in SLI AFR on a different board!)

Check if the load indicators you can enable in the control panel indicate any scaling in your application.[/b][/QUOTE]Hi again,

I normally switch to 2048x2048 to get the depth map (it makes the shadows look much better). I will try to use a much bigger FBO and see if it works.

Yes SFR gives more performance over AFR.

Also when I set the swap_exchange flag, the performance is the same, is this correct?

And finally, in the load indicators, in AFR mode what do the vertical lines represent? ( in my application the SFR indicator stays usually in the middle).

Thanks.

I meant don’t call ChangeDisplaySettings().
You switch what to 2048x2048? An FBO? Not sure what you mean with “switch”.
I thought you need two differently sized render to texture surfaces anyway.

You need to check if the PFD_SWAP_EXCHANGE is set in the pixelformat index you somehow chose.
ChoosePixelFormat is too dumb to look at that flag. Do an explicit DescribePixelFormat call to fill the pfd structure with the actual PIXELFORMATDESCRIPTOR belonging to the selected index. ChoosePixelFormat doesn’t fill those in.
(It’s evil and broken. Write your own. :wink: )

I’ve seen load balance indicator explanation in one of the documents on the official NVIDIA driver download page. Yup, the last document on the bottom of this page: http://www.nvidia.com/object/winxp_2k_81.98.html

In AFR you need to have a big green bar between the two white lines or you don’t have any overlapping work. In SFR the horizontal line shows where the two boards split their work. In the middle is perfectly fine.

Originally posted by Relic:
[b]I meant don’t call ChangeDisplaySettings().
You switch what to 2048x2048? An FBO? Not sure what you mean with “switch”.
I thought you need two differently sized render to texture surfaces anyway.
Yep I have two FBOs

  1. 2048x2048 for shadows and
  2. One that matches the screen size for fxs.

I guess I need to join them. (need to do a little fix to do shadow mapping with rectangle textures.)

You need to check if the PFD_SWAP_EXCHANGE is set in the pixelformat index you somehow chose.
ChoosePixelFormat is too dumb to look at that flag. Do an explicit DescribePixelFormat call to fill the pfd structure with the actual PIXELFORMATDESCRIPTOR belonging to the selected index. ChoosePixelFormat doesn’t fill those in.
(It’s evil and broken. Write your own. :wink: )

This is how my code looks, (Copy from NEHE).

I’ve seen load balance indicator explanation in one of the documents on the official NVIDIA driver download page. Yup, the last document on the bottom of this page: http://www.nvidia.com/object/winxp_2k_81.98.html

In AFR you need to have a big green bar between the two white lines or you don’t have any overlapping work. In SFR the horizontal line shows where the two boards split their work. In the middle is perfectly fine.[/b]
Ups how do I miss that? My mistake. Looks like there is just a little of overlapping work.

Well, Thanks a lot.

That’s the pixelformat you ask for, not the one you got. Check DescribePixelFormat() after you got the id.

Make sure you ask for 24 or 32 color bits.
Do not request a 16 bit depth buffer, use 24.

NeHe has a lot of those “render in the idle loop” examples. IMO, a good tutorial should implement a decent WM_PAINT window message handler instead.

Originally posted by Relic:
[b]That’s the pixelformat you ask for, not the one you got. Check DescribePixelFormat() after you got the id.

Make sure you ask for 24 or 32 color bits.
Do not request a 16 bit depth buffer, use 24.

NeHe has a lot of those “render in the idle loop” examples. IMO, a good tutorial should implement a decent WM_PAINT window message handler instead.[/b]
Great I will try that, lets hope that fix the performance. Thanks. I’ll keep you posted later.