NV FBO Performance, Part 2

In this test case, I’m seeing a 5950 Ultra and even a 5900 XT smoke a 6800 Ultra. so I’d sure appreciate some insight as to what I’m doing wrong.

The usage scenario is volume particle lighting, ala Harris . For example:
</font>[ol]
[li]Set up a 32x32 color-only FBO[/li][li]For all 183188 particles[/li] [ul][li]<font size=“2” face=“Verdana, Arial”>Read back a 1x1 to 8x8 pixel region centered on particle loc (glReadPixels)[/li] [li]Render QUAD particle into buffer with alpha blend[/li] [/ul]
</font>[/ol]<font size=“2” face=“Verdana, Arial”>Now I haven’t even tried any optimization yet because I’m completely baffled by the stats I’m getting. Here they are:

[ul][li]12960 ms - 5950 Ultra (FBO)[]23274 ms - 6800 Ultra (FBO)[] 5268 ms - 5950 Ultra (system frame buffer)[*] 5547 ms - 6800 Ultra (system frame buffer)[/ul][/li]The first two are rendered to a 32x32 color-only FBO. The latter two are rendered to the bottom-left 32x32 corner of the default frame buffer (MSAA disabled of course).
This immediately prompts two questions:

[ul][li]Why is the older card faster in each technique[*]Why is the system framebuffer path faster than FBOs?[/ul][/li]This is all on the same system with the same app and same rendering path – only the graphics card has been changed.
Anyone have an idea what’s going on here? --Thanks!

NVidia Driver: 1.0-7667
NVidia Cfg: AGPGART, 8x, Fast Writes, SBA

Did you uninstall and reinstall the driver when you changed the gfx cards?

Is the color attachment of the FBO a texture or a renderbuffer?

If it’s a texture, I guess the glReadPixel call could be a lot slower than on the system framebuffer. I don’t think FBOs are optimized for readback.

Have you tried the same without glReadPixel? Try to archieve the same effect with RTT if possible…

Jan:
Did you uninstall and reinstall the driver when you changed the gfx cards?
No, I installed the driver with the 6800 Ultra in-place (so it should have been optimal, but wasn’t); then I just dropped in the 5950 Ultra.

Is the driver install card-specific?

Well, i don’t know if the install is card-specific, maybe not, but i wouldn’t simply assume it isn’t.

Jan.

The first two are rendered to a 32x32 color-only FBO.
Um, why 32x32? I could imagine that some hardware would have trouble rendering to hyper-small framebuffers.

Why is the system framebuffer path faster than FBOs?
Rendering to a texture will be slower than rendering to a framebuffer. If you just need a rendertarget for reading or something, you should use a renderbuffer, not a texture. Use a texture only if you need to texture the results onto something else.

Plus, you’re using a 32x32 target, which, as I mentioned, may not be well accelerated.

hi

seems to be the glReadPixels is done faster with a geforce 5900xt than with a geforce 6800 ultra

we made this observation during our pbo tests, too

Thanks to Overmind, Korval, and Jan for the suggestions.

Here are the results of the latest tests. Still no silver bullet:

[ol][li]Use renderbuffer instead of a texture[/li]
RESULT: No difference.
[li]Render to a larger texture than 32x32[/li]
RESULT: No difference rendering to lower-left 32x32 of a 256x256 or 512x512 texture (rendering to a larger region “of” this texture would mean more readback bandwidth).
[li]Use glGetTexImage to read texture instead of glReadPixels[/li]So far I haven’t been able to get reasonable pixel data returns from this.[/ol]

Overmind: Is the color attachment of the FBO a texture or a renderbuffer?
A texture. Changing it to a renderbuffer didn’t affect performance.

Have you tried the same without glReadPixel?
Simple timers around glReadPixel reveal that 91-95% of the latency is in (or masked) by the glReadPixels call, and commenting it out reduces the total time to 5-17% of the original time. So it is mostly the sheer latency of glReadPixels.

However, all this doesn’t yet explain why an NV3x smokes a top-of-the-line NV40 on this test (by 2X when reading from an FBO), or why a glReadPixels from an FBO is 2X-4X slower than the system framebuffer,

Try to archieve the same effect with RTT if possible…
Yeah, I’ve been thinking about that. Even if I was doing a 4-pass 8x8 downsample per particle on the GPU before rendering the particle, it probably wouldn’t be any slower.

The unfortunate but intuitive feature of this lighting algorithm is that the readback is needed before rendering each particle to determine how much light has made it through the volume to this point. So it essentially ping-pongs back and forth between the CPU and the GPU (for all particles: get lighting from texture area; attentuate lighting in texture behind particle).

Henry Jones:
seems to be the glReadPixels is done faster with a geforce 5900xt than with a geforce 6800 ultra … we made this observation during our pbo tests, too
Interesting. Thanks! Did you happen to work out any stats (e.g. % diff for some transfer size)?

Could be that’s the 5% time increase I see with this alg reading from the system framebuffer (5950U vs. 6800U). Though the 80% increase with FBOs suggests something else is at-work here…

Originally posted by Henry Jones:
[b]hi

seems to be the glReadPixels is done faster with a geforce 5900xt than with a geforce 6800 ultra

we made this observation during our pbo tests, too[/b]
Thats the opposite of the results I’ve seen.

Readpixels is about 4 times faster on a gf6 than a gf5.

Measuring just the time for glReadPixels won’t give you the transfer time because ususally GPUs batch the render commands and render only if the buffer is full or of you call glFinish. glReadPixels does a glFinish internally.

So you probably also measure the actual rendering time. To prevent this, add an glFinish before measuring glReadPixels.

ScottManDeath:
Measuring just the time for glReadPixels won’t give you the transfer time…add a glFinish before measuring glReadPixels.
Right. I probably should have listed those stats as well, but didn’t as glReadPixels is blocking and issue-to-completion seemed most relevent. Anyway, here are the latencies exclusively attributed to glReadPixels (processing and transfer) on the 6800 Ultra:

[ul][li]5025 ms - 6800 Ultra (framebuffer) - 90% of entire technique: 5547 ms[*]16764 ms - 6800 Ultra (FBO) - 72% of entire technique: 23274 ms[/ul][/li]The rub of this algorithm is that (unless I rework it and gut out the readback, which it looks like I’ll be doing), I really need fast readback from FBOs, not the framebuffer, as the window system FB will be multisampled, and doing “scratch math” and readbacks to/from a multisampled buffer is just too expensive…

What is the performance when using renderbuffers?

Also, don’t forget that the FBO implementation isn’t exactly complete. It’s functional, but it isn’t tweaked for performance or anything.

Hi, I have a bit different question, although it’s still about FBO performance, so I thought I would stick it in here: Do you guys have any estimates on the cost of switching textures in the FBO color attachment? I’ve only read about costs relative to PBuffers, full FBO switches and such, but what I’m really interested in is that if I were to render the same amount of geometry/pixels into one texture, how much faster would that be than rendering it into 12 separate textures (a littlebit into each)? Let’s say I bind four textures simultaneously and use DrawBuffers to switch between them, so it would only require three texture switches, which doesn’t sound too bad to me. Or is it?

andras:
Do you guys have any estimates on the cost of switching textures in the FBO color attachment?
On NVidia, see this thread for a partial answer.