3D Texture Optimizations

I am working on a volume renderer using view-aligned slices, 3D textures, and a fragment program. It’s working quite nicely and I am trying to optimize it, but it isn’t clear to me how much various operations cost.

Some ideas I’ve had:
I could decrease the number of slices and then use a fragment program to take multiple texture samples along the viewing vector. This would reduce the number of fragments and polygons and not change the number of texture fetches, but would increase the fragment program complexity.

I could break up the volume into smaller cubes and varry the slice frequency per sub-volume, thereby only working hard at edges where slice-aliasing becomes obvious.

I could render once with low slice frequency in near-to-far order, drawing only nearly-opaque fragments, then draw in far-to-near order with full alpha blending. The idea being that the depth buffer would allow me to avoid running the fragment program on fragments that will never be seen.

I think there may be other ways of getting more out of the paralellism of texture hardware, but I am unclear about what’s possible.

I could decrease the number of slices and then use a fragment program to take multiple texture samples along the viewing vector.
Small speedup, if any. You are probably memory bandwidth bound - texture fetches are expensive. You would only save some blend operations with the frame buffer. However, on old graphics hardware, this is a good idea (GeForce1, 2 fetches in one cycle). Lookup the Rezk-Salama paper from Graphics Hardware 2000. Newer hardware will optimize automatically.

I could break up the volume into smaller cubes and varry the slice frequency per sub-volume
There are papers about multi-resolution volume rendering by LaMar et al. and Weiler et al. You will probably loose all the speedup with fixing the boundaries. However, finding empty blocks might be a good idea. This is known as empty-space-leaping.

I could render once with low slice frequency in near-to-far order, drawing only nearly-opaque fragments, then draw in far-to-near order with full alpha blending. The idea being that the depth buffer would allow me to avoid running the fragment program on fragments that will never be seen.
Good idea. Look up Krueger and Westermanns Paper from Viz’2003. They do raycasting with early-ray termination. Will only work on the ATI R3xx with early-z/stencil - don’t waste your time with the FX.

You can get more ideas at our Siggraph 2004 course in Los Angeles ( Siggraph Course28: Real-Time Volume Graphics )

  • Klaus

Originally posted by Klaus:
[b] [quote] I could decrease the number of slices and then use a fragment program to take multiple texture samples along the viewing vector.
Small speedup, if any. You are probably memory bandwidth bound - texture fetches are expensive. You would only save some blend operations with the frame buffer. However, on old graphics hardware, this is a good idea (GeForce1, 2 fetches in one cycle). Lookup the Rezk-Salama paper from Graphics Hardware 2000. Newer hardware will optimize automatically.

  • Klaus[/b][/QUOTE]This is a good idea to use a fragment program to take multiple texture samples along the viewing vector not simply because of performance, but also because of blending precision.
    If you use the frame buffer blending operator, your data will be converted to 8 bit, that is not enough for good volumetric rendering.
    If you perform (partially) this blending in a fragment program, your blending will be performed in floating point.
    Anyway, if you need high quality (high number of slices), avoid the frame buffer blending operator (as long as it is not yet a 16 bit floating point operator).

Wishes,

Luis

Thanks for the help, Klaus. I expect to be at your SIGGRAPH talk.

I think the general thing I’m wondering is if I am limited by raw texture memory bandwidth or by the number of fetches. I guess I don’t have a sense of what’s going on in hardware, exactly. Turning on and off linear interpolation in the texture doesn’t seem to change the speed much.

Does anyone know of a document that describes what goes on on the graphics card when textures are used, such as when things are cached, where texels are blended, and where the possibilites are for paralellism?

Try using a texture internal format with less bits or a luminance texture instead of colour if you aren’t already. This should reduce bandwidth strain. If performance increases, you were probably texture bandwidth limited.

However, you probably use lots of framebuffer bandwidth with all the blending you do, so this might be more significant than texture bandwidth. Try turning off blending and using less bits in the pixel format, if performance increases you’re framebuffer bandwidth limited.

The last thing to try is using a shorter fragment program without affecting bandwidth (tricky but possible, remove all ALU instructions and keep the texture accesses). If performance increases, you’re fragment shader limited.

That was a short but incomplete guide, check out the performance guides on ATI’s and Nvidia’s developer sites for more info.

If you’re on a card that supports it (i.e., Nvidia), you might try using the accumulation buffer and SUN_slice_accum . That should give higher quality than blending to the framebuffer.

One thing I’d like to know is, what does this mean: “Four parallel rendering pipelines process up to 1.6 billion pixels per second”

What does and does not get all of these pipelines going?

The german website 3dcenter.de usually has very detailed articles. Translations are available for most of their articles:
CineFX (NV30) Inside
Inside nVidia NV40

  • Klaus