fragment program performance question

hi all
im presently writing a realtime 3D visualisation application that utilices a fragment program to do a transferfunction lookup and the shading of the object based on the gradients of the dataset. the data is represented as 2 3D textures (the density and the gradient) and one 2D texture (the transferfunction).
presently im having quite severe performance problems, that result because of fill rate limitations. the optimisation, that i have implemented up to now, is, that i do one z-only pass without blending and alpha test enabled drwan front to back to determine which parts of the model will be visible (this pass alone runs at about 60 fps in 400X400 with 100 slices drawn on a radeon 9700pro). then im doing a z-equals pass with the shading fragment program enabled. still the performance is quite poor (around 30fps in 400X400 after this pass. and that is even before im doing a pass that displays transparent portions of the model).

my question is now, if you guys might know some other techniques to speed up situations in which huge amounts of fillrate is spent on drawing full screen multitextured quads with fragmentprograms enabled…

How many instructions is your fragment program? Are you doing all linear calculations in the vertex program and interpolating them? What kind of texture filtering are you using? Is there anyway you could avoid using 3d textures (I would guess not)?

presently i have to use 2 fragment programs.
the first does one lookup into the density 3D texture and uses this data to do a lookupt in the transferfunction. the result is used for the front to back z-pass only.

the second program isnt really finished (it will have to do the lighting calculations). presently its only doing a lookup in the gradient 3D texture and doing incorrect lighting (still have to get view and light vector into texture space…). those results are used for the zequals pass.

the filtering im using is linear interpolation for the 3d textures and nearest neighbour for the 2d transfer texture.

btw i think the biggest performance gain can be made by early testing fragments and discarding them… my problem is, that i already am doing this and thus im out of ideas what to do next in order to save bandwith…

btw here a little example pic (with more slices it gets much smoother but also much slower ). http://stud3.tuwien.ac.at/~e0125958/img/kopf.JPG

Use GL_LEQUAL instead of GL_EQUAL (if possible). That won´t gain you much, but nVidia said that would be a good thing to do, you never know.

Sorry, can´t think of anything else you could do.

Jan.

What performance numbers do you need? 30 fps isn’t too shabby for volume rendering.

Still, maybe you could give some of these things a try:

  1. Render to 2d texture at 1/2 res when user is interacting with volume (or whenever you need it to be > 30 fps) and then paste this texture on a fullscreen quad. When volume is sitting still, swap in a full-res version.

  2. Project every other slice onto the one in front of it and pre-compute their blending/masking. You’ll have a slightly longer fragment program but only have to draw 1/2 the slices.

  3. Try pre-integrated volume rendering to reduce the number of slices needed. Klaus Engel has a paper on the technique.

  4. Try making a ray-casting fragment program. You’ll have a much longer fp but you might only have to touch each pixel once.

Originally posted by Zeno:
What performance numbers do you need? 30 fps isn’t too shabby for volume rendering.

yes 30fps are quite fine, but im trying to get those in higher resolutions as well
[b]

Still, maybe you could give some of these things a try:

  1. Render to 2d texture at 1/2 res when user is interacting with volume (or whenever you need it to be > 30 fps) and then paste this texture on a fullscreen quad. When volume is sitting still, swap in a full-res version.
    [/b]

this whole thing will be a self running demo, so it will be moving all the time… the only thing i might be doing is rendering everything to a pbuffer at a low res and put that on the screen if i have to supply a fullscreen version (doing this for a realtime course at university)
[b]

  1. Project every other slice onto the one in front of it and pre-compute their blending/masking. You’ll have a slightly longer fragment program but only have to draw 1/2 the slices.
  1. Try pre-integrated volume rendering to reduce the number of slices needed. Klaus Engel has a paper on the technique.
    [/b]

those two ideas sound really interesting, and i was already thinking about something a bit like this… even though i feared, that the longer fragment program would slow things down too much, since it will have far more texture accesses. but i might really try this out (fewer fragments with longer programs might really turn out faster… btw anybody got positive experience with the KIL instruction? i tried it in an early version and got zero performance gain out of discarding fragment in the fp)
[b]

  1. Try making a ray-casting fragment program. You’ll have a much longer fp but you might only have to touch each pixel once.

[/b]

as before this sounds like something i should try (even though i guess im limited a bit by the r300 texture indirection limitations)

thanks for the long answer

[This message has been edited by Chuck0 (edited 01-16-2004).]

Originally posted by Jan2000:
[b]Use GL_LEQUAL instead of GL_EQUAL (if possible). That won´t gain you much, but nVidia said that would be a good thing to do, you never know.

Sorry, can´t think of anything else you could do.

Jan.[/b]

im afraid GL_LEQUAL wont work for my situation, since im dealing with hugely transparent fullscreen quads. i have to use GL_EQUAL in order to find the parts of the polygons that are visible (that means that they are not obstructed by other parts AND that their alpha is above a certain level. this later criteria cant be tested with GL_LEQUAL).
still thanks for the answer

I think the “kill” instruction is only there for future hardware. I think that today’s hardware still runs each fragment through the whole program, even if it’s been killed. I could be wrong on this.

Since you don’t seem to be doing much (if any) blending on your volume, only alpha test, it would be really great for you if the kill instruction worked. Especially if you were doing ray-cast volume rendering.

Also, because of that, I think that option 2 would be a better way for you to reduce the number of slices than option 3, since the pre-integration may assume some sort of smoothness (in the mathematical sense) in the opacity of the data. Klaus would probably be able to answer that better if he sees this thread .

as before this sounds like something i should try (even though i guess im limited a bit by the r300 texture indirection limitations)

I tried this in the past and ran into that texture indirection problem. I realized later that it was because I used the same register as the destination for each lookup, which ATI counted as dependent (though they’re really not). Anyway, I haven’t had time to try to revisit the problem so it may be workable now with newer drivers or a slightly adjusted fragment program. It would probably work great on an nvidia card. I’m guessing that this would be the fastest possible method and as a side effect it would naturally work with a perspective projection without having to use spherical shells.

ok now that i really browsed through the paper the idea of using a lookup table for the start and end density sounds really interesting since it could increase display quality considerably and i guess it will work with my plans that ill try to interpolate between 2 transfer functions (which will have to be preintegrated and those lookuptables will have to be interpolated instead of the real transfer functions… guess this will not be the same as interpolating them, but i hope it wont create too disturbing artefacts)

another great thing in this paper is the way the gradient for iso surfaces is generated (it will confine my way of handling those a little, but i guess thats no real problem). a huge part of the artifacts im presently generating could be diminished by employing this technique.

on the whole it was really a good thing to point me to this paper (i guess ill try to experiment a little. maybe ill only implement the gradient interpolation stuff mentioned in there, since it wont mess up the rest of my code…)

I’m around

Some things to look at:

  1. Pre-Integration will improve quality with less slices. However, you have to do two fetches (front and back of ray-segment) into the volume. This will reduce performance, unless you do raycasting with a long fragment program so that you can reuse previous fetches.
  2. Try to skip empty space: Wei Li, Klaus Mueller, and Arie Kaufman, Empty Space Skipping and Occlusion Clipping for Texture-based Volume Rendering, IEEE Visualization 2003, pp.317-324
  3. Try to terminate rays that have reached maximum opacity: Acceleration Techniques for GPU-based Volume Rendering
    Jens Krueger, Rüdiger Westermann (Technical University Munich), IEEE Visualization 2003
  4. Subsample the volume
  5. Subsample the image as previously pointed out by Zeno
  6. Kill won’t help you on current hardware.

Klaus

Another thing:

2D textures are almost twice as fast as 3D textures:
C. Rezk-Salama, K. Engel, M. Bauer, G. Greiner, T. Ertl, Interactive Volume Rendering on Standard PC Graphics Hardware Using Multi-Textures and Multi-Stage Rasterization, in Proc. Eurographics/SIGGRAPH Workshop on Graphics Hardware 2000 (HWWS00):
ftp://ftp9.informatik.uni-erlangen.de/pub/Publications/2000/Publ.2000.5.pdf

Klaus

And

You could put both gradients and the volume scalar value into a RGBA 2D/3D texture (gradient=RGB, A=scalar). So with only one texture read you will have the gradient and the scalar.
Less memory reads = more performance

Really, that’s it for today
Klaus

I think the “kill” instruction is only there for future hardware. I think that today’s hardware still runs each fragment through the whole program, even if it’s been killed. I could be wrong on this.

Well, it may still run the fp through, but the kill instruction does function as it should. Kill is not required to provide any performance benifit; all it needs to do is prevent the rest of the pipeline from working.

Originally posted by Korval:
Well, it may still run the fp through, but the kill instruction does function as it should. Kill is not required to provide any performance benifit; all it needs to do is prevent the rest of the pipeline from working.

Right, sorry. I didn’t mean to imply that it was a noop or something, only that it doesn’t speed things up to use it.

thanks for all those tips and references.
as for using 2d textures… can the cause for the speedimproovement using them be, that the texture cache works better for them? (when using 3d textures it seems, that the rendering speed is dependant on the viewdirections which could indicate this.)

i guess ill really use just one texture for gradient and density (up til now i didnt do this since i didnt want to sacrifice the density precision of my datasets which is 12 bit. so i wrote the first 4 bit into one color and the last 8 bit into the second color of my density texture. the transferfunctiontexture was 2d with the x coordinate representing the last 8 bit and the y the first 4 bit of the density)

You split the bits of the volume scalars to different color channels ? Wouldn’t this break the texture filtering ?

Klaus

Originally posted by Klaus:
[b]You split the bits of the volume scalars to different color channels ? Wouldn’t this break the texture filtering ?

Klaus[/b]

hmm i guess because of rounding and some other numerical influences the resulting interpolated scalar values wont be the exact same, als interpolating the scalar as a whole, but since the following equation should be correct i think it works (the results im getting seem to be fine in my implementation)

ok let
s1 be one scalar sample which is split into
r1256 + g1
and let
s2=r2
256+g2
be another one

a resulting interpolated scalar value would then be

s3=xs1+(1-x)s2
if s1 and s2 are substituted by the split up version then
s3=x(r1
256+g1)+(1-x)(r2
256+g2)
after some conversions it is
s3=(x*r1+(1-x)*r2)256+xg1+(1-x)*g2
this should be the approximately same result i do get when interpolating parts of the scalar separately by using 2 colors. the important thing to minimize artefacts here was to set the transferfunction filtering to nearest…

but i guess ill really abandon this technique in order to gain speed

[This message has been edited by Chuck0 (edited 01-17-2004).]

I’ve been working on a similar renderer, but I’ve had problems getting KIL to actually work.

For example,
MOV temp, 1.0;
KIL temp
appears to cause the fragment to be killed. The only thing that seems to actually work is KIL 1.0; (or KIL x for any non-negative x).

Is there any trick to this?