S3TC texture compression performance

endlosschleife · April 27, 2011, 8:07am

Hi,

I understand that there are additional factors that play into this, but is it expected that a shader based volume rendering algorithm performs faster on compressed textures than on uncompressed textures?

I played around with a shader based volume rendering algorithm from VTK and changed it to compress the 3d texture for the volume during the upload to the graphics board. I wanted to see what the tradeoff is between the additional data that I can hold in the texture memory and the performance drop during the rendering. I was surprised to find out that a maximum intensity projection of my sample data set (512 x 512 x 700 voxels) was about 20% faster with compressed textures. My graphics card is a NVidia G210M.

I assumed that the shader that does the actual volume ray casting does not care about whether a texture is compressed or not and that the only difference is that when the tracer asks for the value of the voxel that in one case the result comes directly from the texture buffer and in the other
case it decompresses the value of the requested pixel on the fly (I understand that the algorithm is optimized for random access and on-the-fly decompression speed). Still I assumed at least some performance drop.

Now where I would understand the performance increase would be if the textures (compressed or uncompressed) do internally get copied around while performing ray casting and that the reduced bandwidth requirements of the compressed textures make the difference, but I don’t see where - and why -
this should happen.

Or maybe assuming that the texture sampling of the ray casting shader is what takes most of the time, the GPU caches some data (neighboring blocks of pixels?) during decompression to on-chip memory and texture memory is accessed less often?

Any ideas?

Thanks,

Mark

ZbuffeR · April 27, 2011, 8:40am

Indeed S3TC is heavily tuned to be extremely hardware efficient.
Graphic card most probably have special circuitry so the actual decompression does not take more time than uncompressed access. However, the data bandwidth is 4 times smaller with compressed textures, and that makes a difference for texture-heavy shading.

overlay · April 27, 2011, 9:51am

I don’t have my “real-time rendering, 3rd edition” book by me right now but I merely remember the chapter about hardware architecture.

I think each texture unit has a really fast cache where it can store a block of the texture. It allows to amortize the cost of reading neighbor texels from the VRAM. This is true for compressed and uncompressed textures.

In addition, by using compressed textures, the loading from VRAM to texture cache is even faster.

Again, it is just on top of my head right now. You’d better check by yourself in the book.

overlay · April 27, 2011, 9:56am

Regarding the use of compressed texture in volume rendering, be aware that S3TC is a lossy compression so you may degrade rendering quality.

It’s probably more important to keep accuracy over speed when you dealing with scientific data.

However, you could find lossy compression a good enough LOD (level-of-detail) for quick interaction and switch back to uncompressed texture for a more accurate rendering when you stop interacting.

The_Little_Body · April 27, 2011, 3:03pm

DXT can perhaps to be enhanced for to handle the red, black and blue channels into separates planes/partitions for to be very less lossy (cf. two bits indexing for each R, G and B planes vs only two bits indexing for a RGB triple on only one RGB plane at the actual DXT compression) ?

For example, with a 4:2:2 partition, the intensity of each component (cf. green:blue:red) can to be interpoled by 2 bits each => this give only 12 bits for 4 pixels with a very big gain in quality …
(but ok, on other side the compression ratio isn’t as good as the first DXT version)

Alfonse_Reinheart · April 27, 2011, 3:40pm

Somehow, I don’t think that’s going to solve his performance problem. Particularly since this is not supported in hardware and would therefore not be faster.

The_Little_Body · April 27, 2011, 3:51pm

A lot of thing are changed since the first 3dfx card …
(the fillrate is very more impressive now …)

We have now for example vertex/fragment[/geometry] shaders that aren’t as old as this but already supported in hardware

And like say Zbuffer and Overlay, the DXT compression have very less memory to handle, so less memory access to do (and very less VESA/AGP/PCI/PCI-E tranferts to make for to access external memory …).

Alfonse_Reinheart · April 27, 2011, 4:11pm

We have now for example vertex/fragment[/geometry] shaders that aren’t as old as this but already supported in hardware

If a simple texture fetch is too slow for this person’s needs, manual texture decompression is not going to be faster.

The_Little_Body · April 27, 2011, 4:14pm

Heu … and if you can have more than only one texel at each fetch ???

A cuboïd of 8x “3D texels” with only one texture fetch for example …

Note that a RGBA texel can already to be easily handled as 4 texels with only intensity
(cf. R/G/B/A 32 bits => 4x 8 bits intensities)

Alfonse_Reinheart · April 27, 2011, 4:41pm

Heu … and if you can have more than only one texel at each fetch ???

You haven’t proposed a way for that to happen.

Note that a RGBA texel can already to be easily handled as 4 texels with only intensity

And how does this deal with the problem at hand? He doesn’t need 4 intensities; he needs RGB(A) color data.

The_Little_Body · April 27, 2011, 5:06pm

You haven’t proposed a way for that to happen.

The green component is the first Z plane, the red component the second Z plane, the blue component the third Z plane and the alpha component the fourth 3D plane (and this %4 scheme continue for others next 3D planes, using another texture unit per example)
=> this give 4 planes per texel/unit, so 4 texels/texture unit …
(ok, we have to use a 256 palette for to map the texel index to a true RGB(A) color)

And we can too work on a YCbCr 4:2:2 colorspace for a better compression/quality than the standard 4:4:4 RGB colorspace of course

This seem like the good old “planar vs packed RGBA pixel format” story …

Alfonse_Reinheart · April 27, 2011, 5:24pm

The green component is the first Z plane, the red component the second Z plane, the blue component the third Z plane and the alpha component the fourth 3D plane (and this %4 scheme continue for others next 3D planes, using another texture unit per example)
=> this give 4 planes per texel/unit, so 4 texels/texture unit …

What do you mean by “Z plane?” Because I’m not seeing how you’re getting 4 planes of anything per texture access.

(ok, we have to use a 256 palette for to map the texel index to a true RGB(A) color)

So now you’re talking about palettes. That means at least two texture accesses: one for the main texture and one for the palette. How exactly do you expect this to be faster to access than a single S3TC access?

And we can too work on a YCbCr 4:2:2 colorspace for a better compression/quality than the standard 4:4:4 RGB colorspace of course

But, since hardware doesn’t support YCbCr, you now need shader logic to transform it back into linear RGB. This isn’t making anything faster.

The_Little_Body · April 27, 2011, 5:43pm

A Z plane is a color component (cf. R/G/B/A => 4 components)

A DXT pixel decompression need a access to two RGB16 colors (4 bytes) and extract two bits from the 4x4x2 = 32 bits = 4 bytes
=> so 8 bytes per texel access
(+ the color diagonale interpolation from C0 to C1 with the two bits/ 4 levels to compute of course …)

And the best for the end, the YCbCr is used in the PAL/SECAM standard since 1960
(before the TV are only B/W, cf. only the Y channel is used on B/W TV, the Cb/Cr is come after for to add the color)

And I have already posted the YCbCr -> RGB shaders the last year

Separate sampler state from texture state

Alfonse_Reinheart · April 27, 2011, 9:40pm

A Z plane is a color component (cf. R/G/B/A => 4 components)

So why don’t you call it that, instead of making up a new term like “Z plane”?

A DXT pixel decompression need a access to two RGB16 colors (4 bytes) and extract two bits from the 4x4x2 = 32 bits = 4 bytes
=> so 8 bytes per texel access
(+ the color diagonale interpolation from C0 to C1 with the two bits/ 4 levels to compute of course …)

I know how S3TC works. I wrote the OpenGL Wiki article explaining how it works. That doesn’t explain anything about what you’ve said in this thread, nor does it explain how your method is supposed to be faster than regular S3TC.

And the best for the end, the YCbCr is used in the PAL/SECAM standard since 1960

That’s a nonsequitor. It doesn’t matter where it’s been used for years; it isn’t in graphics hardware, which is what this discussion is about. Asking or wishing for it to be so will not make it appear.

And I have already posted the YCbCr -> RGB shaders the last year

And is that faster at texture accessing than RGB? No.

This thread is about the performance of S3TC on 3D textures. It is not about your personal pet project of your own style of texture compression. The merits or lack thereof of that method can be discussed elsewhere. If you’re not going to offer suggestions that help the user of this post with his actual problem (making compressed 3D texture accesses faster), then please stop commenting in this thread.

The_Little_Body · April 27, 2011, 11:39pm

I was surprised to find out that a maximum intensity projection of my sample data set (512 x 512 x 700 voxels) was about 20% faster with compressed textures

Effectively, it has a very very big problem of speed with texture compression

But this isn’t a reason for that futures texel access schemes can’t be more speed of course …

What is the name of the depth plane on a 3D texture ???
On 2D, we have a widthheight dimension (cf. xy), it’s why I use the name Z plane about the depth dimension on a 3D texture

And the YCbCr -> RGB conversion is make since a long time on video cards hardware for MPEG hardware decoding for example …
YUV
(YCbCr and YUV are almost the same thing)

And is that faster at texture accessing than RGB? No.

The transfert time between the CPU memory and the GPU memory is 2x more good (because 2x less data to transfert)
=> in a big 3D picture this can make a big difference for texture loading …
(this have already an incidence on 2D, so …)

Alfonse_Reinheart · April 28, 2011, 3:41am

Effectively, it has a very very big problem of speed with texture compression

A problem you’re not helping with. None of your ideas solve his problem; they will only make it worse.

Just because you have a solution that works for your needs doesn’t mean that it is appropriate for anyone whenever the words “texture compression” are used.

What is the name of the depth plane on a 3D texture ???
On 2D, we have a widthheight dimension (cf. xy), it’s why I use the name Z plane about the depth dimension on a 3D texture

It’s called “depth.” Like “width” and “height”. X axis = width. Y axis = height. Z axis = depth.

There is no “depth plane”, just as there is no “width plane” or “height plane”.

The transfert time between the CPU memory and the GPU memory

Was he talking about the transfer time from CPU to GPU? No. He’s talking about runtime performance. The act of accessing the texture.

Transfer time matters to you because you’re sending video. Is he sending video? No; he’s using a static texture.

The_Little_Body · April 30, 2011, 5:36am

Endlosschleife,

Have you test with paletted texture ?

Work you slice per slice, or your 3D texels access are random ?

The_Little_Body · April 30, 2011, 11:43am

Endlosschleife,

As say by Zbuffer, DXT textures are directly (and efficiently) handled by the hardware that have very less memory to handle.

So it’s logic that DXT textures are more fast that RGB(A) non compressed texture (because less memory to access, and the texture memory access time is a big bottleneck on random texture access).

The method of Overlay seem really good
(cf. switch between a DXT texture and a non compressed texture according to the LOD).

Overlay, think you that the “no compressed texture vs DXT texture compression” can be automatically handled by an extended version of gluBuild2DMipmaps/gluBuild3DMipmaps for example ?

endlosschleife · May 2, 2011, 4:32am

This thread was very educational for me. Thank you all!

“So it’s logic that DXT textures are more fast that RGB(A) non compressed texture (because less memory to access, and the texture memory access time is a big bottleneck on random texture access).”

Well, the “less memory to access” was what didn’t feel plausible to me — at least at first sight — because even if the compressed texture is smaller the shader program that does the ray tracing will still query for the values of just as many texture positions. If this would translate to an actual random texture access each time texture compression wouldn’t change anything about that bottleneck. Without having too much knowledge about the S3TC algorithm it does seem plausible to me, though, that it is optimized in a way that redundant texture memory access is avoided/minimized (like overlay suggested), and that the minimized random texture memory access is indeed what caused the performance gain.

Also thanks for the book recommendation (Real-time rendering). I will have a look at it.

As to the idea of using texture compression only for reduced LOD during interaction with the volume, that does make sense to me, but for my purposes I looked into texture compression primarily to minimize the need for texture memory for the volumes, but switching between non-compressed and compressed volumes and having them in memory at the same time would mean that I would actually need more texture memory (or I would need to permanently push volume data from CPU to GPU). If I would go with reduced LOD during interaction it would make more sense to me to have just the uncompressed 3d texture and to sample it differently in the shader program for high- and low-quality respectively. So if it turns out that the loss of detail with the texture compression is too significant this would be a possible reason not to use it at all.

But for now I just wanted to understand what my options are.

Thanks,
Mark