PDA

View Full Version : imageload/store() very slow, any tricks?



robotech_er
06-30-2012, 11:25 PM
The functions imageload() and imagestore() seem very slow. imageload() is about 60% slower than texelfetch() through my test. is there any tricks that can accelerate these two functions, in particular, imagestore(http://www.opengl.org/registry/specs/ARB/shader_image_load_store.txt), since imageload() can be substituted by texelfetch().
I have read the spec () many times but have not found any clue.

Thanks in advance.

Alfonse Reinheart
07-01-2012, 12:13 PM
since imageload() can be substituted by texelfetch().

No it can't, depending on exactly what you're doing. `imageload` loads from bound images, which can have things like `coherant` attached to them. You can't make a sampler `coherant`.

Also, you explain nothing about what you're trying to do, so there's no way to know where the problem is.

robotech_er
07-01-2012, 02:05 PM
No it can't, depending on exactly what you're doing. `imageload` loads from bound images, which can have things like `coherant` attached to them. You can't make a sampler `coherant`.

Also, you explain nothing about what you're trying to do, so there's no way to know where the problem is.

Thanks for the reply.

I am trying to design a volume render. The problem is that the data amount is huge, about over 80 GB. Thus, I implemented a paging system to handle the data access. This paging system runs in a separate I/O thread. At rendering time, this paging system tries to predict the moves of camera, and tries to pre-load the data need to be rendered in next frames. when the camera moves slow, it works well. but when the camera moves really fast, the I/O thread will not have enough time to load data since the data transfer speed of HDD is relatively slow. So i render the data using the so called "LOD" skill. when all the data need to be rendered can not be loaded in time, the data at coarse LOD level will be first loaded and rendered. The LOD scheme works well too. but after read many docs and papers, i noted that it can be faster by compressing the data. Since the HDD speed is the main performance bottleneck, compression can save a large amount of loading time.
The volume data is layered textures which are stored in some texture arrays. i compress the data off-line on the cpu. At rendering time, the compressed data is transferred to video memory, then decompressed and rendered.
The decompression is performed in vertex shader. And imageload() is used to read the compressed data in the shader, and imagestore() is used to write the results of the decompression to the texture arrays, from where the decompressed data is rendered. RTT cannot be used to write the results to the texture arrays because of too many limits. here, imageload() can be substituted by texelfetch().
but the decompression is still very slow. after many tests, i believe that imageload() slows down the whole shader which is used for decompression.
Any suggestion? Thanks.

Alfonse Reinheart
07-01-2012, 06:12 PM
OK, that's my fault. I should have been more clear when I asked what you were doing.

Could you post your shader? Particularly the definition and use of the image uniforms? Are you declaring your images as `coherant`, or not declaring them `readonly` or `writeonly` as apporpriate? Are you properly using `restrict`?

robotech_er
07-04-2012, 09:26 AM
Sorry for failing to describe the question exactly.
The shader is an image decompression shader. Volume data is stored as integer textures.
An image is divided into many small patches, and these patches are compressed/decompressed in parallel. For example, an image sizing 512*512 can be divided into 64*64 patches with the patch size of 8*8, or 16*16 patches with the patch size of 32*32. Then all the patches of an image are compressed into bit-streams and packed into a compression texture.
The shader is basically like this:

#version 420 compatibility

layout(r32ui) restrict readonly uniform uimage2D tex_comp; //the compression texture
layout(r32ui) restrict readonly uniform uimage2D tex_offset; //this texture contains the location of each compressed patch in the compression texture
layout(r16ui) restrict writeonly uniform uimage2D tex_decomp; //this texture holds the decompression results. the right qualifiers?

(pseudocodes below, the basic idea is very simple.i will post the original code later. in a trip:))
void main()
{
... //find the beginning of the compressed patch.

for( int i=0; i< pixels_num_in_a_patch; i++ ) //pixels_num_in_a_patch == patch size
{
(1) read a compressed pixel from tex_comp; //imageload()
(2) decompress the compressed pixel thus get the real value of this pixel;
(3) write the result to tex_decomp; //imagestore()
}
}

the complete code is here:http://pastebin.com/hCStfCvb.

Thanks.

robotech_er
07-07-2012, 09:27 AM
The problem is partially resolved by reducing the write operations as much as possible. The relationship of performance and number of write operations is not linear, at least on my 570GTX is not linear. But generally, the less the better.