Nvidia: bindless textures - first experiences and bugs

Hi,
I am working on virtualization and visualization methods for large image and volume data sets. My current approach is to use large texture atlases to store the page data on the GPU. With volume rendering a problem with large 3D textures is their cache inefficiency under most viewing angles (caused by their GPU internal, more 2D texture optimized, layout). So i was stoked by the bindless extension, because it allows me to store the pages of the virtual textures in single smaller textures and access them in the shaders through an uniform block or better a larger texture buffer holding the 64bit resident texture handles. So my experiments began…

First experiment: Simple volume ray caster using a single 3D volume texture. You can find the complete source code of the experiment here.
1. Store the texture handle in a uniform block



uniform volume_uniform_data
{
    vec4 volume_extends;     // w unused
    vec4 scale_obj_to_tex;   // w unused
    vec4 sampling_distance;  // x - os sampling distance, y opacity correction factor, zw unused
    vec4 os_camera_position;
    vec4 value_range;


    mat4 m_matrix;
    mat4 m_matrix_inverse;
    mat4 m_matrix_inverse_transpose;


    mat4 mv_matrix;
    mat4 mv_matrix_inverse;
    mat4 mv_matrix_inverse_transpose;


    mat4 mvp_matrix;
    mat4 mvp_matrix_inverse;


    sampler3D volume_texture;
    sampler1D color_map;
} volume_data;

[...]
// sample volume
float v = texture(volume_data.volume_texture, spos).r;

Results: Works perfectly as expected.

2. As uniform buffer storage can get pretty limited when trying to input a large number of page textures into a single shader (64KiB overall storage at max, used with other uniform data…), i tried to store the texture handles in a texture buffer with a RG_32UI format, which in the shader is converted back to a uint64_t which can be interpreted as a sampler:


layout (binding = 4) uniform usamplerBuffer texture_handles;

[...]
// sample the texture
uvec2     vtex_hndl_enc = texelFetch(texture_handles, 0).xy;
uint64_t  vtex_hndl     = packUint2x32(vtex_hndl_enc);
sampler3D vtex_smpl     = sampler3D(vtex_hndl);
float v = texture(vtex_smpl, spos).r;

Results: Quite unexpectedly, this works perfectly! (This use case was not expressed in the bindless textures spec)

3. Now i tried to only fetch and translate the sampler once at the beginning of the shader and store the sampler in a global variable:

Problem: The glsl compiler does not allow to assign values to global sampler variables (the spec implied this should work).
So i just stored the uint64_t handle in a global variable and translate this value to a sampler right before taking a texture sample:



// globals
uint64_t vtex_smpl = 0;
uint64_t ctex_smpl = 0;

[...]
// one time initialization on shader start
uvec2     vtex_hndl_enc = texelFetch(texture_handles, 0).xy;
uvec2     ctex_hndl_enc = texelFetch(texture_handles, 1).xy;
vtex_smpl     = packUint2x32(vtex_hndl_enc);
ctex_smpl     = packUint2x32(ctex_hndl_enc);

[...]
// sample texture
float v = texture(sampler3D(vtex_smpl), spos).r;

Results: This works, but it is much slower than the previous result fetching the handle for every sample!

These tests were made using a smaller volume with dimensions 501x401x576 using 8bit scalar values with a simple transfer function on a 1600x1024 viewport. The plain non-bindless version ran with 2.5ms per frame, the uniform block bindless version ran with 2.8ms per frame. The first texture buffer version ran with 3.2ms per frame and the version using the texture buffer trying to just once fetch the sampler ran with 3.8ms per frame.

Second experiment: Modify virtualization renderer to use a texture buffer containing individual page texture handles instead of a large volume texture containing the page textures. (not openly available at this point in time).


#if SCM_LDATA_VTEX_BINDLESS_TEXTURE == 1

struct page_atlas_3d_info {
    usamplerBuffer   atlas_textures;
};


#else

struct page_atlas_3d_info {
    vec3        size_pages_rec;
    sampler3D   atlas_texture;
};


#endif

I changed my data structure containing the indirection information into the texture atlas to contain a simple index into the texture buffer. Additional changes were made to the methods retrieving the actual texture samples from the atlas. What is important is that for the octree traverser i use to retrieve the volume brick for the ray traversal i store the temporary data in the following struct, which is filled out by the traversal function when querying the octree for a texture coordinate:


struct ray_cast_trav_info {
    sampler3D vpage;
    uint64_t  vpage_hndl;


    uvec2   vpage_index_data;
    int     vpage_level;
    vec3    vpage_coord;


    vec3    octree_node_pos;
    vec3    octree_nodes_per_level;
}; // struct ray_cast_trav_info

void ray_cast_octree_traverse(in vtexture3D          vtex,       // virtual texture decriptor struct
                              in vec3                vtex_coord, // virtual texture coordinate
                              in float               target_lod,
                              out ray_cast_trav_info trav_info)
{
[...]
}

Now when sampling the current volume brick i expected i could use the decoded vpage sampler i got from the texture buffer in the following way:


vec4
texture_page(in ray_cast_trav_info rc_tinfo,
             in vec3               page_tc)
{
    return texture(rc_tinfo.vpage, page_tc);
}

Problem: Simple put: The temporarily stored sampler does not work, i always get vec4(0.0) back from this lookup.
So i checked where the problems start and i found that the uint64 handle was ok, so i stored this handle additionally and used it during the lookup:


vec4
texture_page(in ray_cast_trav_info rc_tinfo,
             in vec3               page_tc)
{
    return texture(sampler3D(rc_tinfo.vpage_hndl), page_tc);
}

Results: This works, but it is much more slower than my current texture atlas approach by a huge factor (3x to 6x slower in my experiments).

I used 512MiB worth of 64³ smaller volume page textures for my experiments (resulting in 2048 resident 3d textures). I ran all tests under Windows 7 x64 using a GeForce GTX 680 with the 301.32 driver.

I know that this functionality is pretty new and the drivers surely need to mature. I was surprised how far i got with my experiments and a hope that the performance situation can be improved drastically because i see this way of handling virtual volume textures overcoming a lot of our problems with larger 3D texture atlases.

Regards
-chris

Hi,
cool stuff! we’ll look into the compiler creating non-optimal code.
Btw you might want to use NV_shader_buffer_load to pack your textures into buffers instead of samplerBuffer, safes the unpack work.

e.g
struct entry {
sampler3D tex;

}

uniform entry* entries;


tex3D(entries[idx].tex, …)

(just beware of the memroy layout rules, slightly different from UBO’s std140 (nv allows dense packing of scalars, and not the vec4 expansion)
Also gives possibility to using pointers for hopping around in more complex data structures.

[QUOTE=Christoph Kubisch;1237508]Hi,
cool stuff! we’ll look into the compiler creating non-optimal code.
[/QUOTE]
thanks ;).

I will get some reproducers to you and Jeff.

Btw you might want to use NV_shader_buffer_load to pack your textures into buffers instead of samplerBuffer, safes the unpack work.

e.g
struct entry {
sampler3D tex;

}

uniform entry* entries;


tex3D(entries[idx].tex, …)

(just beware of the memroy layout rules, slightly different from UBO’s std140 (nv allows dense packing of scalars, and not the vec4 expansion)
Also gives possibility to using pointers for hopping around in more complex data structures.

That is a great idea, thanks for the hint!..

Regards
-chris

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.