reasons for lack of programmable depth/stencil?

i guess, there’s a big reason for that, which i am missing. so that’s why this topic is not in the suggestions forum. but googling “opengl programmable depth/stencil” gave me only the topics there people desired to sample and write depth\stencil buffer for some reason. but i don’t expect it to be accessible from vertex\geometry\fragment programs. i think, depth+stencil should be a separate shader stage. here’s an approximation of what i am talking about:


void main() 
{
    if( gl_fragDepth < gl_inDepthValue ) { //gl_inDepthValue  - current depth-buffer value
        if( gl_inStencilValue == 3 ) {     //gl_inStencilValue - current stencil-buffer value
            discard();                     //do not execute fragment program if possible. or prevent it from writing to color buffer, if gl_fragDepth is modified.
        } else {
            gl_outStencilValue = gl_inStencilValue + 1;
        }
    } else {
        discard();
    }
}

i’m not proposing it as actual syntax, it’s just to give an idea. i’m not the person who is capable of writing actual spec. but even if currently there’s a reason for not doing this… i’m 100% sure there’s gonna be properly programmable depth+stencil. because state-based approach is awful. it’s very error prone(it’s hell to control all the states affecting stencil and depth in big project with multipass rendering) and ugly. i guess that it’s gonna be hard to squeeze the same performance out of programmable section of pipeline. but it’s the most straightforward, obvious way to do that. and in many cases, user defined code should be simpler(faster?), than fixed pipeline. sooner or later, hardware should be optimized for that functionality. better be soon, current functionality for depth\stencil looks archaic and out of place.

Look at GL_ARB_shader_stencil_export extension. It is not in core because NVIDIA hardware can’t do it. Currently only AMD supports it.

Your example can actually be done via the fixed-function stencil tests.


glDepthFunc(GL_LESS);
glStencilFunc(GL_NOTEQUAL, 3, 0xFFFFFFFFU)
glStencilOp(GL_INCR, GL_KEEP, GL_KEEP);

This all occurs before fragment processing, so it seems to fit with what you’re suggesting.

However, in the event that your example is just a simple one, you can do more advanced depth/stencil work in the fragment shader. You can query the incoming depth in the fragment shader via gl_FragCoord, specifically the depth in gl_FragCoord.z. The GL_ARB_shader_stencil_export extension allows you to write the stencil reference value (http://www.opengl.org/registry/specs/ARB/shader_stencil_export.txt). You can’t read the stencil value however, so you’d need to set up the stencil tests to accomplish your goal. However, I haven’t seen any Nvidia hardware/drivers support this extension (I have a 670 with latest drivers, and several other earlier cards & Quadros), so it may not be much use to you if your target audience includes Nvidia hardware.

You can also use image loads and stores instead of a stencil buffer via GL 4.3 or GL_ARB_image_load_store (or GL_EXT_image_load_store), which would give you much more flexibility than the stencil buffer, though likely with a performance hit.

The stencil buffer isn’t programmable because it’s optimized for what it does, and likely has dedicated hardware. It’s also fallen by the wayside a bit now that there are other ways to accomplish similar tasks.

i just want to state: i’m not searching for this functionality for my project or something(however i’d gladly use it if it was implemented in similar way to what i expect). and my sample is not some real code, and i know i can do that in fixed pipeline. it’s just to show how i imagine this feature. i know where is no such thing, but there are may be some obscure extensions implementing bits of it, which are not practically useful.

and i’m also not looking for any major extra functionality(except, naturally, more advanced math\comparison operations) from programmable stencil\depth. i’m pleased with stencil\depth buffer functionality. problem is how you achieve it. so i expect shader stage, which uses hardware, dedicated for these operations and strictly follows it’s limitations. am i the only person, who finds it natural thing to do? i mean moving this functionality to the programmable stage(which may lead to it’s further development) and cleaning up api from excessive amount of states.

so GL_ARB_shader_stencil_export shows it’s potentially possible and there was desire to bond stencil with programmable pipeline. but it’s really backwards, and not useful in most applications anyway.

purpose of this topic is to find the reason why stencil\depth is not a separate programmable stage yet and associated states are not deprecated. and if it’s reasonable, push it as a suggestion.

There are reasons why depth/stencil test is not programmable:

  1. Special purpose hardware does it way faster.
  2. Most of the cases depth and stencil test is performed before fragment shader execution thus if a fragment gets rejected by depth or stencil test there is no need to run a fragment shader for that fragment at all, which saves you HW resources.

[QUOTE=aqnuep;1247954]There are reasons why depth/stencil test is not programmable:

  1. Special purpose hardware does it way faster.

  2. Most of the cases depth and stencil test is performed before fragment shader execution thus if a fragment gets rejected by depth or stencil test there is no need to run a fragment shader for that fragment at all, which saves you HW resources.[/QUOTE]

  3. so you think there’s no reason\possibility to make programmable stage, that uses only functionality available for this hardware? i can understand it because it would be too limited and won’t resemble normal shaders much. but on the other side, i’d expect this to potentially lead to further developments in this hardware. now working with depth and stencil buffers looks like something from legacy functionality, but at the same time it is heavily used in modern applications, and often, as an optimization tool.

  4. well i was aware of it and expected custom depth\stencil code to be executed before fragment shader. because, again, i didn’t propose any extra interaction with other pipeline stages(like accessing it from fragment program, or something) or some fancy features which would kill it as an optimization tool. i was only for replacing state changes with simple shader.

i also can’t find any actual info about the hardware responsible for that. i even considered it’s done on normal ALU’s for modern GPU’s(and it’s fast because those operations are always really simple and some optimizations enabled for those stages). could you supply a search query or a link, if you know? i’d like to read it.

Z-culling and stencil ops are part of the functionality of the raster engine (NVidia) or rasterizer (AMD). One of the reasons they’re so fast is that they employ hierarchical z-culling, and only operate in a very specific region of FP32 space (0…1). Here’s a link for the GEForce 480 (newer articles don’t break it down to this level anymore).

The GF100 Recap - NVIDIA’s GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?

While it’s hard to know without actually testing hardware, I would think adding another shader stage would be slower overall than doing the test as part of the fragment shader (assuming you could get access to the stencil/z buffer values in both). Fragment shaders are pretty good at ending early if all fragments discard in their block (usually 2x2). While I’ve run into cases where querying the Z-buffer value directly would be nice, I’ve been able to workaround this by using FBOs and a multi-pass approach with depth-texture sampling.

The flexible programming you are after can be done with the combination of ARB_Stencil_Export and ARB_stencil_texturing extensions.
The latter extension (if I understand correctly) allows you to bind the depth/stencil texture to mutiple texture units and assign a mode to each (thus read either the depth component, or the stencil component).
This, in combination with StencilExport, means you can now read the stencil buffer value, the stencil reference value and write a new stencil value to the depth/stencil buffer.
The only issuse is general promotion of Stencil Export into core ARB so that nVidia adopt it.

Not exactly. While you can bind a texture for texturing that you are currently also rendering to, the writes will not be immediately visible to the reads. It applies to color, depth and stencil textures as well. That’s why such extensions like NV_texture_barrier exist.

[QUOTE=malexander;1247957]Z-culling and stencil ops are part of the functionality of the raster engine (NVidia) or rasterizer (AMD). One of the reasons they’re so fast is that they employ hierarchical z-culling, and only operate in a very specific region of FP32 space (0…1). Here’s a link for the GEForce 480 (newer articles don’t break it down to this level anymore).

The GF100 Recap - NVIDIA’s GeForce GTX 480 and GTX 470: 6 Months Late, Was It Worth the Wait?

While it’s hard to know without actually testing hardware, I would think adding another shader stage would be slower overall than doing the test as part of the fragment shader (assuming you could get access to the stencil/z buffer values in both). Fragment shaders are pretty good at ending early if all fragments discard in their block (usually 2x2). While I’ve run into cases where querying the Z-buffer value directly would be nice, I’ve been able to workaround this by using FBOs and a multi-pass approach with depth-texture sampling.[/QUOTE]

thanks for an article, i took a glance on it and i gonna read it today, it looks like it will cover some white spaces in my understanding how GPU works and what can i expect from it.

as for about depth\stencil programmable stage… yes i see it’s not viable now, it requires a lot of modifications in hardware or it will be slow. and i doubt that hardware dedicated for that functionality will develop in such way it will be possible to make it programmable without performance loss. and i don’t think it’s gonna be profitable at some point to make fully unified architecture and get rid of specific hardware.

[QUOTE=BionicBytes;1247962]The flexible programming you are after can be done with the combination of ARB_Stencil_Export and ARB_stencil_texturing extensions.
The latter extension (if I understand correctly) allows you to bind the depth/stencil texture to mutiple texture units and assign a mode to each (thus read either the depth component, or the stencil component).
This, in combination with StencilExport, means you can now read the stencil buffer value, the stencil reference value and write a new stencil value to the depth/stencil buffer.
The only issuse is general promotion of Stencil Export into core ARB so that nVidia adopt it.[/QUOTE]

i think that solution is for the cases when you need to do something fixed pipeline can’t. i’ve stated several times that i’m against compromising performance and i’m not after fancy fragment program functionality\flexibility. replacing fixed depth\stencil stage with programmable(which would do exact same stuff) and keeping it’s performance is what i’d like to see in OpenGL. but it’s not possible or reasonable now, i see.