Which extensions affect early-z behavior?

As many of use have discovered, early-z is an extremely finicky thing, often disabling itself at the slightest provocation even when we don’t want it to do so.

Through my investigations, I’ve found that the EXT_depth_bounds_test extension modifies early-z behavior to be slightly more amenable. In particular, using the “discard” statement in a fragment program no longer disables it.

This allows one to incrementally whittle out pixels from consideration over several successive passes in an efficient manner; something impossible without depth bounds.

However, the downside is that when you “discard”, the current calculation at that pixel isn’t finished. So, for example, if you’re doing a calculation which has to be done everywhere, but the results of which will give you enough information to significantly streamline the next pass…there’s no way to take advantage of that without a depth-only pass in between.

As far as I can tell, explicitly writing depth still disables early-z even with depth_bounds. However, I haven’t devised a “pure” test of this yet, so it’s possible that isn’t the case.

My question is, which other extensions, if any, affect early-z behavior? I’m hoping to stumble onto something useful.

Ideally, of course, I’d love for there to be a way to tell the hardware “reset early-z behavior using current depth values, no matter what I just did a moment ago.”

I personally think it’s important to understand how what early Z is and how it works. At it’s simplest, it just means running the fragment shader after the depth test. That means, it’s safe to assume that anything that modifies depth will instantaneously disable early Z (unless some information is supplied like the depth bounds extension you mention above).

However, this behavior is highly implementation (video card) dependent. I’m pretty sure that some hardware uses the hierarchical Z test as the pre-fragment shader test. That means anything that disables hi-Z will also disable early Z because they’re the same thing in that case. It also means that in the case where hi-Z information isn’t available, the pixel shader will run regardless of whether or not it will be discarded by the subsequent depth test. Also, some cards put the stencil test in front of the fragment shader, and how they do that is anybody’s guess. It could be related to how they do hi-Z, or it could be a full stencil test.

In other words, there is no set standard for what you’re looking for. Things will effect different cards in different ways, and the best thing you can do is grab one chip from each vender-product line (e.g. R300, R400, NV40, G70, etc) and go to town running performance tests trying to see what the cards handle and what they don’t. Also, you can simply build your renderer, and then performance tune on each product line if it’s a big issue.

Kevin B

I understand that’s how it works now, yes. But the inability to treat depth output as something unrelated to the depth test is a rather glaring oversight in OpenGL, and I was hoping some extension might have been created to do it.

In a way I’m looking to do what it seems like the stencil buffer should do, but since stencil buffers aren’t very well supported with FBOs (even packed_depth_stencil fails with FBOs on Linux----not via a gl error, just in that you get NaNs written out to texture), depth is the thing to use for now.

It seems natural that OpenGL should be able to test depth only prior to running a fragment program, and not test it afterwords. This would allow you to chain several fragment programs together efficiently for various interesting effects, with intermediate programs affecting whether or not later ones are run at all on a given pixel, or choosing one of many possible programs for that pixel by setting depth within one of several ranges.

Alternatively, unless you’re trying to usurp early-z for some purpose other than performance enhancement, you can just use the general guidelines that IHVs set down for them.

If you want to expect early-Z, you can’t discard, alpha-test, or modify the depth value.

It really is an implementation-dependent factor, which is why no GL specification, either primary or extensions, ever mentions it. Unless you can control your hardware, or wish to write multiple potentially significantly different rendering paths for each graphics card type, it’s better to just stick to the guidelines. Particularly since a driver revision can change this kind of low-level stuff easily.

I’m pretty sure that some hardware uses the hierarchical Z test as the pre-fragment shader test.
Generally, hi-Z and things like it are for culling large swaths of pixels/samples. 4x4 or 8x8 blocks are culled in one test. Early-z is a per-sample culling mechanism which just does the z-test before the fragment processing.

Hi-Z and features like it are good for culling terrain fragment groups when you’re rendering a city on top of it, or for culling large portions of distant buildings when there are closer ones nearby. Or for culling large portions of the skybox when there’s a city or terrain in the way. That is their purpose.

The purpose of early-z is to keep you from having to run potentially expensive fragment programs for fragments that will be eliminated by the depth test. Early-Z alone is slower than Hi-Z, but it operates on a per-pixel basis.

It seems natural that OpenGL should be able to test depth only prior to running a fragment program, and not test it afterwords.
I’m not really sure why that would seem natural. From a rendering perspective, depth is depth. It is not an arbitrary value that is used to turn on or off fragment processing. It has a semantic meaning (how far away a pixel is from the eye), and the hardware (and API) is designed around that semantic meaning.

The hardware might be able to do it; the driver is probably the one responsible for verifying that the various bits of state (which fragment program is set, is alpha test on, etc) will allow early-z without breaking the semantics of the depth test. But since depth is depth and not an on/off switch for fragment processing, it makes perfect sense to turn on early-z only when it would not affect the rendering output.

In short, early-z is a performance enhancement, not a GL feature to be used for rendering effects.

Originally posted by Korval:

In short, early-z is a performance enhancement, not a GL feature to be used for rendering effects.

Precisely. It’s a performance enhancement I’d like to use, and can’t use due to some of these restrictions because it disables itself too easily. When you’ve got an image with 1.3 million pixels, it’s nice if you can process all of them as rarely as possible. I’d much prefer to process only the 10,000 of them which previous data has told me will be the only relevant ones.

Depth is depth, sure. But we both know lots of processing is done in multiple passes these days; that’s why render-to-texture exists. Textures aren’t the screen, yet we can render to them!

Similarly, the hardware can efficiently cull fragments based on a depth value. Since this is possible, why not allow it to be used as an on/off switch? This wouldn’t be default behavior, of course, but failing to consider the possibility entirely when the hardware is clearly capable just seems silly. GPUs are bad at figuring out conditional branching on their own; what would be the downside of allowing the programmer to restate the problem in terms they do understand well, eg depth?

The absence of finer-grained control over this feature causes me to need to do unnecessarily duplicated work. If it’s a performance enhancement, it clearly could be a better one.

Now, I do understand the argument that OpenGL must divorce itself somewhat from hardware, and that early-z isn’t a feature implemented identically in all cards. That makes sense.

But see, there are these things called glHints. Something like that would be pretty much perfect for this functionality. No guarantees would be required, just a mechanism to suggest the manner in which you’d like early-z to function if possible.

The point, though, is to see if any other extensions like depth_bounds happen to exist; not to argue what should or should not be.

But we both know lots of processing is done in multiple passes these days; that’s why render-to-texture exists.
No, we have render-to-texture so that we can render to a texture without having to do a copy. How you choose to later use that texture is up to you. There are plenty of non-multipass algorithms that use RTT.

But see, there are these things called glHints. Something like that would be pretty much perfect for this functionality. No guarantees would be required, just a mechanism to suggest the manner in which you’d like early-z to function if possible.
Two things are wrong with this.

One, it’s a hint. If you’re trying to base an algorithm on a hint, you’re doomed. You can’t even test to see if the implementation is following it; all you can do is hope it is. What good is that when an implementation is ignoring you silently? Particularly when the algorithm you choose is based on it and will fail if the hardware doesn’t follow the hint.

Two, it can cause breakage of other algorithms. Like using the depth buffer as a depth buffer. After all, when early-z is turned off, it is either for hardware reasons (the early-z logic is sometimes coupled with the alpha test logic) or because leaving it on would obviously break the intent of the depth buffer (fragment programs that change the depth). The particular case you cite with “discard” turning off early-z but depth_bounds undoing that seems to be a specific driver or hardware thing.

In your initial post, you mentioned, “explicitly writing depth still disables early-z even with depth_bounds”. To paraphrase Babbage, I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a comment. If a driver were to have early-z with a z-writing fragment program, that’s clearly an error. It violates the GL specification, along with the semantics of what it means to have a depth buffer and an active depth test.

And of course three, GL 3.0 almost certainly won’t have hints anymore. The design of GL 3.0 is such that either the feature exists and is real and you can rely on it, or it fails. No more of these “kinda working” things where you have to guess if it will work out OK. So even if you got your wish, it’d be short-lived.

BTW, for your case of wanting to save the knowledge of whether some pixels are “good” and some are “bad”, you should use a second buffer and use multiple render targets. That second target is where you write your good/bad flag, not the depth buffer. Yes, you don’t get “early arbitrary buffer” testing as a performance enhancement, but your algorithm will be able to function.

The point, though, is to see if any other extensions like depth_bounds happen to exist; not to argue what should or should not be.
But the purpose of depth bounds isn’t to modify early-z behavior. What you have cataloged is a side-effect at best, a driver quirk at worst. It is nothing guaranteed by OpenGL or an IHV, so reliance on it is unwise.

Yes, I’m sure it’s possible to take every piece of modern hardware and every recent driver revision and figure out exactly which state turn on or off early-z. But since that will change from revision to revision, except for the early-z recommended usage specified by the IHV (and bugs will probably change them too), you can’t really rely on it.

Even further, it may be a bug. For example, let’s say that the hardware implements “discard” by writing a failing depth value out of the shader, and preventing later shader ops from overwriting it (which is a valid though slightly silly way of implementing it). If that’s how it works, then this depth_bounds thing is literally a driver bug, one that may get patched later. So not only might it change, the change may be for the better overall.

The point is this. Unless and until specifications start requiring early-z and making it a first-class feature (I wouldn’t hold my breath on this), even cataloging each state combination and extension that happens to allow early-z is not a wise way of using it. It’s just too unreliable in the long-term.

Originally posted by Korval:
One, it’s a hint. If you’re trying to base an algorithm on a hint, you’re doomed. You can’t even test to see if the implementation is following it; all you can do is hope it is. What good is that when an implementation is ignoring you silently? Particularly when the algorithm you choose is based on it and will fail if the hardware doesn’t follow the hint.

Oh no, the algorithm will always work, because something akin to glDepthBoundsEXT (or a simpler depth test) will still cull the fragment properly. That much works fine even now.

It’s simply a question of whether it works 10x slower than it otherwise would, because early-z isn’t killing fragments which would eventually be culled.

It’s like this. Some passes do lots of work, and should be optimized with early-z. Some passes do very little work, and only set up the depth buffer for early-z on one of the heavy loads. Sometimes we alternate back and forth between the two types.

I’m simply wanting to avoid an extra texture-read on the depth-only passes by doing that depth-write while I’ve already got the relevant information in a register on the previous heavy-duty pass. Especially when each subsequent heavy-duty pass will never need to operate on pixels not seen in the last.

Two, it can cause breakage of other algorithms. Like using the depth buffer as a depth buffer. After all, when early-z is turned off, it is either for hardware reasons (the early-z logic is sometimes coupled with the alpha test logic) or because leaving it on would obviously break the intent of the depth buffer (fragment programs that change the depth). The particular case you cite with “discard” turning off early-z but depth_bounds undoing that seems to be a specific driver or hardware thing.

No, nothing would be broken that didn’t explicitly ask for early-z to re-enable itself after something turned it off. Default behavior would remain the same, obviously.

In your initial post, you mentioned, “explicitly writing depth still disables early-z even with depth_bounds”. To paraphrase Babbage, I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a comment. If a driver were to have early-z with a z-writing fragment program, that’s clearly an error. It violates the GL specification, along with the semantics of what it means to have a depth buffer and an active depth test.

It’s breathtakingly simple. You simply devise a test which is only executed at the start of a pass. You then allow fragment programs to write depth, understanding that what they’re writing is only relevant to what happens in the next pass.

I don’t know what the GL spec says about this, but it’s a completely obvious usage of early-z functionality, so the fact that it isn’t possible right now is, I say again, silly.

You don’t have to call it a depth test if you’d prefer…it’s just doing effectively the same thing using the same real buffer, but you can logically consider it something else easily enough.

And of course three, GL 3.0 almost certainly won’t have hints anymore. The design of GL 3.0 is such that either the feature exists and is real and you can rely on it, or it fails. No more of these “kinda working” things where you have to guess if it will work out OK. So even if you got your wish, it’d be short-lived.

Early-z has always been a kinda-working thing, as you’ve said yourself. Does this mean we can’t expect it to work at all anymore?

BTW, for your case of wanting to save the knowledge of whether some pixels are “good” and some are “bad”, you should use a second buffer and use multiple render targets. That second target is where you write your good/bad flag, not the depth buffer. Yes, you don’t get “early arbitrary buffer” testing as a performance enhancement, but your algorithm will be able to function.

Which is great if we could attach a special one-bit-per-pixel render target that wouldn’t strain write or read bandwidth at all. FBOs don’t allow that, unfortunately; in fact, I think no such internalFormat exists.

Still, if it did, it would be almost as good. Using such a thing for depth-only passes would effectively get you the desired behavior for free, since massive amounts of the thing could fit in the texture cache at once. It’d be a matter of microseconds even testing over a million fragments; right now, such a depth-only pass using RGBA float 32 textures takes about 2 milliseconds.

Even further, it may be a bug. For example, let’s say that the hardware implements “discard” by writing a failing depth value out of the shader, and preventing later shader ops from overwriting it (which is a valid though slightly silly way of implementing it). If that’s how it works, then this depth_bounds thing is literally a driver bug, one that may get patched later. So not only might it change, the change may be for the better overall.

The thing has been documented in several published papers as a useful exploit, so I think it’ll stick around. However, if not, everyone will just use CUDA or somesuch for more complex behavior anyway. This is all interim thinking.

Besides, discard can only ever make a depth value greater (assuming glDepthFunc is GL_LESS). So long as that’s all you do, there’s no reason early-z shouldn’t work. Some fragments may increase their depth, but none will become visible that would have been occluded.

Even a simple scheme that allows you to write depth without disabling early-z, so long as it’s greater than the interpolated depth, would be useful. And that’s not breaking the depth-buffer concept too badly, even.

You simply devise a test which is only executed at the start of a pass. You then allow fragment programs to write depth, understanding that what they’re writing is only relevant to what happens in the next pass.
The depth test is defined to test the depth computed by the fragment stage. Doing what you ask violates that definition.

To do what you suggest would require the following changes to the OpenGL spec:

#1: Acknowledgment of Early-Z testing as a first-class feature, thus allowing Early-Z to be controlled by the user for a primitive.

#2: Restructuring the meaning of the depth test. That is, the depth test must no longer mean a test between the fragment’s depth value and the depth buffer value. Which, as a consequence, means that the depth value after such processing may have no relation to the depth value in the buffer previously.

Of course, this creates any number of problems, virtually all of them implementation dependent.

Some hardware, as mentioned earlier, couples early-z to other features (alpha test, etc), thus making it unusable for arbitrary reasons. Other hardware simply has no early-z (for good reasons), thus making it impossible to use at all.

Basically, this is a problem because you’re trying to use a piece of functionality that is designed to be invisible to produce visible results. See below.

Early-z has always been a kinda-working thing, as you’ve said yourself. Does this mean we can’t expect it to work at all anymore?
It’s an optimization. Like pipelining or Hi-Z. They’re optimizations that are designed to speed up processing, but not to affect the results of said processing.

Modern GPUs process multiple vertices and fragments simultaneously, even though the spec says that they should be processed in order. So long as their output is entirely consistent with what the spec demands, they can actually implement it however they want.

They can use a deferred tile-based renderer , which is about as out-of-order as you can get, so long as the output is consistent with the specification (in-order processing). It is important to give IHVs freedom of implementation, which is why the specification specifies behavior and not implementation.

Early-z can thus only happen when doing so will not violate the spec (which says that z-tests happen after the fragment processing). Thus, it can only happen when the z after the fragment is identical to the z before the fragment.

Much like you cannot rely on pipelining to, say, render fragments from triangles out of order, you cannot rely on early-z to provide some visible side-effect. That is not the purpose of either.

You can rely upon early-z and pipelining to speed up various calculations, but only where they are applicable. You can break vertex pipelining by not stripping your triangles properly or by using glDrawArrays inappropriately, just as you can break early-z by writing to the z or turning on alpha test (implementation dependent). But neither will be visible in the output; they will only show up as a performance hit compared to the alternatives.

Basically, you should treat early-z as a performance bonus for rendering the “right” way, just as you treat the post-T&L cache as a performance bonus for proper stripping of triangles. This is as opposed to, say, alpha blending, which is not a bonus. It is a first-class feature that has a direct effect on the output of the renderer.

GL 3.0 will specify that spec-defined behavior must be either on or off. Early-z is not spec-defined behavior; it’s an optimization. And optimizations are implementation-defined.

Besides, discard can only ever make a depth value greater (assuming glDepthFunc is GL_LESS). So long as that’s all you do, there’s no reason early-z shouldn’t work.
You’re assuming that the hardware will do two depth tests, one early and one late. I can almost guarantee you that the same bit that tells the early-z to turn on is the bit that the last-z test checks to see if it should run.

Remember: the purpose of early-z is to be felt in performance, not to be seen in the results. Given that, there’d be no point in spending the extra transistors on making early-z and late-z independent of one another. By the meaning of the concept of depth testing as defined by the OpenGL spec, when one is on for a primitive, the other one is off. If you could do early-z, that means that you didn’t change z from the initial one, so there’s no point in doing it later. And if you couldn’t do early-z, then you clearly need to do it late.

Unless you change what it means to do a depth test as outlined above, there is no incentive for hardware makers to decouple the two.

I know what you’re saying. You just aren’t quite understanding the full context of what I’m saying.

It’s an optimization. Like pipelining or Hi-Z. They’re optimizations that are designed to speed up processing, but not to affect the results of said processing.
I know this, I’m not expecting otherwise. Figure that out, because you keep talking like you don’t see this.

One of the goals of the program I’m working on is maximum throughput, so I’m trying to design my algorithm to do as little duplicate work as possible.

Let’s say I need to do lots of work at a given pixel P if

  1. The value in texture A at P is over a certain threshold,
  2. The values at P in texture B and C are both less than the value at P in A.

Clearly, looking up all three textures in a single fragment program everywhere is a waste of time. It would be better to only lookup texture A, check if we’re over the threshold, and then look up textures B+C only if necessary.

This could be done with a conditional…if we expect some degree of coherence in the results of the A(P) > thresh test. If not, it’s faster to do Fragment Program 1 for checking A, discard at the valid Ps, and then do Fragment Program 2 with early-Z to look up B and C. You could even do one for each, with another discard in the first one.

This works if you use depth bounds. It’s extremely fast, too----from 2 milliseconds for the check of A against the uniform thresh, it drops to something like 100 microseconds for the check against B and C.

But now let’s consider that I calculated the values in A during Fragment Program 0. So really, it seems like reading them back for the first depth-only pass shouldn’t be necessary; it should be possible to check against the threshold before I output from Program 0, and set depth accordingly. Except I can’t use discard in Program 0, because I need all of A available later on.

Thus, I’m using an entire extra 2 milliseconds for Program 1, just because writing depth from Program 0 disables early-Z in the execution of Program 2. If that didn’t happen, I wouldn’t need Program 1 at all.

I don’t care that writing depth from Program 0 will disable early-z for Program 0 in this case. It executes everywhere anyway. (I might in others, but that is definitely a more difficult case.) I merely want it enabled for the execution of Program 2, using the depth values I wrote from Program 0!

This should not violate anything…

So other than glClear(GL_DEPTH_BUFFER_BIT), what other way is there of re-enabling early-z between passes? That is largely what I’m looking for here.

#2: Restructuring the meaning of the depth test. That is, the depth test must no longer mean a test between the fragment’s depth value and the depth buffer value. Which, as a consequence, means that the depth value after such processing may have no relation to the depth value in the buffer previously.
That’s not required. I can work around that easily enough. However…

You’re assuming that the hardware will do two depth tests, one early and one late. I can almost guarantee you that the same bit that tells the early-z to turn on is the bit that the last-z test checks to see if it should run.
There have to be at least two bits:
Early-z vs Late-Z?
Depth test enabled at all?

That gives four states total, which opens the possibility of a state in which early-z is run regardless of what happens in the fragment program. Unfortunately, disabling depth test also disables depth writing, so that’s still a roadblock.

do Fragment Program 2 with early-Z to look up B and C. You could even do one for each, with another discard in the first one.
And therein lies your conceptual flaw. Fragment Program 2 does not have early-z because early-z is not a property that is available to fragment programs.

It just so happens that the fragment program and other state in question is compatible with a performance optimization that some graphics cards provide in certain cases. This is not a fundamental property of Fragment Program 2; it is between you and your expectations of what the implementation will do with a given set of state.

More importantly, if the hardware did not actually have early-z (which, as I keep pointing out, is not universal), this version of your program would still function. It may not have the performance characteristics you would like, but the acts you describe and the output you require would still happen.

Thus, I’m using an entire extra 2 milliseconds for Program 1, just because writing depth from Program 0 disables early-Z in the execution of Program 2. If that didn’t happen, I wouldn’t need Program 1 at all.
And here is where you’re turning optimizations into features.

Let’s assume that you could tell an implementation to force early-z regardless. If hardware did not have early-z, this algorithm would break. In early-z hardware, the depth written by Program 2 would not be tested because the early-z test has already happened. Early-z is either/or. Either you depth test early or depth test late; you don’t do both.

In non-early-z hardware, Program 2 would have its written depth tested, where the early-z case would not.

Now, maybe you can swear that you will not write an invalid depth from the depth buffer, thus preserving the nature of it (that is, you always write a passing depth). But the hardware certainly cannot guarantee that anymore; the meaning of the depth test is fundamentally transformed.

The disconnect is this: your algorithm wants to use early-z to make the entire algorithm faster (which is why you think of it as optimization), but in so doing, it changes the output of a particular stage of the algorithm, such that non-early-z hardware cannot reproduce the results. It is this latter part that makes it not an optimization but a feature.

Well, I guess it can be put to this: there are some general guidelines of what one should do to have the best chance of getting an early-z (like don’t write to depth etc.), but your case is a bit more complicated. You can try to experiment if your application will only run on one particular card with particular drivers installed, but the probability is very high that there is no universal solution.

Or another way: if you need to do something more complex, don’t count on early-Z :slight_smile:

Originally posted by Korval:
[quote]Thus, I’m using an entire extra 2 milliseconds for Program 1, just because writing depth from Program 0 disables early-Z in the execution of Program 2. If that didn’t happen, I wouldn’t need Program 1 at all.
And here is where you’re turning optimizations into features.

Let’s assume that you could tell an implementation to force early-z regardless. If hardware did not have early-z, this algorithm would break.
[/QUOTE]No, pretty sure it wouldn’t.

In early-z hardware, the depth written by Program 2 would not be tested because the early-z test has already happened. Early-z is either/or. Either you depth test early or depth test late; you don’t do both.

I’m not trying to do both. Which fragments end up getting processed----at every stage----is determined entirely by where I use “discard” and/or where I write depth, and at what depth I choose to render each quad. This is well-defined whether or not early-z is in effect.

The only difference is whether a million fragments which could not possibly fail to be culled, given the state of the depth buffer upon starting a quad-render and its depth, are still processed anyway.

If the depth buffer contains only 1s and 0s, and I render a quad at depth 0.5, and the only thing the fragment program does is discard or not…it shouldn’t matter how those 1s and 0s were written in the past.

In non-early-z hardware, Program 2 would have its written depth tested, where the early-z case would not.

All right, you’re misunderstanding. Let me lay it out. What I’d like:

Program 0: Execute everywhere, write depth as either 0 or 1 everywhere.

Program 2: Fragments are discarded if they pass. Otherwise, the quad’s depth (say, 0.5) is written. But since some pixels had depth 0 going in, they’ll be culled either way. Better to let early-z take them out quickly.

Program 3: Lots and lots of work, but only at the pixels which still have depth 1 at this point. Counting on early-z to speed this up again.

The depth buffer—every buffer, in fact----gets precisely the same information written to it at every stage either way. The only difference is whether or not an extra pass and texture lookup is needed.

I understand that early-z would have to be disabled for the run of Program 0 if it wrote depth. I just don’t see why it should stay disabled after that. There should be a way to turn it back on somehow.

In any case, I’ve got an idea which might minimize the drawbacks of the current situation, but it seems suited to a new thread.

Some reading on how Hi-Z and Early-Z operates on ATI hardware:
http://ati.amd.com/developer/SDK/AMD_SDK_Samples_May2007/Documentations/Depth_in-depth.pdf

Some time ago i found that my algorithms didn’t receive any speedup on GF 6600, if I try to use stencil mask to prevent execution of very expensive fragment shader. Finally, after the many troubles, I’ve implemented custom OpenGL-based test that checks early depth, stencil and scissor rejection.

I’ve found interseting results:

NV4x and G70 hardware can use early stencil rejection only in window framebuffer. In P-buffer or FBO it doesn’t work. Early depth and scissor tests works good in all buffers.

G80 (GF 8500) do not use early stencil rejection at all! (like old NV30 cards). May be because it’s not a full version of G80? Early depth and scissor test works good.

R500 and R600 can execute early stencil in window and p-buffer with very well performance! The same for early depth and scissor tests. I can’t use FBO for early stencil test, because ATI do not support GL_EXT_packed_depth_stencil extension properly.

If I remember correctly, early Z wasn’t working on NVIDIA with floating point rendertargets. But that was century ago :rolleyes:

I haven’t had much time to research that since, but any insight is welcome as I’m going to return back to rendering in near future.

Possible it still isn’t on Linux, but within certain bounds it appears to work fine on Windows these days.