Clearing DEPTH_COMPONENT of type UNSIGNED_SHORT in gles2 using OES_depth_texture

Hi,

I’m trying to implement shadowmapping in opengl es 2.0 and ran into a problem.

I’m rendering the depth to a GL_DEPTH_COMPONENT texture with type GL_UNSIGNED_SHORT. I have made sure OES_depth_texture is available. Rendering works fine but clearing isn’t working (properly). To be on the safe side I’m enabling and clearing anything I can find. This should work, right?


glClearDepthf(1.0f);
glDepthMask(GL_TRUE);
glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

For some reason this doesn’t clear the texture on my android device using Mali-400 MP. Same implementation works in OSX and even in WebGL. I haven’t tried other android devices.

My guess is that this has something to do with the fact that the depth texture is of integer type. In gles3 glClearBufferiv was introduced for clearing integer buffers. This isn’t available in gles2 (probably because integer types aren’t a valid format without extensions). Could this be the case?

How are you guys clearing depth attachment FBOs? Does it work for you on all devices?

Any suggestions for going around this issue?

An update on this:
It’s not only the depth texture that isn’t cleared. I get similar problems also with a normal depth buffer when rendering to texture. The issue almost disappears when I reduce the size of the texture, but not quite, still some flickering. Might be a memory issue? I’ll try to optimize my memory usage and see what happens.

Is the FBO bound when you call to these functions ?
Which draw buffers are set before your clear attempt ?

Have you scissoring enabled (and a scissor rectangle set)?

Is the FBO bound when you call to these functions ?
Which draw buffers are set before your clear attempt ?

Yes, the correct FBO is bound.
glDrawBuffers isn’t available in OpenGL es 2.0, so I guess no draw buffers are set? Or what do you mean by this?

Have you scissoring enabled (and a scissor rectangle set)?

I haven’t enabled scissoring at any point. Trying to disable it anyways yields no result.

I really don’t think this is a programming issue, unless I’ve forgotten some critical step in setting up the buffers. The behaviour is too random. Sometimes the buffer is cleared every few frames, sometimes almost never, sometimes it clears it but nothing is rendered on it (or it’s cleared for some reason after rendering?)… It feels like the driver is buggy or I’m doing some out of bounds memory operations that cause this random behaviour. I don’t think it’s the latter because the same solution works on osx and even WebGL. What could cause such weird buggy behaviour on this GPU? My FPS is pretty low, is it normal that GPUs start acting buggy when overloaded?

More input: I’m using SDL for windowing. If I add a 50ms delay after SDL_GL_SwapWindow, everything looks good (except for the frame rate). 10ms delay doesn’t affect the result much. Maybe there’s an issue with doublebuffering? I tried to disable it but it didn’t do anything. Not sure if it’s even possible to disable it.

You might need a call to glFinish at some points of your code.
You most certainly did not managed any double buffering in your FBOs. Or are you ? (And even if, this won’t be managed by any SwapBuffer functions or so).

Plus, answering prior questions would definitely help to solve your problem.

I tried to answer the questions but the reply went to moderator review. I guess it appears at some point. But no, I don’t use scissors and gles2 doesn’t have drawbuffers, so the question is irrelevant, right?

No, I’m not double buffering FBOs. Just wondering if SDL’s double buffering messes up something somewhere. Just a wild guess.

I tried glFinish. And it actually does something!
It drops the framerate to what I had with 50ms delay. It also produces the same result as the 15ms delay. Not sure if it works by accident (because of the slowdown) or if it actually causes a finish that SDL_SwapWindow for some reason doesn’t do.

[QUOTE=tikotus;1287287]I tried glFinish. And it actually does something!
It drops the framerate to what I had with 50ms delay. It also produces the same result as the 15ms delay. [/quote]

I would expect that since you’re using a mobile GPU. A full pipeline flush is very expensive on mobile as it defeats the DRAM-bandwidth-saving features of tile-based GPUs.

To get more ideas from folks, you might post some code. Best case, make this a standalone GLUT or SDL test program folks can compile and try locally. That would at least give you more input on results with other drivers.

[QUOTE=Dark Photon;1287288]I would expect that since you’re using a mobile GPU. A full pipeline flush is very expensive on mobile as it defeats the DRAM-bandwidth-saving features of tile-based GPUs.

To get more ideas from folks, you might post some code. Best case, make this a standalone GLUT or SDL test program folks can compile and try locally. That would at least give you more input on results with other drivers.[/QUOTE]

So yeah, glFinish is slow. But it fixes the issue. Does this mean the GPU is messing up its pipeline? Any faster ways to help the GPU understand what it should do than glFinish? glFlush doesn’t seem to help.

The code is part of a bigger game engine. Would take some time to rip out the rendering part and make it readable.

Can you show your context initialization code? It might be a useful place to start, and at the very list would give us a minimal subset of code to rule out potential issues in.

SDL_SwapWindow has nothing to do with FBOs at all.

Here is my context setup: SDL OpenGL context creation · GitHub

SDL_SwapWindow has nothing to do with FBOs at all.

Not directly, but it has something to do with flushing, and the issue seems to be with flushing.

[QUOTE=tikotus;1287292]Here is my context setup: SDL OpenGL context creation · GitHub

Not directly, but it has something to do with flushing, and the issue seems to be with flushing.[/QUOTE]

True. But what I meant is that since there are implicit glFlush/glFinish when calling to swapBuffers, one might believe he can use swapBuffers for doing such synchronization, which might not work, as you experienced.
My guess is that the driver is clever enough to see that the synchronization has to be done with commands issued to the window framebuffer, not any potential FBOs. I might be wrong…

For testing purposes, try detaching the depth texture from the FBO before you use it for rendering in the shadow application pass.

Oh, and after rendering to your depth texture, when setting up for the shadow application pass, verify that you are binding your depth texture to the texture unit. IIRC that’s how the driver knows there’s an implicit flush needed for the FBO render (particularly in the absence of detaching the depth texture from the FBO).

You could also try binding 0 to the texture unit (i.e. unbinding whatever texture was bound) and then binding your depth texture before the shadow application pass to make sure the driver gets the picture.

Could be a driver bug you’re chasing, but it could also be a usage problem in your program.

No.

The first thing of course is to get your shadow rendering working properly without any expensive waits or explicit flushes. However, after you’ve solved that…

One thought on the performance: drivers often use an FBO as a placeholder for all of the rendering that’s targetted to that FBO. If you need to have renders for multiple off-screen render targets in-flight at the same time (which you’re more than likely to need on mobile), then you shouldn’t render everything through one FBO but rather use a small pool of FBOs – enough to get you through 2-3 frames without re-use. That should let the driver efficiently parallelize rendering to different render targets.

However, there is a per-FBO memory cost associated with these FBOs, so don’t just create a pool of dozens of them though or you could blow your GPU memory budget.

What you really should do when you get to optimization is to pull out an ARM Mali GPU profiler and see how your workload is mapping to the GPU’s functional units. If you see big timing gaps on the vertex or fragment pipe, or you don’t see those pipes executing in parallel, then you’ve got something to fix on your side.

It’s not that glFinish() is slow per se. It’s that glFinish() waits for rendering to complete, and the rendering may be slow. If omitting glFinish() means that it runs faster but produces garbage, that suggests that you’re skipping much of the rendering, i.e. displaying what has been rendered by that point and discarding whatever is queued up. If that is what’s happening, there isn’t any solution that will be both fast and correct.

Provided that the CPU is mostly idle, glFinish() by itself shouldn’t have much of an effect upon performance. Although the glFinish() will take a while, the subsequent GL commands can be executed immediately; whereas without glFinish(), subsequent commands will just be queued up for the future. The main situation where synchronisation has a major performance penalty is if the CPU needs to do a lot of work to generate the data to pass to the GL. In that case, synchronisation means that execution alternates between the CPU and GPU, rather than the two working concurrently.

I did this now for testing purposes, kind of. Now I attach texture, depth texture and depth buffer (which ever are available) to the FBO before draw call and detach right after. Same with clear. I also set all texture texture units to 0 after draw call and made sure I’m binding them before draw call. It changed the result a bit. Depending on my setup and FPS it glitches in different ways. With very low FPS I get this kind of results (with and without glFinish, the white quad is the depth texture. Weird that it’s white during the quad rendering but works partially during the shadowing?):

urls because I’m having issues with image uploads

EDIT: It looks like the depth texture is cleared halfway through the shadowing pass. The ground is rendered in more than one pass because of the vertex count. Seems like one of the passes succeeds while the other one leaves polygons unshadowed. Only once in a few seconds there is a frame when all polygons are shadowed correctly, and once in a few seconds the depth texture shows correctly. When ever the depth texture shows correctly, the shadows are also correct, but not vice versa. To me it seems that the depth texture is cleared before all meshes are rendered, except sometimes.

If omitting glFinish() means that it runs faster but produces garbage, that suggests that you’re skipping much of the rendering, i.e. displaying what has been rendered by that point and discarding whatever is queued up.

This makes sense to me. Still I would like to figure out why I need the explicit glFinish. Why does the driver discard the work?

This is somewhat true for a discrete desktop GPU (if you disable read-ahead), but less true for a mobile/embedded GPU where by-design the GPU is still rendering the work submitted last frame “this” frame. That is, it requires parallel CPU/GPU operation to avoid overrunning the low bandwidth of ordinary DRAM.

On mobile / GLES, glFinish() (if even honored correctly by the driver) may force the CPU to wait for several VSync clocks for the GPU to catch up, even though the CPU itself may have been totally idle. At that point, the app has totally missed the opportunity to submit a frame, which clearly reduces performance.

[QUOTE=tikotus;1287300]EDIT: It looks like the depth texture is cleared halfway through the shadowing pass. The ground is rendered in more than one pass because of the vertex count. Seems like one of the passes succeeds while the other one leaves polygons unshadowed.

Only once in a few seconds there is a frame when all polygons are shadowed correctly, and once in a few seconds the depth texture shows correctly. When ever the depth texture shows correctly, the shadows are also correct, but not vice versa. To me it seems that the depth texture is cleared before all meshes are rendered, except sometimes.[/QUOTE]

Good find! That’s definitely something you can work with.

When you say that the ground is rendered in more than one pass because of the vertex count, how do you know that? Are you talking about you are rendering it in multiple passes, or under-the-hood the GPU is rendering it in multiple passes?

If the latter, then I think you’re saying that when rendering into your shadow map, the geometry you’re submitting for your ground pass is overrunning the size of the tiled primitive buffer passed between the vertex and the fragment pipes, causing multiple read/rasterize/write passes to be performed when rendering to your shadow map (aka a pipeline flush).

If so, then this is going to reduce your performance of course, but unless you’re rendering with MSAA when the pipeline flush occurs, this shouldn’t generate any rendering artifacts unless there’s a bug involved (either in your graphics driver or in your app).

Here are a few things you might check into:

[ol]
[li]the logs for your graphics driver to see if it indicates what the size of that driver primitive buffer (associated with the framebuffer) is, [/li][li]look for ways to tune up the size of that primitive buffer in your graphics driver configuration, and [/li][li]try reducing the amount of geometry you’re sending to your shadow map to see if you can get it to consistently work. That would at least help you nail down the root cause. [/li][/ol]

What I’m wondering is if your driver experiences a primitive buffer overflow, is it properly busting up your rendering into multiple passes properly, or is it possible it’s just dumping the whole primitive buffer in the bit bucket when it experiences an overflow? The logs emitted by your graphics driver may help answer this question.

This one. And actually the ground isn’t even rendered to the shadowmap. The shadowmap is just used to shadow the ground.

try reducing the amount of geometry you’re sending to your shadow map to see if you can get it to consistently work. That would at least help you nail down the root cause.

I started with this. Now down to 3254 triangles.

The logs emitted by your graphics driver may help answer this question.

Can’t see anything interesting in the logs.

I further simplified the case (should have done this earlier). Getting really weird. Now I’m only rendering the simplified scene (around 3200 triangles) to a texture with a depth texture. I render the depth texture on a quad. FPS is over 90 and the quad is flickering. It looks correct maybe 70% of the frames, otherwise it’s empty. With glFinish it’s correct again.

Doesn’t seem like this is related to GPU being too busy. It’s something else in the pipeline.