PDA

View Full Version : async glReadPixels on HD3000 stalls the pipeline



AnimalCoder
11-22-2011, 05:35 AM
Hi,

I have a problem getting back exposure-data from my HDR-rendering pipeline on Mac OS X Lion on a Intel Sandy Bridge HD 3000 Macbook Air mid-2011 (MBA4,2) in an asynchronous fast way (tried on both 10.7.1 and 10.7.2).

It's really annoying, because I really only need to read back like a single float :) (or ideally a couple of pixels, but the problem has nothing to do with bandwidth).

The theory, and which works fine on for example my Nvidia Linux box, is to create and bind a PBO, then issue glReadPixels asynchronously to trigger the readback of data. I'm using a bound FBO of GL_RGBA16F. Then several frames later I use glMapBuffer on the PBO and can read out the data (one or more pixels), thus hiding any latency due to the driver inserting the readback in the normal GPU command queue.

On Lion with HD3000 graphics, even the asynchronous glReadPixels stalls the pipeline for like 40 ms, I guess it simply flushes the entire GPU pipe and then uses the CPU to copy the data to the PBO.

On Nvidia, it does this too, unless you have exactly the right data type parameters (GL_BGRA and the same format as your FBO, for example GL_HALF_FLOAT in my case). So naturally I tried every combination of datatypes known to man, on the Intel, but all of them stalls.

I also tried with a normal 8-bit FBO (GL_BGRA and GL_RGBA with GL_UNSIGNED_BYTE), I tried reading back from the default window-system framebuffer, and I also tried glGetTexImage() to no avail.

I downloaded a small PBO-demo from one of the OpenGL forums which is supposed to demonstrate asynchronous use, and that too fails on the Macbook (no performance gain). So I'm fairly certain I'm not doing any trivial mistake here.

Is there no game with a HDR-rendering pipeline that does exposure that runs on Mac on a HD3000 that I can snoop to find the One Almighty Allowed Combination of Parameters ? :)

Or might it be so bad that asynchronous data-readout on this driver is simply not supported?

FWIW, on 10.7.1 the OpenGL version was 2.1 and on 10.7.2 I still run the 2.1, even though I could switch to core 3.2. Is there any reason to suspect that switching to core 3.2 on Lion would make this work (for example, does it actually switch to a physically different driver?).

Any help at all would be most welcome :)

ZbuffeR
11-22-2011, 01:18 PM
I would instantly blame Intel for being as lazy as usual on the graphic driver/hardware.
I am not familiar with Mac drivers, but indeed it could switch to a different codepath, I guess it is worth the try, as it looks like the best solution.

Bruce Wheaton
11-22-2011, 09:28 PM
Are you doing a glFlush after the read pixels call?

Otherwise your GL commands could be just sitting in the queue, and then when you map the PBO, all the work is being done at once.

Bruce

AnimalCoder
11-23-2011, 02:02 AM
Hi Bruce,

I'm not doing any explicit glFlush, nor should I have to as I understand. I'm not aware of any GPU which ignores commands until a glFlush... but stranger things have happened of course :) Anyway, any remaining commands are flushed implicitely when a framebuffer has to be swapped to the screen. But up until that, all commands should be pipelined in the GPU versus the CPU codepath, including asynchronous glReadPixels calls.

It is true that glMapBuffer() will flush and stall until the associated buffer has been filled. But I overlap these with up to 4 frames, so I don't try to map the buffer until LONG afterwards. Also, it is not the mapbuffer which stalls, it is readpixels, like it is going into software fallback (probably the case).

AnimalCoder
11-23-2011, 02:07 AM
I would instantly blame Intel for being as lazy as usual on the graphic driver/hardware.
I am not familiar with Mac drivers, but indeed it could switch to a different codepath, I guess it is worth the try, as it looks like the best solution.

I was going to try it, but I'd have to overhaul the rest of my code to fit with Core 3.2 first so it's not done in 2 min though.. in the meantime, I'll go work on some other part than the exposure algorithm..

I've heard a lot of bad things of Intel's former drivers, but Apple's coders seem to have done the Right Thing with the latest mac osx update which added core 3.2 support all over the board even for the Intel macbooks which many people didn't believe they would care about. So obviously Apple's coders have been involved a lot in this driver, so it's not as certain who's to blame now :)

mhagain
12-15-2011, 08:40 AM
My experience with Intels is that they need GL_UNSIGNED_INT_8_8_8_8_REV instead of GL_UNSIGNED_BYTE (and you absolutely must use GL_BGRA) otherwise they'll put you through a software stage, PBO or no PBO.

AnimalCoder
01-09-2012, 01:55 PM
@mhagain: you are a HERO!

I switched from GL_UNSIGNED_BYTE in glReadPixels to GL_UNSIGNED_INT_8_8_8_8_REV in the demo I have, and it worked! Haven't tested it in my production code yet but since it definitely changed the demo, which stalled before, it feels good :)

Now, I really wanted to read back GL_HALF_FLOAT's instead, but I take whatever works for now... you don't happen to sit on some inside intel on reading floats on Intel too? ;)

It's too bad these (apparently known) shortcomings are not documented anywhere..

Thanks

mhagain
01-12-2012, 10:10 AM
I'm not aware of any float performance issues (or lack of issues) with Intel, but I have confirmed that uploads to texture resources using GL_RGB10A2 are also stall-free, so it may be the case that the same applies to reads from an FBO (GL_RGB10A2 is a format supported by D3D too so it seems a reasonable assumption that it's a "safe" format to use on Intel).

Not an ideal solution I know, but it does give you a couple more bits of precision in the colour channels, which may be good enough.