the fastest way to get pixels from display card

pango · October 12, 2003, 8:38pm

my app need to read pixels from display card,I use “glReadPixels()” to do it,but the performance of the function is not satisfied me,so is there any OpenGL extenstions to do the same purpose and the speed is faster than “glReadPixels()”

OneSadCookie · October 13, 2003, 2:07am

NV_pixel_data_range significantly helped us with glReadPixels performance… not necessarily that the call was faster, but that it could execute asynchronously.

imported_jwatte · October 13, 2003, 5:46pm

Aren’t there like 3 identical questions on this board right now? Curious.

Anyway, to read back with reasonable speed, you should make sure that you ask for the same format of pixels that your card uses internally, and ideally with the same alignment.

This typically (for 32-bit contexts on x86 machines) translates into GL_BGRA,GL_UNSIGNED_BYTE external pixel format. If you try to read back anything else, or if read back into an un-aligned buffer, or if your internal format is not 32-bit RGBA, then you’ll get a slow software path on many cards.

pango · October 13, 2003, 7:48pm

Thanks for OneSadCookie and jwatte,but what I want to know is the bandwidth of read back,the result I get is only 2M-3M bytes per second,I can’t believe it!I think my method may has some problem,so please tell me the bandwidth number of AGP read back?

Nutty · October 13, 2003, 11:47pm

JWatte, do you think the destination buffer should be aligned to any specific amount?

heath · October 14, 2003, 6:50am

You might want to try what is mentioned in the green book on page 403. It certainly makes a difference for me. Just in case you don’t have the green book, it suggest minimizing the per-fragment operations during read/draw and copy pixel operations.

glDisable(GL_ALPHA_TEST);
glDisable(GL_LIGHTING);
glDisable(GL_LOGIC_OP);
glDisable(GL_TEXTURE_1D);
glDisable(GL_TEXTURE_2D);
glDisable(GL_DITHER);
glDisable(GL_STENCIL_TEST);
glDisable(GL_DEPTH_TEST); // seems to have a tremendous effect on performance
glDisable(GL_BLEND);
glDisable(GL_FOG);
glBindTexture(GL_TEXTURE_2D,0); // seriously effects the copypixel performance
glBlendFunc(GL_ONE,GL_ZERO);

glPixelZoom(1.0,1.0);
/*

Disable all unnecessary pixel transfer modes
*/
glPixelTransferi(GL_MAP_COLOR, GL_FALSE);
glPixelTransferi(GL_MAP_STENCIL,GL_FALSE);
glPixelTransferi(GL_INDEX_SHIFT,0);
glPixelTransferi(GL_INDEX_OFFSET,0);
glPixelTransferf(GL_MAP_COLOR,false);

glPixelTransferf(GL_RED_SCALE,1.0);
glPixelTransferf(GL_GREEN_SCALE,1.0);
glPixelTransferf(GL_BLUE_SCALE,1.0);
glPixelTransferf(GL_ALPHA_SCALE,1.0);
glPixelTransferf(GL_DEPTH_SCALE,1.0);

glPixelTransferf(GL_RED_BIAS,0.0);
glPixelTransferf(GL_GREEN_BIAS,0.0);
glPixelTransferf(GL_BLUE_BIAS,0.0);
glPixelTransferf(GL_ALPHA_BIAS,0.0);
glPixelTransferf(GL_DEPTH_BIAS,0.0);
/*

Pixel store alignment
*/
glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
glPixelStorei(GL_UNPACK_ROW_LENGTH, 0);
glPixelStorei(GL_PACK_ALIGNMENT, 1);

Hope This helps

Heath.

system · October 14, 2003, 8:14am

There was some article at nvidia about using GDI functions for getting back the front buffer. Look it up.

It’s better than a plain glReadPixels

imported_Adrian1 · October 14, 2003, 9:48am

From the faq on the NVidia site:

"BGRA is and always has been the fastest format to use. (There are some cases where RGBA is OK, and usually BGR is better than RGB, but in general, BGRA is the safest mode.)

The fastest performance you’ll get a readback is approximately 160-180 MB/s (~45 MPix/s) for RGBA/BGRA which is the GPU hardware limit (due to PCI reads on the memory interface). This is with a P4 1.5GHz and above class system. The readback rate doesn’t change significantly with the GeForce FX family. Note that you’ll get the highest performance when you read back large areas as opposed to small ones. "

imported_Adrian1 · October 14, 2003, 10:09am

Originally posted by V-man:
[b]There was some article at nvidia about using GDI functions for getting back the front buffer. Look it up.

It’s better than a plain glReadPixels[/b]

I can’t find the article, how much faster is it?

Cyranose · October 14, 2003, 12:32pm

Originally posted by Adrian:
I can’t find the article, how much faster is it?

For source code, look at the NVidia SDK under Demos\OpenGL\src\shared\MovieMaker.cpp

Avi

Nakoruru · October 14, 2003, 12:56pm

If you do a glDisable(GL_BLEND); then glBlendFunc(,) is irrelavent. To be consistent, why doesn’t the code also change the AlphaFunc and DepthFunc to something easy?

imported_jwatte · October 14, 2003, 7:21pm

Regarding alignment, I’d assume each row needs to be aligned on at least 4 bytes. The next bigger alignment size that might make sense is 8 bytes; the next up is cacheline size; the next up is page size. I don’t think anything > 8 bytes alignment is likely to matter.

system · October 15, 2003, 8:42am

Yes that’s it!
http://cvs1.nvidia.com/DEMOS/OpenGL/inc/shared/MovieMaker.h
http://cvs1.nvidia.com/DEMOS/OpenGL/src/shared/MovieMaker.cpp

imported_Adrian1 · October 15, 2003, 11:32am

Thanks, yes I have seen that before, I was hoping there would be some information as to how and why it is supposedly faster. I’m a little sceptical. If it is faster I would have expected to find information and benchmarks via google. I would also expect NVidia’s readpixel faq to recommend this as an alternative method but there is no mention of it. It doesnt add up.

OneSadCookie · October 15, 2003, 11:55am

Originally posted by jwatte:
I don’t think anything > 8 bytes alignment is likely to matter.

I don’t know about Windows/Linux, but on the Mac, it helps a great deal to have pointers 16-byte aligned (many system routines use altivec, and altivec requires 16-byte aligned pointers), and it helps a great deal to align large buffers to the size of a cacheline (32 bytes for G3 & G4; 128 bytes for G5). Page-aligned seems mostly to be overkill, but at least you’re guaranteed that it’s aligned the best way possible

imported_jwatte · October 16, 2003, 6:44am

OneSadCookie,

On X86, it also helps to align on 16-byte buffers if you want to use parallel instructions. Unfortunately, the data bus of the CPU is only 64 bits wide, so the wider alignment won’t give you any speed in copy operations.

Similarly, we’re copying large chunks from uncacheable memory to cacheable system memory (assuming it goes through the CPU) so the only benefit of aligning on cache lines would be avoiding the partial cache line eviction at the beginning/end of the large block – but the cost of the block would totally dwarf that.

If you manage to hit a fully-DMA path on the hardware, then the hardware doesn’t even see the cache, so anything more than 4 byte alignment would probably not be necessary – make that 8 for good measure

pango · October 16, 2003, 7:57am

To V-man:
You say the method of using GDI is better than “glReadPixel()”,why?I had browse the code,I know the core of it is screen capture,but do you think the speed of screen capture is faster than the API operating hardware directly?I always think the performance of GDI function is not good,and MS had released the GDI+,so I think the method of NVSDK is not better than “glReadPixel()”,do you think is it right?

system · October 19, 2003, 10:39am

Originally posted by pango:
To V-man:
You say the method of using GDI is better than “glReadPixel()”,why?I had browse the code,I know the core of it is screen capture,but do you think the speed of screen capture is faster than the API operating hardware directly?I always think the performance of GDI function is not good,and MS had released the GDI+,so I think the method of NVSDK is not better than “glReadPixel()”,do you think is it right?

You can always benchmark and see for yourself. I havent benchmarked but I think (and others have said so) that it is faster.

#1 GDI is hardware accelerated (some functions may not be available)
#2 GDI+ is GDI with a few extras plus it is OO. The primary reason for its existance is OO design and not performance or hw accel.

Cyranose · October 19, 2003, 11:17am

Originally posted by V-man:
[b] You can always benchmark and see for yourself. I havent benchmarked but I think (and others have said so) that it is faster.

#1 GDI is hardware accelerated (some functions may not be available)
#2 GDI+ is GDI with a few extras plus it is OO. The primary reason for its existance is OO design and not performance or hw accel.

[/b]

I benchmarked it a while back (before I knew of NV_PIXEL_DATA_RANGE) and the GDI was much faster. Not sure with the extension, but it’s important to test. I don’t remember the actual numbers.

The reason the GDI func was fast, I recall, was that it’s used for MS Video for Windows, which was a high priority for MS (both fast reads and writes to the framebuffer for obvious reasons).

Avi

duhroach · October 20, 2003, 5:57am

I haven’t had the chance to look at the GDI specs, but will it allow you to specify the type of pixels you want to read back? (GL_READ, GL_DEPTH etc?)

~Main