Taking screenshot rapidly with Pixel Buffer Objects

Hello there,

I have been reading lots about using PBOs to capture screenshots rapidly and have got it working. Here is my scenario.

Hardware platform : Android with OpenGL 3.0.
I have images streaming to an Android tablet at about 40-50 FPS. I need to then take screenshots as quickly as I can(once per frame is possible) that will be transmitted to a remote server. Unfortunately, the remote server takes a RGBA bitmap byte array. i.e. TransmitToServer(byte[] bitmap);

Here is my question.

I have been successful with using PBO to read the data back at the full 40-50 FPS but where I run into trouble is to pack that data into the byte array for transmission.

The code before has been truncated for simplicity.

… onDrawFrame() loop
{
index = index % 2;
nextIndex = (index + 1) % 2;

glBindBuffer (GL_PIXEL_PACK_BUFFER, pboIndex[index]);
glReadPixels(0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL); // trigger glReadPixels

glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIndex[nextIndex]); // I am positive this is not waiting for the previous draw to complete as I have experimented by waiting 3-4 frames to be absolutely certain.
ByteBuffer byteBuffer = glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, datasize, GL_MAP_READ_BIT); // read the buffer from the previously frame. the glReadPixel should be done and this will return immediately.

// Package byteBuffer into byte array.
// This is where I am having a major slowdown. I have used memcpy in JNI and just straight ByteBuffer.clone. They all work correctly but just too slow.
// When I clone the data from a predefined array of the same datasize, it completes in 5ms but with the buffer pointer coming back from glMapBufferRange, it takes almost 20-30ms. What gives?

index++;
}

Any suggestions would be greatly appreciated. My application is working correctly but it is just too slow. Thanks again.

Cheers.

OpenGL 3.0 or OpenGL ES 3.0? They aren’t the same thing.

OpenGL ES has a section on the Khronos message boards, which may be worth a try.

If the device has dedicated video memory (uncommon for a mobile device, but possible), you would expect copying from video memory to system memory to be slower than copies between regions of system memory. Even if it uses the same physical RAM for video memory and system memory, the region used for video memory may be normally unmapped while the GPU is active, resulting in a task switch when you try to access it.

If you’re using OpenGL 3, you could try using glGetBufferSubData() instead, but that isn’t available in OpenGL ES 3.

Ultimately, copies from video memory to system memory are considered less important than the other cases and are often slower as a consequence (i.e. if the designer can make something else faster at the expense of read-back performance, they probably will).

[QUOTE=hujanais;1272133]Hello there,

I have been reading lots about using PBOs to capture screenshots rapidly and have got it working. Here is my scenario.

Hardware platform : Android with OpenGL 3.0.[/QUOTE]

What embedded platform is this?
What GPU does it have in it?
What speed of DRAM is in the system?
Does it have dedicated GPU memory?

As GClements said, there is an OpenGL ES Forum on Khronos.org, and you should certainly try posting on it. That said, in my experience, the Khronos GL-ES forum is not very active. By contrast, we get all kinds of OpenGL ES questions on the OpenGL.org forums and have lots of folks reading here. So feel free to post here if you don’t get what you need.

glBindBuffer (GL_PIXEL_PACK_BUFFER, pboIndex[index]);
glReadPixels(0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, NULL); // trigger glReadPixels

What values are assigned to width and height?
What is the format of the color buffer buffer you’re reading from (e.g. RGB565, RGBA4, RGB8, RGBA8, etc.)?
Does the format you’re reading from match the format (bit depth, component order, and packing) you’re asking for?
Is the buffer you’re reading from an EGL surface or a color attachment in an FBO?

glBindBuffer(GL_PIXEL_PACK_BUFFER, pboIndex[nextIndex]); // I am positive this is not waiting for the previous draw to complete as I have experimented by waiting 3-4 frames to be absolutely certain.
ByteBuffer byteBuffer = glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, datasize, GL_MAP_READ_BIT); // read the buffer from the previously frame. the glReadPixel should be done and this will return immediately.

Have you put timing calipers around all of the above section of code to verify that this is all taking near-zero time? How much time do you measure?

// Package byteBuffer into byte array.
// This is where I am having a major slowdown. I have used memcpy in JNI and just straight ByteBuffer.clone. They all work correctly but just too slow.
// When I clone the data from a predefined array of the same datasize, it completes in 5ms but with the buffer pointer coming back from glMapBufferRange, it takes almost 20-30ms. What gives?

You say “package” but then you also say “memcpy” so it’s not clear. Is there any processing involved (e.g. repacking), or is this literally just a memcpy (possibly prepending a header)?

As GClements said, GPUs and GPU drivers typically aren’t optimized for the readback case. Some vendors even cripple the readback performance to serve some marketing goal. That said, with knowledge of your GPU and what formats and methods work best with it for readback, you can often increase your readback performance.

If you would, please make sure that the code prior to and including the MapBufferRange() call is “really” coming back to you in almost zero time. If the previous render and readback hasn’t completed, it is here that you would expect to see a stall. Readbacks are especially bad on mobile GPUs because most of them have low memory bandwidth and run with an added frame of draw latency to try and cover for the very slow CPU/system memory they’re typically forced to use, and a readback will cause a full pipeline flush and sync which is particularly time consuming. Also keep in mind that many GPUs don’t store framebuffer pixel data in the order that you want it to be read back in, so often the driver and possibly the GPU have to do extra work to at least reorder the pixel data if not also convert the pixel format (if what it has and what you want don’t match).

If it truly is your memcpy out of the mapped PBO that is slow, then short of optimizing the readback format, resolution, and method based on your knowledge of the GPU and GPU driver, you’re somewhat at the mercy of the speed of memory your GL driver is putting that buffer in and the speed of your system memory.

Also check out:

Thanks to GClements and DarkPhoton for your reply.

This app needs to run various Android tablets. Here is one of the weakest tablet targets I have.

Hardware and OpenGL question?
Galaxy Tab Active. GPU : Adreno 305. CPU : Quad-core 1.2 GHz Cortex-A7
Using OpenGL ES 3. Pretty sure there is no dedicated GPU.

Have you put timing calipers around all of the above section of code to verify that this is all taking near-zero time? How much time do you measure?
Yes, the glGetPixels and glMapBufferRange calls fast. I don’t have actual numbers but if I would read the EGL Surface with that and redisplay on another EGLSurfaceView, I can easily do it at 60FPS.

As for the memcpy, I am just doing a straight copy and absolutely no re-processing of the data. I am just moving the data out from the PBO into a byte[] and done.

Thanks.

Teik

[QUOTE=hujanais;1272153]This app needs to run various Android tablets. Here is one of the weakest tablet targets I have. … GPU : Adreno 305. … Using OpenGL ES 3. …

Yes, the glGetPixels and glMapBufferRange calls fast. I don’t have actual numbers but if I would read the EGL Surface with that and redisplay on another EGLSurfaceView, I can easily do it at 60FPS.

As for the memcpy, I am just doing a straight copy and absolutely no re-processing of the data. I am just moving the data out from the PBO into a byte[] and done. [/quote]

That’s interesting.

It sure does sound like, assuming your app is getting the CPU and memory cycles it needs to do the memcpy, that that memory is just very slow to read from. A couple thoughts:

  1. Consider changing the pixel format (and readback format) to reduce the total amount of data you need to copy,
  2. Might try the EGLimage method of fetching the pixels (see the link I posted above). It might give you a pointer back that’s faster to memcpy from.

Your code might be slow because you create lots of java objects which need to be garbage collected.
Could you show us the actual java code you use with your buffers? In general, when performance is important, you should try to keep objects around and re-use them instead of creating new ones.

Hello there. Here is the bare minimum code that I just ran on a Nexus 9(which is much faster than a Samsung Tab Active) and I am getting copy times of 23-25ms. Just incredibly slow. I wonder if the rest of the application is competing for CPU affection. Should have thought about this earlier but I think I will run this same code on a seperate application with absolutely nothing else running.

I like DarkPhoton’s suggestion to look at the image format type and perhaps reduce the size of the payload. Thanks all.


	ByteBuffer destination = null;   // ByteBuffer that is preallocated and re-used.
	
	private void simpleTest() {
		if (frameNum == 0) {
			if(destination == null)
			{
				destination = ByteBuffer.allocate(datasize);
			}
			PBOEx pboObj = pboBufferFactory.peek(frameNum);
			GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, pboObj.PboId);
			glReadPixelsNull(imageWidth, imageHeight);
		} else {
			index = index % 2;
			PBOEx pboObj = pboBufferFactory.peek(index);
			GLES30.glBindBuffer(GLES30.GL_PIXEL_PACK_BUFFER, pboObj.PboId);

			ByteBuffer mappedBuffer = ((ByteBuffer) GLES30.glMapBufferRange(
					GLES30.GL_PIXEL_PACK_BUFFER, 0, datasize,
					GLES30.GL_MAP_READ_BIT)).order(ByteOrder.nativeOrder());
			
			// Cloning code.
			long st = System.nanoTime();
			mappedBuffer.rewind();
			destination.put(mappedBuffer);
			mappedBuffer.rewind();
			destination.flip();
			float et = (System.nanoTime() - st) / 1000000f;
			Log.v("FrameCapture", String.format("%.2f ms", et));
			
			GLES30.glUnmapBuffer(GLES30.GL_PIXEL_PACK_BUFFER);

			// Trigger the next read.
			glReadPixelsNull(imageWidth, imageHeight);  // This method is a call into JNI to trigger a zero read.
		
			index++;
		}

		// This is just here to setup the buffering queue backlog.
		if (frameNum < Integer.MAX_VALUE) {
			frameNum++;
		}
		
		// Reset
		glBindBufferNull();
	}

Just ran the same code on a bare minimum Android app and I am getting the same thing. The copy of 7MB of data runs about 20ms or so. Therefore the bottleneck definitely something to do with reading the data back from the GPU to system memory. Will also look at DarkPhoton’s link to using EGLImage method. Thought I updated my findings as I go along.

Thanks again.

There seem to be three possibilities here:

A) Java is getting in the way.

B) Reading from GPU memory is slow.

C) Memory performance on your device is just that slow (~700MB/s. 7MB copy in 20ms. Since it’s a copy, you’re reading 7MB and writing 7MB, so it’s a total of 14MB of memory work.)

To figure out what the problem is, try the following tests:

  1. Measure the performance of copying a ByteBuffer that doesn’t come from mapping data. Just allocate a standard ByteBuffer of the proper size, then copy it into another ByteBuffer of the proper size.

  2. Write a C or C++ application that does all of the OpenGL work, maps the buffer, and measure how long it takes to copy from the mapped pointer into CPU-allocated memory.

If test 2 improves performance (regardless of the other test), then Java clearly seems to be the issue. If test 1 gets better performance and test 2 does not, then the problem is reading from GPU memory (which would be odd, since most Android devices use UMA). If neither test improves performance, then the problem has to be with hardware memory performance.

[QUOTE=Alfonse Reinheart;1272176]There seem to be three possibilities here:

A) Java is getting in the way.

B) Reading from GPU memory is slow.

C) Memory performance on your device is just that slow (~700MB/s. 7MB copy in 20ms. Since it’s a copy, you’re reading 7MB and writing 7MB, so it’s a total of 14MB of memory work.)

To figure out what the problem is, try the following tests:

  1. Measure the performance of copying a ByteBuffer that doesn’t come from mapping data. Just allocate a standard ByteBuffer of the proper size, then copy it into another ByteBuffer of the proper size.
    Yes I did that test and the copy is only 3ms. 7-8x faster.

  2. Write a C or C++ application that does all of the OpenGL work, maps the buffer, and measure how long it takes to copy from the mapped pointer into CPU-allocated memory.
    Did the same test and no noticeable difference in the frame rate but I didn’t explicitly measure the memcpy itself.

If test 2 improves performance (regardless of the other test), then Java clearly seems to be the issue. If test 1 gets better performance and test 2 does not, then the problem is reading from GPU memory (which would be odd, since most Android devices use UMA). If neither test improves performance, then the problem has to be with hardware memory performance.[/QUOTE]

Thanks.

Teik

[QUOTE=Alfonse Reinheart;1272176]There seem to be three possibilities here:

A) Java is getting in the way.

B) Reading from GPU memory is slow.

C) Memory performance on your device is just that slow (~700MB/s. 7MB copy in 20ms. Since it’s a copy, you’re reading 7MB and writing 7MB, so it’s a total of 14MB of memory work.)

To figure out what the problem is, try the following tests:

  1. Measure the performance of copying a ByteBuffer that doesn’t come from mapping data. Just allocate a standard ByteBuffer of the proper size, then copy it into another ByteBuffer of the proper size.

  2. Write a C or C++ application that does all of the OpenGL work, maps the buffer, and measure how long it takes to copy from the mapped pointer into CPU-allocated memory.

If test 2 improves performance (regardless of the other test), then Java clearly seems to be the issue. If test 1 gets better performance and test 2 does not, then the problem is reading from GPU memory (which would be odd, since most Android devices use UMA). If neither test improves performance, then the problem has to be with hardware memory performance.[/QUOTE]

Hi there,

I put my comments inside the quote so people probably didn’t see it. Sending again.

  1. Measure the performance of copying a ByteBuffer that doesn’t come from mapping data. Just allocate a standard ByteBuffer of the proper size, then copy it into another ByteBuffer of the proper size. Yes I did that test and the copy is only 3ms. 7-8x faster.

  2. Write a C or C++ application that does all of the OpenGL work, maps the buffer, and measure how long it takes to copy from the mapped pointer into CPU-allocated memory… No noticeable difference in the frame rate but I didn’t explicitly measure the memcpy itself.

Thanks.

Huj