glClientWaitSync always timeouts, glMapBufferRange always stalls

Extra explanations

I completely don’t understand what’s wrong with my tries to make asynchronous frames downloading from FBO. I’m now is trying to get high FPS in fullscreen rendering with OpenGL.

Well, that’s OpenGL ES 3.0, I use GL functions from QOpenGLExtraFunctions (QT framework), but I think this context is not important.

I have a background OpenGL rendering thread, which for now draws nothing, just reads frames from FBO with no pauses.

My screen has resolution 1920x1080 pixels, so FBO has the same size.

I realised, that glReadPixels are too slow to transfer such big frames through PCI from NVIDIA video card to RAM, I have about 55 FPS, but I want 60 FPS.

Then I knew about PBOs and got an idea, that I can copy frames from FBO to a PBOs (create a buffer with GL_PIXEL_PACK_BUFFER, bind it and call glReadPixels, which in this case copies pixels to PBO in video card memory, not to a storage on RAM, and returns immediately because GL_PIXEL_PACK_BUFFER is bound) and asynchronously transfer them to my storages on RAM after calling glMapBuffer. And while the last frame is being written to storages on RAM, I map and draw last second frame, which is already completely (as we hope) transferred to RAM.

I also read about shared contexts for multithreading, but as I understood, the best solution for performance - one thread for one context with asynchronous data downloading/uploading, just forget about shared contexts.

glMapBufferRange stalling issue

So i have, for easy example, two buffers. What i realised next:


glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs

glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs
    
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height * 4, GL_MAP_READ_BIT); // ~20000 microsecs
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs
    
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, width * height * 4, GL_MAP_READ_BIT); // ~15 microsecs
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glReadPixels(0, 0, width, height, GL_BGRA, GL_UNSIGNED_BYTE, 0); // ~50 microsecs

// keep mapping and glReadPixel'ing pbo[0] and pbo[1] with same call durations

As you see, mapping of first PBO stalls CPU for 20 ms, but the mapping of second PBO is no-op.

But I need about the same time duration between mapping of PBOs.

How I understand, that means that when I map the first buffer, it causes synchronization such what OpenGL is finishing glReadPixels to 1st and 2nd PBOs before return, because I try to map same PBO that is already using in other GL commands (glReadPixels), but instead of waiting 1st glReadPixels finish, GL just flushes all already queued commands, including 2nd glReadPixels.

But! When i place std::this_thread::sleep_for(10ms) before every glMapBufferRange, i get same durations, so when my CPU thread have waited enough before calling glMapBufferRange, glMapBufferRange call for 1st PBO still takes 20ms! That’s why i have “glMapBufferRange always stalls” in title.

Otherwise, I have no idea what’s happening. So did I understood this right?

glClientWaitSync timeout issues

Then I knew about OpenGL synchronization objects, which are inserted into GL command queue, and when such object is processed by GL and signalled, that means, that all commands in the queue before this objects are processed.

So I wanted to insert glFenceSync just after glMapBufferRange and glClientWaitSync just before glMapBufferRange, or after/before glReadPixels, to make my frames to be updated evenly. But I still didn’t try, because my sync objects just don’t work properly.

Now i try to execute just this:


GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
while (true)
{
	GLenum syncRes = glClientWaitSync(fence, 0, 1000);
	switch (syncRes)
	{
		case GL_ALREADY_SIGNALED: qDebug() << "ALREADY"; break;
		case GL_CONDITION_SATISFIED: qDebug() << "EXECUTED"; break;
		case GL_TIMEOUT_EXPIRED: qDebug() << "TIMEOUT"; break;
		case GL_WAIT_FAILED: qDebug() << "FAIL"; break;
	}
	if (syncRes == GL_CONDITION_SATISFIED || syncRes == GL_ALREADY_SIGNALED) break;
}
glDeleteSync(fence);

This loop becomes infinite and always prints “TIMEOUT”, so as I understand, GL just can’t process this sync fence, although I’ve inserted it into the command queue.

So what’s wrong with my sync fences using?

Any function which makes data visible to the client (CPU-side code) must wait until any GL commands which affect that data have completed. And in order for that to happen, it must perform an implicit glFlush() (those commands won’t finish if they haven’t even started).

Try putting an explicit glFlush() before the sleep_for(). As it stands, GL probably isn’t even sending any of your commands to the GPU until you call glMapBufferRange().

I suggest adding [var]GL_SYNC_FLUSH_COMMANDS_BIT[/var] to the flags parameter.

[QUOTE=GClements;1290197]Any function which makes data visible to the client (CPU-side code) must wait until any GL commands which affect that data have completed. And in order for that to happen, it must perform an implicit glFlush() (those commands won’t finish if they haven’t even started).

Try putting an explicit glFlush() before the sleep_for(). As it stands, GL probably isn’t even sending any of your commands to the GPU until you call glMapBufferRange().

I suggest adding [var]GL_SYNC_FLUSH_COMMANDS_BIT[/var] to the flags parameter.[/QUOTE]

Oh, big thanks to you! I really didn’t think, thay i have to push commands to GL with glFlush() forcely.

As i totally missed GL_SYNC_FLUSH_COMMANDS_BIT. Now all is working as i was expecting.