PDA

View Full Version : Strange delay when executing glTexImage2D with vsync on



mellow3
11-20-2013, 04:17 AM
I'm programming a video software where I'm using Qt's QGLWidget to show the frames with the following code:



glTexImage2D(GL_TEXTURE_2D, 0, (color ? GL_RGB8 : GL_LUMINANCE8), VIDEO_WIDTH, VIDEO_HEIGHT, 0, (color ? GL_RGB : GL_LUMINANCE), GL_UNSIGNED_BYTE, (GLubyte*)imBuf);

glBegin(GL_QUADS);
glTexCoord2d(0.0,0.0); glVertex2d(-1.0,+1.0);
glTexCoord2d(1.0,0.0); glVertex2d(+1.0,+1.0);
glTexCoord2d(1.0,1.0); glVertex2d(+1.0,-1.0);
glTexCoord2d(0.0,1.0); glVertex2d(-1.0,-1.0);
glEnd();

updateGL();


I want to use vsync so I set a swapInterval of 1. I measured the time to execute the code above. As expected updateGL takes about 16ms to execute. What puzzles me is that glTexImage2D also takes 1..16ms time to execute as if it was waiting for VBLANK signal as well. When I turn vsync off with swapInterval of 0, glTexImage2D only takes about 1 ms to execute. Now, instead of the 16ms delay for the whole program vsync is supposed to give me, I get 32ms of delay at worst. I don't understand why updateGL and glTexImage2D both wait for VBLANK. I want to have as little delay as possible so could someone explain what is going on here?

Dark Photon
11-21-2013, 06:46 AM
What puzzles me is that glTexImage2D also takes 1..16ms time to execute as if it was waiting for VBLANK signal as well.

How are you timing? You can't just snap CPU timers around the GL calls (which often execute on the GPU asynchronously) and say that it takes that long. Think of GL calls as "bookmarkToDoListItem()", not "executeToDoListItem()". glFinish() can sometimes help, but not in this case. When you give GL texture data, it just saves it off (often in CPU memory). It doesn't necessarily upload the data to the GPU until you first render with it (or do something that demands it exist on the GPU).

mhagain
11-21-2013, 09:59 AM
If your values of color, VIDEO_WIDTH and VIDEO_HEIGHT don't change you could get more mileage out of specifying the texture just once during startup, then using glTexSubImage2D instead. If they do need to change, you can use a similar pattern but only make a glTexImage2D call when they change, otherwise use glTexSubImage2D.

Additionally, your format and type parameters to glTexImage2D are putting you into possibly the worst possible case scenario for updating textures dynamically. GL_RGB and GL_UNSIGNED_BYTE might seem preferable on the surface, as they will require you to allocate less memory for your video image, but (and this holds true even if using GL_RGB8 for your internalFormat) there's normally actually no such thing as a 24-bit texture on the GPU. An internalFormat of GL_RGB8 is not a 24-bit texture, it's a 32-bit texture with the extra 8 bits unused.

So what your driver needs to do when you upload a GL_RGB/GL_UNSIGNED_BYTE texture is copy off the data, expand it out to 32-bit, probably swap the components to BGRA, then do the upload. As you can imagine that's going to be considerably slower than just being able to do a direct upload from the data you supply.

Far better to have your source data in 32-bit format to begin with, from which I've found that using GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV is consistently the fastest combination on the widest range of hardware. GL_UNSIGNED_BYTE instead of GL_UNSIGNED_INT_8_8_8_8_REV may give equivalent performance on some drivers, but - last time I benchmarked - not on Intel.

This is considerably faster, not just a little bit. I've personally benchmarked it as up to 40 times faster. So if you've a texture upload that's currently taking 16ms, you could get it down to 0.4ms just by making this change.

On the OpenGL wiki, you really should read the "Common mistakes (http://www.opengl.org/wiki/Common_Mistakes)" page, where you'll find your current usage listed as a common mistake. Relevant sections are "Texture uploads and pixel reads (http://www.opengl.org/wiki/Common_Mistakes#Texture_upload_and_pixel_reads)", "Image precision (http://www.opengl.org/wiki/Common_Mistakes#Image_precision)" and "Slow pixel transfer performance (http://www.opengl.org/wiki/Common_Mistakes#Slow_pixel_transfer_performance)".

mellow3
11-22-2013, 05:08 AM
How are you timing? You can't just snap CPU timers around the GL calls (which often execute on the GPU asynchronously) and say that it takes that long. Think of GL calls as "bookmarkToDoListItem()", not "executeToDoListItem()". glFinish() can sometimes help, but not in this case. When you give GL texture data, it just saves it off (often in CPU memory). It doesn't necessarily upload the data to the GPU until you first render with it (or do something that demands it exist on the GPU).

I'm using a timer. What you say is true, but with a timer I can see that glTexImage2D blocks the rendering thread for a varying time of 1ms to 16ms when using vsync. When not using vsync this does not happen.


If your values of color, VIDEO_WIDTH and VIDEO_HEIGHT don't change you could get more mileage out of specifying the texture just once during startup, then using glTexSubImage2D instead. If they do need to change, you can use a similar pattern but only make a glTexImage2D call when they change, otherwise use glTexSubImage2D.

Additionally, your format and type parameters to glTexImage2D are putting you into possibly the worst possible case scenario for updating textures dynamically. GL_RGB and GL_UNSIGNED_BYTE might seem preferable on the surface, as they will require you to allocate less memory for your video image, but (and this holds true even if using GL_RGB8 for your internalFormat) there's normally actually no such thing as a 24-bit texture on the GPU. An internalFormat of GL_RGB8 is not a 24-bit texture, it's a 32-bit texture with the extra 8 bits unused.

So what your driver needs to do when you upload a GL_RGB/GL_UNSIGNED_BYTE texture is copy off the data, expand it out to 32-bit, probably swap the components to BGRA, then do the upload. As you can imagine that's going to be considerably slower than just being able to do a direct upload from the data you supply.

Far better to have your source data in 32-bit format to begin with, from which I've found that using GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV is consistently the fastest combination on the widest range of hardware. GL_UNSIGNED_BYTE instead of GL_UNSIGNED_INT_8_8_8_8_REV may give equivalent performance on some drivers, but - last time I benchmarked - not on Intel.

This is considerably faster, not just a little bit. I've personally benchmarked it as up to 40 times faster. So if you've a texture upload that's currently taking 16ms, you could get it down to 0.4ms just by making this change.


I have tried all the tricks above, but I don't think the problem here is how expensive the call for glTexImage2D/glTexSubImage2D is. The problem is that when I'm using vsync, glTxImage2D/glTexSubImage2D blocks the thread until it gets a VBLANK signal. The same thing happens with updateGL. If I have understood correctly this should only happen with updateGL. I don't understand why glTexImage2D is affected by vsync.

Dark Photon
11-22-2013, 12:33 PM
I have tried all the tricks above, but I don't think the problem here is how expensive the call for glTexImage2D/glTexSubImage2D is. The problem is that when I'm using vsync, glTxImage2D/glTexSubImage2D blocks the thread until it gets a VBLANK signal.

Oh, I think I know.

Put a glFinish after your SwapBuffers call. Now time your glTexImage2D call execution time on the CPU. Do you get a different result?

The issue is that SwapBuffers is not a blocking call. Like other GL calls, the driver queues that up and then happily starts queing up your GL commands for the "next" frame, and will block at some point when the driver decides the FIFO has gotten full or it's reading too far ahead, so it blocks in random GL calls somewhere in the process of queuing up the render calls for your frame.

This is often not what you want when you're running sync-to-vblank. You want the CPU to wait on the end-of-frame time before it goes processing the next frame. glFinish after SwapBuffers does that. It says to the driver, do not return until everything I've given you to do thus far is done.

mellow3
11-23-2013, 06:38 AM
Oh, I think I know.

Put a glFinish after your SwapBuffers call. Now time your glTexImage2D call execution time on the CPU. Do you get a different result?

The issue is that SwapBuffers is not a blocking call. Like other GL calls, the driver queues that up and then happily starts queing up your GL commands for the "next" frame, and will block at some point when the driver decides the FIFO has gotten full or it's reading too far ahead, so it blocks in random GL calls somewhere in the process of queuing up the render calls for your frame.

This is often not what you want when you're running sync-to-vblank. You want the CPU to wait on the end-of-frame time before it goes processing the next frame. glFinish after SwapBuffers does that. It says to the driver, do not return until everything I've given you to do thus far is done.

Yeah you must be right, because after adding glFinish, glTexImage2D doesn't block the thread anymore. But now glFinish takes 16 ms and the updateGL call right before it also takes 16 ms making the total delay 32 ms. In my original code there was atleast a varying delay with jitter 16..32ms. Also, now the total lag from video camera to computer display increased to a second or something for some reason, when it was only about 80 ms before. Well, I guess there is no way of getting a lower delay than 16..32 ms with vsync on. It is strange though because I thought a swapInterval of 1 should only increase the delay by about 0..16 ms, swapInterval of 2 by 16..32 ms and so on.

Dark Photon
11-24-2013, 05:57 PM
But now glFinish takes 16 ms and the updateGL call right before it also takes 16 ms making the total delay 32 ms.

What this sounds like is you have a glFinish inside of updateGL(). Let's see the contents of updateGL().

And to ensure it's really updateGL() that's expensive, do this for testing only:



glFinish()
start timer
updateGL()
stop timer
print elapsed



It is strange though because I thought a swapInterval of 1 should only increase the delay by about 0..16 ms,

Correct. Actually 60Hz (what most monitor LCDs nowadays want) yields a frame time of 16.666ms (== 1/60). So yes, depending on how much time the rest of your frame processing costs, it'll consume 0..16.666ms more time -- whatever it takes to get to the next 60Hz boundary.

To time the rest of your frame, do a swapInterval(0), and see what the total frame time is. If it's milliseconds less than 16.666ms, then swapInterval(1) should bring you right up to 16.666ms for the total frame time.

mhagain
11-25-2013, 12:11 AM
It just occurred to me - are you using a Sleep call anywhere inside your main loop?

mellow3
11-26-2013, 07:45 AM
What this sounds like is you have a glFinish inside of updateGL(). Let's see the contents of updateGL().


updateGL() is a function of Qt. I'm using a QGLWidget in my program. I don't think there's a glFinish() call there. Qt documentation just says "Updates the widget by calling glDraw().". glDraw in turn calls paintGL. updateGL also seems to call swapBuffers function. Isn't it supposed the block here? The buffers cannot be swapped until it gets a vsync signal.



And to ensure it's really updateGL() that's expensive, do this for testing only:



glFinish()
start timer
updateGL()
stop timer
print elapsed



This gives me 16 ms.



Correct. Actually 60Hz (what most monitor LCDs nowadays want) yields a frame time of 16.666ms (== 1/60). So yes, depending on how much time the rest of your frame processing costs, it'll consume 0..16.666ms more time -- whatever it takes to get to the next 60Hz boundary.

To time the rest of your frame, do a swapInterval(0), and see what the total frame time is. If it's milliseconds less than 16.666ms, then swapInterval(1) should bring you right up to 16.666ms for the total frame time.




start timer;

glTexImage2D(GL_TEXTURE_2D, 0, (color ? GL_RGB8 : GL_LUMINANCE8), VIDEO_WIDTH, VIDEO_HEIGHT, 0, (color ? GL_RGB : GL_LUMINANCE), GL_UNSIGNED_BYTE, (GLubyte*)imBuf);

glBegin(GL_QUADS);
glTexCoord2d(0.0,0.0); glVertex2d(-1.0,+1.0);
glTexCoord2d(1.0,0.0); glVertex2d(+1.0,+1.0);
glTexCoord2d(1.0,1.0); glVertex2d(+1.0,-1.0);
glTexCoord2d(0.0,1.0); glVertex2d(-1.0,-1.0);
glEnd();

glFinish();
updateGL();

print elapsed;


With swapInterval(0) the above code gives me 1..3ms. When I set swapInterval(1) it gives me 33 ms. If I remove glFinish and keep swapInterval(1), I get about 14..31ms.


It just occurred to me - are you using a Sleep call anywhere inside your main loop?

I'm not using any sleep calls. In the loop requesting a frame from the video camera, the function blocks the thread until it gets a frame from the camera. This happens in a different thread though.

Dark Photon
11-27-2013, 09:30 AM
updateGL() is a function of Qt. I'm using a QGLWidget in my program. I don't think there's a glFinish() call there. Qt documentation just says "Updates the widget by calling glDraw().". glDraw in turn calls paintGL. updateGL also seems to call swapBuffers function. Isn't it supposed the block here? The buffers cannot be swapped until it gets a vsync signal.

...

This gives me 16 ms.

Sounds like updateGL() is doing a swap and finish.

Random websearch hit:

* http://qt-project.org/doc/qt-4.8/QGLWidget.html

Read the "buffer swap" references in here. Apparently you can use setAutoBufferSwap() to control whether an internal swap is done, and one is done by default. In any case, you only need one swap per double-buffered drawable.


With swapInterval(0) the above code gives me 1..3ms. When I set swapInterval(1) it gives me 33 ms. If I remove glFinish and keep swapInterval(1), I get about 14..31ms.

Yep, that's consistent with two swaps being in there.

mellow3
11-29-2013, 06:17 AM
Sounds like updateGL() is doing a swap and finish.

Random websearch hit:

* http://qt-project.org/doc/qt-4.8/QGLWidget.html

Read the "buffer swap" references in here. Apparently you can use setAutoBufferSwap() to control whether an internal swap is done, and one is done by default. In any case, you only need one swap per double-buffered drawable.



Yep, that's consistent with two swaps being in there.

But if I remove the buffer swap from updateGL, then where is the buffer supposed to swap? If I put setAutoBufferSwap(false) the timer takes only 1 ms, but then I don't get any picture on the widget because the buffers aren't swapped anywhere. You think that in my code there are two swaps? Then there is no other explanation than updateGL itself calling for swapBuffers twice, because I don't have any buffer swap calls in my code other than updateGL.

Dark Photon
11-29-2013, 07:18 PM
You think that in my code there are two swaps? Then there is no other explanation than updateGL itself calling for swapBuffers twice...

There are other possibilities, but are less likely. 1) You could have a call to wgl/glXSwapInterval(2) in your code, or 2) your monitor/GPU could be negotiating a 30Hz refresh rather than a 60Hz refresh.