PDA

View Full Version : glTexSubImage2d taking a lot of time



advorak
10-20-2015, 04:27 PM
I have this code segment in a separate thread than the main:


if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "poll returned %f\n", Total_Inputs, ((float)hack_time.tv_nsec/1000000000.0) );
}
/* Check all requested fds for returned data */
for( request = 0; request < Inputs; request++ )
{
/* Check this requested fds for data */
if( fds[request].revents & (POLLIN | POLLPRI ) )
{
/* Default to not found */
Found = 0;

/* Determine which device gave us data */
for( idx = 0; idx < Total_Inputs; idx++ )
{
if( fds[request].fd == Capture_State.Inputs[Vid_Used_Index[idx]].fd )
{
Found = 1;
break;
}
}

if( !Found )
{
fprintf( stderr, "Could not determine which input yielded data. Request index = %d\n", request );
continue;
}

/* de-queue the buffer */
if((ret_val = ioctl( fds[request].fd, VIDIOC_DQBUF, &buf[Vid_Used_Index[idx]] )) < 0 )
{
switch( errno )
{
case EAGAIN:
{
continue;
break;
}
case EIO:
default:
{
fprintf( stderr, "Failed to de-queue buffer[%d]: %d\n", buf[Vid_Used_Index[idx]].index, ret_val);
Return_Value = EXIT_FAILURE;
goto realtime_failure;
}
}
}

/* Use buffer to update texture on all outputs that use it */
for( Output = Tracker; Output < Output_Count; Output = (Outputs)((int)Output + 1) )
{
/* Check if this output used this input */
if( Input_State[Vid_Used_Index[idx]].Used & (1 << Output) )
{
/* Make output context current */
glXMakeCurrent( X.display, X.window[Output], X.context[Output][CAPTURE_CONTEXT] );

/* Update the texture with new image */
glBindTexture( GL_TEXTURE_2D, Capture_Texture[Output][Vid_Used_Index[idx]] );
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, SOURCE_WIDTH, SOURCE_HEIGHT, GL_RGB, GL_UNSIGNED_BYTE,
Capture_State.Frame_Buffers[Vid_Used_Index[idx]][buf[Vid_Used_Index[idx]].index].pData );

/* Insert fence into stream */
Texture_Sync[Output][Vid_Used_Index[idx]] = glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, 0 );

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "Fence[%s][%s] Created %f\n", Output_Strings[Output], VID_Strings[Vid_Used_Index[idx]], ((float)hack_time.tv_nsec/1000000000.0) );
}

/* clear the input request bit for this output */
Input_State[Vid_Used_Index[idx]].Used &= ~(1 << Output);
}
}


Here is the output log:


poll returned 0.003716
Fence[Tracker][EOW] Created 0.006632
Fence[Operator][EOW] Created 0.009841

Frame Top Draw 0.010289
Fence[DGU][EOW] Created 0.013028


Why are the four gl calls taking 3+ milliseconds each? Each of the texture buffers is 6220800 bytes long. I'm creating the fence objects so I know when the data is done moving, but the other end always gets GL_ALREADY_SIGNALED.

I thought the glTexSubImage would return before the data was copied.

mhagain
10-21-2015, 12:41 AM
Because the pointer passed to glTexSubImage belongs to your program, and because your program can free it immediately after the call, glTexSubImage must therefore copy the data somewhere before it can return.

That's not the real reason things are so slow. The real reason is that you're using GL_RGB/GL_UNSIGNED_BYTE which will almost certainly guarantee that the driver must do a software format conversion before it can load the new data to your texture. Please see https://www.opengl.org/wiki/Common_Mistakes#Texture_upload_and_pixel_reads and https://www.opengl.org/wiki/Common_Mistakes#Image_precision for more information.

advorak
10-21-2015, 07:08 AM
I will change the input and texture type to RGBA forthwith. Wish me luck.

advorak
10-21-2015, 08:39 AM
I think i am getting an error. the glTexSubImage2D is returning almost immediately. Since it has a void return, how do I get the error that it may have generated?





Never mind the error. I found it.




Unfortunately, now the time for glTexSubImage is even greater. It is over 4 milliseconds per image.

On the bus this works out to just over 2Gb/S. This is only about 15% of the theoretical limit of 15Gb/S

mhagain
10-21-2015, 09:31 AM
GL_RGBA can be equally as slow as GL_RGB, because the driver likely still needs to do a software format conversion. All that you've achieved is to increase the amount of data you're sending over the bus. Try GL_BGRA for format, see what performance is like, and on some hardware you may need to use GL_UNSIGNED_INT_8_8_8_8_REV instead of GL_UNSIGNED_BYTE for type, in order to get maximum performance.

If your data is natively coming in as 24-bit (3 component) RGB order, it will likely be faster to expand it to 32-bit BGRA order in code yourself than it would be to rely on the driver to do it.

advorak
10-21-2015, 09:49 AM
modified the init code to:



/* Bind the texture */
glBindTexture( GL_TEXTURE_2D, Capture_Texture[Output][Input] );

/* set the texture parameters */
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_BORDER );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_BORDER );

/* Set the texture filtering */
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );

/* create and initialize texture */
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA8, SOURCE_WIDTH, SOURCE_HEIGHT, 0, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, Cap_Blank );

/* Un-bind the texture */
glBindTexture( GL_TEXTURE_2D, 0 );


and the capture thread looks like this:


if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "poll returned %f\n", Total_Inputs, ((float)hack_time.tv_nsec/1000000000.0) );
}
/* Check all requested fds for returned data */
for( request = 0; request < Inputs; request++ )
{
/* Check this requested fds for data */
if( fds[request].revents & (POLLIN | POLLPRI ) )
{
/* Default to not found */
Found = 0;

/* Determine which device gave us data */
for( idx = 0; idx < Total_Inputs; idx++ )
{
Device = Vid_Used_Index[idx];
if( fds[request].fd == Capture_State.Inputs[Device].fd )
{
Found = 1;
break;
}
}

if( !Found )
{
fprintf( stderr, "Could not determine which input yielded data. Request index = %d\n", request );
continue;
}

/* de-queue the buffer */
if((ret_val = ioctl( fds[request].fd, VIDIOC_DQBUF, &buf[Device] )) < 0 )
{
switch( errno )
{
case EAGAIN:
{
continue;
break;
}
case EIO:
default:
{
fprintf( stderr, "Failed to de-queue buffer[%d]: %d\n", buf[Device].index, ret_val);
Return_Value = EXIT_FAILURE;
goto realtime_failure;
}
}
}

/* Use buffer to update texture on all outputs that use it */
for( Output = Tracker; Output < Output_Count; Output = (Outputs)((int)Output + 1) )
{
/* Check if this output used this input */
if( Input_State[Device].Used & (1 << Output) )
{
/* Make output context current */
glXMakeCurrent( X.display, X.window[Output], X.context[Output][CAPTURE_CONTEXT] );

/* Update the texture with new image */
glBindTexture( GL_TEXTURE_2D, Capture_Texture[Output][Device] );
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, SOURCE_WIDTH, SOURCE_HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE,
Capture_State.Frame_Buffers[Device][buf[Device].index].pData );

glerror = GL_NO_ERROR;
if((glerror = glGetError()) != GL_NO_ERROR)
{
fprintf( stderr, "glGetError returned 0x%x\n", glerror );
}

/* Insert fence into stream */
Texture_Sync[Output][Device] = glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, 0 );

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "Fence[%s][%s] Created %f\n", Output_Strings[Output], VID_Strings[Device], ((float)hack_time.tv_nsec/1000000000.0) );
}

/* clear the input request bit for this output */
Input_State[Device].Used &= ~(1 << Output);
}
}
} /* if( fds[request].revents & (POLLIN | POLLPRI ) ) */
else
{
fprintf( stderr, "fds[%d] had no data.\n", request );
}
} /* for( request = 0; request < Inputs; request++ ) */


the output looks like this:



poll returned 0.501397
Fence[Tracker][EOW] Created 0.507845
Fence[Operator][EOW] Created 0.512167
Fence[DGU][EOW] Created 0.516110


I will attempt the swizzle myself and only read RGB from the capture card.

It may not be obvious, but I have the buffers mmap'ed on the capture card. I was trying to avoid the copy from capture to ram and then the copy to driver space, and finally the copy to GPU.

advorak
10-21-2015, 10:32 AM
I was just looking at sharelists. while i do not create any display lists (my displays are mostly text and dynamic lines that differ across displays), could i share the texture across the three display contexts? This way, I only would load the EOW texture once.



Never mind. the displays do not share a screen. I can not share contexts.

advorak
10-21-2015, 11:53 AM
I decided to try reading the buffer from the capture card myself with a memcpy. the memcpy completes in less than 1 millisecond.

I don't know what else to try.

advorak
10-21-2015, 02:40 PM
I don't understand the difference, but implementing a PBO for each of the output channels allows me to copy data from the capture card to the PBO via a memcpy in ~1.5 ms and the subsequent glTexSubImage2D returns almost immediately. The textures draw correctly, so I don't care.

My system is not like the example code given in the document http://www.nvidia.com/docs/IO/40049/Dual_copy_engines.pdf

I do have a Quadro card, but my textures are coming from a capture card. I can't get data any faster than 16.666 ms between frames. Because of this, I did not implement the ping/pong mode of the Pixel Buffer Objects (PBOs). I can not fill one while the other is being used. Instead, I just implemented fence sync objects to be sure the fill and use threads stayed separate.

The fill thread is not woken up until at least 5 ms after the main render function runs.

My initialization code is:



/* Generate the textures */
glGenTextures( VID_Count, Capture_Texture[Output] );

/* Loop across all the inputs */
for( Input = 0; Input < VID_Count; Input++ )
{
/* create the PBO */
glGenBuffers( 1, &PBO[Output] );

/* bind the first pbo */
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, PBO[Output] );

/* set the pbo buffer parameters */
glBufferData( GL_PIXEL_UNPACK_BUFFER, SOURCE_WIDTH*SOURCE_HEIGHT*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW );

/* un-bind the pbo */
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, 0 );

/* specify texture1 */
glActiveTexture( GL_TEXTURE1 );

/* Bind the texture */
glBindTexture( GL_TEXTURE_2D, Capture_Texture[Output][Input] );

/* set the texture parameters */
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_BORDER );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_BORDER );

/* Set the texture filtering */
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );

/* create and initialize texture */
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA, SOURCE_WIDTH, SOURCE_HEIGHT, 0, GL_BGRA, GL_UNSIGNED_BYTE, Cap_Blank );

/* Un-bind the texture */
glBindTexture( GL_TEXTURE_2D, 0 );
}


the capture thread code looks like this:



/* Loop until all inputs retrieved */
while( Inputs )
{
/* check if the system returned data to any of the fds */
if((ret_cnt = poll( fds, Inputs, 20 )) > 0 )
{
if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "poll returned %d inputs at %f\n", Total_Inputs, ((float)hack_time.tv_nsec/1000000000.0) );
}

/* Check all requested fds for returned data */
for( request = 0; request < Inputs; request++ )
{
/* Check this requested fds for data */
if( fds[request].revents & (POLLIN | POLLPRI ) )
{
/* Default to not found */
Found = 0;

/* Determine which device gave us data */
for( idx = 0; idx < Total_Inputs; idx++ )
{
Device = Vid_Used_Index[idx];
if( fds[request].fd == Capture_State.Inputs[Device].fd )
{
Found = 1;
break;
}
}

if( !Found )
{
fprintf( stderr, "Could not determine which input yielded data. Request index = %d\n", request );
continue;
}

/* de-queue the buffer */
if((ret_val = ioctl( fds[request].fd, VIDIOC_DQBUF, &buf[Device] )) < 0 )
{
switch( errno )
{
case EAGAIN:
{
continue;
break;
}
case EIO:
default:
{
fprintf( stderr, "Failed to de-queue buffer[%d]: %d\n", buf[Device].index, ret_val);
Return_Value = EXIT_FAILURE;
goto realtime_failure;
}
}
}

/* Use buffer to update texture on all outputs that use it */
for( Output = Tracker; Output < Output_Count; Output = (Outputs)((int)Output + 1) )
{
/* Check if this output used this input */
if( Input_State[Device].Used & (1 << Output) )
{
/* Make output context current */
glXMakeCurrent( X.display, X.window[Output], X.context[Output][CAPTURE_CONTEXT] );

/* set the texture unit */
glActiveTexture( GL_TEXTURE1 );

/* Update the texture with new image */
glBindTexture( GL_TEXTURE_2D, Capture_Texture[Output][Device] );

/* bind the PBO */
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, PBO[Output] );

/* set the pbo buffer parameters to avoid sync issue*/
glBufferData( GL_PIXEL_UNPACK_BUFFER, SOURCE_WIDTH*SOURCE_HEIGHT*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW );

/* Map the buffer data */
PBO_Memory = (GLubyte *)glMapBuffer( GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY );

assert( PBO_Memory );

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "before memcpy %f\n", ((float)hack_time.tv_nsec/1000000000.0) );
}

/* Copy image data to PBO */
memcpy( PBO_Memory, Capture_State.Frame_Buffers[Device][buf[Device].index].pData, 8294400 );

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "after memcpy %f\n", ((float)hack_time.tv_nsec/1000000000.0) );
}

/* Un-map the buffer data */
glUnmapBuffer( GL_PIXEL_UNPACK_BUFFER );

/* transfer PBO data to texture */
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, SOURCE_WIDTH, SOURCE_HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, 0);

glerror = GL_NO_ERROR;
if((glerror = glGetError()) != GL_NO_ERROR)
{
fprintf( stderr, "glGetError returned 0x%x\n", glerror );
}

/* Insert fence into stream */
Texture_Sync[Output][Device] = glFenceSync( GL_SYNC_GPU_COMMANDS_COMPLETE, 0 );

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "Fence[%s][%s] Created %f\n", Output_Strings[Output], VID_Strings[Device], ((float)hack_time.tv_nsec/1000000000.0) );
}

/* clear the input request bit for this output */
Input_State[Device].Used &= ~(1 << Output);
}
}
} /* if( fds[request].revents & (POLLIN | POLLPRI ) ) */
else
{
fprintf( stderr, "fds[%d] had no data.\n", request );
}
} /* for( request = 0; request < Inputs; request++ ) */
} /* if((ret_cnt = poll( fds, Inputs, 20 )) > 0 ) */
else if( ret_cnt < 0 )
{
/* Handle the error */
fprintf( stderr, "Poll produced errno %d.\n", errno );
Return_Value = EXIT_FAILURE;
goto realtime_failure;
}
else
{
/* Post message about the timeout */
fprintf( stderr, "Poll of capture card timed out.\n" );
}

/* Default input count to 0 */
Inputs = 0;

/* rebuild fds array with remaining inputs */
for( Device = 0; Device < MAX_VIDEO_INPUTS; Device++ )
{
if( Input_State[Device].Used && Capture_State.Inputs[Device].Valid )
{
/* fill pollfd struct */
fds[Inputs].fd = Capture_State.Inputs[Device].fd;
fds[Inputs].events = POLLIN | POLLPRI;
fds[Inputs].revents = 0;

/* Increment input count */
Inputs++;
}
}

if( DEBUG )
{
/* Grab system time */
clock_gettime(CLOCK_REALTIME, &hack_time);
fprintf( stderr, "Recreated fds with %d inputs %f\n", Inputs, ((float)hack_time.tv_nsec/1000000000.0) );
}
} /* while( Inputs ) */

/* unlock the done semaphore */
if( sem_post( &Capture_State.Done ) )
{
fprintf( stderr, "Failed to unlock the Done semaphore: %d.\n", errno );
Exit = 1;
continue;
}


I apologize if I am posting too much code. I have suffered from my ignorance when others have "weeded" their code for clarity.