PBO: mapped memory and OpenMP

Hi All,

i experienced an interesting issue when working with mapped buffers and multithreaded writes to these buffers, at least with the latest ATI drivers on windows ( others need to be verified). We have a multithreaded data producer ( unpacker ) that writes to mapped memory. when requesting read AND write memory, the call is 30x faster.


// alloc buffer
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, buffer_id );
glBufferData(GL_PIXEL_UNPACK_BUFFER, size, NULL, GL_STREAM_DRAW);

This scope takes 20ms to complete

{
void* b = glMapBufferRange( GL_PIXEL_UNPACK_BUFFER, 0, size, GL_MAP_WRITE_BIT|GL_MAP_READ_BIT );
copyTo( b, getRGBASize() ); // <- ca. 8 threads writing ca. 64MB data
glUnmapBuffer( GL_PIXEL_UNPACK_BUFFER );
}

This scope takes 600ms to complete

{
void* b = glMapBufferRange( GL_PIXEL_UNPACK_BUFFER, 0, size, GL_MAP_WRITE_BIT );
copyTo( b, getRGBASize() ); // <- 8 threads writing ca. 64MB data
glUnmapBuffer( GL_PIXEL_UNPACK_BUFFER );
}

Is it possible that we get video memory in the latter case and the threaded write trashes the memory access?

Reply to myself, go deeper into detail:

Indeed, the problem is ‘non-cacheable memory’ and i found a 2008 forum entry from yooyo that mapped buffers should be accessed sequentially only. So i serialized the last producer step and the copy performance was there again:


#pragma omp parallel
{
     char some_thread_private_memory_on_stack[];
     dataproducer_into_local_memory(some_thread_private_memory_on_stack);
     
     #pragma omp critical
     {
           memcpy( mapped_buffer_ptr, some_thread_private_memory_on_stack, sizeof() );
     }
}

If anyone can throw a light on ‘non-cacheable memory’, (what is it? where is it?) that would be great!

This article by Fabian Giesen about write-combined memory explains the issue & some guidelines on getting the best performance. Write-combined & non-cacheable memory are explained in section 3.3.3 of this article (What Every Programmer Should Know About Memory).