Unsynchronized VBO Reads?

Hey - I wasn’t sure if this should go under the Windows section, or maybe drivers, but I am a bit of a beginner, so here it goes.

In my company’s application, we have a pretty simple GL system for VBOs and geometry, that was written about 8 years ago. We dynamically generate a lot of geometry and store it directly in VBOs, since it’s often transient and we want to save system memory. Occasionally we want to read the dynamically created geometry back from the VBO, and sometimes we only do partial updates. If partial updates are small enough, we use glBufferSubData. Otherwise we orphan the old VBOs using glBufferData set to NULL, and generate an entirely new VBO. When we do want to read from the VBO, we use glMapBuffer (as GL_READ_ONLY).

On Windows + ATI systems, the glMapBuffer call to read data was extremely slow. So slow that even the occasionally reads can make the app unusable. I believe this is due to the driver unnecessarily synchronizing on a read-only call. I tried switching to glMapBufferRange with the flags GL_MAP_READ_BIT | GL_MAP_UNSYNCHRONIZED_BIT, which is technically illegal, and it massively sped up the app on Windows + ATI systems. But, because it’s illegal, it broke on nVidia systems and returned NULL as it should.

So, is there is an established way to declare an asynchronous read?

Here are a couple of things which seem to work in general, but I don’t know how robust they are: use glMapBufferRangeARB, which only seems to be declared on ATI systems (and is not part of the GL spec). That way I can use wglGetProcAddress, and if it’s NULL just use glMapBufferRange without the unsynchronized read. However, I don’t really know if it’s only on ATI systems, but it seems that way anecdotally.

Alternatively, I can attempt an unsynchronized read, and if it fails (returns NULL, as it does on the nVidia systems), do it without the unsynchronized bit. I think this approach is technically safe (should be inline with the GL spec, while hacking the ATI systems to read asynchronously), but I’m not 100% certain.

You might look at the trying a different option on the last parameter of glBufferData when the VBO is created.

ARB functions are usually function that the OpenGL committee (ie both nVidia and ATI agree) think are usualful but not part of the standard. They are usually available on both companies drivers
but may be removed once the call is moved (sometimes with modifications) into the core specs or are considered no longer useful. So glMapBufferRangeARB will have been replaced by glMapBufferRange; this means ATI and nVidia are
free to not support glMapBufferRangeARB any more. Production code should only use ARBs if you also have fallback code in you logic when the function is not found.

So, is there is an established way to declare an asynchronous read?

No.

There is no glMapBufferRangeARB. That function doesn’t exist. That function never existed. There is no specification governing the behavior of that function, so relying on its presence for the working of your program in any capacity is… not wise. Its behavior can change at a moment’s notice.

I would suggest solving your synchronization problem elsewhere. Use fence sync objects to figure out why you’re getting stalls when you map the buffer. You should always wait (likely by doing something else) until the read operation has finished (as denoted by using a fence), then to map the buffer.

ARB functions are usually function that the OpenGL committee (ie both nVidia and ATI agree) think are usualful but not part of the standard. They are usually available on both companies drivers
but may be removed once the call is moved (sometimes with modifications) into the core specs or are considered no longer useful. So glMapBufferRangeARB will have been replaced by glMapBufferRange; this means ATI and nVidia are
free to not support glMapBufferRangeARB any more. Production code should only use ARBs if you also have fallback code in you logic when the function is not found.

This is misleading. Both NVIDIA and ATI have shown a willingness to continue to implement pretty much every extension in perpetuity. If what you said were true, you wouldn’t see things like ARB_vertex_buffer_object in extension strings, yet there it is.

In any case, the problem is that glMapBufferRangeARB does not exist. There is no specification that defines this function. ARB_map_buffer_range is a core extension; it’s functions don’t have the ARB suffix.

[QUOTE=tonyo_au;1241679]You might look at the trying a different option on the last parameter of glBufferData when the VBO is created.
[/QUOTE]

Yup I tried that - we’re currently using GL_DYNAMIC_DRAW, but I also tried DYNAMIC_COPY, STREAM_DRAW, and STREAM_COPY, and it didn’t make a difference.

[QUOTE=Alfonse Reinheart;1241680]No.

There is no glMapBufferRangeARB. That function doesn’t exist. That function never existed. There is no specification governing the behavior of that function, so relying on its presence for the working of your program in any capacity is… not wise. Its behavior can change at a moment’s notice.

I would suggest solving your synchronization problem elsewhere. Use fence sync objects to figure out why you’re getting stalls when you map the buffer. You should always wait (likely by doing something else) until the read operation has finished (as denoted by using a fence), then to map the buffer.
[/QUOTE]

Yeah that’s what I meant by it’s not part of the GL spec. I thought that was mostly an interesting observation (that ATI drivers seemed to have it anyway), but you’re right we definitely should not be using it.

How would I use a fence sync object to figure out why I’m getting stalls? Should I move it around in the code and figure out where the fence is actually taking time (i.e. as a method of tracking down the gl calls that cause the synchronization)?

How would I use a fence sync object to figure out why I’m getting stalls? Should I move it around in the code and figure out where the fence is actually taking time (i.e. as a method of tracking down the gl calls that cause the synchronization)?

You put a fence after you issue your read command. Right before you map the buffer, check the fence to see if it has completed. If so, then you shouldn’t see a stall. Otherwise, you’re not doing enough stuff between the time you issue your read and the time you want to actually read it.

Both NVIDIA and ATI have shown a willingness to continue to implement pretty much every extension in perpetuity

This is true in general but some have definitely disappeared like GL_ARB_vertex_blend or were these just never implemented?

Sorry, I’m sure this is a stupid question, but what exactly do you mean by “issue your read command”?

Call glReadPixels. Or glGetTexImage. Or whatever command you’re doing that copies pixels from an image into the buffer object.

In the scenario described above the buffer is not modified at all by the GPU. In that case would you sync on the last glBuffer(Sub)Data or rather on the last draw command using (=reading from) that buffer ?

Sorry - I think you may have misread my post? I’m talking about using VBOs for drawing geometry, so there isn’t a read command.

That’s correct, the GPU should not be modifying the buffer at all in this case, so I would expect the sync to be on the last glBuffer(Sub)Data. But based off the experimentation and performance differences, I think that ATI/AMD on Windows may be syncing with drawing (essentially treating the buffer as having a basic lock, rather than a read-write lock).