That’s pretty time consuming for moving 4 bytes over the PCI bus.
You’ve clearly never used the PCI bus before
Initiating memory transfer operations have a fixed cost, in addition to the actual speed of the byte transfer. Thus, its best to avoid small transfers where possible.
duh - atomic counter doesn’t live on the GPU anymore
That’s not what mapping does.
OK, you haven’t told us everything you’re doing with the buffer. So I’ll give you some advice based purely on what you’ve stated.
Assuming:
[ul]
[li]The only thing this buffer object is used for are atomic counters.
[/li][li]The only CPU-side operations you do on the buffer are clearing it to zero (ie: the GPU increments the value, but GPU operations are also the only ones that read it).
[/li][/ul]
Given the above assumptions, the solution is quite simple. Your access flags ought to be zero. Yes, really. This means you cannot use [var]glBufferSubData[/var], nor should you want to.
To clear the buffer, you should invalidate it with [var]glInvalidateBufferData[/var]. Then [i]clear[/i] the buffer to a value with [var]glClearBufferData[/var].
That should get you the fastest theoretical performance you’re going to get.
Technically, the above would also work if you use [var]glGetBufferSubData[/var] to read from the value on the CPU before clearing it. But that’s a terrible function for performance (it forces a CPU/GPU synchronization).
Instead, if you need to read the value on the CPU, you should set the access flags to [var]GL_MAP_BUFFER_READ_BIT[/var] and [var]GL_MAP_PERSISTENT_BIT[/var]. Then you use persistent mapping to map the buffer once. When you need to read from it, issue the appropriate memory barrier and use synchronization objects to delay access to it for as long as possible.
Obviously, you can’t clear the value until you read from it. Indeed, you may want to use multiple buffers in this case. Frame 1 uses buffer 1, frame 2 uses buffer 2, etc. You can invalidate/clear immediately after reading from each.