PDA

View Full Version : Fastest way to reset an atomic counter?



kzuiderveld
12-08-2014, 06:51 AM
I am using an atomic counter that needs to be reset frequently. I'm currently doing this by writing 4 zero bytes to the counter using glBufferSubData; the atomic counter is allocated via glBufferStorage with GL_DYNAMIC_STORAGE_BIT mapping.

My measurements suggests that resetting the atomic counter takes about 100 us. Hmm. That's pretty time consuming for moving 4 bytes over the PCI bus.

I looked into faster ways to do this. Tried to map the atomic buffer using glMapBufferRange, but that just absolute destroys overall performance (duh - atomic counter doesn't live on the GPU anymore).

Any suggestions to make resetting (and reading) atomic counters faster or is glBufferSubData the way to go?

Alfonse Reinheart
12-08-2014, 07:57 AM
That's pretty time consuming for moving 4 bytes over the PCI bus.

You've clearly never used the PCI bus before ;)

Initiating memory transfer operations have a fixed cost, in addition to the actual speed of the byte transfer. Thus, its best to avoid small transfers where possible.


duh - atomic counter doesn't live on the GPU anymore

That's not what mapping does.

OK, you haven't told us everything you're doing with the buffer. So I'll give you some advice based purely on what you've stated.

Assuming:


The only thing this buffer object is used for are atomic counters.
The only CPU-side operations you do on the buffer are clearing it to zero (ie: the GPU increments the value, but GPU operations are also the only ones that read it).


Given the above assumptions, the solution is quite simple. Your access flags (https://www.opengl.org/wiki/Buffer_Object#Immutable_access_methods) ought to be zero. Yes, really. This means you cannot use glBufferSubData, nor should you want to.

To clear the buffer, you should invalidate it (https://www.opengl.org/wiki/Buffer_Object#Invalidation) with glInvalidateBufferData (https://www.opengl.org/wiki/GLAPI/glInvalidateBufferData). Then clear the buffer to a value (https://www.opengl.org/wiki/Buffer_Object#Clearing) with glClearBufferData (https://www.opengl.org/wiki/GLAPI/glClearBufferData).

That should get you the fastest theoretical performance you're going to get.

Technically, the above would also work if you use glGetBufferSubData (https://www.opengl.org/wiki/GLAPI/glGetBufferSubData) to read from the value on the CPU before clearing it. But that's a terrible function for performance (it forces a CPU/GPU synchronization).

Instead, if you need to read the value on the CPU, you should set the access flags to GL_MAP_BUFFER_READ_BIT and GL_MAP_PERSISTENT_BIT. Then you use persistent mapping (https://www.opengl.org/wiki/Buffer_Object#Persistent_mapping) to map the buffer once. When you need to read from it, issue the appropriate memory barrier and use synchronization objects (https://www.opengl.org/wiki/Sync_Object) to delay access to it for as long as possible.

Obviously, you can't clear the value until you read from it. Indeed, you may want to use multiple buffers in this case. Frame 1 uses buffer 1, frame 2 uses buffer 2, etc. You can invalidate/clear immediately after reading from each.

kzuiderveld
12-08-2014, 10:04 AM
You've clearly never used the PCI bus before ;)

I actually have, it was more a tongue-in-cheek comment.


you haven't told us everything you're doing with the buffer.

For each frame, I'm creating a linked-list (hence the atomic counter needs to be reset to zero). However, I don't know the length of the linked list in advance, so I need to render a frame, READ THE ATOMIC COUNTER to determine the number of entries generated, check it against the size of the SSOB that contains the linked list, make the SSOB bigger if needed.

So I need to reset the counter and read the counter. Your suggestion of invalidate & clear the buffer data works like a charm, it completely removes the overhead of writing 0 to the counter.

No sigar on reading the atomic counter though. If I use GL_MAP_READ_BIT and GL_MAP_PERSISTENT_BIT, my compute shader becomes 100x slower likely because the atomic counter doesn't exclusively in GPU memory space anymore. I can't wait for a next frame, need to wait for the result.

It would be great if there would be a way for the GPU to control how sparse buffers/textures are allocated so I can keep everything running on the GPU. But for now, I need to use the CPU to resize the SSOB, so I need to read that atomic counter value.

mbentrup
12-11-2014, 03:22 AM
I actually have, it was more a tongue-in-cheek comment.
So I need to reset the counter and read the counter. Your suggestion of invalidate & clear the buffer data works like a charm, it completely removes the overhead of writing 0 to the counter.


As you have to map the buffer anyway to read the counter, can't you just reset it in the same step ?


I actually have, it was more a tongue-in-cheek comment.
No sigar on reading the atomic counter though. If I use GL_MAP_READ_BIT and GL_MAP_PERSISTENT_BIT, my compute shader becomes 100x slower likely because the atomic counter doesn't exclusively in GPU memory space anymore. I can't wait for a next frame, need to wait for the result.

Well, if the atomic counter is allocated on the CPU, the atomicIncrement has to access it over the PCI bus....

For a use case like this a Query object would be ideal, because they are designed to get the result to the CPU as fast as possible. However, there are no user queries in OpenGL, so you'd have to hijack one of the system queries, e.g. render to a 1x1 FBO and count SAMPLES_PASSED.