Unsynchronize SSBO

Hi I have a huge performance issue on the cpu that I am 99% certain that it has to do with synchonization. With 4 triangles it takes 30% cpu on a quad core and around 50 it takes 60% and the fps drops from 60 to about 40.

I am retrieving a single unsigned int from the gpu with a SSBO. I use this to see what the mouse is howering and it works perfect. Except for the performance issue. I’m pretty sure I could make it work without any synchonization. In the meaning that we wait for the gpu to finish all draw calls before retrieving the uint from the gpu.

I have the following code:


	glDrawArrays(GL_TRIANGLES, 0, m_vertex_buffer.size());
	
	GLuint vertexIndex = _engine->retrieveHoweredVertexIndex(); //this takes time.
	if (vertexIndex != 0xffffffff) 
		_engine->setHoweredVertexObserver(m_triangles[vertexIndex/3]);
	

Here is the function were I suppose the loops and waits for the gpu to finish.


	inline GLuint retrieveHoweredVertexIndex()
	{
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_vertex_index_SSBO_ID);
		GLvoid* p = glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_WRITE); //This is probably were something like glFinish() is called which I don't want.
		GLuint vertexIndex = *(GLuint*)p;
		*(GLuint*)p = 0xffffffff;
		glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
		return vertexIndex;
	}

I have tried some glMapBufferRange without any improvements. (I may very well have used it wrong though).

So my question is how do I make the glMapBuffer not to wait for the drawcall to finish? Or is that even the problem?

By the way read on the wiki “the smallest required SSBO size is 16MB” does that mean this single uint will take 16MB of graphic memory?

To the latter, don’t read back from the GPU. Unless done very carefully, that will kill your performance, even on a desktop GPU (sort-last), but especially on a mobile GPU (most are sort-middle).

To accelerate readbacks, you can sometimes use buffer object intermediates with delayed fetching of the data by the CPU from the buffer object to give the GPU/driver time to finish the data and copy the data across to the client side in the background. However, best case is you don’t read back from the GPU. I think there’s an article in OpenGL Insights on doing fast transfers using PBOs. Alternatively search for fast readbacks using PBOs on the net. See also mentions of PBOs and Download in the wiki:

To the former, I suggest you read this:

and pay attention to any mention of synchronization. However, this is mainly written for the desired case where your data is all moving in the CPU->GPU direction.

[QUOTE=Dark Photon;1279936]To the latter, don’t read back from the GPU. Unless done very carefully, that will kill your performance, even on a desktop GPU (sort-last), but especially on a mobile GPU (sort-middle).
[/QUOTE]

Yes but what I want is simply for the glMapBuffer not to wait for the drawing. I tried GL_MAP_UNSYNCHRONIZED_BIT but the program crashes when trying to access (GLuint vertexIndex = (GLuint)p;). So maybe that is my question why does this crash. For what I guess this does is that it reads from a place in graphic memory and puts it were p point and then when you unmap it, it uploads that value to graphics memory again?

With that said is this valid code? Cuz it crashes.


	inline GLuint retrieveHoweredVertexIndex()
	{
		glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_vertex_index_SSBO_ID);
		//GLvoid* p = glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_WRITE);
		GLvoid* p = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, sizeof(GLuint), GL_MAP_UNSYNCHRONIZED_BIT);
		GLuint vertexIndex = *(GLuint*)p;
		*(GLuint*)p = 0xffffffff;
		glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
		return vertexIndex;
	}

Are you checking for a NULL pointer?

Check for GL errors (for development purposes only) after the glMapBufferRange() call (link). Your call should be throwing one of those. See Errors under glMapBufferRange for details. Also, you cannot do unsynchronized reads, only writes.

It might do that, or it might physically map the video RAM into the process address space, so that writing to it modifies video RAM directly.

No, it’s not valid. The glMapBufferRange() call should generate a GL_INVALID_OPERATION error and return a null pointer, because neither GL_MAP_READ_BIT nor GL_MAP_WRITE_BIT is set in the access parameter.

Also, you’re reading from a mapped region which was mapped without GL_MAP_READ_BIT. Even if the call did actually map the region (i.e. didn’t return a null pointer), there’s no reason to believe that the region can be read from.

[QUOTE=Dark Photon;1279942]Are you checking for a NULL pointer?

Check for GL errors (for development purposes only) after the glMapBufferRange() call (link). Your call should be throwing one of those. See Errors under glMapBufferRange for details. Also, you cannot do unsynchronized reads, only writes.[/QUOTE]

Ooh if I can’t unsynchronize reads then I quess this problem of mine can’t be fixed :frowning:

Just a little curious why isn’t it possible to read unsynchronized? Since for my problem the uint only changes in very few cases and any undefined behavior I think I could handle that even if I read the uint while the gpu processes that value.

No that means that an OpenGL implementation that advertises SSBO support must return at least 16MB when you ask for GL_MAX_SHADER_STORAGE_BLOCK_SIZE.
You can assume that the memory allocated for the buffer will be much less, but it could be a few kB.

Also note that reading unsynchronized from a buffer object you will not have any guarantee that the shader did write to the SSBO yet, so the value in the SSBO may be garbage.

Okay so I had an idea. I’m using glfw and have a game loop updating 60 times per second. I won’t use my uint value until the frame/update after the one I was rendering. Is it garanteed that everything is draw call I have made this far are done when glfwSwapBuffers is done and the next frame/update begin? Because then glMapBuffer would not have too wait for the gpu to be done if I call it directly after glSwapBuffer and everything would go smooth?

Or is it such that glMapBuffer waits for all of the computer’s program’s gpu calls to be done?

[QUOTE=Trionet;1279959]Okay so I had an idea. I’m using glfw and have a game loop updating 60 times per second. I won’t use my uint value until the frame/update after the one I was rendering. Is it garanteed that everything is draw call I have made this far are done when glfwSwapBuffers is done and the next frame/update begin?
[/QUOTE]
No. Buffer swaps can be pipelined.

If you don’t specify an unsynchronised mapping, it will have to wait for any commands which might modify the buffer, and there will be limitations as to how “smart” this is (e.g. if the buffer is bound as an SSBO, the implementation isn’t necessarily going to analyse whether a particular shader will write to it).

You could try using a query or a sync object to detect when specific commands have completed, and poll that to determine whether it’s safe to map the buffer.

[QUOTE=GClements;1279962]

No. Buffer swaps can be pipelined.[/QUOTE]

For a desktop GPU, the conventional way to deal with that is to put a glFinish after your SwapBuffers call (“and only” after your SwapBuffers call). After that glFinish, you then know that the GPU has processed all of the calls for the previous frame, and performed any frame post-processing required to present the frame you just submitted to the user.

This also has the benefit of synchronizing your draw thread with the frame clock, which has some nice benefits in terms of providing consistent end-to-end latency through the system.

Don’t do this on most mobile GPUs though.

Alternatively, use the sync object method GClements suggested.

[QUOTE=Dark Photon;1279966]For a desktop GPU, the conventional way to deal with that is to put a glFinish after your SwapBuffers call (“and only” after your SwapBuffers call). After that glFinish, you then know that the GPU has processed all of the calls for the previous frame, and performed any frame post-processing required to present the frame you just submitted to the user.

This also has the benefit of synchronizing your draw thread with the frame clock, which has some nice benefits in terms of providing consistent end-to-end latency through the system.

Don’t do this on most mobile GPUs though.

Alternatively, use the sync object method GClements suggested.[/QUOTE]

glFinish didn’t do much different. I guess that’s because glMapBuffer kinda does that already. I will try with sync objects.

Looked into sync object. Seems to me that what they do is simply put a fence after a certain buffer command and wait for just the buffer command before it to be done. Then you can wait for the fence and be sure that the buffer command you called just before that is done. However I don’t see how that would make glMapBuffer to wait a shorter time since I guess in that command something like glFinish is called anyway.

I also looked in to PBOs. They seem very promising since what I’ve understood after they are done writes the output texture to a given place in memory. However the problem with PBO is that what you recieve is based on the output of the fragment shader and I would not want to render the scene twice. I guess I could make a 1x1 pixel texture as the output for the PBO however then I would need to discard fragments to not after each fragment write to the texture since then I would just overwrite the value I want with something like 0 all the time. But if I discard fragments all the time the scene will not be seen on my main frame. So what would be needed is either to be able to read what is already in the PBO texture from the fragment shader and overwrite the 1x1 pixel with what is already in it. However this would probably make data races very common and give bad results since many fragments are rendered parallell. The second solution would be to be able to only discard the part of the fragment that writes to the PBO. Which I guess would be the best.

Something that would be really cool would be if you after you a buffer command has been done by the GPU could have specified a function to run in a new thread. Then I could do my function just after the the rendering Is complete and everything would be awesomely synced and working. I guess that could come in handy for many situations when you don’t want the CPU to wait for the GPU to be done with something.

I would love to have my teories confirmed since I’m not sure I’m right about what I said :slight_smile: Thanks!

[QUOTE=Trionet;1279990]Looked into sync object. Seems to me that what they do is simply put a fence after a certain buffer command and wait for just the buffer command before it to be done. Then you can wait for the fence and be sure that the buffer command you called just before that is done. However I don’t see how that would make glMapBuffer to wait a shorter time since I guess in that command something like glFinish is called anyway.
[/QUOTE]
glFinish() waits until the command queue is empty. glMapBuffer() only needs to wait until there are no pending commands which could modify the buffer (for some value of “could”; make sure you unbind it after the call which writes to it).

By the way I tried GL_MAP_UNSYNCHRONIZED_BIT, with or without GL_MAP_READ_BIT on glMapBufferRange. It didn’t work in the meaning that p always was nullptr and the CPU still worked 100%. How’s that?

Didn’t unbind before. However the gains were about 2% cpu I think maybe no gain.

According to the glMapBufferRange reference page:

GL_INVALID_OPERATION is generated for any of the following conditions:

  • length is zero.

  • The buffer object is already in a mapped state.

  • Neither GL_MAP_READ_BIT nor GL_MAP_WRITE_BIT is set.

  • GL_MAP_READ_BIT is set and any of GL_MAP_INVALIDATE_RANGE_BIT, GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_UNSYNCHRONIZED_BIT is set.

  • GL_MAP_FLUSH_EXPLICIT_BIT is set and GL_MAP_WRITE_BIT is not set.

  • Any of GL_MAP_READ_BIT, GL_MAP_WRITE_BIT, GL_MAP_PERSISTENT_BIT, or GL_MAP_COHERENT_BIT are set, but the same bit is not included in the buffer’s storage flags.

If you set GL_MAP_UNSYNCHRONIZED_BIT, don’t set GL_MAP_READ_BIT, and don’t set GL_MAP_WRITE_BIT, it will fail for the first of the highlighted cases above.

If you set GL_MAP_UNSYNCHRONIZED_BIT and set GL_MAP_READ_BIT, it will fail for the second of the highlighted cases above.

If you set GL_MAP_UNSYNCHRONIZED_BIT and set GL_MAP_WRITE_BIT, it should return a valid pointer, but you can’t write to that region (doing so will result in undefined behaviour, e.g. an access violation).

[QUOTE=GClements;1280003]
If you set GL_MAP_UNSYNCHRONIZED_BIT, don’t set GL_MAP_READ_BIT, and don’t set GL_MAP_WRITE_BIT, it will fail for the first of the highlighted cases above.

If you set GL_MAP_UNSYNCHRONIZED_BIT and set GL_MAP_READ_BIT, it will fail for the second of the highlighted cases above.

If you set GL_MAP_UNSYNCHRONIZED_BIT and set GL_MAP_WRITE_BIT, it should return a valid pointer, but you can’t write to that region (doing so will result in undefined behaviour, e.g. an access violation).[/QUOTE]

Ooh I thought it was only reading that was possible.

However when I use GL_MAP_UNSYNCHRONIZED_BIT and GL_MAP_WRITE_BIT it works just as normal, with the cpu use and all. Isn’t that weird?

Sorry, I meant to say that you can’t read from the region, only write to it.