Trying to understand an issue with IndirectDraw and data struct of indirectBuffer

Hi, Y’all!
A few days ago I went back to my little project at the point where I was obstruct with the performance of DrawElementsIndirect call that receives as instanceCount parameter the result of glGetQueryObjectuiv from Query object that receibed this values from tranformFeedback a step, using the GL_ARB_query_buffer_objectB extension to avoid server <-> client synchronization

[CODE=cpp]
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, my_gIndirectBuffer);
glBindBuffer(GL_QUERY_BUFFER, my_gIndirectBuffer);
glGetQueryObjectuiv(_query, GL_QUERY_RESULT, BUFFER_OFFSET(offsetof(struct DrawElementsIndirectCommand, instanceCount)));
// bind VAO & VBO to set atributes here (from the VBO taken of TF)
glDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (GLvoid*)0);



This works, I get the [i]instanceCount [/i]that I need in to the Indirectbuffer but I have detected that it affects performance in the same way that if synchronization client - server was occurring. Now, Doing tests I got a result that I can´t understand; in the data struct of the IndirectBuffer I change the data type of instanceCount (GLuint  instanceCount) to (GLuint* instanceCount) :
[CODE=cpp]
	struct DrawElementsIndirectCommand {
		GLuint  count;
		GLuint*  instanceCount;
		GLuint  firstIndex;
		GLuint  baseVertex;
		GLuint  baseInstance;
	};

I know, I know! , I know that this is probably far from what is the solution that I look for … but … this gives me a half solution!, This means that for some reason this makes the performance behave as I would expect if there was not synchronization (this means that the performance improves, the percentage of gpu that is used goes from 80% to 97-98%), but only drawn arround a half of instances that should, this probably means that it are ignoring glGetQueryObjectuiv command.

I know it’s wrong and it’s not a valid solution, but I’d like to know why this happens.

[QUOTE=martel;1293575]
This works, I get the instanceCount that I need in to the Indirectbuffer but I have detected that it affects performance in the same way that if synchronization client - server was occurring. Now, Doing tests I got a result that I can´t understand; in the data struct of the IndirectBuffer I change the data type of instanceCount (GLuint instanceCount) to (GLuint* instanceCount) :

[CODE=cpp]
struct DrawElementsIndirectCommand {
GLuint count;
GLuint* instanceCount;
GLuint firstIndex;
GLuint baseVertex;
GLuint baseInstance;
};



I know, I know! , I know that this is probably far from what is the solution that I look for ... but ... this gives me a half solution!, This means that for some reason this makes the performance behave as I would expect if there was not synchronization (this means that the performance improves, the percentage of gpu that is used goes from 80% to 97-98%), but only drawn arround a half of instances that should, this probably means that it are ignoring [i][b]glGetQueryObjectuiv[/b][/i] command.

I know it's wrong and it's not a valid solution, but I'd like to know why this happens.[/QUOTE]
By changing the type of the second member from a 32-bit type to a 64-bit type, you're moving the other 3 members. So what glDrawElementsIndirect() actually sees is either:
[code=cpp]
count = count
instanceCount = <uninitialised>
firstIndex = instanceCount
baseVertex = <uninitialised>
baseInstance = firstIndex

or:

[code=cpp]
count = count
instanceCount = instanceCount
firstIndex =
baseVertex = firstIndex
baseInstance = baseVertex


The former is the case for the default alignment rules on most 64-bit platforms, the latter if pointers are 32-bit aligned (e.g. for a "packed" structure). On a 32-bit architecture (where GLuint and GLuint* are the same size), it wouldn't make any difference.

The structure that glDrawElementsIndirect actually uses is the one given in the specification (and reference pages). Using a different structure in application code is just going to result in the application code writing values to the wrong offsets.

I suspect that your initial problem is that you're still stalling the GPU pipeline, but synchronising on the GPU rather than the CPU. If you retrieve the value of a query object immediately following the operation which generates that value, the GPU has to wait for the first operation to complete before it can start the second operation. If it didn't have to do that, it could probably initiate (or at least prepare) the memory transfers for the draw command while the previous command was still processing.

As a general guide, you should try to interleave dependencies. If A must complete before B can start and C must complete before D can start (but with no other dependencies), execute A,C,B,D, so that the transition between A and B can be overlapped with the execution of C, and the transition between C and D can be overlapped with the execution of B.