PDA

View Full Version : Timing transform feedback



mobeen
06-29-2011, 12:05 AM
Hi all,
I want to know the amount of time taken by transform feedback mechanism. Currently, I am doing it like this (in pseudocode),


start timer;
glBindVertexArray(...);
glBindBufferBase(...);
glEnable(GL_RASTERIZER_DISCARD); // disable rasterization
glBeginTransformFeedback(GL_POINTS);
glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRIT TEN, query);
glDrawArrays(GL_POINTS, 0, MAX_MASSES);
glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTE N);
glEndTransformFeedback();
glDisable(GL_RASTERIZER_DISCARD);
stop timer;

This however return 0 secs. Is this the correct way?

Alfonse Reinheart
06-29-2011, 12:40 AM
OpenGL operations are typically asynchronous. They will be executed sometime after you actually call those commands.

If you want to know how long something takes on the GPU, use ARB_timer_query (http://www.opengl.org/registry/specs/ARB/timer_query.txt), which was made core in GL 3.3.

mobeen
06-29-2011, 02:09 AM
Thanks for the prompt reply Alfonse.

mobeen
06-29-2011, 04:00 AM
OK I have got my times. One question though. I think for correct time calculation, I should call glFinish to make sure every GL call is finished before the call to glEndQuery(GL_TIME_ELAPSED);
If i dont call glFinish the times are significantly different. Should my timing calculation call glFinish?

Alfonse Reinheart
06-29-2011, 04:30 AM
You need to call glFlush, to make sure that the time-elapsed token isn't late, but not glFinish.

mobeen
06-29-2011, 04:38 AM
ok thanks for that.

Aleksandar
06-29-2011, 06:02 AM
No, you don't need to call glFinish()/glFlush() at all!

The purpose of ARB_timer_query is to enable asynchronous execution. Previous commands are very heavyweight synchronization fences.

glGetQueryObject*() is a blocking function, so it is not very useful to call it immediately after glEndQuery(GL_TIME_ELAPSED). Call glGetQueryObject*() as late as possible, or even better in the next frame.

mobeen
06-29-2011, 08:47 AM
No, you don't need to call glFinish()/glFlush() at all!

The purpose of ARB_timer_query is to enable asynchronous execution. Previous commands are very heavyweight synchronization fences.

HI Aleksander,
Thanks for the info. Now this is confusing for me since removing the glFlush call gives a significantly different time.

Alfonse Reinheart
06-29-2011, 10:03 AM
That's because he doesn't understand the issue.

When you execute an OpenGL command, it may not yet be put into the GPU command stream. It may wait in an internal buffer somewhere until the GPU's command stream is empty or nearly so.

If the command stream completely empties, then some part of the GPU is idle. If the timer query token was not placed into the command stream yet, then you will be measuring the time it takes for the token to be placed into the stream in addition to the time it takes for the commands to execute.

Hence the need to flush. And do note that the flush should happen *after* the glEndQuery call, not before it.

Aleksandar
06-29-2011, 11:10 AM
Maybe I didn't understand the problem, but how can you be sure that you did? :)

I assumed that there are other commands executed after glEndQuery(), and also that there is at least SwapBuffers() which would internally call glFlush().


The timer is started or stopped when the effects from all previous commands on the GL client and server state and the framebuffer have been fully realized.
Having glFlush() call before glEndQuery() wouldn't give correct time since glFlush() would also be included in the total range.


When you execute an OpenGL command, it may not yet be put into the GPU command stream. It may wait in an internal buffer somewhere until the GPU's command stream is empty or nearly so.
The previous two sentences are contradictory. Or I have misunderstood something again. Well, there is a command buffer, and it is flushed when it is full, or a glFlush()/glFinish() is called. That is a consequence of old and well know client/server organization of OpenGL. Can you post a link to that "stream based" solution that pulls commands from client's internal buffers? It is a quite new concept to me.

aqnuep
06-29-2011, 11:37 AM
I can also say that no need to call glFlush. ARB_timer_query provides a transparent method to measure the server side time for any particular command set, without affecting the rest of the code (at least, that is how I understand that as the extension spec doesn't say anything about requiring a glFlush to get accurate measurements).

Alfonse Reinheart
06-29-2011, 11:37 AM
Having glFlush() call before glEndQuery() wouldn't give correct time since glFlush() would also be included in the total range.

I didn't say anything about calling glFlush before glEndQuery. Indeed, I said the opposite: "And do note that the flush should happen *after* the glEndQuery call, not before it."


The previous two sentences are contradictory. Or I have misunderstood something again.

This Wiki article (http://www.opengl.org/wiki/Synchronization) explains how synchronization works.

Just because you call a function does not mean that the corresponding GPU command has been issued to the GPU. That's the point I'm getting at. If the GPU's command buffer empties while there are commands waiting to be processed, then that stall will be part of the timing.

Now granted, one might expect glEndQuery to perform a flush internally.

Aleksandar
06-29-2011, 01:38 PM
I didn't say anything about calling glFlush before glEndQuery.I didn't say that you had said that. I just wanted to emphasize why someone shouldn't do that.


This Wiki article (http://www.opengl.org/wiki/Synchronization) explains how synchronization works.
This Wiki article is written by you. Are you working for some hardware vendor? if not, what is the source of those claims?


Just because you call a function does not mean that the corresponding GPU command has been issued to the GPU. That's the point I'm getting at.
I agree with that. That's quite clear.


If the GPU's command buffer empties while there are commands waiting to be processed, then that stall will be part of the timing.
That's a little bit odd, because it is not quite clear why those awaiting commands are not pulled if the command queue is empty. For me "the story about client/server organization" is more "digestible".


Now granted, one might expect glEndQuery to perform a flush internally.
glEndQuery() doesn't perform a flush internally. The purpose of timer query extension is to avoid any synchronization stalls. Unless one calls GetQueryObject*() and the counter is not ready, there is no stalls at all.

Alfonse Reinheart
06-29-2011, 04:03 PM
This Wiki article is written by you. Are you working for some hardware vendor? if not, what is the source of those claims?

The OpenGL specification.


That's a little bit odd, because it is not quite clear why those awaiting commands are not pulled if the command queue is empty.

Because the CPU has to put them there. If the command queue is full when you call an OpenGL function, then it can't put it there. Therefore, the driver must wait until sometime later. Even if the driver is threaded, that doesn't ensure that it will have a timeslice available when the queue starts to empty.


The purpose of timer query extension is to avoid any synchronization stalls.

It most certainly isn't. The point is to get accurate timings for OpenGL operations. GPU timings. A flush is a CPU stall, not a GPU stall.

Also, either it performs a flush or you do; there's no other way to get accurate GPU timings.

The glBeginQuery call will attempt to put some kind of token into the command stream that tells the GPU to start the clock. The glEndQuery call must therefore put a token into the command stream that causes it to stop the clock. The only way to get an accurate timing from one to the other is to ensure that there are no GPU stalls between the begin and the end (other than those caused by the regular processing of the GPU commands, of course).

And there's only one tool for doing that: a flush. Halt execution of the user's code while constantly polling the GPU command queue, putting tokens in as fast as possible until all have been issued to the queue.

Is it possible that the driver has some mechanism for doing so that doesn't stall the CPU? Possibly. But something's going to have to ensure that there are no GPU issuing delays.

This is ultimately the same reason you have to use glFlush when you create fence objects with ARB_sync: to ensure that the token is added to the command stream in reasonable time.

mobeen
06-29-2011, 07:20 PM
Thanks for a healthy discussion Alfonse, aqnuep and Aleksander.
Ok so this is how I am doing it now. Is this correct?


glBeginQuery(GL_TIME_ELAPSED,t_query);
glBindVertexArray( vaoUpdateID[writeID]);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, vboID_Pos[readID]);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 1, vboID_Vel[readID]);
glEnable(GL_RASTERIZER_DISCARD);
glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, 0, MAX_MASSES);
glEndTransformFeedback();
glDisable(GL_RASTERIZER_DISCARD);
glEndQuery(GL_TIME_ELAPSED);
glFlush();
// get the query result
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);
printf("Time Elapsed: %f ms\n", elapsed_time / 1000000.0);

aqnuep
06-30-2011, 03:25 AM
It should work as you would expect. The only question is the necessity of the glFlush after glEndQuery, however let's not argue on that.
Maybe you can try with and without it a few times and tell us the results. I know that it wouldn't proof anything if the query results would be the same, but it could prove Alfonse's theory if there are big differences between the two (more precisely if without glFlush the value would be reasonably higher).

Aleksandar
06-30-2011, 04:07 AM
This Wiki article is written by you. Are you working for some hardware vendor? if not, what is the source of those claims?

The OpenGL specification.


I beg you to point me to the file/chapter/page where in the spec that is defined. You have used a description of how drivers might implement commands execution. That is not proscribed by the spec, so it shouldn't be the part of it. The word "driver" is rarely used throughout the spec and it is easy to find all occurrences. States you are used in the Wiki articles are not part of the spec either, although your classification of states is pretty reasonable. I'm just curious if that is really implemented somewhere.


It most certainly isn't. The point is to get accurate timings for OpenGL operations. GPU timings.
Depends on what you want to measure. It is used for measuring GPU time, but, as you said, there is no guarantee that the same set of instruction will have the same execution time over multiple invocations. There are a lot of drivers optimizations, as well as other circumstances that can have impact on execution time.


A flush is a CPU stall, not a GPU stall.


The command
void Flush( void );
indicates that all commands that have previously been sent to the GL must complete in finite time.

Why this should stall CPU? Unlike glFinish(), which is a blocking function, glFlush() just flushes command buffer. Of course, if we strictly speaking about execution time, there is a time driver needs for executing the command and flushes command buffer, but the penalty is probably not great considering CPU. It have to be measured. :confused:


This is ultimately the same reason you have to use glFlush when you create fence objects with ARB_sync: to ensure that the token is added to the command stream in reasonable time.
The reason for using glFlush() in synchronization across multiple contexts is the fact we have to deal with totally independent GL servers. Each rendering context is a GL server "per se", with it's own command buffer. It is possible event to wait forever if you are locked to another thread associated with the context with just several commands issued (and waiting to be flushed).


Ok so this is how I am doing it now. Is this correct?It's ok if you don't care about overall performance. Try to measure the time with and without glFlush() and report the differences. I'm wondering if there is any. You can improve performance if you remove glFlush() and displace glGetQueryObjectui64v() as much as possible. The best solution is to display results from the previous frame.

mobeen
06-30-2011, 04:14 AM
Hi,
As far as the results are concerned, there isn't much difference if I remove glFlush call.

Aleksandar
06-30-2011, 04:26 AM
The results might differ if you are waiting something in your code and don't issue GL commands after the feedback code you want to measure. That's what Alfonse wanted to say. But it is a rather pathological case. In the spirit of good graphics programming, you shouldn't block drawing while waiting some other calculations to finish (or even worse, for user input). So, our assumption about your code was right. I'm glad for that. :)

Dan Bartlett
06-30-2011, 06:31 AM
Thanks for a healthy discussion Alfonse, aqnuep and Aleksander.
Ok so this is how I am doing it now. Is this correct?


glBeginQuery(GL_TIME_ELAPSED,t_query);
glBindVertexArray( vaoUpdateID[writeID]);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, vboID_Pos[readID]);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 1, vboID_Vel[readID]);
glEnable(GL_RASTERIZER_DISCARD);
glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, 0, MAX_MASSES);
glEndTransformFeedback();
glDisable(GL_RASTERIZER_DISCARD);
glEndQuery(GL_TIME_ELAPSED);
glFlush();
// get the query result
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);
printf("Time Elapsed: %f ms\n", elapsed_time / 1000000.0);

It's okay if you don't mind about overall performance (eg. the code won't be included in finished app), but querying the result with

glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time); directly after ending the query will cause a stall until all the commands issued before ending the query are fully complete. Including your own glFlush() or glFinish() call should have no effect on performance, since querying the result with glGetQueryObjectui64v + GL_QUERY_RESULT is almost like issuing a glFinish() call, since it can't return until all the previous commands are finished.

You either want to query the result at a later stage:

glBeginQuery(GL_TIME_ELAPSED,t_query);
...
glEndQuery(GL_TIME_ELAPSED);
// ... some time much later, perhaps next frame ...
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);

Or you can check whether the result is available before actually asking for the result without stalling using:

glGetQueryObjectiv(t_query, GL_QUERY_RESULT_AVAILABLE, &available);

where you could do something on the CPU in a loop while waiting for the result to become available.

Alfonse Reinheart
06-30-2011, 04:29 PM
Why this should stall CPU?

The only way to ensure that all previously given OpenGL commands complete in "finite time" is for the client to send those commands to the server. If the server command queue is full, then the client (CPU) must stall until that command queue empties out enough to accept all of the waiting commands.

If an OpenGL implementation is threaded, then it is possible for a GL client thread to exist that will ensure that all OpenGL commands are completed in "finite time". In which case, a glFlush is effectively a no-op. But this is implementation-dependent; it's better to assume that glFlush will induce a CPU stall than to think that it won't.


The reason for using glFlush() in synchronization across multiple contexts is the fact we have to deal with totally independent GL servers.

ARB_sync has nothing to do with synchronization across multiple contexts specifically. It introduces sync objects and fence syncs; you can use those just fine in a single context application.

The purpose of fences is to be able to test whether the GPU has completed a particular task (defined by the point where you sent the sync command) or to simply be able to do a glFinish, but only to a certain point in the command stream.

Look at how the spec states it:



If the sync object being blocked upon will not be signaled in finite time (for example, by an associated fence command issued previously, but not yet flushed to the graphics pipeline), then ClientWaitSync may hang forever.

It says nothing about only flushing with multiple contexts. It simply says that you need to ensure that a fence is flushed manually at some point before you can start listening for it.

It does mention multiple contexts a bit later, but only to say that it requires more work than the single context case. You still need that flush regardless of how many contexts you have.

aqnuep
07-01-2011, 01:05 AM
Why this should stall CPU?

The only way to ensure that all previously given OpenGL commands complete in "finite time" is for the client to send those commands to the server. If the server command queue is full, then the client (CPU) must stall until that command queue empties out enough to accept all of the waiting commands.

If an OpenGL implementation is threaded, then it is possible for a GL client thread to exist that will ensure that all OpenGL commands are completed in "finite time". In which case, a glFlush is effectively a no-op. But this is implementation-dependent; it's better to assume that glFlush will induce a CPU stall than to think that it won't.

Very good point. Even though drivers are nowadays multithreaded and I'm pretty sure they will try to avoid application side CPU stalls at all costs, it is better not to build on such assumptions.


ARB_sync has nothing to do with synchronization across multiple contexts specifically. It introduces sync objects and fence syncs; you can use those just fine in a single context application.

While it is true, I would point out that ARB_sync's main use cases are still synchronization across multiple GL contexts and synchronization between OpenCL and OpenGL operations. I cannot really think of any particular use case where ARB_sync provides more flexibility than the regular explicit and implicit synchronization mechanisms, but please make me wrong as I'm interested.

Aleksandar
07-01-2011, 12:27 PM
ARB_sync has nothing to do with synchronization across multiple contexts specifically. It introduces sync objects and fence syncs; you can use those just fine in a single context application.
Although it is possible to use sync objects in a single context, there is probably very limited number of use-cases where it can be useful. Frankly, I cannot find even one. Potentially, they can be used for signaling the client that some operation is finished on the server side by unblocking ClientWaitSync(), but I have never had a need for that.

Waiting on the server side something to be finished inside a single context is meaningless, since GL guarantees in order execution of issued commands. Please correct me if I'm wrong.

So, as Daniel already said, main use cases are synchronization across multiple GL contexts and synchronization between OpenCL and OpenGL.

Alfonse Reinheart
07-01-2011, 12:59 PM
Frankly, I cannot find even one

Really? I can think of a few:

1: Finding out when a read into a PBO has finished. That way, you're not stalling the CPU waiting for one.

2: Finding out when the GPU is finished with a buffer, so that you're free to write to it without stalling the CPU, and without allocating another block of memory (ie: being able to use unsynchronized mapping without flushing the buffer).

3: Asking if a timer is finished without blocking the CPU to wait for it.

4: Finding out if an occlusion query is finished without blocking the CPU for it (though that one isn't so important these days thanks to conditional rendering).

And that's just what I thought up off the top of my head.

aqnuep
07-01-2011, 02:03 PM
Alfonse, you have the point, however, I think there are much better ways to deal with such issues.


3: Asking if a timer is finished without blocking the CPU to wait for it.

What kind of timer? You mean a timer query? You can ask for the completeness of a timer query without sync objects.


4: Finding out if an occlusion query is finished without blocking the CPU for it (though that one isn't so important these days thanks to conditional rendering).

Same here. You could always query whether the results of the occlusion query are available. Same goes for all asynchronous queries. Also, ARB_occlusion_query2 goes even farther as it can theoretically make the results available already after the first fragment passed the depth test so, again, theoretically, it is unnecessary to wait for the query to finish.

Aleksandar
07-01-2011, 02:45 PM
2: Finding out when the GPU is finished with a buffer, so that you're free to write to it without stalling the CPU, and without allocating another block of memory (ie: being able to use unsynchronized mapping without flushing the buffer).
This use-case sounds interesting (unlike the rest). I was not aware of that since I'm not using VBO mapping.

Nevertheless, thank you for hints on single-context usage of sync objects.

Also apologize for drawing discussion from the main OP's question. But I hope it was useful to clear up some facts about timing and synchronization in OpenGL.