When exactly does glBeginQuery set the counter to zero?

FrankBoltzmann · August 29, 2014, 8:43am

Hi, I’m trying to determine the time it takes from the moment I call a glQueryCounter on the CPU and the moment it is actually executed on the GPU. The reference pages say that when you use glBeginQuery with GL_TIME_ELAPSED, the time counter is set to zero. Is that exactly when the glBeginQuery is called on the CPU? Or is it set to zero once the call is executed on the GPU? If the latter is correct, then using glBeginQuery and glEndQuery will return the same result as using the difference between two glQueryCounter; which also means I can’t measure the time between CPU call and GPU execution. If this is correct, is there any other way of measuring that delay?

Normally, an OpenGL application would look like this:

CPU: A------B---------------
GPU: ------A---------------B

A is glBeginQuery, B is glEndQuery.

The CPU calls A, then calls a bunch of other opengl calls and finally calls B. Meanwhile, the GPU is still taking care of opengl calls previous to A and the opengl calls between A and B take more time to be executed in the GPU then they took to be called in the CPU. This means CPU time and GPU time are unsynchronized. However, GL_TIMESTAMP and GL_TIME_ELAPSED can be used to measure the time distance between A and B on the GPU. So right now I can measure the CPU time between A and B with a chrono::high_precision_clock and the GPU time with a query, but how can I get the time from the CPU A to the GPU A?

FrankBoltzmann · September 1, 2014, 4:22am

No experts on this? I was hoping to hear from Alfonse or someone else too bad you can’t tag forum members!

Aleksandar · September 1, 2014, 1:32pm

Alfonse is not active on the forum for a long time. It’s a pity. The forum has lost much of its liveness with his retreat.

Back to the question. Although I’m not an expert, let’s try to find an answer to your question.

Yes, timer_query measures GPU time. The time elapsed from the moment all instruction prior to some query instruction are executed to the time all instruction prior to the next query are fully executed (finalized). The specification is pretty clear about this. By the way, I wouldn’t recommend the usage of glBeginQuery/glEndQuery since they don’t allow overlapping.

It is very hard to determine latency of the command execution on the GPU regarding the time when CPU issues it. As far as I know it is impossible in a real-time. Why do you need that? You can hardly imagine what’s actually happening in the mean-time. It is not just the consequence of GPU/CPU asynchronous execution, but the ultimate-fight between different processes for the GPU, scheduling, multi-level command-queues writing/reading, user-mode graphics driver execution, kernel-mode graphics driver execution, DMA transfers, a trip through PCIE, through several levels of memory and caches, etc.

At best, you could estimate the time by offline analysis of the time-span between writing to a context CPU queue and removing from the hardware queue. On Windows it can be accomplished by the usage of Windows Performance Toolkit (WPT). The analysis of the event log can be done with GPUView.

carsten_neumann · September 1, 2014, 1:48pm

Couldn’t you at least approximate it by issuing a glFenceSync after the command and then do a glWaitSync()?

FrankBoltzmann · September 1, 2014, 2:49pm

Hi guys, thanks for the responses! My replies are in blue.

[QUOTE=Aleksandar;1261328]Alfonse is not active on the forum for a long time. It’s a pity. The forum has lost much of its liveness with his retreat.

Back to the question. Although I’m not an expert, let’s try to find an answer to your question.

Yes, timer_query measures GPU time. The time elapsed from the moment all instruction prior to some query instruction are executed to the time all instruction prior to the next query are fully executed (finalized). The specification is pretty clear about this. By the way, I wouldn’t recommend the usage of glBeginQuery/glEndQuery since they don’t allow overlapping.

[b]This seems to be wrong, according to what I found out.

From the specification: using glQueryCounter(GL_TIMESTAMP) will get a time point (GPU time) at the moment where glQueryCounter is executed after all previous OpenGL calls have already been executed by the pipeline.

From what I read: glBeginQuery(GL_TIME_ELAPSED) will measure the time since glBeginQuery enters the pipeline until glEndQuery leaves the pipeline, which means the CPU time from when glBeginQuery is called by the CPU to when glEndQuery is executed on the GPU. This means using glBegin/glEnd and glQueryCounter will result in 2 different durations: the first includes the delay that I wanted, the second doesn’t.

Is this correct or does glBeginQuery actually wait for all other calls to be done?
[/b]

It is very hard to determine latency of the command execution on the GPU regarding the time when CPU issues it. As far as I know it is impossible in a real-time. Why do you need that? You can hardly imagine what’s actually happening in the mean-time. It is not just the consequence of GPU/CPU asynchronous execution, but the ultimate-fight between different processes for the GPU, scheduling, multi-level command-queues writing/reading, user-mode graphics driver execution, kernel-mode graphics driver execution, DMA transfers, a trip through PCIE, through several levels of memory and caches, etc.

This is exactly what I want to determine. I want to be able to pull out time graphs of how the CPU and GPU work in parallel. I know I wont be able to determine what exactly is causing the delay (scheduling, BUS, etc) but at least I’ll see how long that delay is and I’ll be able to visualise the asynchronous timeline.

At best, you could estimate the time by offline analysis of the time-span between writing to a context CPU queue and removing from the hardware queue. On Windows it can be accomplished by the usage of Windows Performance Toolkit (WPT). The analysis of the event log can be done with GPUView.[/QUOTE]

[QUOTE=carsten neumann;1261330]Couldn’t you at least approximate it by issuing a glFenceSync after the command and then do a glWaitSync()?

How would this help me measure the delay? Care to explain thoroughly? I don’t want to block the CPU.[/QUOTE]

carsten_neumann · September 2, 2014, 12:43am


/* issue GL command */
cpu_timer.start();
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glWaitSync(fence, 0, GL_TIMEOUT_IGNORED); // or is glClientWaitSync more useful here?
cpu_timer.stop();

I think this gives an approximation of how long it takes from the moment the command is issued until the GL says it is complete. Sorry, no idea how to do that non-blocking.

Aleksandar · September 2, 2014, 2:06am

We are not reading the same spec probably.

What you could eventually do is to measure time between reaching the server (not quite clear what does it mean, i.e. which command-queue has been reached) and execution time.

Considering glClientWaitSync(), it could measure some time, but I’m not sure whether it is what you need. It is a total CPU time from issuing to signaling back, but I’m not sure how precise it is. It just guaranties the previous commands are finished, but what time has passed since that event. And, yes, it is a blocking call. glWaitSync() is used for inter-context synchronization. It cannot be used in this case.

FrankBoltzmann · September 2, 2014, 3:42am

[QUOTE=Aleksandar;1261356]We are not reading the same spec probably.

What you could eventually do is to measure time between reaching the server (not quite clear what does it mean, i.e. which command-queue has been reached) and execution time.

Considering glClientWaitSync(), it could measure some time, but I’m not sure whether it is what you need. It is a total CPU time from issuing to signaling back, but I’m not sure how precise it is. It just guaranties the previous commands are finished, but what time has passed since that event. And, yes, it is a blocking call. glWaitSync() is used for inter-context synchronization. It cannot be used in this case.[/QUOTE]

Thanks for that quote from the spec, I couldn’t find that anywhere! I think this should also be in the man page explaining the use of glBeginQuery (https://www.opengl.org/sdk/docs/man3/xhtml/glBeginQuery.xml) and why is that specification still ARB? Aren’t timer queries already in core?

I guess what I want to do isn’t really possible. And how about if I issue a glQueryCounter with GL_TIMESTAMP in the beginning of the application, then record the CPU time when the query result is available and record the query result as well? I would have a CPU and a GPU timestamps of (approximately) the same time point, would I not? Then I could fetch other OpenGL timestamps and CPU time_points and use the difference between them and the initial timestamps as a timestamp relative to the application’s initialization. Does this make any sense?

EDIT: I mean, if I do something like this:

glQueryCounter(gpu_start)
glFence
glClientWaitSync
cpu_start = clock::now()
{
   do a lot of OpenGL stuff
}
glQueryCounter(gpu_stop)
glFence
glClientWaitSync
cpu_stop = clock::now()

won’t the duration from cpu_start to cpu_stop always be the same (with a tiny bit of jitter) as the duration from gpu_start to gpu_stop, since they are synchronized?

If so, I could just get an also synchronized timestamp at the beginning of my application and use it to calculate the (correlated) gpu and cpu timestamps relative to the start of the application. Am I missing something?

Aleksandar · September 4, 2014, 10:53am

Yes, timer_query is a part of the core specification since ver. 3.3, but I like to use Registry instead of searching through the core specification.

Probably it is a quite correct statement, bur let’s see what can be done.

The CPU time-stamp will be behind the GPU time-stamp for the undetermined amount of time. But OK, let’s say it is the same moment.

I don’t get this point. Tgpu2-Tgpu1 if it is perfectly synchronized with the CPU, should be equal to Tcpu2-Tcpu1. If they are not, you’ll actually measure latency of the results reading, not waiting in commands queues or something that precedes a GPU execution. Or maybe I missed something… Also, the precision of the counters is not the same. I don’t know how clock::now works.

Also, I don’t get why you are using glFenceSync/glClientWaitSync. It can be used glGetQueryObjectui64v instead with probably less overhead.

FrankBoltzmann · September 4, 2014, 2:00pm

Response and explanation of my logic:

Exactly, let’s say you capture GPU time at program start and it is A. You then capture CPU time when A is available, but there will always be a small delay from the time A was captured and the time its value is stored into C++ code, and lets call it B. A and B have no correlation whatsoever, but we know they were taken at almost the same absolute real time, except B was taken X nanoseconds after A. Now, our GPU and CPU timestamps are calculated by subtracting A and B respectively. So Tgpu = gpu_now - A and Tcpu = cpu_now - B. If the GPU and CPU are synchronized (let’s ignore the delay I’m trying to find for now) Tgpu will be Tcpu+X, am I right?

[QUOTE=Aleksandar;1261431]
I don’t get this point. Tgpu2-Tgpu1 if it is perfectly synchronized with the CPU, should be equal to Tcpu2-Tcpu1. If they are not, you’ll actually measure latency of the results reading, not waiting in commands queues or something that precedes a GPU execution.[/QUOTE]

According to what you say, Tgpu2-Tgpu1 = Tcpu2-Tcpu1, every time, if they both have the same nanosecond precision, right? That said, I can calculate the delay I wanted by doing Tgpu1-Tcpu1 because Tgpu=gpu_time-A and Tcpu=cpu_time-B, where A and B are different counter values that refer to the same absolute time point. Do you understand? Tgpu1 and Tcpu1 are actually durations from the start of the program, not durations from 2 different counters that have been initiated at 2 different+unrelated+undefined time points. Am I making any sense?

[QUOTE=Aleksandar;1261431]
Or maybe I missed something… Also, the precision of the counters is not the same. I don’t know how clock::now works. [/QUOTE]

My CPU timer has a 1 ns precision, so does the GPU timer, I hope.

[QUOTE=Aleksandar;1261431]
Also, I don’t get why you are using glFenceSync/glClientWaitSync. It can be used glGetQueryObjectui64v instead with probably less overhead.[/QUOTE]

You are right, but it was just for you to understand there was a sync. Basically:

DELAY CALCULATION:

Program initialization:

glQueryCounter(id,GL_TIMESTAMP);
glGetQueryObjectui64v(id,GL_QUERY_RESULT,gpu_start_time);
cpu_start_time = clock::now();

During execution:

cpu_begin_time = clock::now();
glQueryCounter(id2,GL_TIMESTAMP);
{
      // gl calls being timed
}
glQueryCounter(id3,GL_TIMESTAMP);
cpu_end_time = clock::now();

When results are available:

// get query results
glGetQueryObjectui64v(id2,GL_QUERY_RESULT,gpu_begin_time);
glGetQueryObjectui64v(id3,GL_QUERY_RESULT,gpu_end_time);

// calculate relative timestamps from start of program
cpu_begin_time -= cpu_start_time;
cpu_end_time   -= cpu_start_time;
gpu_begin_time -= gpu_start_time;
gpu_end_time   -= gpu_start_time;

// calculate durations
gpu_duration = gpu_end_time - gpu_begin_time;
cpu_duration = cpu_end_time - cpu_begin_time;

// calculate OpenGL pipeline delay (from call to execution)
async_delay = gpu_begin_time - cpu_begin_time;

Makes sense?

skynet · September 5, 2014, 3:02pm

TL;DR

FYI: There’s an excellent article on timing GPU commands in the ‘OpenGL Insights’ book.

Aleksandar · September 6, 2014, 1:59am

[QUOTE=skynet;1261470]TL;DR

FYI: There’s an excellent article on timing GPU commands in the ‘OpenGL Insights’ book.[/QUOTE]

Do you mean “chapter 34. The OpenGL Timer Query”, written by Chris Lux?
It is a nice article, but it doesn’t address CPU-GPU correlation, something Frank want to do.
If it is some other chapter, please refer to it. Command timeline is shown also in chapters: 9, 28 and 29.

Certainly not! The resolution is 1ns, i.e. all values are returned in ns, but the resolution is much coarser.
The Chrono library from modern C++ reports 100ns as the resolution. I cannot confirm or deny that, but it is a little bit suspicious, since high-performance counter on Windows, depending on the computer, has about 0.5 us precision. For the GPU I cannot claim anything since there is no source to support it. But, yes, I agree you have values in the same units - ns.

Considering the rest of your assumption, yes, it does make sense.
Except we know nothing about latency between two first time stamps (on a CPU and a GPU), and it is used for the correlation. Maybe you could sample CPU time whenever read GPU time and adjust correlation. Use minimal or average value. I have no idea, just brainstorming. Currently have no time to carry out some experiments by myself. Try to measure something and come back to discuss the results.

Additionally, you can measure time from the command issuing to the full realization of the previous command. It that moment next commands are already in the pipeline. Is this valid for your measurement. On the other hand, you could synchronously measure time when it “reaches GPU” (whatever it means, very imprecise statement). GPUs are massively parallel processors with very deep pipeline. It is hard to get the correct timing for anything. The execution environment is also very dynamic. And, yes, I didn’t mentioned dynamic power state changing. Maxwell changes core frequency probably several times even in a single frame. Even under the full load, it tries to minimize working frequency. A single power state now has a whole range of frequencies. What are you actually measuring? It is very hard to answer this question.

FrankBoltzmann · September 7, 2014, 12:29pm

[QUOTE=Aleksandar;1261473]
Certainly not! The resolution is 1ns, i.e. all values are returned in ns, but the resolution is much coarser.
The Chrono library from modern C++ reports 100ns as the resolution. I cannot confirm or deny that, but it is a little bit suspicious, since high-performance counter on Windows, depending on the computer, has about 0.5 us precision. For the GPU I cannot claim anything since there is no source to support it. But, yes, I agree you have values in the same units - ns.

Considering the rest of your assumption, yes, it does make sense.
Except we know nothing about latency between two first time stamps (on a CPU and a GPU), and it is used for the correlation. Maybe you could sample CPU time whenever read GPU time and adjust correlation. Use minimal or average value. I have no idea, just brainstorming. Currently have no time to carry out some experiments by myself. Try to measure something and come back to discuss the results. ;)[/QUOTE]

Do you mean doing a synchronized timestamp every once in a while to check if the correlation is the same as before?

[QUOTE=Aleksandar;1261473]
Additionally, you can measure time from the command issuing to the full realization of the previous command. It that moment next commands are already in the pipeline. Is this valid for your measurement. On the other hand, you could synchronously measure time when it “reaches GPU” (whatever it means, very imprecise statement). GPUs are massively parallel processors with very deep pipeline. It is hard to get the correct timing for anything. The execution environment is also very dynamic. And, yes, I didn’t mentioned dynamic power state changing. Maxwell changes core frequency probably several times even in a single frame. Even under the full load, it tries to minimize working frequency. A single power state now has a whole range of frequencies. What are you actually measuring? It is very hard to answer this question.[/QUOTE]

What I actually want to measure is as you said the duration from the command issuing by the CPU to the start of its realization. Which is probably the time “from the command issuing to the full realization of the previous command” as you said, PLUS whatever delay there is from the full realization of the previous command to the start of the next command in the pipeline, right? I’m assuming there CAN BE a delay between two OpenGL commands on the GPU, or is that delay too small to take into consideration? (lower than 1000ns for example)

Even if there is a delay between the GPU time query result fetching and the CPU timestamp right after fetching the GPU timestamp, would that matter much? it would just mean that ALL GPU timestamps would have the same miscalculation. So for example if our GPU timestamp returns 10000 at program start and the corresponding CPU time is 50000, if we then take a CPU timestamp (let’s say 60500) from the moment I issue my glDrawIndexed, my program’s CPU timestamp would be 10500 (relative to the program start). If the time query result for glDrawIndexed returns as 30100, my relative timestamp would be 20100. So by what I said in my other post, the glDrawIndexed call is issued at 10500 and executed at 20100, which means 9600 of delay (the delay I’m trying to measure). However, let’s assume the delay between my initial GPU and CPU timestamps (the 10000 and 50000 ones) is 300. This means that the CPU initial timestamp should have been 49700. Which then means ALL my CPU relative timestamps should have 300 extra in there (10800 for glDrawIndexed). So in reality the delay is between 10800 and 20100, meaning a 9300 delay. So all of my calculated delays are actually inflated by 300!

In conclusion, if I use the technique I described to measure that delay between issuing and execution, there is enough correlation to calculate the duration of that delay, but all calculations will have the same deviation! This is at least some progress, wouldn’t you say? If that deviation isn’t too big (which probably isn’t) the measure will be correct. But even if it’s a big deviation, at least it’s a stable one.

Aleksandar · September 8, 2014, 2:40am

Yes. That’s the way to see a variation in reading. If difference between CPU and GPU time varies a lot, it is very hard to estimate the values.

Your assumption is wrong. There is no such delay. In fact, when the previous command is realized, the next one is almost realized too. Do you understand the meaning of a pipeline?

FrankBoltzmann;1261521:

Even if there is a delay between the GPU time query result fetching and the CPU timestamp right after fetching the GPU timestamp, would that matter much? it would just mean that ALL GPU timestamps would have the same miscalculation. So for example if our GPU timestamp returns 10000 at program start and the corresponding CPU time is 50000, if we then take a CPU timestamp (let’s say 60500) from the moment I issue my glDrawIndexed, my program’s CPU timestamp would be 10500 (relative to the program start). If the time query result for glDrawIndexed returns as 30100, my relative timestamp would be 20100. So by what I said in my other post, the glDrawIndexed call is issued at 10500 and executed at 20100, which means 9600 of delay (the delay I’m trying to measure). However, let’s assume the delay between my initial GPU and CPU timestamps (the 10000 and 50000 ones) is 300. This means that the CPU initial timestamp should have been 49700. Which then means ALL my CPU relative timestamps should have 300 extra in there (10800 for glDrawIndexed). So in reality the delay is between 10800 and 20100, meaning a 9300 delay. So all of my calculated delays are actually inflated by 300!

And what if that delay is actually much longer than the time you want to measure? Try to collect some values before continuing this debate.

Also, before continuing any further, why do you need to measure such delay? It can vary severely from call to call, depending on what’s going on in OS, performance state policies and numerous other factors. Be aware that the time you want to measure may be pretty long. The execution on the GPU may lagging for several frames.

Aleksandar · September 8, 2014, 3:52am

In spite of being in a rush, I implemented what you proposed and … Results are useless!
I read the difference between CPU and GPU time stamps at the beginning of the application (it was 6655170365911ns) and at the end (6655171233623ns). It was a debug version of the application.
The difference of the time stamps was 867712 ns, i.e. 0.87ms!

Now, let’s try in full speed with the release version:

Second try: 7364879422472.69 vs 7364879734914.50 = 0.312ms (greater difference at the end).

Third try: 7521515844922.50 vs 7521515987968.31 = 0.143ms (greater difference at the end).

Forth try: 7640450144996.02 vs 7640451072964.14 = 0.928ms.

Do you see the point?

Aleksandar · September 8, 2014, 4:24am

Oh, I have forgotten to post a code, in a case someone want to reproduce the test.
Here it is:


double GLRenderer::CPU2GPUCorrelation()
{
	unsigned int id;
	GLuint64 startTimeGPU;
	LARGE_INTEGER startTimeCPU, freq;
	glGenQueries(1, &id);
	glQueryCounter(id, GL_TIMESTAMP);
	glGetQueryObjectui64v(id,GL_QUERY_RESULT,&startTimeGPU);
	QueryPerformanceCounter(&startTimeCPU);
	QueryPerformanceFrequency(&freq);
	double valCPU = startTimeCPU.QuadPart*1e9/(double)freq.QuadPart;
	double valGPU = (double)startTimeGPU;
	glDeleteQueries(1, &id);
	return valCPU-valGPU;
}

Also, in the forth try, I waited for a while to allow GPU to change performance state. Lower P-States induce higher delay.

FrankBoltzmann · September 8, 2014, 8:09am

[QUOTE=Aleksandar;1261540]In spite of being in a rush, I implemented what you proposed and … Results are useless!
I read the difference between CPU and GPU time stamps at the beginning of the application (it was 6655170365911ns) and at the end (6655171233623ns). It was a debug version of the application.
The difference of the time stamps was 867712 ns, i.e. 0.87ms!

Now, let’s try in full speed with the release version:

Second try: 7364879422472.69 vs 7364879734914.50 = 0.312ms (greater difference at the end).

Third try: 7521515844922.50 vs 7521515987968.31 = 0.143ms (greater difference at the end).

Forth try: 7640450144996.02 vs 7640451072964.14 = 0.928ms.

Do you see the point?[/QUOTE]

Sorry, I was going to do this but I have been busy with other parts of my experiment. Thank you for the results!

This pretty much finishes our debate: it is impossible to measure the delay.

I only wanted it for visual reasons, to make a time diagram that shows the timings and delays so that anyone could understand how asynchronous it can be, with real time values. But it’s not really that important, it would be just a convenience.

You have been really helpful! Sorry that you had to go through the trouble of implementing it yourself when you were in a rush, just to conclude that this whole debate goes nowhere!

Aleksandar · September 8, 2014, 10:23am

Since it is for illustration purpose only, you could draw CPU and GPU time diagrams separately (of course, preserve a sense of latency by moving GPU time-line a little bit to the right). On the other hand, the delay could be much longer that the whole process you want to illustrate.

I’m glad if I could help you.
And don’ be sorry. I wouldn’t do it if I wasn’t interested in the topic.