glDrawElements() blocking in CPU - takes 5ms to return

I am having a strange issue where calling glDrawElements appears to be blocking; stalling the CPU for a moment while it waits for the function to return. I placed a system timer immediately before and after my call to glDrawElements and the function consistently takes betweeen 5 and 7 milliseconds to return.
First, this is very slow. I’m only rendering about 4000 vertices. Second, I thought the whole point of using VBOs was to eliminate immediate mode, so that glDrawElements would just issue commands to the GPU and the CPU can continue working while the GPU works at its own pace. This does not appear to be happening at all.

As I mentioned, my data is entirely stored in VBOs which are initialized and filled once. I am using very simple shaders. The vertex shader is handed a 20 element array of mat4’s as uniforms before each render, and the pre-filled VBO’s are merely activated, and vertex attribute pointers set.
The only thing I actually time is the call to glDrawElements. Which as I mentioned, takes about 5 to 7 ms.
Here are the general specs for my setup. Pretty modern. Should be fine.

System info:

Laptop: Dell XPS L511Z
Processor:
Intel® Core™
i5-2410M CPU @
2.30 Ghz 2.30Ghz
Installed Memory (RAM) 6.00 GB.
System type: 64 bit operating system.
Operating system: Windows 7.

Graphics card info:

Card type: Nvidia GeForce GT 525M
Driver Version: 285.77
DirectX support: 11
CUDA Cores: 96
Graphics clock: 600 Mhz
Processor clock:: 1200 Mhz
Memory clock: 900 Mhz (1800 Mhz data rate)
Memoryinterface: 128 bit
Total available graphics memory: 3797 MB
Dedicated video memory: 1024 MB DDR3
System video memory: 0 MB
Shared System Memory: 2773 MB
Video BIOS version: 70.08.53.00.07
IRQ: 16
Bus: PCI Express x 16 Gen 2

My application info:
Win32 project written in C++ using Visual Studio 2010.

Doesn’t glDrawElements pass indices? Still seems very slow. You should be using an element buffer I think. I am not sure, but graphics hardware seems to be only able to manage one task at a time. So there could be interaction with other applications running on your computer.

EDITED: You might also be sure that your indices are aligned. If you just allocated them with new (assuming C++) then you should be fine. Anyway, passing unaligned memory can be incredibly slow. Just a hunch.

glDrawElements doesn’t just draw. Most modern GPUs will operate in a “lazy mode” so any state changes, shader changes, texture changes, etc are cached locally by the driver as they happen, then evaluated/validated/etc when a draw call occurs. What you’re getting in your timing is the result of all of this as well as the cost of the draw call itself - you can confirm this by issuing a second draw call immediately after with the exact same params and time that too - you’re most likely going to find that it returns almost immediately.

So based on that, the excessive time is going to be on account of something that happened before the glDrawElements call (but which the driver just stored up at the time it happened, and is only doing for real when the draw call is made), and the most likely looking suspect is that 20-element array of mat4s. Some info on how you’re sending that to the driver will help in diagnosing further.

It’s also possible that the timing functions you’re using are not accurate (e.g. you might be using something like GetTickCount which has very poor resolution) in which case wrong times are to be expected. That’s what you should double-check first.

^Yeah that makes sense. I was just reading an article on Wikipedia yesterday (comparing D3D and OpenGL) that claimed D3D’s weakness was an inability to buffer user mode calls (before switching to kernel mode) presumably because the “IHV” layer Microsoft engineered does not allow for it. But the article says this only plagues D3D9 and was corrected for 10 I think. Still ~5ms is a long time. A 60fps frame is like 15.

EDITED: 20-element array sounds like 5 matrices which seems like nothing. I’ve always assumed uploading all of the program registers at once would not be a big deal (unless your hardware supports a whole lot more than are normally required)

D3D9 and below actually does buffer user-mode calls - that’s 100% a myth. The cost was in validation.

I’m reading the 20-element array as being 20 matrices, which equates to 80 vec4s. Worst case is 80 glUniform4f calls here, but even 20 glUniformMatrix4fv calls can be quite heavy (especially if each one is also accompanied by a run-time glGetUniformLocation). Of course it can also be done with a buffer object (which - if careless - may involve a CPU/GPU sync but the time for that wouldn’t be expected to be measured with glDrawElements) or even a single glUniformMatrix4fv call, so we need to know how the OP is setting these.

As a complement to measuring CPU time, use a query to measure the time as reported by the GPU. But be careful about asking for the result, as it may stall the pipeline. One way is to do something as follows. That is, you get the result at the next iteration.

GLuint result = 0;
if (!firsttime) glGetQueryObjectuiv(fQuery, GL_QUERY_RESULT, &result);
glBeginQuery(GL_TIME_ELAPSED, fQuery);
glDraw...
glEndQuery(GL_TIME_ELAPSED);

You will have to do a glGenQueries(1, &fQuery); somewhere also.

^mhagain, Direct3D9 lets you upload as many consecutive registers (of a class) as you need to. Which seems reasonable, as I would imagine the best approach would be to stream them all up in a block if possible. But I can also imagine the driver building a buffer and tagging each register in the upload for random access update. I recently programmed a lot with OpenGL ES (WebGL) and I don’t remember there being an API for updating a block of registers (which would probably be very helpful for Javascript; as would bringing back display lists I think) but I did not really look. I guess OpenGL cannot do that then? Either way I don’t think it would matter much beyond the unnecessary (presumably user mode) function calls.

If they’re declared in the GLSL as:

uniform mat4 matrixArray[20];

They can be loaded in the C(++) code as:

matrixType matrixArray[20];

// fill in data here

glUniformMatrix4fv (uniformLocation, 20, GL_FALSE, matrixArray);

That should be the fastest way of loading them when using traditional uniforms.

That’s good to know. The WebGL API for glUniformMatrix4fv does not take a count argument. But It takes a typed array which I assumed had to be 4x4 (16) since that is the name of the API procedure.

https://www.khronos.org/registry/webgl/specs/1.0/
void uniformMatrix4fv(WebGLUniformLocation location, GLboolean transpose, Float32Array value);

For the record. I am not sure the spec (above) even explains it, but it sounds like (from a little searching about) you can pass a multiple of 16 sized array. But I am not 100% positive that you can select out the individual matrices for use in your script with the Float32Array spec. That may be a fundamental limitation of Javascript.

Sounds like I have some homework to do.

Forgive me. I’d forgotten how glBindBuffer interacts with glDrawElements. Look into it if you’ve not heard of it. Otherwise disregard these comments. I cannot edit the original post for correctness at this point.