ARB_instanced_arrays = slow?

hello all,
i did a simple bit of billboarding where i stuffed positions and texcoords for a million billboards into a single VBO (4 vertices per billboard) and hit glDrawArrays (with a simple vertex shader which transforms the position into eye space then offsets it along the texcoord). I got around 200 million triangles per second, so I was reasonably happy with the performance.
But I wanted to save memory…so I gave ARB_instanced_arrays a go - stuffed 4 sets of texcoords into one VBO, then a million positions into another VBO and hit glDrawArraysInstanced.
Now I could maybe understand a slight drop in throughput (well…after all…I’m saving memory and there’s no such thing as a free lunch), but not more than a halving in throughput!
It went from 200mtps to 75mtps. Is this expected behaviour by the IHV’s? If it is, then fair enough I’ll drop it and revert back to not using it - i just want to be clear this is expected performance.
thanks for any advice.

card:Quadro 4000 driver:270.71 os:XP64

here’s some key bits of source code.

c++


void init() {
	m_billboardCount = 1000000;
	const float area = 200.0f;

	glGenBuffers(1, &m_uvBuffer);
	glBindBuffer(GL_ARRAY_BUFFER, m_uvBuffer);
	glBufferData(GL_ARRAY_BUFFER, 4*sizeof(vec2f), NULL, GL_STATIC_DRAW);
	vec2f* uvPtr = (vec2f*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
	*uvPtr++ = vec2f(0.0f, 0.0f);
	*uvPtr++ = vec2f(1.0f, 0.0f);
	*uvPtr++ = vec2f(1.0f, 1.0f);
	*uvPtr++ = vec2f(0.0f, 1.0f);
	glUnmapBuffer(GL_ARRAY_BUFFER);
	glBindBuffer(GL_ARRAY_BUFFER, 0);

	glGenBuffers(1, &m_posBuffer);
	glBindBuffer(GL_ARRAY_BUFFER, m_posBuffer);
	glBufferData(GL_ARRAY_BUFFER, m_billboardCount*sizeof(vec3f), NULL, GL_STATIC_DRAW);
	vec3f* posPtr = (vec3f*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
	for (uint32 i=0; i<m_billboardCount; ++i)
		*posPtr++ = vec3f(area*-0.5f+RND(area), 0.0f, area*-0.5f+RND(area));
	glUnmapBuffer(GL_ARRAY_BUFFER);
	glBindBuffer(GL_ARRAY_BUFFER, 0);
}

void draw() {
	glBindBuffer(GL_ARRAY_BUFFER, m_posBuffer);
	glVertexAttribPointer(g_billboardShader->attrib_pos, 3, GL_FLOAT, GL_FALSE, sizeof(vec3f), 0);
	glEnableVertexAttribArray(g_billboardShader->attrib_pos);
	glVertexAttribDivisorARB(g_billboardShader->attrib_pos, 1);
	glBindBuffer(GL_ARRAY_BUFFER, m_uvBuffer);
	glVertexAttribPointer(g_billboardShader->attrib_uv, 2, GL_FLOAT, GL_FALSE, sizeof(vec2f), 0);
	glEnableVertexAttribArray(g_billboardShader->attrib_uv);
	glBindBuffer(GL_ARRAY_BUFFER, 0);
	glDrawArraysInstanced(GL_QUADS, 0, 4, m_billboardCount);
	glDisableVertexAttribArray(g_billboardShader->attrib_pos);
	glDisableVertexAttribArray(g_billboardShader->attrib_uv);
}

GLSL shader:-


attribute vec3 attrib_pos;
attribute vec2 attrib_uv;

void main() {
	vec4 eyePos = gl_ModelViewMatrix * vec4(attrib_pos,1.0);
	vec2 uv = attrib_uv;
	eyePos += vec4(uv.x-0.5, uv.y-0.5, 0.0, 0.0);
	gl_Position = gl_ProjectionMatrix * eyePos;
}

just updated drivers to latest 275.36. Exactly the same poor throughput.

I guess the driver allocates at least one warp of 32 threads per instance. This would mean that the instanced vertex shaders (4 vertexes per warp) utilize only 1/8th of the threads compared to the big VBO (32 vertexes per warp).

I didn’t have time yet to check thoroughly your code but in general ARB_instanced_arrays should be actually faster in practice because of the decreased memory bandwidth requirements. Of course, this is a bit more relevant to AMD hardware than NVIDIA as there memory bandwidth is more of an issue, at least based on my experience.
I’ll tell more if I figure out something about your particular problem.

ARB_instanced_arrays faster here on NVidia, and faster than ARB_draw_instanced with TBO as well, but haven’t tested with such trivial instances. You can try glDrawElementsInstanced, but I wouldn’t guess that to be a relevant difference. Also may be something with your code, like are you sure you’re only timing repeated draw time and not any VBO upload or bind time. Also, this is on a GeForce GTX480, so same GPU core as yours, but very different clockings and number of cores. Run some tests and see what you’re bottlenecked on – may be something else.

thanks for the replies guys.

ARB_instanced_arrays faster here on NVidia

faster than what? unrolled instances?? are you sure?
yes i would imagine this extension is intended for instancing geometry of more than 4 vertices - but more than half the throughput is not understandable really.
you have my word that my timings are correct, i’ve been in the business for many years. Besides which, the framerate goes up from 75fps to 200fps with no other change other than using instance-unrolling instead of this extension. This is an isolated test program with nothing else going on. The code I’ve posted is pretty much it.

Dark Photon, I think you’ve misunderstood me. I also wanted to ssay that instanced arrays are faster, no matter if NVIDIA or AMD, just my experience is that AMD GPUs are more sensible to bandwidth demanding stuff.

I also can confirm that ARB_draw_instanced with TBO is slower, at least that was the case on my Radeon HD2600 (don’t see any difference now on the HD5770).

Anyway, there should be definitely some issue with your usage.

First, I would check whether the slowdown is caused by using QUADS primitive type. You should know that current GPUs support only triangles in native so the quads will be split up anyway. Maybe ARB_instanced_arrays are not natively supported by the hardware if you use QUADS primitive type.

Another thing I don’t understand is why you don’t use the geometry shader to generate the billboard. Then you would just pass the position (and maybe some other data if you need to) and generate the triangle strip representing the quad and the appropriate texture coordinates in the geometry shader.

Another thing I don’t understand is why you don’t use the geometry shader to generate the billboard.

Geometry shaders aren’t exactly known for high performance.

Yep. But using VBOs for GPU streaming and reuse, not binding discrete per-batch VBOs like you’re doing.

you have my word that my timings are correct … This is an isolated test program with nothing else going on. The code I’ve posted is pretty much it.

Ok. Only you know for sure, but looking at the above draw code, I suspect you may be timing the binds and enables too. For kicks, try putting both the binds and the enables in init, and then don’t do them every time. Binds can be very expensive.

Obviously this isn’t a general solution, but just a probing technique. Lazy enables and binds are what you would typically do if using discrete VBOs.

With the number of primitives you’re ripping this may not be it, but you haven’t told us what CPU you have. I will say that it’s “completely” amazing (almost nonsensically so) the difference between various latest-gen CPUs with batch throughput. You’ve also got a clocked-down and scaled-back Fermi, which might be related to throughput.

But yeah, as far as discrete “classic” VBOs (bind handles), they can be pretty darn finicky things for performance. Like in your case, here client arrays often won (beaten only by display lists), so why use VBOs? Perf especially stunk on the slower CPUs. It wasn’t until I flipped to using bindless VBOs (no binds; provide GPU address) and/or using VBOs for streaming to (and reuse on) the GPU that VBO perf really shined.

If you want some thread pointers on these, just say the word.

On NVidia, bench all these against the perf with batches in display lists. That’s really the best you can do. And I will say that VBOs with bindless gets you pretty darn close to display list perf.

Yes, that’s right, however there are usually fast paths in hardware for 1:1 and 1:4 input:output ratios, the later being exactly for billboard rendering (I wouldn’t wonder if point sprites implementations would internally use geometry shaders nowadays). Check the ATI HD2000 GPU programming guide here: http://developer.amd.com/media/gpu_assets/ATI_Radeon_HD_2000_programming_guide.pdf.

Please try to replace quads with triangle lists or tri strips just to be sure that the performance drop is not because of a slow hardware path or software emulation that results from using quads with instancing.

dark photon, i’ve read your previous posts so I’m aware of your (justified) obsession with buffer binds, but this is a single batch per frame. This is not a general usage case, this is a test program specifically for this. CPU is completely irrelevant in this case, as you really should know at a glance of the code. You are perhaps overstating the cost of a CPU cache-miss looking up 2 buffer handles.
aqnuep, the fact that you suggest geometry shaders for this specific case indicates you have very little idea on performance optimisation. GS’s are pretty much only for transform feedback cases - they are very sub-optimal.
replacing quads with triangles makes it slower, which is understandable because there’s now 6 vertices to process instead of 4 per billboard (non-indexed). If you’re going to suggest tristrips then stop, because i’d either have to use primitive-restart or write a GS which I know will make it slower still.

peterfilm, the fact that you think that geometry shaders are always sub-optimal and GS is only for transform feedback is wrong and shows that you have very little idea on how you should use geometry shaders in order to get good performance.

As I see you output only a single quad in case of each instance thus you don’t need primitive-restart or anything like that because you have a single tri-strip with 4 vertices. The fact that you think you need it indicates you have very little idea on how instancing works. Actually that also shows why you don’t understand that outputting a 4 vertex tri-strip for billboards in the GS can be really efficient.

doh you’re right of course, it was too early in the morning for me, tristripping in this case doesn’t require restart/degenerates/gs. Tried it, same result.
give me some examples where GS’s are faster for static geometry then…

I don’t say GS are faster or slower for billboards but if you are more limited by bandwidth than compute power, then GS is good. Also, don’t forget that GS is slow because of the complex logic in the output buffer of the GS that is needed to ensure that multiple GS instances can output their variable number of vertices in the proper order. In case of 1 input 1 output, GS works in the same exact synchronous fashion like other shader stages do. Also, as I mentioned, at least AMD has a fast path for 1 input 4 output, especially made for billboard rendering.

Once you make the compiler know that you always output the same amount of vertices for each incoming vertex, the driver can choose a different hardware path. Btw, I’ll test GS generated billboard vs replicated vertex data just to see how it performs.

You should also take into consideration that GL3 class hardware (GF8 and HD2000) did not have the same hardware implementation for GS like GL4 class hardware does (GF200 and HD5000). Many use cases that were slow on GL3 GS can run much faster on GL4 GS.

i’d like to restate my aim - i’m after the fastest way of rendering static billboards (i.e. billboards who’s attributes are not changed by the CPU or the GPU). I got 200mtps with just plain VBO and glDrawArrays, which is what I’d expect on a 300mtps card, but I wanted to save a little memory without sacrificing much performance (i’d take a 5%-10% drop).
Dark Photon mentions I’m using “classic” VBO’s and that this might be the issue. Well it’s a single bind/draw operation per frame, so bindless graphics ain’t gonna help, and the geometry is static, so VBO/stream/orphan ain’t gonna help either.

If you are timing, you are free-running (not running sync-to-vblank). If you are free-running, it’s slightly irrelevant that this is the only thing you are doing in this frame because it’s the same thing over and over as fast as possible. The question is, what percentage of what you’re doing is it, which removing it temporarily will reveal.

But as I said: “With the number of primitives you’re ripping this may not be it…”. I didn’t guarantee you that was it. I just gave you one thing to try.

You’re looking for ideas for things to try which may help reveal why you are getting the results you are. Some folks are helping you out. If you don’t like those suggestions, that’s fine. No point in admonishing though.

On NVidia, display lists - no question. Simple, easy. Also easy and fast, client arrays. A little more work but not much: bindless VBOs.

Well it’s a single bind/draw operation per frame, so bindless graphics ain’t gonna help, and the geometry is static, so VBO/stream/orphan ain’t gonna help either.

This doesn’t follow.

“VBO” does not define where the data lives precisely. It could be on the GPU. It could be on the CPU. All you know is it’s “server side” (i.e. on “the other side” of the GL API). The performance of each is different. Yes, we’re getting into implementation specific details, but the only issue is to what degree you want to optimize to get “fastest”. With bindless and an early MakeResident, you can effectively lock the VBO on the GPU, meaning batch reissues from there without reupload will be as close to the GPU as possible, and thus likely as fast as possible.

That’s one of the cool things about “VBO/stream/orphan” as you put it. If you’ve already uploaded in a prior frame, there is no “stream”. You (in most cases) just reissue from the VBO that’s there. Since nearly every batch is dispatched from the same VBO every time, there’s little to no VBO “binding” so effectively (in my experience) you get better perf in most cases. And if you locked your VBO onto the GPU, all the better.

That said, all that is nicely abstracted and cross vendor if you just wrap your batch in a display list. So if you want a baseline of “how good it can get” (with your existing batch contents), then (on NVidia), try display lists. Aim for that performance with anything you do with VBOs or client arrays.

ok guys thanks for taking the time to answer. I guess it’s just a mystery.