PDA

View Full Version : ARB_instanced_arrays = slow?



peterfilm
06-22-2011, 09:31 AM
hello all,
i did a simple bit of billboarding where i stuffed positions and texcoords for a million billboards into a single VBO (4 vertices per billboard) and hit glDrawArrays (with a simple vertex shader which transforms the position into eye space then offsets it along the texcoord). I got around 200 million triangles per second, so I was reasonably happy with the performance.
But I wanted to save memory...so I gave ARB_instanced_arrays a go - stuffed 4 sets of texcoords into one VBO, then a million positions into another VBO and hit glDrawArraysInstanced.
Now I could maybe understand a slight drop in throughput (well...after all...I'm saving memory and there's no such thing as a free lunch), but not more than a halving in throughput!
It went from 200mtps to 75mtps. Is this expected behaviour by the IHV's? If it is, then fair enough I'll drop it and revert back to not using it - i just want to be clear this is expected performance.
thanks for any advice.

card:Quadro 4000 driver:270.71 os:XP64

peterfilm
06-22-2011, 09:57 AM
here's some key bits of source code.

c++


void init() {
m_billboardCount = 1000000;
const float area = 200.0f;

glGenBuffers(1, &m_uvBuffer);
glBindBuffer(GL_ARRAY_BUFFER, m_uvBuffer);
glBufferData(GL_ARRAY_BUFFER, 4*sizeof(vec2f), NULL, GL_STATIC_DRAW);
vec2f* uvPtr = (vec2f*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
*uvPtr++ = vec2f(0.0f, 0.0f);
*uvPtr++ = vec2f(1.0f, 0.0f);
*uvPtr++ = vec2f(1.0f, 1.0f);
*uvPtr++ = vec2f(0.0f, 1.0f);
glUnmapBuffer(GL_ARRAY_BUFFER);
glBindBuffer(GL_ARRAY_BUFFER, 0);

glGenBuffers(1, &m_posBuffer);
glBindBuffer(GL_ARRAY_BUFFER, m_posBuffer);
glBufferData(GL_ARRAY_BUFFER, m_billboardCount*sizeof(vec3f), NULL, GL_STATIC_DRAW);
vec3f* posPtr = (vec3f*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
for (uint32 i=0; i<m_billboardCount; ++i)
*posPtr++ = vec3f(area*-0.5f+RND(area), 0.0f, area*-0.5f+RND(area));
glUnmapBuffer(GL_ARRAY_BUFFER);
glBindBuffer(GL_ARRAY_BUFFER, 0);
}

void draw() {
glBindBuffer(GL_ARRAY_BUFFER, m_posBuffer);
glVertexAttribPointer(g_billboardShader->attrib_pos, 3, GL_FLOAT, GL_FALSE, sizeof(vec3f), 0);
glEnableVertexAttribArray(g_billboardShader->attrib_pos);
glVertexAttribDivisorARB(g_billboardShader->attrib_pos, 1);
glBindBuffer(GL_ARRAY_BUFFER, m_uvBuffer);
glVertexAttribPointer(g_billboardShader->attrib_uv, 2, GL_FLOAT, GL_FALSE, sizeof(vec2f), 0);
glEnableVertexAttribArray(g_billboardShader->attrib_uv);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glDrawArraysInstanced(GL_QUADS, 0, 4, m_billboardCount);
glDisableVertexAttribArray(g_billboardShader->attrib_pos);
glDisableVertexAttribArray(g_billboardShader->attrib_uv);
}


GLSL shader:-



attribute vec3 attrib_pos;
attribute vec2 attrib_uv;

void main() {
vec4 eyePos = gl_ModelViewMatrix * vec4(attrib_pos,1.0);
vec2 uv = attrib_uv;
eyePos += vec4(uv.x-0.5, uv.y-0.5, 0.0, 0.0);
gl_Position = gl_ProjectionMatrix * eyePos;
}

peterfilm
06-22-2011, 10:16 AM
just updated drivers to latest 275.36. Exactly the same poor throughput.

mbentrup
06-22-2011, 01:19 PM
I guess the driver allocates at least one warp of 32 threads per instance. This would mean that the instanced vertex shaders (4 vertexes per warp) utilize only 1/8th of the threads compared to the big VBO (32 vertexes per warp).

aqnuep
06-23-2011, 12:21 AM
I didn't have time yet to check thoroughly your code but in general ARB_instanced_arrays should be actually faster in practice because of the decreased memory bandwidth requirements. Of course, this is a bit more relevant to AMD hardware than NVIDIA as there memory bandwidth is more of an issue, at least based on my experience.
I'll tell more if I figure out something about your particular problem.

Dark Photon
06-23-2011, 05:23 AM
I didn't have time yet to check thoroughly your code but in general ARB_instanced_arrays should be actually faster in practice because of the decreased memory bandwidth requirements. Of course, this is a bit more relevant to AMD hardware than NVIDIA as there memory bandwidth is more of an issue, at least based on my experience.
ARB_instanced_arrays faster here on NVidia, and faster than ARB_draw_instanced with TBO as well, but haven't tested with such trivial instances. You can try glDrawElementsInstanced, but I wouldn't guess that to be a relevant difference. Also may be something with your code, like are you sure you're only timing repeated draw time and not any VBO upload or bind time. Also, this is on a GeForce GTX480, so same GPU core as yours, but very different clockings and number of cores. Run some tests and see what you're bottlenecked on -- may be something else.

peterfilm
06-23-2011, 07:37 AM
thanks for the replies guys.

ARB_instanced_arrays faster here on NVidia
faster than what? unrolled instances?? are you sure?
yes i would imagine this extension is intended for instancing geometry of more than 4 vertices - but more than half the throughput is not understandable really.
you have my word that my timings are correct, i've been in the business for many years. Besides which, the framerate goes up from 75fps to 200fps with no other change other than using instance-unrolling instead of this extension. This is an isolated test program with nothing else going on. The code I've posted is pretty much it.

aqnuep
06-23-2011, 07:42 AM
Dark Photon, I think you've misunderstood me. I also wanted to ssay that instanced arrays are faster, no matter if NVIDIA or AMD, just my experience is that AMD GPUs are more sensible to bandwidth demanding stuff.

I also can confirm that ARB_draw_instanced with TBO is slower, at least that was the case on my Radeon HD2600 (don't see any difference now on the HD5770).

Anyway, there should be definitely some issue with your usage.

First, I would check whether the slowdown is caused by using QUADS primitive type. You should know that current GPUs support only triangles in native so the quads will be split up anyway. Maybe ARB_instanced_arrays are not natively supported by the hardware if you use QUADS primitive type.

Another thing I don't understand is why you don't use the geometry shader to generate the billboard. Then you would just pass the position (and maybe some other data if you need to) and generate the triangle strip representing the quad and the appropriate texture coordinates in the geometry shader.

Alfonse Reinheart
06-23-2011, 11:50 AM
Another thing I don't understand is why you don't use the geometry shader to generate the billboard.

Geometry shaders aren't exactly known for high performance.

Dark Photon
06-23-2011, 05:07 PM
ARB_instanced_arrays faster here on NVidia
faster than what? unrolled instances?? are you sure?
Yep. But using VBOs for GPU streaming and reuse, not binding discrete per-batch VBOs like you're doing.


you have my word that my timings are correct ... This is an isolated test program with nothing else going on. The code I've posted is pretty much it.
Ok. Only you know for sure, but looking at the above draw code, I suspect you may be timing the binds and enables too. For kicks, try putting both the binds and the enables in init, and then don't do them every time. Binds can be very expensive.

Obviously this isn't a general solution, but just a probing technique. Lazy enables and binds are what you would typically do if using discrete VBOs.

With the number of primitives you're ripping this may not be it, but you haven't told us what CPU you have. I will say that it's "completely" amazing (almost nonsensically so) the difference between various latest-gen CPUs with batch throughput. You've also got a clocked-down and scaled-back Fermi, which might be related to throughput.

But yeah, as far as discrete "classic" VBOs (bind handles), they can be pretty darn finicky things for performance. Like in your case, here client arrays often won (beaten only by display lists), so why use VBOs? Perf especially stunk on the slower CPUs. It wasn't until I flipped to using bindless VBOs (no binds; provide GPU address) and/or using VBOs for streaming to (and reuse on) the GPU that VBO perf really shined.

If you want some thread pointers on these, just say the word.

On NVidia, bench all these against the perf with batches in display lists. That's really the best you can do. And I will say that VBOs with bindless gets you pretty darn close to display list perf.

aqnuep
06-24-2011, 12:23 AM
Another thing I don't understand is why you don't use the geometry shader to generate the billboard.

Geometry shaders aren't exactly known for high performance.

Yes, that's right, however there are usually fast paths in hardware for 1:1 and 1:4 input:output ratios, the later being exactly for billboard rendering (I wouldn't wonder if point sprites implementations would internally use geometry shaders nowadays). Check the ATI HD2000 GPU programming guide here: http://developer.amd.com/media/gpu_assets/ATI_Radeon_HD_2000_programming_guide.pdf.

aqnuep
06-24-2011, 12:25 AM
faster than what? unrolled instances?? are you sure?

Please try to replace quads with triangle lists or tri strips just to be sure that the performance drop is not because of a slow hardware path or software emulation that results from using quads with instancing.

peterfilm
06-24-2011, 01:54 AM
dark photon, i've read your previous posts so I'm aware of your (justified) obsession with buffer binds, but this is a single batch per frame. This is not a general usage case, this is a test program specifically for this. CPU is completely irrelevant in this case, as you really should know at a glance of the code. You are perhaps overstating the cost of a CPU cache-miss looking up 2 buffer handles.
aqnuep, the fact that you suggest geometry shaders for this specific case indicates you have very little idea on performance optimisation. GS's are pretty much only for transform feedback cases - they are very sub-optimal.
replacing quads with triangles makes it slower, which is understandable because there's now 6 vertices to process instead of 4 per billboard (non-indexed). If you're going to suggest tristrips then stop, because i'd either have to use primitive-restart or write a GS which I know will make it slower still.

aqnuep
06-24-2011, 03:33 AM
aqnuep, the fact that you suggest geometry shaders for this specific case indicates you have very little idea on performance optimisation. GS's are pretty much only for transform feedback cases - they are very sub-optimal.
peterfilm, the fact that you think that geometry shaders are always sub-optimal and GS is only for transform feedback is wrong and shows that you have very little idea on how you should use geometry shaders in order to get good performance.

replacing quads with triangles makes it slower, which is understandable because there's now 6 vertices to process instead of 4 per billboard (non-indexed). If you're going to suggest tristrips then stop, because i'd either have to use primitive-restart or write a GS which I know will make it slower still.
As I see you output only a single quad in case of each instance thus you don't need primitive-restart or anything like that because you have a single tri-strip with 4 vertices. The fact that you think you need it indicates you have very little idea on how instancing works. Actually that also shows why you don't understand that outputting a 4 vertex tri-strip for billboards in the GS can be really efficient.

peterfilm
06-24-2011, 04:23 AM
doh you're right of course, it was too early in the morning for me, tristripping in this case doesn't require restart/degenerates/gs. Tried it, same result.
give me some examples where GS's are faster for static geometry then...

aqnuep
06-24-2011, 05:11 AM
I don't say GS are faster or slower for billboards but if you are more limited by bandwidth than compute power, then GS is good. Also, don't forget that GS is slow because of the complex logic in the output buffer of the GS that is needed to ensure that multiple GS instances can output their variable number of vertices in the proper order. In case of 1 input 1 output, GS works in the same exact synchronous fashion like other shader stages do. Also, as I mentioned, at least AMD has a fast path for 1 input 4 output, especially made for billboard rendering.

Once you make the compiler know that you always output the same amount of vertices for each incoming vertex, the driver can choose a different hardware path. Btw, I'll test GS generated billboard vs replicated vertex data just to see how it performs.

You should also take into consideration that GL3 class hardware (GF8 and HD2000) did not have the same hardware implementation for GS like GL4 class hardware does (GF200 and HD5000). Many use cases that were slow on GL3 GS can run much faster on GL4 GS.

peterfilm
06-24-2011, 05:27 AM
i'd like to restate my aim - i'm after the fastest way of rendering static billboards (i.e. billboards who's attributes are not changed by the CPU or the GPU). I got 200mtps with just plain VBO and glDrawArrays, which is what I'd expect on a 300mtps card, but I wanted to save a little memory without sacrificing much performance (i'd take a 5%-10% drop).
Dark Photon mentions I'm using "classic" VBO's and that this might be the issue. Well it's a single bind/draw operation per frame, so bindless graphics ain't gonna help, and the geometry is static, so VBO/stream/orphan ain't gonna help either.

Dark Photon
06-26-2011, 03:01 PM
dark photon, i've read your previous posts so I'm aware of your (justified) obsession with buffer binds, but this is a single batch per frame. This is not a general usage case, this is a test program specifically for this. CPU is completely irrelevant in this case, as you really should know at a glance of the code. You are perhaps overstating the cost of a CPU cache-miss looking up 2 buffer handles.

If you are timing, you are free-running (not running sync-to-vblank). If you are free-running, it's slightly irrelevant that this is the only thing you are doing in this frame because it's the same thing over and over as fast as possible. The question is, what percentage of what you're doing is it, which removing it temporarily will reveal.

But as I said: "With the number of primitives you're ripping this may not be it...". I didn't guarantee you that was it. I just gave you one thing to try.

You're looking for ideas for things to try which may help reveal why you are getting the results you are. Some folks are helping you out. If you don't like those suggestions, that's fine. No point in admonishing though.

Dark Photon
06-26-2011, 04:36 PM
i'd like to restate my aim - i'm after the fastest way of rendering static billboards (i.e. billboards who's attributes are not changed by the CPU or the GPU).
On NVidia, display lists - no question. Simple, easy. Also easy and fast, client arrays. A little more work but not much: bindless VBOs.


Well it's a single bind/draw operation per frame, so bindless graphics ain't gonna help, and the geometry is static, so VBO/stream/orphan ain't gonna help either.
This doesn't follow.

"VBO" does not define where the data lives precisely. It could be on the GPU. It could be on the CPU. All you know is it's "server side" (i.e. on "the other side" of the GL API). The performance of each is different. Yes, we're getting into implementation specific details, but the only issue is to what degree you want to optimize to get "fastest". With bindless and an early MakeResident, you can effectively lock the VBO on the GPU, meaning batch reissues from there without reupload will be as close to the GPU as possible, and thus likely as fast as possible.

That's one of the cool things about "VBO/stream/orphan" as you put it. If you've already uploaded in a prior frame, there is no "stream". You (in most cases) just reissue from the VBO that's there. Since nearly every batch is dispatched from the same VBO every time, there's little to no VBO "binding" so effectively (in my experience) you get better perf in most cases. And if you locked your VBO onto the GPU, all the better.

That said, all that is nicely abstracted and cross vendor if you just wrap your batch in a display list. So if you want a baseline of "how good it can get" (with your existing batch contents), then (on NVidia), try display lists. Aim for that performance with anything you do with VBOs or client arrays.

peterfilm
06-27-2011, 04:00 AM
ok guys thanks for taking the time to answer. I guess it's just a mystery.

peterfilm
06-27-2011, 04:23 AM
on the very different topic of streaming geometry to the GPU using VBO orphaning - dark photon could you describe this process as implemented by your good self please?

Dark Photon
06-28-2011, 06:54 AM
on the very different topic of streaming geometry to the GPU using VBO orphaning - dark photon could you describe this process as implemented by your good self please?
Nearly all of the credit goes to Rob Barris here, as he described the original technique. I just proposed adding batch reuse to it, and using bindless for extra-fast batch dispatch. I also use it for canned batches -- not just generated/decompressed batches.

Good links to read on this:

* VBOs strangely slow? (OpenGL.org thread, 2/23/10) (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=273141#Post2731 41) (focus on Rob Barris' posts)
* Buffer Object Streaming (OpenGL.org Wiki)] (http://www.opengl.org/wiki/Buffer_Object_Streaming)
* mega vbo, any reservations? (OpenGL.org thread, 9/8/10) (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=283552&amp;page=1)
* The A to Z of DX10 Performance (pg 9) (http://developer.amd.com/gpu_assets/The%20A%20to%20Z%20of%20DX10%20Performance.pps)

peterfilm
06-29-2011, 05:49 AM
brilliant! thanks for the links, i've read the barris stuff before, but it was the batch reuse strategy i was more interested in, otherwise buffer streaming is all just for CPU-computed dynamic stuff.
one gem of information came out of those links though - the DX10 performance pps. Slide 14
Instance data:
ATI: Ideally should come from additional streams (up to 32 with DX10.1)
NVIDIA: Ideally should come from CB indexing.
So it seems nvidia have optimised instancing for uniform buffers rather than instanced arrays. This would explain my experience.