PDA

View Full Version : Geometry shader - a threat for high performance



Aleksandar
03-06-2013, 02:24 AM
Hi All,

I would like to start a short discussion about GS and its impact on the performance.

Well, yesterday I tried to implement some effect using GS, and for the beginning I made a simple, just pass-through, GS and was unpleasantly surprised by a significant decrease of performance.
Since I was working on a laptop with NV 8600M GT, decreasing performance 2.3 times I attributed to a first generation of hardware that supports GS. But the surprise was greater when I tried it on GTX470.
The results were 3.1 times worse than without pass-through GS.

I knew GS was something "rather to avoid", but such performance lost is very disturbing for any application (except for a very few vertices/primitives).

Does anyone have any positive experience with GS, or you were just satisfied with the effect without profiling the impact on the performance? :)

Does Kepler improve performance regarding GS compared to Fermi architecture?
Previous experiment showed that GS has less impact on slower G80 than on much faster GF100 GPUs. Very strange, indeed.

tonyo_au
03-06-2013, 05:00 AM
Does anyone has any positive experience with GS, or you were just satisfied with the effect without profiling the impact on the performance?

I use it mainly to save space for very large point data sets and some low use options where keeping the data would be wasteful and I just wear the cost on performance. I limit myself to a maximum of 4 generated vertices because I read somewhere that nVidia geforce cards can
generate 4 vertices with miminal stalling but I don't know how true that it.

aqnuep
03-06-2013, 06:17 AM
Well, when I did layered rendering with GS on an HD5770 it showed a significant performance gain compared to rendering to each layer separately without a GS.

GS does have its overhead. First, it requires shader execution resources, second, it is a bit more heavy weight than other shader stages because it has to support 1:N input-output cases.
Anyways, to my experience, there are cases when using a GS could improve performance a lot, but generally speaking you shouldn't use them, unless you need them.

I think the most important cases are layered and multi-viewport rendering cases, but you can also save bandwidth with GS if you e.g. have per-triangle attributes that you don't want to replicate and use it as a VBO, but you rather fetch them per-primitive in the GS. Not to mention that there are algorithms that you simply cannot implement on the GPU without a GS.

Btw, have you tried your test on any AMD GPUs?

Aleksandar
03-06-2013, 08:10 AM
I know that GS is necessary in some cases, but executing for every possible primitive (for example for wireframe-over-solid-model effect) it is not acceptable.
This experiment is carried out with just a pass-through GS, without any calculation in it,


Btw, have you tried your test on any AMD GPUs?
Nice idea! I have just tried on AMD Radeon HD 6850 with Catalyst 13.1. The results are even worse. Ratio is 5.5!

There are some errors though (Msg. ID: 3200 - Using glTexParameteri in a Core context with parameter <param> and enum '0x2900' which was removed from Core OpenGL (GL_INVALID_ENUM), and Msg. ID: 1001 - glGetIntegerv parameter <pname> has an invalid enum '0x87fb' (GL_INVALID_ENUM)), but I'm sure they didn't cause such ratio, since they appear in both cases.

aqnuep
03-06-2013, 08:56 AM
Nice idea! I have just tried on AMD Radeon HD 6850 with Catalyst 13.1. The results are even worse. Ratio is 5.5!
Strange. When I did try a simple pass-through VS, GS on my HD5770 I remember seeing a ratio less than 2. Actually, less than 1.5, IIRC. I'll have to check that again.

Aleksandar
03-06-2013, 03:07 PM
When I did try a simple pass-through VS, GS on my HD5770 I remember seeing a ratio less than 2. Actually, less than 1.5, IIRC. I'll have to check that again.
My VS is not a simple pass-through. On the contrary, it is very complicated. But GS is trivial. Here's the code:


#version 330

layout( triangles ) in;
layout( triangle_strip, max_vertices = 3 ) out;

in vec4 ex_Color[];
in vec2 ex_TexCoord[];
in vec4 out_vDebugVec4[];

out vec4 ex_gColor;
out vec2 ex_gTexCoord;
out vec4 out_DebugVec4;

void main(void)
{
for(int i=0; i<3; i++)
{
gl_Position = gl_in[i].gl_Position;
gl_ClipDistance[0] = gl_in[i].gl_ClipDistance[0];
ex_gColor = ex_Color[i];
ex_gTexCoord = ex_TexCoord[i];
out_DebugVec4 = out_vDebugVec4[i];
EmitVertex();
}
EndPrimitive();
}


Maybe there is something unusual in my GS.

By the way, it is not trivial to estimate the effect of the GS since the bottleneck can be anywhere in the pipeline. In my case it is VS. The slowdown factor slightly depends on the size of the problem. I agree it is about 2x or less if there is ~100K triangles , but it is more than 2.5x if there is more than 2M triangles. I cannot guaranty the accuracy of the result on the HD 6850, since it is not my computer and I had just a few minutes to test it, but they are certainly worse than on the Fermy counterparts (by the way GTX470 renders the test scene without GS in about 2ms while HD 6850 ~8ms, and 8600M ~33ms).

Alfonse Reinheart
03-06-2013, 03:50 PM
And how does the slowdown vary with the number of interpolated components?

tonyo_au
03-06-2013, 04:02 PM
but executing for every possible primitive (for example for wireframe-over-solid-model effect) it is not acceptable.

A couple of points -

I only use GS in specific shaders not all my shaders.

I think it is important to remember the difference between actual perfromance and perceived perfomance. I do wireframe-over-solid and similar using GS on large DTMs (4-6 million triangles)
and while I would not say the performance matches the game industry standard, my users are more than happy with the frame rates.

malexander
03-06-2013, 06:27 PM
Does Kepler improve performance regarding GS compared to Fermi architecture?

It doesn't seem to. I have a 1.5M polygon model with ~6M vertices that draws in ~5ms with a vertex/fragment shader The same vertex/fragment pair with a geometry shader in between takes 20ms to draw (tri > tri, polygon selection lookup via TBO). This is using a GEForce 670 (311.09). Using a wire-over-shaded geometry shader takes much the same time (20ms), so it appears to have little to do with actual geometry or fragment shader computation. I suspect either memory bandwidth or the number of threads able to execute simultaneously might be the bottleneck.

A 20M polygon model draws in 30ms with the same vertex/fragment shader, and 225ms with the geometry TBO lookup, so it does get worse with the amount of geometry processed.

Alfonse Reinheart
03-06-2013, 07:33 PM
I'm curious: whats the hit if you toss in tessellation, without the geometry shader? Just a go with an inner and outer level of 1.

mbentrup
03-07-2013, 07:01 AM
Another question: Has setting max_vertices = 4 (and still emitting only 3 vertices as before) an impact on the performance ?

Aleksandar
03-07-2013, 07:12 AM
Another question: Has setting max_vertices = 4 (and still emitting only 3 vertices as before) an impact on the performance ?
That doesn't change anything.

Aleksandar
03-07-2013, 07:15 AM
I'm curious: whats the hit if you toss in tessellation, without the geometry shader? Just a go with an inner and outer level of 1.
It is not easy to reproduce, since the input for the TS is a patch. The test would not be the same if VS is not a trivial one.

Aleksandar
03-10-2013, 04:14 AM
GS résumé

I have carried out some experiments and have come to the following conclusion:

1. I have made no mistakes in the implementation, since neither debug_output nor GLExpert reported any GS performance penalty.

2. GS imposes significant performance penalty on the regular bases because of the following rules:
- Due to the requirement that the output of a GS thread must preserve the order in which it is generated, the GPU must support buffering the output data in this correct order for a number of threads running in parallel.
- GeForce 8, 9 and GTX2xx series have limited output buffer sizes, which is at least sufficient to support the maximum output size allowed (I'm not sure if this is significant reason, in my case certainly not, but one of the reasons is for sure)
- The performance of a GS is inversely proportional to the output size (in scalars) declared in the GS, which is the product of the vertex size and the number of vertices. This performance degradation however occurs at particular output sizes, and is not smooth (GREATEST IMPACT!!!).
- In addition, because a GS runs on primitives, per-vertex operations will be duplicated for all primitives that share a vertex. This is potentially a waste of processing power.

Let’s look how number of "output scalars" influences GS performance. In my example (a simple pass-through GS) I have 15 "output scalars" per vertex:
gl_Position - 4
gl_ClipDistance[0] - 1
ex_gColor - 4
ex_gTexCoord - 2
out_DebugVec4 - 4

Since GS outputs 3 vertices, there is total of 45 "output scalars". Rendering time on 8600M is ~77ms.

If I remove DebugVec4, the total number of "output scalars" is 33, and the rendering time is ~68ms.

If ex_gColor is also removed, there is 21 "output scalars" and the rendering time is ~68ms. As you can see, rendering time stays the same. Remember this for the later discussion.

By removing gl_ClipDistance[0], the number of "output scalars" drops to 18 (just 3 floats less), and rendering time to ~58ms. We hit the next step!

With only gl_Position, and we cannot go any further, the number of "output scalars" is 12 and rendering time is ~55ms.

Just to remember, rendering time without GS is 34.5ms. So, GS performance significantly depends on the number of "output scalars". The dependence is not a continuous but a discrete function. According to NV spec. steps are at every 20 scalars added (or so) for GeForce 8, 9 and GTX2xx series. Below 20 "output scalars" they declare max. performance, which is in our case is ~1.59x slower than without GS.

Conclusion:
A geometry shader is most useful when doing operations on small vertices or primitive data that requires outputting only small amounts of data. But in general, the potential for wasted work and performance penalties for using a GS makes it often an unused feature (except for culling, point sprites and layered/multi-vewport rendering (only in GL 4.1+) ).

aqnuep
03-10-2013, 10:32 AM
Your statements are not all accurate:


- Due to the requirement that the output of a GS thread must preserve the order in which it is generated, the GPU must support buffering the output data in this correct order for a number of threads running in parallel.
That's unfortunately true. Don't forget that there is a theoretical upper bound on how fast outputting ordered data from multiple parallel threads can be.


- GeForce 8, 9 and GTX2xx series have limited output buffer sizes, which is at least sufficient to support the maximum output size allowed (I'm not sure if this is significant reason, in my case certainly not, but one of the reasons is for sure).
While it might be true for some GPUs, there is no technical reason why you have to have a limited output buffer size. It really depends on the implementation.


- The performance of a GS is inversely proportional to the output size (in scalars) declared in the GS, which is the product of the vertex size and the number of vertices. This performance degradation however occurs at particular output sizes, and is not smooth (GREATEST IMPACT!!!)
Again, this is implementation dependent. It in fact can be directly affected in some cases by the declared maximum output vertices, however, it usually really is inversely proportional to the actual amount of data output by the GS. Also, the performance degradation is smooth on some hardware, while it's not smooth on other. You should try a wider variety of GPUs and you'll observe this. It's definitely not an inherent limitation.


- In addition, because a GS runs on primitives, per-vertex operations will be duplicated for all primitives that share a vertex. This is potentially a waste of processing power.
There is, once again, no requirement that a hardware has to run per-vertex operations for all primitives that share a vertex. I can tell this for sure as I measured this on AMD GPUs at the time I wrote my chapter for the OpenGL Insights book. In fact, adding more work to the vertex shader decreased the relative cost of the geometry shader.


Conclusion:
A geometry shader is most useful when doing operations on small vertices or primitive data that requires outputting only small amounts of data. But in general, the potential for wasted work and performance penalties for using a GS makes it often an unused feature (except for culling, point sprites and layered/multi-vewport rendering (only in GL 4.1+) ).
It's an incorrect conclusion. I've tested it (at least on AMD hardware) and the number of vertices doesn't matter. Also, the more work you put in the vertex shader the less impact using a geometry shader has. The only key point is that you have to minimize the number of data output by your geometry shader e.g. by performing back-face, frustum, or other culling techniques to further limit the actual number of data output by the GS.

Regarding the use cases you've mentioned:
- Point sprites: you should rather use the fixed function point sprite feature exposed by the hardware instead of using GS for the purpose
- Layered/multi-viewport rendering (not only GL 4.1, but actually GL 3.x hardware do support these): usually is way faster than doing per-layer/per-viewport rendering, especially when you have a time consuming vertex shader; when you don't, you might consider using AMD_vertex_shader_layer or AMD_vertex_shader_viewport_index if it's available.

Alfonse Reinheart
03-10-2013, 02:42 PM
it usually really is inversely proportional to the actual amount of data output by the GS

That's what he said. The number of scalars is the amount of data. The only thing you're disagreeing with is whether it's smooth or not.


There is, once again, no requirement that a hardware has to run per-vertex operations for all primitives that share a vertex.

He's talking about the per-vertex operations within the GS. Since it operates on (for example) triangle primitives, rather than triangle strips, it has to happen after the post-T&L cache. So any per-vertex operations you do in the GS will be repeated on duplicate vertices, unlike if you had done them in the VS.


- Point sprites: you should rather use the fixed function point sprite feature exposed by the hardware instead of using GS for the purpose

Assuming you like having your point sprites clipped by the center of the point, so they disappear when they're halfway off screen.

tonyo_au
03-10-2013, 06:07 PM
Below 20 "output scalars" they declare max. performance,

I find a similar result on AMD 5870.


rendering time without GS is 34.5ms
Is this producing the same output; ie do you add extra vertices and do the GS work in the VS?

aqnuep
03-10-2013, 08:49 PM
That's what he said. The number of scalars is the amount of data. The only thing you're disagreeing with is whether it's smooth or not.
No, that's not exactly what he said. It's not inversely proportional to the declared amount of data, but the actually output data. There is big difference. The former is max_vertices * num_scalars, the other is EmitVertex calls * num_scalars.


He's talking about the per-vertex operations within the GS. Since it operates on (for example) triangle primitives, rather than triangle strips, it has to happen after the post-T&L cache. So any per-vertex operions you do in the GS will be repeated on duplicate vertices, unlike if you had done them in the VS.
If he really did talk about that, then he's right, but he didn't mention he meant per-vertex operations within the GS. Also, usually there is little to no such operations as these are usually done in the vertex shader, and that's a conceptual restriction of the GS, because it works on a primitive. That's nothing out of ordinary.


Assuming you like having your point sprites clipped by the center of the point, so they disappear when they're halfway off screen.
Well, if that's a problem, one can use instancing instead in the following way:
1. Set VertexAttribDivisor to 1 for the point sprite's vertex arrays
2. Generate the fixed coordinates of the point sprite corners in the VS using gl_VertexID
3. Use the instanced attributes as if they would be just regular attributes
This will work faster than the GS approach and provides the exact same flexibility, just you render your point sprites as instances instead of vertices.

tonyo_au
03-11-2013, 04:52 AM
Well, if that's a problem, one can use instancing instead in the following way:
1. Set VertexAttribDivisor to 1 for the point sprite's vertex arrays
2. Generate the fixed coordinates of the point sprite corners in the VS using gl_VertexID
3. Use the instanced attributes as if they would be just regular attributes
This will work faster than the GS approach and provides the exact same flexibility, just you render your point sprites as instances instead of vertices.


Thanks for that

Aleksandar
03-11-2013, 07:41 AM
Your statements are not all accurate
If you read carefully you could see that everything written is related to NV pre-Fermi cards. Most of the facts are "borrowed" from NV's documentation.


Again, this is implementation dependent. It in fact can be directly affected in some cases by the declared maximum output vertices, however, it usually really is inversely proportional to the actual amount of data output by the GS. Also, the performance degradation is smooth on some hardware, while it's not smooth on other. You should try a wider variety of GPUs and you'll observe this. It's definitely not an inherent limitation.
As Alfonse said, you have agreed with me in this statement. Degradation is not smooth on any card I have tried. The test on Radeon HD 6850 will follow later in this post, as well as for Fermi.



There is, once again, no requirement that a hardware has to run per-vertex operations for all primitives that share a vertex. I can tell this for sure as I measured this on AMD GPUs at the time I wrote my chapter for the OpenGL Insights book. In fact, adding more work to the vertex shader decreased the relative cost of the geometry shader.
This is correct if the workload of the VS is increased while GS workload stays the same. In my case GS workload is directly proportional to number of vertices emitted by the VS. I have just tested it on AMD 6850, and the frame rendering time with GS is directly proportional to number of rendered primitives! This is true for any GPU I have tested. Other contributor to this thread also confirms the results.



It's an incorrect conclusion. I've tested it (at least on AMD hardware) and the number of vertices doesn't matter. Also, the more work you put in the vertex shader the less impact using a geometry shader has. The only key point is that you have to minimize the number of data output by your geometry shader e.g. by performing back-face, frustum, or other culling techniques to further limit the actual number of data output by the GS.
Not true! See above! Or ask other members of the forum.


- Layered/multi-viewport rendering (not only GL 4.1, but actually GL 3.x hardware do support these): usually is way faster than doing per-layer/per-viewport rendering, especially when you have a time consuming vertex shader; when you don't, you might consider using AMD_vertex_shader_layer or AMD_vertex_shader_viewport_index if it's available.
Layered/multi-viewport rendering has been introduced in GL 4.1. That's what I meant. I haven't tried it on GL 3.3 class hardware, but I believe that it works.


No, that's not exactly what he said. It's not inversely proportional to the declared amount of data, but the actually output data. There is big difference. The former is max_vertices * num_scalars, the other is EmitVertex calls * num_scalars.
On pre-Fermi and AMD HD 6850 performance of the GS is inversely proportional to the amount of output data. After all two of us reported that (tonyo_au and I). On Fermi it is not relevant!

Let's back to the experiment...

I have tested the behavior of GTX470 and HD6850, and the results are the following:

- On GTX470: w/o GS ~1.9ms, full GS ~5.9ms, GS w/o out_DebugVec4 ~5.8ms, GS w/o out_DebugVec4 and ex_gColor ~5.8ms, GS w/o out_DebugVec4, ex_gColor and ex_gTexCoord ~5.8ms, GS w/o out_DebugVec4, ex_gColor, ex_gTexCoord and gl_ClipDistance[0] ~6.6ms => Fermi does not depend on the number of "output scalars"! Clipping eliminates certain number of verteces hence the increase in the frame rendering time.

- On HD 6850: full GS ~40ms, GS w/o out_DebugVec4 ~28.5ms, GS w/o out_DebugVec4 and ex_gColor ~20.9ms, GS w/o out_DebugVec4, ex_gColor and ex_gTexCoord ~20.6ms, GS w/o out_DebugVec4, ex_gColor, ex_gTexCoord and gl_ClipDistance[0] ~46ms => HD 6850 with Catalyst 13.1 behaves pretty the same way as NV pre-Fermi cards!

Both GTX470 and HD6850 linearly increase frame rendering time with the number of input vertices. I could post concrete numbers, but they are irrelevant in this discussion.

malexander
03-11-2013, 09:17 AM
My suspicion is that the small local memory size in the original 8000 and GT200 series cards (16K) is why geometry shader performance is so poor on these earlier Nvidia architectures. The Fermi generation increased this to 64K. If the geometry shader stores its output vertices in local shared memory (which I suspect it does as every Nvidia CUDA tutorial hammers home the point that you need to use shared memory for optimal performance), then the smaller 16K size is likely limiting the number of threads that can be run at a time.

For example, if a single vertex requires 6 vec4s of storage and you output 3 vertices, each thread would require 6x4x4x3 = 288 bytes of vertex output storage. That means the maximum number of threads that can run on a GT200 simultaneously is 56, while on Fermi this would be 227 (assuming nothing else uses shared memory, which is probably a bit ideal). This leaves the GT200 more sensitive to memory latency. After a certain point, running more threads won't improve performance, so perhaps that's why your not seeing any variation with Fermi and the vertex output size. There could be other architectural enhancements as well, such as memory speed and newer instructions that improve geometry shader throughput, but it seems like local memory is a big constraint in this context.

As a test I tried implementing a 1.5M point/6M vertex model, which was drawn with glDrawElements() on 1.5M point vertex arrays and a GS with a 1.6M TBO lookup per primitive for a per-primitive attribute, with a 6M vertex VS-only glDrawArrays() implementation. This promoted points and per-triangle attributes to triangle vertex frequency (all VBOs of 6M-length). Both implementations took ~26ms to draw on a GEForce 670. As a reference, drawing 1.5M point VBOs with glDrawElements() and no GS or per-triangle attribute took ~6ms to draw. So while the GS-case definitely slower than the non-GS case, it is no worse than promoting all attributes to per-triangle-vertex frequency (and takes 1/4 the memory).

Now, with a 20M polygon model, the GS-case beats out the promoted vertex case by 225ms to 357ms (VS draw elements only - 49ms). Of course at this point, both are terribly sluggish so it's hard to tell the difference when tumbling. For a game context you wouldn't want either of these models, but for CAD purposes the geometry shader seems pretty useful.