Geometry shader massive slowdown / ATI Radeon 5770

Hi everyone !

I’m not used to work with geometry shaders, and I set up a small synthetic benchmark to check peak performances on triangles throughput using VBOs and geometry shaders.

I do this by drawing triangles covering half-a-pixel, and measuring time with GL_TIME_ELAPSED queries.

On my GF9800, drivers version 257.21, I get :

  • without GS, ~274 M triangles/sec
  • with GS, ~92 M triangles/sec (slowdown ~3x)

On my Radeon HD 5570 + Cat.10.9, I get :

  • without GS, ~428 M triangles/sec
  • with GS, ~21 M triangles/sec (slowdown ~20x !)

My geometry shader is a simple passthrough shader :


#version 120
#extension GL_EXT_geometry_shader4 : enable
 
void main()
{
  for(int i = 0; i < gl_VerticesIn; ++i)
  {
    gl_Position = gl_PositionIn[i];
    EmitVertex();
  }
  EndPrimitive();
}

The only amount of additional code I have for GS support is :


#if USE_GEOMETRY_SHADER
  mgd::GLShaderObject geomShader(GL_GEOMETRY_SHADER);
  geomShader.compile(test_GL_vbo_speed_geom);
  program.attachObject(geomShader);

  glProgramParameteriEXT(program.handle(), GL_GEOMETRY_INPUT_TYPE_EXT, GL_TRIANGLES);
  glProgramParameteriEXT(program.handle(), GL_GEOMETRY_OUTPUT_TYPE_EXT, GL_TRIANGLE_STRIP);
  glProgramParameteriEXT(program.handle(), GL_GEOMETRY_VERTICES_OUT_EXT, 3);
#endif

Drawing is done using :


::glDrawRangeElements(GL_TRIANGLES, 0, 3*NUM_VERTEX_SET, 3*NUM_TRIANGLES, GL_UNSIGNED_SHORT, 0);

Is a 20x slowdown normal on ATI ? Do I miss something ?

Cheers,
Nicolas.

I haven’t attempted to duplicate your setup, but I don’t think you can draw any usefully valid conclusion from your synthetic test.

The reason is that the GL is a parallelized pipeline. Basically, I think all your test is measuring is shader setup overhead. The way you have it, your fragment shader does almost nothing; besides rendering 1/2 a fragment per triangle, are you doing any math like per pixel lighting calculations or any texture lookups? I assume your vertex shader is also programmed to do almost nothing, right? Are you multiplying each vertex by a 4x4 matrix, or just passing it through to the next stage?

If you have a realistic shader program, significant execution time will be required inside at least one shader to do useful work, and probably more than one shader. When that is the case, the shader setup overhead becomes a smaller percentage of overall execution time per shader, and besides each shader is running as part of a pipeline, so as vertex and/or fragment shaders (and/or tessellation control shaders and/or tessellation evaluation shaders) get longer, they mask the time spent in the geometry shader, since all shaders are running in parallel.

What’s really important in real GL programs is speeding up the bottleneck. The bottleneck depends on the actual conditions, and often times the bottleneck will change as conditions change. In a parallel pipeline, speeding up anything that is not the bottleneck will have no impact on performance, it just means that the parts that aren’t the bottleneck will spend a greater percentage of their time in an idle state.

To get the biggest bang for your buck performance-wise, you will ideally balance all the stages in your pipeline so they each take the same amount of time. Any stages sitting in an idle state aren’t doing anything to enhance the quality of your rendered frame.

David, thanks for your reply.

I’m aware of the GL pipelining, and pipeline balancing according to performance bottlenecks : my aim with this synthetic test was to put some numbers in front of potential bottlenecks, not much more.

The test shader I’m using is indeed overly simple (ftransform() in vertex shader, flat color assignment in fragment shader), as my aim was to investigate raw triangle throughput performance.

What surprises me is to get such a slow ‘raw’ performance number with GS on ATI (compared to NVidia), and I’m just wondering if anyone with ‘real world’ geometry shader experience could shed some light on this.

For now, this does not encourage me spending more time investigating GS.

For now, this does not encourage me spending more time investigating GS.

The point he was trying to make is that it’s a synthetic benchmark. “raw” performance doesn’t mean anything if it is unattainable in the real world.

There have been many pieces of hardware that boast of raw performance numbers that, when used with actual software rather than synthetic, theoretical tests, are inferior to hardware that has lower theoretical stats but higher actual performance. Hardware design is often a balance. And if that “raw” triangle setup performance would never actually be useful in real software, why bother putting it in the hardware?

That’s not to say that there isn’t something they could be doing better, or that using geometry shaders is a great idea. Just that drawing such conclusions from such artificial benchmarks is perhaps not wise.