Hello,
I'm using a vertex and geometry shader to do lod selection and frustum culling into a buffer bound for transform feedback (GL_EXT_transform_feedback). I'm subsequently using that buffer as the instance source for a glDrawElementsInstanced() call, but I've currently got that bit disabled so I can specifically profile the lod selection and frustum culling stage.
The input attributes to the vertex shader are a 4x3 matrix (4*vec4) and a bounding sphere (1*vec4).
The input uniforms to the vertex shader are camera position (vec3) and frustum planes (6*vec4).
The vertex shader does the cull/lod tests, and outputs the 4x3 matrix and a 'visible' flag to the geometry shader.
The geometry shader only emits 'points' if the vertex shader 'visible' output is 1.
Problem is, I'm only getting approx 57 million points processed per second on a quadro 4000.
Question is, is there some performance trick/caveat I should be aware of when doing this sort of thing?
Note that the code is *not* doing the GL_QUERY_RESULT (so not a stall problem) or the glDrawElementsInstanced() (so not related to the instancing API).
Thanks for any advice offered.




