ARB meeting notes from Jun/Sept/Dec

Yeah, they’re finally online! :wink:
June
September
December

Pixel Buffer Object Status … so vote PASSES. Spec will be posted to the registry shortly.

:slight_smile:

EXT_framebuffer_object is done!
http://www.opengl.org/about/arb/notes/SBupdatedec.ppt

:smiley:

Originally posted by namespace:
[b]Pixel Buffer Object Status … so vote PASSES. Spec will be posted to the registry shortly.

:slight_smile: [/b]
This one ? Doesn’t look too fresh though. It’s dated March 2004 and it also says “NVIDIA Release 55 (early 2004) drivers support this extension.” :confused:

Originally posted by Corrail:
[b] [quote]EXT_framebuffer_object is done!
http://www.opengl.org/about/arb/notes/SBupdatedec.ppt

:smiley: [/b][/QUOTE]:smiley: So where is the f$%&#$@ spec?! :mad: :smiley:

Yeah! RTT doesn’t have a crappy API!

Now the question is, who will have the first implentation: ATi or nVidia (I’m guessing nVidia, as they tend to be a bit more on the ball than ATi).

Instancing geometry - DX has an explicit API with per-primitive and per-vertex attributes, allowing instancing primitives with different per-vertex attributes. Pat notes that setting attributes in immediate mode outside of glBegin/glEnd can accomplish this already.
Interesting. This technique is perhaps as old as OGL itself, just today in context of GLSL it equates to faking an uniform variable with an attribute. This would make sense only if changing attribute value (in immediate mode) is significiantly cheaper state change than changing uniform value. I kind of can’t recall any GPU-optimization document mentioning such thing…

But this doesn’t explain why is the DX-instancing tied to Microsoft DirectX Shader Model 3.0 ™? This suggests dedicated HW is needed, currently available only in nv40. (yeah, I’ve heard of ATI’s “extension” with FOURCC ‘INST’)

nvidia updated its extensions page but it doesn’t mention the framebuffer one.

http://developer.nvidia.com/object/nvidia_opengl_specs.html

Also note: “Extensions marked with an asterisk (*) will be supported in Forceware Release 75 drivers.”

This would make sense only if changing attribute value (in immediate mode) is significiantly cheaper state change than changing uniform value. I kind of can’t recall any GPU-optimization document mentioning such thing…
It is. This is a known fact.

The confusion is the notion that this concept is in any way similar to changing attributes outside of glBegin/End. In fact, instancing is not the same.

The reason is that instancing is, very specifically, using vertex arrays, while glBegin/End is not. Instancing is a way of drawing lots of things with 1 glBegin-equivalent call (ie: one glDraw*) call. The mere fact that you’re changing the attribute outside of glBegin/End means that you have multiple glBegin/End calls. Add to this the relative inefficiency of rendering stuff with glBegin/End compared to vertex arrays, and you have all kinds of issues.

huzza! a nice late Xmas gift :wink:

Lets hope for some framebuffer imps and spec soon :smiley:

@Korval
Let’s not confuse things, I’m not even talking about pure immediate mode and drawing with glBegin/End. Things like:

glColor3f(1,0,0);
glDrawArrays();

has been used for ages, but the real Uniforms/Constants has come to GL quite recently - with introduction of Shaders/Programs. So we can talk about performance comparison between the new, real Uniform and the old one (now considered as “faked”), only since that time.

Example:

vec4 foo[2005]; // positions and rotations of 2005 beer cans,  
                // each one magically encoded in single vec4.

// program A:
for ( int i = 0; i < 2005; i ++ )
{
  glUniform4fv(..., foo[i]);
  glDrawArrays(); // draws single beer can
}

// program B:
for ( int i = 0; i < 2005; i ++ )
{
  glVertexAttrib4fv(..., foo[i]);
  glDrawArrays(); // draws single beer can
}

So, now we are told in the meeting minutes that B is equivalent of DX instancing (thus much faster than A). You, Korval, say that B being faster than A is a “known fact”. For me it’s surprising, as I find this very against common sense:

  • In both A & B you transfer to GPU exactly the same amount of data, in identical batches. The purpose of instancing was to fight penalty incured by large number of small batches, so why would anyone expect significiant difference here?

  • In B you are using an attribute to disguise itself as an uniform. Why would anyone expect the emulated solution to be faster then the one which was solely designed for the purpose in question (per-primitive data)?

  • On nv30 HW, access to an attribute in fragment program in any instruction other than MOV costs you additional cycle (unlike access to a constant), and you pay that penalty every rendered pixel. In some cases it is even faster to read out the attribute to a temp register and use it in computations instead of the original. But too many temps will cost you a penalty too - this must be real fun for the compiler. Anyway, alleged benefits of the B would have to outweight these losses. Of course nv30 != nv40, but as ATI has shown, the problem of HW required for instancing is rather fuzzy.

I’m sure such against common sense optimization would have drawn my attention. If you still claim this is well “known fact”, I must have truly missed something.

Instancing geometry - no need, GL immediate mode rendering calls support this.
This gotta be the worst motivation I’ve ever heard. You could say vertex arrays, texture objects, display lists and a whole range of other features were useless as well because other parts of the API can accomplish the same thing. The whole point of instancing is to improve performance by drawing all batches in one call to dodge the overhead of DrawIndexedPrimitive()/glDrawElements(). GL doesn’t have any functionality to do this. Now a real motivation would be that it’s not really needed in GL since the overhead of glDrawElements() isn’t very high in the first place, unlike DrawIndexedPrimitive() in DX.

There’s quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.

It is. This is a known fact.
Known by whom, Korval? You and your pet stick insect?

but the real Uniforms/Constants has come to GL quite recently - with introduction of Shaders/Programs. So we can talk about performance comparison between the new, real Uniform and the old one (now considered as “faked”), only since that time.
No, all Uniforms are is a substitution for per-vertex and per-fragment numerical state. It’s the programmatic equivalent to changing lighting parameters or TexEnv parameters or other such state. As such, the changes themselves don’t take any more or less time than their fixed-function; they are state changes and should be treated as such.

On nv30 HW, access to an attribute in fragment program in any instruction other than MOV costs you additional cycle (unlike access to a constant), and you pay that penalty every rendered pixel.
Fragment programs don’t have access to attributes. They have access to varyings.

So, now we are told in the meeting minutes that B is equivalent of DX instancing (thus much faster than A). You, Korval, say that B being faster than A is a “known fact”.
The thing you don’t get is that B isn’t the same thing as instancing. Instancing uses 1 draw call for all primitives. It does not loop over primitives. It does not rely on vertex arrays at all, but a legacy concept of OpenGL that gives the illusion of performance without actually providing it.

Now a real motivation would be that it’s not really needed in GL since the overhead of glDrawElements() isn’t very high in the first place, unlike DrawIndexedPrimitive() in DX.
I’ve still seen no evidence that a GL program can achieve the same speed as a D3D instancing program using the same data and hardware. Until such evidence is brought forth, and holds in all cases, I firmly reside in the camp that says that we need this functionality.

I made a little test app.

GeForce 5200, Athlon XP 1700 (1400 MHz)
icosahedron model stored in display list (20 triangles, 12 verts)
20000 instances drawn each frame
screen size of each: ~15 pixels
dumb vertex shader (MVP matrix multiply preceded by single vec4 add)
dumb fragment shader (constant color)
no blending, etc.

two paths of the innermost drawing loop resemble my A & B pseudocode above: no state changes between drawing instances other than single uniform/attribute.

assembly output shows B needs one VP instruction more than A (6 vs 5)

results:
A, “uniform” mode: 5.25 fps
B, “attribute” mode: 33 fps

20 triangles * 3 verts (no vertex sharing) * 20000 instances * 6 VP instructions * 33 fps = 237M. Pretty close to theoretical maximum of 250MHz GPU.

With larger meshes the speedup effect gets diminished.

…and with d3d9 instancing?

The thing is with OpenGL is that it should have paths that can be optimised by each vendor for their hardware. The absense of a specific mechanism for specifying instances means that, in OpenGL, a vendor cannot take advantage of specific instancing tricks they’ve developed for d3d9, possibly in hardware. The fact that there’s a kind of work-around in OpenGL that is almost as fast as d3d instancing isn’t really the point - the work-around may be as fast now, but what about the future? It puts an additional burden on the driver to detect that the app is doing instancing and switch it to the optimised path…if it’s possible to detect.
Couple this with the fact that big missing features like this make a cross-api generic renderer interface just that bit more tricky to get done elegantly…and managing your app vertex data becomes awkward.
A frequency parameter for every vertex attribute would be simple to add wouldn’t it?

There’s quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.

That is very true. I realized this a long time ago when I had this program where my scene was organized in a BSP tree and well I had this function that would construct batches based on common textures. Well I had done something really dumb in that code which caused every triangle to be in its own batch. :smiley: So when I was rendering what happened was I called glDrawElements for every triangle in the scene. The scene had around 5,000 triangles in it, give or take a few. My framerate from this was less than 10. Once I realized what was going on and fixed my batching code it jumped up in the 200-300s.

The overhead may be low for a single call or a small number of calls of glDraw* but it will pile up quickly once you start doing hundreds or thousands of calls which is quite common in a scene with an a$$ load of objects, like asteroids in space for example.

-SirKnight

The overhead may be low for a single call or a small number of calls of glDraw* but it will pile up quickly once you start doing hundreds or thousands of calls which is quite common in a scene with an a$$ load of objects, like asteroids in space for example
just concatenate your vertexs/texcoords etc togather, this drops your draw calls into one/a few, (even a slow cpu can handle this APOP),
not really that much more memory required

>>>results:
A, “uniform” mode: 5.25 fps
B, “attribute” mode: 33 fps
<<<

Is this a common case on all hardware, including from other vendors?

Is this a common case on all hardware, including from other vendors?
Uniforms cause state changes, which create stalls in the rendering pipeline. Even if attribute setting isn’t a function of the hardware, all that needs to be done is the construction of a vertex array/buffer containing that value replicated, and all is good. This is driver-side stuff, so it isn’t so bad. It slows the CPU down, but the GPU pipeline is still smooth.