PDA

View Full Version : ARB meeting notes from Jun/Sept/Dec



Corrail
12-27-2004, 01:16 PM
Yeah, they're finally online! ;)
June (http://www.opengl.org/about/arb/notes/meeting_note_2004-06-08.html)
September (http://www.opengl.org/about/arb/notes/meeting_note_2004-09-21.html)
December (http://www.opengl.org/about/arb/notes/meeting_note_2004-12-07.html)

namespace
12-27-2004, 01:44 PM
Pixel Buffer Object Status ... so vote PASSES. Spec will be posted to the registry shortly.

:)

Corrail
12-27-2004, 01:45 PM
EXT_framebuffer_object is done! http://www.opengl.org/about/arb/notes/SBupdatedec.ppt

:D

zeckensack
12-27-2004, 02:17 PM
Originally posted by namespace:
Pixel Buffer Object Status ... so vote PASSES. Spec will be posted to the registry shortly.

:) This one (http://oss.sgi.com/projects/ogl-sample/registry/EXT/pixel_buffer_object.txt) ? Doesn't look too fresh though. It's dated March 2004 and it also says "NVIDIA Release 55 (early 2004) drivers support this extension." :confused:

KRONOS
12-27-2004, 03:35 PM
Originally posted by Corrail:

EXT_framebuffer_object is done! http://www.opengl.org/about/arb/notes/SBupdatedec.ppt

:D :D So where is the f$%&#$@ spec?! :mad: :D

Korval
12-27-2004, 04:57 PM
Yeah! RTT doesn't have a crappy API!

Now the question is, who will have the first implentation: ATi or nVidia (I'm guessing nVidia, as they tend to be a bit more on the ball than ATi).

MZ
12-27-2004, 09:35 PM
Instancing geometry - DX has an explicit API with per-primitive and per-vertex attributes, allowing instancing primitives with different per-vertex attributes. Pat notes that setting attributes in immediate mode outside of glBegin/glEnd can accomplish this already.Interesting. This technique is perhaps as old as OGL itself, just today in context of GLSL it equates to faking an uniform variable with an attribute. This would make sense only if changing attribute value (in immediate mode) is significiantly cheaper state change than changing uniform value. I kind of can't recall any GPU-optimization document mentioning such thing...

But this doesn't explain why is the DX-instancing tied to Microsoft DirectX Shader Model 3.0 (tm)? This suggests dedicated HW is needed, currently available only in nv40. (yeah, I've heard of ATI's "extension" with FOURCC 'INST')

Zak McKrakem
12-28-2004, 12:21 AM
nvidia updated its extensions page but it doesn't mention the framebuffer one.

http://developer.nvidia.com/object/nvidia_opengl_specs.html

Also note: "Extensions marked with an asterisk (*) will be supported in Forceware Release 75 drivers."

Korval
12-28-2004, 02:39 AM
This would make sense only if changing attribute value (in immediate mode) is significiantly cheaper state change than changing uniform value. I kind of can't recall any GPU-optimization document mentioning such thing... It is. This is a known fact.

The confusion is the notion that this concept is in any way similar to changing attributes outside of glBegin/End. In fact, instancing is not the same.

The reason is that instancing is, very specifically, using vertex arrays, while glBegin/End is not. Instancing is a way of drawing lots of things with 1 glBegin-equivalent call (ie: one glDraw*) call. The mere fact that you're changing the attribute outside of glBegin/End means that you have multiple glBegin/End calls. Add to this the relative inefficiency of rendering stuff with glBegin/End compared to vertex arrays, and you have all kinds of issues.

bobvodka
12-28-2004, 06:00 AM
huzza! a nice late Xmas gift ;)

Lets hope for some framebuffer imps and spec soon :D

MZ
12-28-2004, 01:50 PM
@Korval
Let's not confuse things, I'm not even talking about pure immediate mode and drawing with glBegin/End. Things like:
glColor3f(1,0,0);
glDrawArrays();has been used for ages, but the real Uniforms/Constants has come to GL quite recently - with introduction of Shaders/Programs. So we can talk about performance comparison between the new, real Uniform and the old one (now considered as "faked"), only since that time.

Example:
vec4 foo[2005]; // positions and rotations of 2005 beer cans,
// each one magically encoded in single vec4.

// program A:
for ( int i = 0; i < 2005; i ++ )
{
glUniform4fv(..., foo[i]);
glDrawArrays(); // draws single beer can
}

// program B:
for ( int i = 0; i < 2005; i ++ )
{
glVertexAttrib4fv(..., foo[i]);
glDrawArrays(); // draws single beer can
}So, now we are told in the meeting minutes that B is equivalent of DX instancing (thus much faster than A). You, Korval, say that B being faster than A is a "known fact". For me it's surprising, as I find this very against common sense:

- In both A & B you transfer to GPU exactly the same amount of data, in identical batches. The purpose of instancing was to fight penalty incured by large number of small batches, so why would anyone expect significiant difference here?

- In B you are using an attribute to disguise itself as an uniform. Why would anyone expect the emulated solution to be faster then the one which was solely designed for the purpose in question (per-primitive data)?

- On nv30 HW, access to an attribute in fragment program in any instruction other than MOV costs you additional cycle (unlike access to a constant), and you pay that penalty every rendered pixel. In some cases it is even faster to read out the attribute to a temp register and use it in computations instead of the original. But too many temps will cost you a penalty too - this must be real fun for the compiler. Anyway, alleged benefits of the B would have to outweight these losses. Of course nv30 != nv40, but as ATI has shown, the problem of HW required for instancing is rather fuzzy.

I'm sure such against common sense optimization would have drawn my attention. If you still claim this is well "known fact", I must have truly missed something.

Humus
12-28-2004, 02:25 PM
Instancing geometry - no need, GL immediate mode rendering calls support this.This gotta be the worst motivation I've ever heard. You could say vertex arrays, texture objects, display lists and a whole range of other features were useless as well because other parts of the API can accomplish the same thing. The whole point of instancing is to improve performance by drawing all batches in one call to dodge the overhead of DrawIndexedPrimitive()/glDrawElements(). GL doesn't have any functionality to do this. Now a real motivation would be that it's not really needed in GL since the overhead of glDrawElements() isn't very high in the first place, unlike DrawIndexedPrimitive() in DX.

knackered
12-28-2004, 02:46 PM
There's quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.


It is. This is a known fact.Known by whom, Korval? You and your pet stick insect?

Korval
12-28-2004, 02:55 PM
but the real Uniforms/Constants has come to GL quite recently - with introduction of Shaders/Programs. So we can talk about performance comparison between the new, real Uniform and the old one (now considered as "faked"), only since that time.No, all Uniforms are is a substitution for per-vertex and per-fragment numerical state. It's the programmatic equivalent to changing lighting parameters or TexEnv parameters or other such state. As such, the changes themselves don't take any more or less time than their fixed-function; they are state changes and should be treated as such.


On nv30 HW, access to an attribute in fragment program in any instruction other than MOV costs you additional cycle (unlike access to a constant), and you pay that penalty every rendered pixel.Fragment programs don't have access to attributes. They have access to varyings.


So, now we are told in the meeting minutes that B is equivalent of DX instancing (thus much faster than A). You, Korval, say that B being faster than A is a "known fact".The thing you don't get is that B isn't the same thing as instancing. Instancing uses 1 draw call for all primitives. It does not loop over primitives. It does not rely on vertex arrays at all, but a legacy concept of OpenGL that gives the illusion of performance without actually providing it.


Now a real motivation would be that it's not really needed in GL since the overhead of glDrawElements() isn't very high in the first place, unlike DrawIndexedPrimitive() in DX.I've still seen no evidence that a GL program can achieve the same speed as a D3D instancing program using the same data and hardware. Until such evidence is brought forth, and holds in all cases, I firmly reside in the camp that says that we need this functionality.

MZ
12-29-2004, 12:34 AM
I made a little test app.

GeForce 5200, Athlon XP 1700 (1400 MHz)
icosahedron model stored in display list (20 triangles, 12 verts)
20000 instances drawn each frame
screen size of each: ~15 pixels
dumb vertex shader (MVP matrix multiply preceded by single vec4 add)
dumb fragment shader (constant color)
no blending, etc.

two paths of the innermost drawing loop resemble my A & B pseudocode above: no state changes between drawing instances other than single uniform/attribute.

assembly output shows B needs one VP instruction more than A (6 vs 5)

results:
A, "uniform" mode: 5.25 fps
B, "attribute" mode: 33 fps

20 triangles * 3 verts (no vertex sharing) * 20000 instances * 6 VP instructions * 33 fps = 237M. Pretty close to theoretical maximum of 250MHz GPU.

With larger meshes the speedup effect gets diminished.

knackered
12-29-2004, 04:31 AM
...and with d3d9 instancing?

The thing is with OpenGL is that it should have paths that can be optimised by each vendor for their hardware. The absense of a *specific* mechanism for specifying instances means that, in OpenGL, a vendor cannot take advantage of specific instancing tricks they've developed for d3d9, possibly in hardware. The fact that there's a kind of work-around in OpenGL that is almost as fast as d3d instancing isn't really the point - the work-around may be as fast now, but what about the future? It puts an additional burden on the driver to detect that the app is doing instancing and switch it to the optimised path....if it's possible to detect.
Couple this with the fact that big missing features like this make a cross-api generic renderer interface just that bit more tricky to get done elegantly....and managing your app vertex data becomes awkward.
A frequency parameter for every vertex attribute would be simple to add wouldn't it?

SirKnight
12-29-2004, 01:28 PM
There's quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.
That is very true. I realized this a long time ago when I had this program where my scene was organized in a BSP tree and well I had this function that would construct batches based on common textures. Well I had done something really dumb in that code which caused every triangle to be in its own batch. :D So when I was rendering what happened was I called glDrawElements for every triangle in the scene. The scene had around 5,000 triangles in it, give or take a few. My framerate from this was less than 10. Once I realized what was going on and fixed my batching code it jumped up in the 200-300s.

The overhead may be low for a single call or a small number of calls of glDraw* but it will pile up quickly once you start doing hundreds or thousands of calls which is quite common in a scene with an a$$ load of objects, like asteroids in space for example.

-SirKnight

zed
12-29-2004, 09:47 PM
The overhead may be low for a single call or a small number of calls of glDraw* but it will pile up quickly once you start doing hundreds or thousands of calls which is quite common in a scene with an a$$ load of objects, like asteroids in space for examplejust concatenate your vertexs/texcoords etc togather, this drops your draw calls into one/a few, (even a slow cpu can handle this APOP),
not really that much more memory required

V-man
12-30-2004, 03:29 PM
>>>results:
A, "uniform" mode: 5.25 fps
B, "attribute" mode: 33 fps
<<<

Is this a common case on all hardware, including from other vendors?

Korval
12-30-2004, 04:55 PM
Is this a common case on all hardware, including from other vendors?Uniforms cause state changes, which create stalls in the rendering pipeline. Even if attribute setting isn't a function of the hardware, all that needs to be done is the construction of a vertex array/buffer containing that value replicated, and all is good. This is driver-side stuff, so it isn't so bad. It slows the CPU down, but the GPU pipeline is still smooth.

knackered
12-31-2004, 03:12 AM
Why are you saying that state changes cause a stall in the pipeline?

zeckensack
12-31-2004, 07:23 AM
Originally posted by knackered:
Why are you saying that state changes cause a stall in the pipeline?State changes trigger reconfiguration. "Work" OTOH doesn't.
If you reconfigure a pipelined hardware device, it's very hard to maintain coherency without flushing.

I agree with Korval insofar as this is likely the cause of the performance difference between attributes (work) and uniforms (state).

____
Korval,
I think I've seen evidence that both ATI and NVIDIA don't need to alter the geometry. Immediate mode attributes (that aren't updated from array data) can just stick at the vertex fetch stage. No need for replicating them.

Humus
12-31-2004, 03:05 PM
Originally posted by knackered:
There's quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.It's not nearly as dramatic in OpenGL as in D3D since you don't have the context switch to ring0 and back to ring3 again for each draw call. I'm doubtful instancing will ever be particularly useful on the OpenGL side. It's hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it's not really needed. 5000 draw calls btw isn't going to stop you from running at smooth framerates. Now if you really need to draw 20000+ objects of the same kind you can still use the shader constant instancing method where you pack the instance data into uniforms and look it up in the vertex shader. It is pretty much equally good, sometimes even faster.

Korval
12-31-2004, 05:19 PM
It's not nearly as dramatic in OpenGL as in D3D since you don't have the context switch to ring0 and back to ring3 again for each draw call. I'm doubtful instancing will ever be particularly useful on the OpenGL side.What is the particular basis for your doubt of the usefulness of this technique? Have you seen a program that actually compares instancing in D3D to non-instanced draw calls in GL? Under both nVidia and ATi hardware?

20,000 draw calls may not be 20,000 switches to Ring0 and back, but it isn't cheap.


It's hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it's not really needed.There are plenty of objects that we would like to draw 10,000 of that are less than 100 triangles in size. Tufts of grass, for example. A massive forest (of very simple trees). Or a field of fragment-based imposters.

Plus, if it is going to be needed in the future, what is the harm of adding it today? Indeed, considering the general slowness of the ARB, starting the process today would be a good idea. Worst case, it becomes a feature of GL that nobody uses; GL has plenty of those, so nobody will really notice. We're talking about 1 or 2 entrypoints here, not a massive change to how vertex arrays work.


Now if you really need to draw 20000+ objects of the same kind you can still use the shader constant instancing method where you pack the instance data into uniforms and look it up in the vertex shader. It is pretty much equally good, sometimes even faster.There aren't anywhere near 20,000 uniforms, so you're going to need a lot of state changes. And, as we've demonstrated here, the state change penalty for switching out uniforms is hardly trivial.

<edit: missed this from before>


I think I've seen evidence that both ATI and NVIDIA don't need to alter the geometry. Immediate mode attributes (that aren't updated from array data) can just stick at the vertex fetch stage. No need for replicating them.Really? That's pretty nifty if it does work that way. I think the best way to verify it is to render with the set attribute and then render the same scene with that attribute bound to a vertex buffer (replicated with the same value). If both methods are equally fast, or the vertex buffer method is slightly slower, then it means that their hardware does allow updating attributes directly through immediate mode. While it doesn't eliminate the draw call overhead, it does mean that the only thing direct instancing would buy you is losing this overhead.

Humus
12-31-2004, 09:13 PM
Originally posted by Korval:
[QB]What is the particular basis for your doubt of the usefulness of this technique? Have you seen a program that actually compares instancing in D3D to non-instanced draw calls in GL? Under both nVidia and ATi hardware?I have not compared to non-instanced calls in GL, but I've done quite a bit of work on instanced vs. non-instanced on the DX side, both at home and at work. The first time it took me quite a lot of effort and even some help by one of our driver guys to even get it to run faster than the non-instanced path. And our hardware still see larger benefits of using instancing than nVidia's, at least the last time I checked.


Plus, if it is going to be needed in the future, what is the harm of adding it today?We don't really need more garbage hanging around in the API. If it's going to be added, it should be proved to be useful first.


There aren't anywhere near 20,000 uniforms, so you're going to need a lot of state changes. And, as we've demonstrated here, the state change penalty for switching out uniforms is hardly trivial.If you have 20,000 instances, already by packing two instances per batch you're down to 10,000. With 4 you're at 5,000 calls etc. It doesn't take many instances per batch to cut down the number of calls so that the bottleneck ends up elsewhere. Depending on how much instance data you have you'll probably be able to pack at 30-60 instances in a batch, getting the number of draw calls down 300-600 for 20,000 instances. In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.

AndrewM
01-01-2005, 12:04 AM
Originally posted by Humus:
We don't really need more garbage hanging around in the API. If it's going to be added, it should be proved to be useful first.
It needs to be added before it can be tested and proved.

:)

nystep
01-01-2005, 05:43 AM
I think there was a feature that was previously discussed here, that should be in WGF 1 or 2. The multiple indice streams thing (i don't really remenber the exact name). Having an extension to do this would cover (?) the needs of geometry instancing maybe. Furthemore it is backward compatible, since the driver would do pretty much the same job as humus describes above to render everything. On newer hardware it would even save memory bandwidth. What do you think about it?

knackered
01-01-2005, 09:33 AM
Originally posted by Humus:

There aren't anywhere near 20,000 uniforms, so you're going to need a lot of state changes. And, as we've demonstrated here, the state change penalty for switching out uniforms is hardly trivial.If you have 20,000 instances, already by packing two instances per batch you're down to 10,000. With 4 you're at 5,000 calls etc. It doesn't take many instances per batch to cut down the number of calls so that the bottleneck ends up elsewhere. Depending on how much instance data you have you'll probably be able to pack at 30-60 instances in a batch, getting the number of draw calls down 300-600 for 20,000 instances. In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.I don't get this - you seem to be using the word 'instancing' in the wrong context. If you're talking about packing 30-60 'instances' into a batch then you're talking about replicating vertices to work-around the absense of an instancing mechanism....which is using loads more memory, and memory bandwidth, which is the exact problem instancing attempts to address. Scale your example up, and you're using a significant resource. So you're arguing against the very idea of instancing? Seems odd, when everyone else in the industry seems to be pushing for things like precedural textures and geometry in order to address the increasing detail-versus-memory-constraints problem, you seem to be against instancing which would help greatly in this area.

Humus
01-01-2005, 10:11 AM
I'm not against the idea. If it proves useful in GL I'm all for it. I'm just saying I'm not so sure this is the case.

As for the instancing method I described, the idea of instancing is to be able to draw many instances with one draw call as the main bottleneck is considered to be the actual draw calls (in DX anyway). This method solves that problem equally much as "real instancing". Yes, your VBO needs to contain several copies of the model, but since we're talking about < 100 triangle models, it will still be very small. Say vertex + normal + texcoord, 80 vertices and 50 copies, that's only 125Kb. Hardly a problematic resource usage. The data that needs to pass the AGP bus every frame is also the same as in real instancing.

Korval
01-01-2005, 12:51 PM
In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.But it still isn't as fast as the truely instanced case.


The data that needs to pass the AGP bus every frame is also the same as in real instancing.Not really. The reason that models of greater than 100 triangles (or thereabouts) are not faster with instancing is because these models blow the pre-T&L cache. If an instance fits into the pre-T&L cache entirely, then there's no problem. Every call after the first will not provoke a hit on memory, save for the index load.

If you do your instancing mechanism, the likelihood is that you'll blow the pre-T&L cache, and every vertex (depending on intra-instance sharing) will provoke a memory access.

nystep
01-01-2005, 01:57 PM
Korval,

You're talking about Pre-T&L cache, would you have any hint/links to explain what it is further? I'm interrested... How do you know its size? Is it avial on all 3d accelerators?

regards,

Jan
01-02-2005, 03:11 AM
Hi

Only because OpenGL already is faster (in that point), doesn┤t justify to not make it any faster anymore.

This is really a lame excuse and such an attitude will hurt OpenGL in the long run. I dare to say, that it already did for several years.

I am pretty sure, that there are lots of apps, that might benefit from instancing, even if the bottleneck it tackles, might not be the most important.

And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??

Jan.

zed
01-02-2005, 11:20 AM
Originally posted by Jan:
Hi

Only because OpenGL already is faster (in that point), doesn┤t justify to not make it any faster anymore.

This is really a lame excuse and such an attitude will hurt OpenGL in the long run. I dare to say, that it already did for several years.

I am pretty sure, that there are lots of apps, that might benefit from instancing, even if the bottleneck it tackles, might not be the most important.

And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??

Jan.speaking as someone who knows nothing about the internal workings of drivers but
A/ im not to sure a LOT of apps are gonna benifit from it, thus its only gonna benifit some ppl. eg how will it benifit doom3/maya?
B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )

Korval
01-02-2005, 12:25 PM
You're talking about Pre-T&L cache, would you have any hint/links to explain what it is further? I'm interrested... How do you know its size? Is it avial on all 3d accelerators?It's a memory cache, just like any other regular memory cache. It works like the one in your CPU. If a vertex index (when converted into one or more actual memory addresses) would provoke the fetching from a memory location that is already in the cache, then it doesn't fetch from memory. Just like the one in your CPU. It's just an L1 (or maybe an L2, depending on hardware) cache strapped to the vertex reading apparatus of the hardware.


And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??Look how long it took to get RTT (and, despite any notes to the contrary, we don't have it yet).


A/ im not to sure a LOT of apps are gonna benifit from it, thus its only gonna benifit some ppl. eg how will it benifit doom3/maya? Bad excuse: no new extensions benifit already existing applications. Glslang doesn't benifit Doom3 or the current version of Maya either; that doesn't mean we shouldn't have it.


B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )Considering that ATi is perfectly capable of optimizing it in D3D for hardware that doesn't even support instancing directly (R420), I don't think it puts an undo burden on driver writers. Plus, its an extension; as such, it's not required. We're not asking it to be included in the core, or to even be an ARB extension. Just get both ATi and nVidia to agree on it, so that those two can implement it.

knackered
01-02-2005, 01:26 PM
B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )It's existance would provide an opportunity for optimisations, that's the whole point...not having an explicit mechanism makes it virtually impossible to optimise for instancing. You need to be able to say to the driver "I'm instancing". If the driver doesn't want to optimise instancing, it can just drop to a slow path. That's the nice thing about OpenGL....the matrix stack has always been part of the core, display lists have always been part of the core....if a driver had no way of optimising display lists, it would just behave like immmediate mode, but at least you had the opportunity to say "This stuff is static", just in case a future driver could do something with that valuable information. glDrawRangeElements is another one....some drivers may ignore the extra information, but most drivers are grateful for it.

zed
01-02-2005, 04:10 PM
Bad excuse: no new extensions benifit already existing applications. Glslang doesn't benifit Doom3 or the current version of Maya either; that doesn't mean we shouldn't have it.u misunderstand me, im using a couple of examples of apps that (if instancing existed before they were even dreampt about) they would most likely not use, eg where would u use instancing in doom3?

B/ displaylists + rangeelements are a different they are more generic, instancing is benificial only to limited data sets (eg blades of grass), i liken instancing to point sprites, as waste of time (ok there are some limited cases where they are benificial) but personally its just added clutter to the api, i would prefer a lean mean api, instead of a slow bulky api that does everything.

harsman
01-02-2005, 04:22 PM
Of course it would provide an opportunity for optimisation, and in the long run adding an abstraction for instancing to OpenGL might be useful, but in the short run I know tons of other things I'd rather see worked on over this.

The draw calls in OpenGL are really light weight as it is, FAR more than they are in D3D and I find being batch limited quite rare in OpenGL with sane engine design.

If one really wanted to, instancing might be achieved by a modification to MultiDrawArrays.

Humus
01-02-2005, 07:06 PM
Originally posted by Korval:
But it still isn't as fast as the truely instanced case.But difference will be negliable.


Not really. The reason that models of greater than 100 triangles (or thereabouts) are not faster with instancing is because these models blow the pre-T&L cache. If an instance fits into the pre-T&L cache entirely, then there's no problem. Every call after the first will not provoke a hit on memory, save for the index load.

If you do your instancing mechanism, the likelihood is that you'll blow the pre-T&L cache, and every vertex (depending on intra-instance sharing) will provoke a memory access.Not sure what you're saying "not really" about, as the data passing the AGP is still the same regardless of pre-T&L cache utilization, but anyway, I understand your argument, but I'm not sure I agree for three reasons. The first is that in general memory access is seldom the bottleneck anyway, so it usually doesn't matter that much. The other is that instancing already screws up for the pre-T&L already as it requires two vertex streams. The third reason I'm afraid involves some non-public information that I can't disclose, which I believe is good argument why what you say isn't the case in practice, but it's hard of course to make this convincing without going into details.

Humus
01-02-2005, 07:17 PM
Originally posted by Jan:
Only because OpenGL already is faster (in that point), doesn┤t justify to not make it any faster anymore.If it doesn't make any difference in practice, then that's certainly a good reason not to include it. Everything you do has some overhead. The question is, is it significant enough to motivate additional functionality to dodge it, particularly given the likelyhood it will actually be used by common applications? glDrawElements() has its overhead, but so does glEnable(), glBindTexture() etc. There are probably some usage scenarios where putting texture changes directly into the index stream to glDrawElements() so that several meshes could be drawn with a single draw call given some hardware support, but is it worth the effort and the API pollution? That's the question.

Humus
01-02-2005, 07:29 PM
Originally posted by knackered:
It's existance would provide an opportunity for optimisations, that's the whole point...not having an explicit mechanism makes it virtually impossible to optimise for instancing. You need to be able to say to the driver "I'm instancing". If the driver doesn't want to optimise instancing, it can just drop to a slow path. That's the nice thing about OpenGL....the matrix stack has always been part of the core, display lists have always been part of the core....if a driver had no way of optimising display lists, it would just behave like immmediate mode, but at least you had the opportunity to say "This stuff is static", just in case a future driver could do something with that valuable information. glDrawRangeElements is another one....some drivers may ignore the extra information, but most drivers are grateful for it.Well, you mention display lists, well, instancing in GL runs the risk of becoming another of those "seemed like a good idea when it was added, but turned out to be a mess in the long run" kind of features. Display lists takes up a whole lot of code, and slows down the whole API actually since pretty much each and every call has to check whether we're currently collecting a display list or rendering as usual, and there are better ways to deal with the what it's mainly used for namely storing geometry, like VBOs. Same with immediate mode. It has clogged up the API since all extensions typically are made to be orthogonal with immediate mode calls. There's a good reason why both display lists and immediate mode were ditched in OpenGL ES.
As much as I love new features, it's hard to get excited about instancing. Especially in OpenGL were glDrawElements() is so lightweight already. I think there are other priorities that are more important right now.

Korval
01-03-2005, 12:37 AM
But difference will be negliable.Have any specific evidence of that? Once again, have you seen a demo that compares the best OpenGL can do in instancing situations to the best D3D can do?

The onus isn't on us to provide a reason for this extension; we already have one (worst case, it does nothing. Best case, non-trivial performacne gains. Ergo worthwhile). The onus is on you to provide some specific evidence that shows how it would not be useful.


Not sure what you're saying "not really" about, as the data passing the AGP is still the same regardless of pre-T&L cache utilizationI don't think I realized that you could replicate vertex data from indices, thus avoiding replicating the actual vertex data and thus potentially blowing your cache.


The other is that instancing already screws up for the pre-T&L already as it requires two vertex streams.It isn't that bad (or bad at all). It simply requires a different kind of per-vertex fetch operation. It doesn't screw up the cache unless the hardware has some horrible limitation.


The third reason I'm afraid involves some non-public information that I can't disclose, which I believe is good argument why what you say isn't the case in practice, but it's hard of course to make this convincing without going into details.Considering that you work for ATi, this means that it's ATi's problem, not a problem with the concept as a whole or hardware in general. They should have made a real graphics card this go around with real features, rather than a simple knockoff of the R300.


It has clogged up the API since all extensions typically are made to be orthogonal with immediate mode calls. There's a good reason why both display lists and immediate mode were ditched in OpenGL ES.And yet, it is immediate mode which gives OpenGL some semblence of instanced rendering.


I think there are other priorities that are more important right now.Such as? Performance should always be priority #1. Just because ATi doesn't see it as a priority doesn't mean that it isn't a priority. And, looking at what the ARB has cooking, it ain't much. This is not a highly complex spec that requires 2 years (including a failed year) to make progress on. It is a spec of already known behavior that can, in good hardware, potentially improve performance.

Christian Sch├╝ler
01-03-2005, 05:47 AM
It is some time ago that I used OpenGL for game engine work, it was the time of the GeForce2 then.

To test the batch performance, I implemented a "batch test mode", where, at the level of doing the glDrawElements() call, the primitive count wound be reduced to 1 when batch test mode was enabled. So in batch test mode each batch of a scene was rendered with 1 triangle.

When run at a sufficient small resolution (say, 320x240) so that pixel work is negligible, the batch test mode could tell whether the rendering speed was dependant on size of batches or not. If the rendering speed is not dependant on batch test mode, this is evidence that batch overhead is the bottleneck.

What I can tell is that batch overhead was not the bottleneck, with one particular scene having some 5500 batches per frame, and on ~1000000 rendered triangles, FPS did improve significantly in batch test mode (that's 200 triangles per batch, on average).

I think the instancing API is not strictly needed in OpenGL for simple object instancing, like, placing batteries of trees or lanterns into the scene. glDrawElements() interleaved with glLoadMatrix() is already fast enough.

However, some points:

The added semantics of The Instancing API (tm) may offer some new programming techniques previously difficult to communicate to the driver.

If you go extreme and view a single quad as an instanced object, like, in a particle system, clearly glDrawElements() isn't going to cut it.

The CPU work consumed by each render call could be used for different purposes. In a game, there's always to little CPU left. So, it may be true that the batch overhead is not bottleneck at 200 tri/batch (the CPU can submit batches faster as the card can process them), but we would like to get rid of the CPU consumption altogether :-)

V-man
01-03-2005, 03:23 PM
Originally posted by Christian SchŘler:
To test the batch performance, I implemented a "batch test mode", where, at the level of doing the glDrawElements() call, the primitive count wound be reduced to 1 when batch test mode was enabled. So in batch test mode each batch of a scene was rendered with 1 triangle.

When run at a sufficient small resolution (say, 320x240) so that pixel work is negligible, the batch test mode could tell whether the rendering speed was dependant on size of batches or not. If the rendering speed is not dependant on batch test mode, this is evidence that batch overhead is the bottleneck.
Sorry, but where is the conclusion to this? Was glDrawElements() a bottleneck or not?
And since you say Gf2, why not use glDrawRangeElements()?

Christian Sch├╝ler
01-03-2005, 03:46 PM
Sorry for the convoluted wording.

Actually, it was glDrawRangeElements().

No, glDrawRangeElements() wasn't a bottleneck, drawing 5500 batches with 1 triangle each (on a 600 Mhz PC, with Detonator 44.something) resulted in frame rates of 30-40 Hz, and with geometry on it was much lower, so which means OpenGL was pushing at least 150k batches per second.

EDIT:
Looking back to the first page, the example posted 20000 instances @ 5 Hz which is just 100k batches per second. Maybe the drivers have become more batch unfriendly over the time? (More stuff to do...)

zed
01-03-2005, 04:22 PM
"(worst case, it does nothing. Best case, non-trivial performacne gains. Ergo worthwhile)"

but the thing is it doesnt do nothing (as in affect performance/stability),
(your answer to this 'but they can always choose a different path, ie ignore the instancing path ')
true (though having the driver make a choice is gonna have a slight impact)
the major problem though is the added complexity in the driver which leads to less stabilty(oportunities to optimize)
(your answer to this 'but implementing it is trivial , they do it once + forget it')
how often have u seen with new releases of drivers, old stuff that once worked becomes broken, adding instancing will create extra burden on the driver writers, leading to worse drivers. nobodies perfect.

where would u use it(instancing) in doom3?
where would u use it in 3dmax?

are u doing over 100million tris/sec now with your app?
ie youre not even realising the potential of what u have to play with, yet u want more!

comeon korval give it up ;)

Korval
01-03-2005, 04:55 PM
the major problem though is the added complexity in the driver which leads to less stabiltyThen they shouldn't implement the extension. Just like if a D3D implementation couldn't handle instancing, then they don't have to.

Equally importantly, it isn't a difficult thing to implement. If the hardware supports it directly, then it is trivial. And, if it doesn't, then don't implement the extension or write a fairly short bit of code to convert one glDraw* call into many.


where would u use it(instancing) in doom3?
where would u use it in 3dmax?Doom 3 is an indoor game. And, since this is not the actual Doom 3 (which is a finished game and therefore can't use instancing), but instead a hypothetical Doom 3, I could still see uses for instancing. For example, imagine a cave. Now, imagine that the cave floor/walls are littered with rocks. Granted, it'd be the same rock, but with a different position/orientation, it would be difficult to realize this in a game enviromnent. It's a step beyond mere bump mapping and into what, in movie terms, would be set decoration.

In theory, you could have the walls themselves made of nothing but instances of a repeatd material: bricks, planks of metal, etc. No need for bump mapping or the much slower displacment mapping; this is real live geometry on the walls, created via step-and-repeat. You could imagine a wall made up of more interesting geometry this way.

In an outdoors game, there are even more uses for instancing.

3D Max isn't even a performance program. This is a performance extension, so you shouldn't expect them to use it. They probably don't use VBO's either (since their vertex data changes rather frequently and in ways that most game applications would consider unusual).


are u doing over 100million tris/sec now with your app?I certainly won't be able to with my CPU taken up by sending a bunch of batches, rather than running the application.

MZ
01-03-2005, 07:07 PM
Update to the test results:

I suspect the driver optimized the display list by detecting shared vertices and turning submitted vertex data into indexed mesh. I think so because I noticed performance dropped (by 20-40%, in both paths) when I prevented welding of vertices by modyfying positions by small random value in individual triangles.

Instead of "20 triangles * 3 verts * (...)" there should be "12 verts * (...)"

So the "yield" is actually 5 times lower than it seemed. Of course, the fps numbers (and hence the relative speedup value) are unaffected by this.

zed
01-03-2005, 11:28 PM
Originally posted by Korval:
Then they shouldn't implement the extension. Just like if a D3D implementation couldn't handle instancing, then they don't have to.dammed if u do dammed if u dont


Equally importantly, it isn't a difficult thing to implement. If the hardware supports it directly, then it is trivial. And, if it doesn't, then don't implement the extension or write a fairly short bit of code to convert one glDraw* call into many.ive made an appointment for u to go and see nvidia/ati this saturday between 2-2.30pm, should be plenty of time for u to wip up a instancing implementation (they also mentioned if u finish early about perhaps u could throw together a render to texture implementation for them as well)



In theory, you could have the walls themselves made of nothing but instances of a repeatd material: bricks, planks of metal, etc. No need for bump mapping or the much slower displacment mapping; this is real live geometry on the walls, created via step-and-repeat. You could imagine a wall made up of more interesting geometry this way.how many vertices do u have to use to emulate a bumpmapped brick, perhaps if instancing was 1000x quicker than an non instanced method u could do it, but mate, its not 1000x quicker not even close.


I certainly won't be able to with my CPU taken up by sending a bunch of batches, rather than running the application.im working with scenes which are in the 100,000s of vertices, the number of verts aint the bottleneck

Korval
01-04-2005, 12:59 AM
BTW, Zed, feel free to use appropriate punctuation and sentence capitalization in your posts.


ive made an appointment for u to go and see nvidia/ati this saturday between 2-2.30pm, should be plenty of time for u to wip up a instancing implementation (they also mentioned if u finish early about perhaps u could throw together a render to texture implementation for them as well)I presume you have secured their current driver codebase, as well as their hardware documentation and engineers so that I may have various questions answered. After all, without these resources (among others), noone would be capable of writing any kind of functioning OpenGL driver for their cards.

Oh, and FYI: both nVidia and ATi have implemented instanced rendering into their D3D drivers. If it wasn't oppressively hard to put them there, it can't be that hard to put them in their GL codebase with an appropriate API. It is the same hardware, after all.


how many vertices do u have to use to emulate a bumpmapped brick, perhaps if instancing was 1000x quicker than an non instanced method u could do it, but mate, its not 1000x quicker not even close. The reason it isn't done (and I'm not talking about a detailed brick; I'm talking about a relatively simplified brick pattern) is not due to hardware issues. It is simply because of the brutal memory costs associated with such hyper-detailed terrain. A wall that could have been 2 polys can quickly become 3,000 with such detail. That's a massive increase in the size of the vertex data, and it can easily get out of control. However, if you build it out of instanced pieces, you save memory by only storing the location/orientation of the instances.


im working with scenes which are in the 100,000s of vertices, the number of verts aint the bottleneckVertex processing, of course, isn't the bottleneck in question: vertex upload and CPU processing is the bottleneck that instancing is designed to mitigate.

KRONOS
01-04-2005, 04:35 AM
(not wanting to interrupt the instancing debate) Why don't they post the EXT_framebuffer_object in the registry? If it is ready, is there any reason not to post it?

bobvodka
01-04-2005, 06:39 AM
intresting point, probably because of Xmas and maybe wanting to give the IHVs time to get a driver sorted with it in before letting us see it...

knackered
01-04-2005, 07:12 AM
Originally posted by Korval:
BTW, Zed, feel free to use appropriate punctuation and sentence capitalization in your posts.:)

People seem to be parallised with fear of anyone adding anything to OpenGL at the moment, for fear of the nvidia/ati driver writers being unable to cope with the extra complexity. Is this a justified fear? Maybe with ATI it is, but NVidia are pretty savvy.
I don't see the problem - like Korval says, it's in d3d now: and I doubt very much that vertex streams were introduced *just* because of d3d's drawprimitive call penalty...it's a pretty dramatic change to the mechanism in d3d, a lot of work put in to something that is obviously going to be useful. It doesn't need to be a major change in OpenGL because of the nice way vertex arrays are handled already.
Also, I keep hearing this talk of the driver having yet another state to consider when issuing draws, but surely this is outweighed by the fact that when instancing it has far less states to consider because of thousands of draw calls condensed into one.
Let me have this feature please.

marco_dup1
01-04-2005, 07:34 AM
maybe there will be a feature in the near future that will be making this kind of instancing obsolet. a much more general mechanismn, for example vertex generation, killing in the vertex shader.

knackered
01-04-2005, 07:51 AM
I seriously doubt it. In any case, how would that be useful for instancing?

cyclone
01-04-2005, 08:42 AM
>I seriously doubt it. In any case, how would >that be useful for instancing?

This can be usefull for a lot of things such as fractal terrain generation for example (cf. one big quad can become a lot of smalls quads/triangles for "simulate" bump mapping).

Ok in this case we don't destroy vertices but add a lot of anothers vertices, but the idea is the same (the number of input and output vertices aren't the same ...).

It's seem me that ATI have already make something like this with something named TrueForm or something like this.


@+
Cyclone

Korval
01-04-2005, 09:27 AM
a much more general mechanismn, for example vertex generation, killing in the vertex shader.It would be better to have an entire new programmable mode (aka, a primitive processor), rather than overloading vertex shaders. A decent primitive processor needs to do lots of memory accesses in order to do truly useful stuff. Plus, it helps with pipelining, as primitive processing can happen completely in parallel with vertex shading.

As to your point, yes, a primitive processor can do instancing. However, it will likely be slower than a hardware-based solution.


It's seem me that ATI have already make something like this with something named TrueForm or something like this.TrueForm was just a tesselation and mesh smoothing mechanism ATi created, much like GeForce 3/4 hardware had some form of polynomial surface generation. Both of these were very hard-wired and non-trivially restrictive. I'm pretty sure that later hardware (R300+ and NV30+) doesn't even have these features, though I may be mistaken. At the very least, nobody seems terribly interested in using them.

V-man
01-04-2005, 09:30 AM
Originally posted by cyclone:

This can be usefull for a lot of things such as fractal terrain generation for example (cf. one big quad can become a lot of smalls quads/triangles for "simulate" bump mapping).
That's a tesselation engine and truform was one. Truform was also a big failure and brings down performance by half. NVidia had evaluators which was also dropped.

There is talk that DX10 will support a programmable tesselator, but it's just rumours.

Instancing is about reducing API calls. You render the same thing except you use another stream of data to replace all the color or all the normals or all the texcoords in another VBO. Of course, in a shader, use them however you want.

The ARB meeting notes give an example of what is instancing-like, but there are other ways, like have separate arrays for normals and texcoords.

I think that a lot of the ARB members are reluctant to add features, because it causes an explosion of complexity. Instancing was refused with one short sentence :)

At least GL ES is cleaned up. They want to eliminate FF altogether in GL ES 2.0

marco_dup1
01-04-2005, 09:57 AM
Originally posted by Korval:

As to your point, yes, a primitive processor can do instancing. However, it will likely be slower than a hardware-based solution.

Yes, hard wired solution will mostly faster but I doubt that it's so econonical to have hardwired instancing because its so specialised. How often do you need instancing? Maybe for some games(this boring one without a new idea and very good graphics). Maybe somewhere a people who have some new ideas what to do with this new possibilities. But I'm pessimistic because life is so much more interesting than games, you have allways a real risk :-)

knackered
01-04-2005, 10:38 AM
Trees, grass, clouds, rocks.
Not teapots, granted - but most serious users of OpenGL are interested in more than buggering around with bumpmapped teapots and rabbits.
I refer you once again to the arguments for procedural textures and geometry. It's the same argument for instanced geometry.

MikeC
01-04-2005, 12:51 PM
Originally posted by V-man:
At least GL ES is cleaned up. They want to eliminate FF altogether in GL ES 2.0Interesting. I hadn't heard that.

I haven't been paying much attention to OpenGL-ES, but if backward-compatibility cruft is becoming the hindrance that several posts in this thread seem to indicate, I wonder whether it might one day drop the "Embedded" and become what was once mooted as OpenGL 2.0 Pure.

marco_dup1
01-04-2005, 02:57 PM
Originally posted by knackered:
Trees, grass, clouds, rocks.
Not teapots, granted - but most serious users of OpenGL are interested in more than buggering around with bumpmapped teapots and rabbits.
I refer you once again to the arguments for procedural textures and geometry. It's the same argument for instanced geometry.No my argument is GLSL yours is phong fragment shading. And please show me that its so much faster. Show me thats not possible now.

Humus
01-04-2005, 04:39 PM
Originally posted by Korval:
Have any specific evidence of that?How about common sense? If you have a bottleneck somewhere else, it's not going to help speeding up this particular piece of the pipeline. You don't exactly expect performance increases by optimizing your vertex shader when you're fillrate limited either.


Originally posted by Korval:
The onus isn't on us to provide a reason for this extension; we already have one (worst case, it does nothing. Best case, non-trivial performacne gains. Ergo worthwhile). The onus is on you to provide some specific evidence that shows how it would not be useful.That's an incredible backwards way of working. You don't start by expecting everything put forth to be implemented unless it explicitely is proven to be useless by those you expect to implement it. It's up to those suggesting a particular features to convince the implementors that the feature is useful.


I don't think I realized that you could replicate vertex data from indices, thus avoiding replicating the actual vertex data and thus potentially blowing your cache.???


It isn't that bad (or bad at all). It simply requires a different kind of per-vertex fetch operation. It doesn't screw up the cache unless the hardware has some horrible limitation.It think you're expecting too much from current hardware. Most hardware if not all see a slowdown when using multiple streams because of this. That's also why you should use interleaved arrays rather than separate streams unless you have a good reason otherwise. The difference is of course small though unless you're limited by vertex fetch performance.


Considering that you work for ATi, this means that it's ATi's problem, not a problem with the concept as a whole or hardware in general. They should have made a real graphics card this go around with real features, rather than a simple knockoff of the R300.To begin with a clarification, what I'm saying here should be interpreted entirely as my own personal opinion and not a official ATI opinion. It's entirely possible that the driver team disagrees.
What I'm referring to is not just ATI's problems. I have very good reasons to believe the same applies to nVidia. Their instancing performance is worse than ours. Both in absolute and relative terms.


And yet, it is immediate mode which gives OpenGL some semblence of instanced rendering.Immediate mode as in glBegin()/glEnd() etc., not as in the ability to provide uniform values across a bunch of primitives. There are glColor calls for instance in OpenGL ES IIRC, but there's no glBegin()/glEnd(). Good riddance I say.


Performance should always be priority #1. Stability is #1. Bug fixes go before performance. But even if you're working on performance, where do you spend your resources on optimizations? A niche features that's hardly useful on current hardware, or on say boosting shader performance, resource allocation etc?

Humus
01-04-2005, 04:44 PM
Originally posted by Korval:
Equally importantly, it isn't a difficult thing to implement. If the hardware supports it directly, then it is trivial.And you know this by experience I presume?

Korval
01-04-2005, 06:13 PM
You don't exactly expect performance increases by optimizing your vertex shader when you're fillrate limited either.Vertex transfer is limitted by the bus between the CPU's memory and the graphics card. This bus rarely increases in size, and when it does, it isn't terribly much. By contrast, fillrate doubles quite frequently, and vertex processing isn't that far behind.

We may be bound on these things now (and, it isn't that tough to be bound on vertex transfer; just have simple shaders), but as time goes on, we will not be nearly so bound on them.

Oh, and don't forget that any attempt to alieviate a CPU burden is good; we need to find ways to increase parallelism and improve on what the CPU can do. This functionality provides this, and it is a help for those situations.

Lastly, it doesn't hurt to have it there in terms of performance either.


That's an incredible backwards way of working. You don't start by expecting everything put forth to be implemented unless it explicitely is proven to be useless by those you expect to implement it. It's up to those suggesting a particular features to convince the implementors that the feature is useful.I did. It will, worst case, not impair performance. Best case, it will improve performance. Ergo, it is worthwhile.

QED.

This isn't something to the level of complexity of glslang or even RTT (with the infinitely many ways of combinding textures). It's one entrypoint with very well-defined behavior.


What I'm referring to is not just ATI's problems. I have very good reasons to believe the same applies to nVidia. Their instancing performance is worse than ours. Both in absolute and relative terms.As a guess, their instancing performance is likely due to having to copy the indices directly into the command stream. NV20 and previous hardware were unable to directly use indices from AGP memory (or, doing so caused some slowdown or something); that's why VAR didn't let you do it. It's, also, likely why the VBO spec suggest putting your index VBO's in a different object from your mesh VBO's, so that they can optimize that circumstance. Since your indices have to be repeated pretty massively for instancing (number of instances * indices in an instance), this is likely going to make things slower than the theoretical maximum.

By contrast, ATi's never had a problem with pulling directly from a buffer in AGP memory.

And it's still faster than batching.


Immediate mode as in glBegin()/glEnd() etc., not as in the ability to provide uniform values across a bunch of primitives. There are glColor calls for instance in OpenGL ES IIRC, but there's no glBegin()/glEnd(). Good riddance I say.I can't say I disagree. I wonder if, in the relatively near future (2-3 years), PC IHV's will just start implementing OpenGL ES instead of regular OpenGL. Except for the heavy palatting options, it seems to be a much nicer, cleaner OpenGL.


A niche features that's hardly useful on current hardware, or on say boosting shader performance, resource allocation etc?You haven't provided any real evidence that the performance gain will be minimal. Saying it doesn't make it true.


And you know this by experience I presume?Experience is not required. If the hardware actively supports instancing, then the specific calls simply translate into writing the proper token into the command stream. The only way the extension becomes more complicated is if you have to manually "uninstance" it, because the hardware isn't capable of handling it directly.

idr
01-05-2005, 11:22 AM
Korval,

I actually looked into implementing an instancing API in the open-source R200 driver on Linux. I worked through what it would take to implement it, what it would take for developers to use, and what the potential performance gains would be.

My understanding of instancing is that it allows you to draw the same mesh (with identical state) multiple times with a different tranformation matrix. The simple API that I used added two functions. I think the intended usage of both functions is pretty obvious.

void MatrixPointer(int size, enum type, sizei stride, void *pointer);

void DrawInstancedRangeElements(enum mode, uint start, uint end, sizei count, enum type, const void *indices, uint instances);When used with all data in on-card VBOs, I found the following:
The theoretical performance gain would be roughly equal to the API overhead of calling PushMatrix, MultMatrix, DrawRangeElements, PopMatrix. That overhead, with properly setup VBOs and moderate sized meshes, was very small.
The theoretical performance gain over compiling a display list with the sequence of PushMatrix, MultMatrix, DrawRangeElements, PopMatrix on a card like the R200 (probably not the best example) was nil (see way, way below).
Requiring developers to calculate their own matricies and store them in an array instead of using GL functions to create the matrix sucks (see below).
I did also consider an API that added a parameter to DrawInstancedRangeElements to specify whether to load or multiply the matrix and whether the matricies were "standard" or transpose. That still takes away some of the flexability of GL matrix operations, but I think it covers the 90% case.

There are three cases where I was able to convince myself that there would be significant performance gains. However, I wasn't able to convince myself that I cared enough about those cases to continue with the exercise.
Very small meshes (the hypothetical 1,000,000 cubes case). These cases would see similar performances gains as with using NV_primitive_restart (http://oss.sgi.com/projects/ogl-sample/registry/NV/primitive_restart.txt) .
Non-VBO vertex arrays.
With some modification to the proposed API, immediate mode.
Basically, with OpenGL instancing would only help in cases that are API call bound. That leaves two questions that I haven't seen answered (either "yes" or "no). Are there important cases of that today? Are there important cases of that moving forward?

My other finding was that modifying the display list rules WTR vertex arrays provided more flexability with the same performance improvement potential. Basically, display lists copy data out of the vertex array when the display list is compled. If display lists sourced the vertex data when the display list was executed, you could trivally do instancing without adding extra API entry-points (other than perhaps a BeginListArrays or something) or reducing GL flexability. There are some other potential problems with it, but it's something I've been mulling over for a awhile now.

Korval
01-05-2005, 12:44 PM
My understanding of instancing is that it allows you to draw the same mesh (with identical state) multiple times with a different tranformation matrix.It is more general purpose than that.

The way instancing works is that you state that one or more attributes are not indexed the way other attributes are. Instead, they are indexed by a frequency parameter. Their index starts at 0, and, whenever the current index's index hits a multiple of the frequency, then the index for the modified attribute(s) increases.

This creates instancing depending on your vertex shader. If each instance contains, say, 45 indices, then you set the frequency parameter to 45, and you have an attribute array just contains a list of positions. Now, you create a large index array, with 45 * the number of instances entries in it. It repeats every 45 indices. You make a single draw call with that vertex array.


Very small meshes (the hypothetical 1,000,000 cubes case). These cases would see similar performances gains as with using NV_primitive_restart.Well, how small do they need to be? The number being thrown about in this thread is 100 vertices. You can do quite a bit in 100 vertices: rocks, grass tuffts, bushes, simple windows on buildings, mesh-tiled walls, etc. Anything you want to create by step-and-repeat can be built with this.


Are there important cases of that today? Are there important cases of that moving forward?There's no way to answer that. Programs avoid being vertex transfer bound because they do not use large numbers of small meshes. They would like to, but they know this will cause performance problems. Until they have an efficient way to do it, it simply won't be done by developers.

We're talking about window-dressing here (unless you're making an asteroids-like game). It isn't large geometry. It isn't huge stuff. It would certainly help in terms of creating better graphics, but it isn't preventing anyone from making a game or other application. It is just preventing them from fully utilizing their hardware's power to achieve a greater level of depth in terms of graphics.

Jan
01-05-2005, 12:55 PM
Well, at least someone, who has experience with this, is clearing things up a bit.
Thanks, idr!

If OpenGL is really already that fast, that instancing couldn┤t improve much/anything, then we really don┤t need it.

However, the problem is, that we, as "end-users" really cannot know such stuff, and that this feature has been denied to us, with the lame excuse "this is already possible in immediate mode".

It would be nice, if we could simply get a bit more detailed information, why such a feature is not necessray / would not be useful for us.

Jan.

idr
01-06-2005, 08:49 AM
The way instancing works is that you state that one or more attributes are not indexed the way other attributes are. Instead, they are indexed by a frequency parameter. Their index starts at 0, and, whenever the current index's index hits a multiple of the frequency, then the index for the modified attribute(s) increases. Interesting. You basically specify a modulous value in addition to the usual type, count, stride, etc. parameters to the *Pointer functions. The arrays are then indexed with (i % M) instead of i. Can you only use instancing with DrawArrays-like commands? I guess it could work with DrawElements, but you'd have to explicitly replicate and bias the index data. Hmm...

I don't see how this would apply to the transformation matrix (especially since it isn't indexed per-vertex). You'd have to use DrawElements and ARB_matrix_palette (http://oss.sgi.com/projects/ogl-sample/registry/ARB/matrix_palette.txt) , which sounds unpleasant.


Well, how small do they need to be? The number being thrown about in this thread is 100 vertices. The actual number for my system aren't terribly valid. The speed of my CPU is way overbalanced to the speed of my GPU. Using some profile data to guide back-of-the-envelope calculations, I saw that "reasonable" performance improvement would drop off around 40 vertices. Again, different CPU / GPU combinations would have that mark at different places.


There's no way to answer that. Programs avoid being vertex transfer bound because they do not use large numbers of small meshes. They would like to, but they know this will cause performance problems. Until they have an efficient way to do it, it simply won't be done by developers. That's the thing, though. In the optimal VBO case, there should be no difference in the vertex transfer bandwidth required with or without instancing. In fact, if index data has to be replicated and biased (I asked about this above), instancing requires more index transfer bandwidth. What it does save is API call overhead and, on cards with "optimal" support for instancing, command transfer overhead to the card (hence my comment about NV_primitive_restart in my previous post).

When I get a chance, I'll have to experiment with this some more...

Korval
01-06-2005, 09:45 AM
The arrays are then indexed with (i % M) instead of i.It isn't quite (i % M). It's actually the index of i (the index into the index array) that gets modded by M and used as the index into the instanced vertex attribute(s).


I don't see how this would apply to the transformation matrix (especially since it isn't indexed per-vertex).It doesn't. Instancing is a way to work around stalls created by frequent vertex state changes like transform matrices or other such things.


I guess it could work with DrawElements, but you'd have to explicitly replicate and bias the index data.You do replicate the index data (one per rendered instance), but no biasing is needed. Regular attributes follow the standard rules; it is only the instance attributes that follow the new rules. It does need more overall memory, and it sucks for the pre-T&L cache.

zeckensack
01-06-2005, 10:22 AM
Originally posted by idr:
Interesting. You basically specify a modulous value in addition to the usual type, count, stride, etc. parameters to the *Pointer functions. The arrays are then indexed with (i % M) instead of i. Can you only use instancing with DrawArrays-like commands?More.
Modulo addressing or div addressing, per attribute pointer.
You put instance mesh data into attribute stream that are modulo adressed (index_n'=index%stream_frequency_n).
You put per-instance data into other attribute streams that are div addressed (index_n'=index/stream_frequency_n; integer division).

"Ultimate" indexing would allow different frequencies per attribute pointer, and also index biases, but that's probably not necessary. 99% of cases could be covered with a single frequency for all attributes and only a div/modulo/flat choice per attribute.

This allows you to use vertex positions as "instance mesh data" (the actual model of e.g. a rock), and other attributes as "per-instance data" (per-instance position, rotation, scale, color modifiers, texcoord scale factors and offsets, blend factors between mutliple textures, specular and gloss modifiers, whatever).

Also note that you don't really need direct matrix support. Just support attributes. A vertex program is required to make proper use of it w/o direct matrix support, but this way the whole instancing mechanism becomes far more versatile (not just position, scale, rotation). And vertex program support is widespread enough to pull it off.

You can then render a bright, small, flat rock, a darker, big, grainy rock, another mossy rock, etc, in a single API call -- and a single command buffer transaction if you have hardware support.

And if you don't really need full matrices, but can do with position and scale (say, particle systems), you just use one attribute, not four.
Or use a quaternion and another vec4 for rotation, position and uniform scale. You know the drill.

Originally posted by idr:
I guess it could work with DrawElements, but you'd have to explicitly replicate and bias the index data. Hmm...You'd have to if you implement this as a pure in-ICD software feature. If you implement this to expose a feature of the vertex fetch hardware, you don't have to do any replication and biasing of indices. That's the point. If the hardware you're working with doesn't support the required addressing modes, I wouldn't bother.

Originally posted by idr:
I don't see how this would apply to the transformation matrix (especially since it isn't indexed per-vertex).That's why it shouldn't be part of any instancing mechanism IMO. Matrices (or perhaps vertex program constants) are state, and while they can be implemented on top of attributes, that's very ugly and non-orthogonal. You'd get huge problems if the current vertex program already uses up all attributes the vertex processing hardware allows, but you need four more to implement your matrix. No go.
Programmers shouldn't be encouraged to believe that they can use the same number of attributes along with an instanced matrix. That's why I strongly suggest this should be limited to vertex attributes.

Humus
01-06-2005, 03:11 PM
Originally posted by Korval:
[QB]Vertex transfer is limitted by the bus between the CPU's memory and the graphics card. If you have your model in a VBO, which I expect people to have, then the transfer over AGP is exactly the same for real and shader constant based instancing (except maybe some tiny difference in command stream).


I did. It will, worst case, not impair performance. Best case, it will improve performance. Ergo, it is worthwhile.

QED.No, you should prove that it DOES improve performance. Not that it doesn't reduce it. If you end up with zero gain it's not worthwhile. Besides, at worst is does impair performance by adding overhead to glDrawPrimitive for the non-instanced case. That's what it does in D3D.


You haven't provided any real evidence that the performance gain will be minimal. Saying it doesn't make it true.Well, you haven't exactly proven that the performance gain will be huge so we're pretty even then. I've intended to bring home an instancing app I got at work and quickly switch it to OpenGL to compare, but I have so far forgot it. But my experience with instancing on the D3D side, and knowing that the overhead of DrawIndexedPrimitive() is much higher than glDrawPrimitive() is convincing enough for me to make it not worthwhile on the GL side on current hardware.

Humus
01-06-2005, 03:19 PM
Originally posted by Korval:
You do replicate the index data (one per rendered instance), but no biasing is needed. Regular attributes follow the standard rules; it is only the instance attributes that follow the new rules. It does need more overall memory, and it sucks for the pre-T&L cache.No you don't. You only need one copy of the index data. If you had to duplicate it instancing would definitely be useless, and I assume shader constant based instancing would beat it soundly in all situations.

Korval
01-06-2005, 08:30 PM
Modulo addressing or div addressing, per attribute pointer.Really? I'd apparently gotten my information confused.

This is, in fact, better than I thought. With a mechanism as flexible as this, you can, in theory, change the actual meshes for each instance, by simply using a different set of indices into the same vertex array. Trees with physically different branches that all pull from the same set of vertices. This is good stuff. Very good stuff.


If you have your model in a VBO, which I expect people to have, then the transfer over AGP is exactly the same for real and shader constant based instancing (except maybe some tiny difference in command stream).Once you compare instance-based methods to state-change based methods, you're comparing two different things. The state-change method will induce stalls in the pipe, while the instance one will not.


No, you should prove that it DOES improve performance. Not that it doesn't reduce it.Why does VBO include usage hints? There's no proof that they will improve performance. By your logic, they shouldn't exist.

They exist for the purpose of giving the driver vital information that will allow it to improve performance where possible. The same goes here.


Besides, at worst is does impair performance by adding overhead to glDrawPrimitive for the non-instanced case. That's what it does in D3D.A simple conditional is not overhead. And, by all rights, you can negate this (and any other 'if' overhead) by using a simple v-table/function pointer.

Plus, if that overhead wasn't in the driver, it'd be in the user's application. Whether the underlying driver is the one adding the constants to the command stream or not, someone has to.


Well, you haven't exactly proven that the performance gain will be huge so we're pretty even then.Except that it doesn't need to be huge; it just needs to be non-trivial. It doesn't need to be the difference between immediate mode and VBO's; it just needs to be able to consistently beat the uninstanced case by a non-trivial margin. 5-10% would be sufficient. Getting rid of the CPU overhead alone would be quite helpful.


If you had to duplicate it instancing would definitely be useless, and I assume shader constant based instancing would beat it soundly in all situations.How can you expect me to believe that a mechanism that, by its very nature, induces stalls in the pipeline (ie, state changes) is going to be slower than one that works exactly like the hardware wants to?

Oh, and BTW, this thread already contains substantial evidence (not proof because we have no nVidia results) that the "shader constant" instancing method is weaker under OpenGL in both tests (by a 6:1 margin) than using attributes. MZ's uniform-vs-attribute test (on the first page) attests to this fact. So, clearly, using uniforms for instances is a bad idea, compared to attributes.

zeckensack
01-07-2005, 07:37 AM
Originally posted by Korval:

Modulo addressing or div addressing, per attribute pointer.Really? I'd apparently gotten my information confused.

This is, in fact, better than I thought.Actually, your information may have been more correct than mine. After looking at what Microsoft put in DirectX Graphics (http://msdn.microsoft.com/archive/en-us/directx9_c_Summer_04/directx/graphics/programmingguide/advancedtopics/DrawingMultipleInstances.asp) , it looks like they didn't bother with the modulo mode, which is IMO both vital to get the full bang out of the technique, and quite easy to add if you already have the div mode.
If you have already spent the resources to compute the result of an integer division (which is quite hefty btw), you can certainly also deliver the remainder of the corresponding integer division without adding much extra complexity. x86 be my witness: it's not possible to divide without computing the remainder; you get it for free. I don't understand how this could have been left out. This crippled form of instancing is really much more limited than it needs to be.

Anyway, instancing in DXG is not as flexible as I described. My apologies.

Humus
01-09-2005, 12:18 PM
Originally posted by Korval:
Once you compare instance-based methods to state-change based methods, you're comparing two different things. The state-change method will induce stalls in the pipe, while the instance one will not.Yes, but this has nothing to do with what I wrote. You implied that there would be a difference in the amount of AGP traffic. There's not.


Why does VBO include usage hints? There's no proof that they will improve performance. By your logic, they shouldn't exist.

They exist for the purpose of giving the driver vital information that will allow it to improve performance where possible. The same goes here.Uhm, the difference is that we know for a fact that these flags improve performance if used correctly. Until we know for a fact that an instancing API will improve performance there's no reason to add such an API. If we knew that, I would of course support the inclusion, but until we know that I remain sceptical. There are litterally hundreds of things you could add to the API that might improve things in certain situations if hardware and drivers would be done for it etc. But we don't just add stuff without knowing it's useful.


A simple conditional is not overhead. And, by all rights, you can negate this (and any other 'if' overhead) by using a simple v-table/function pointer.A simple condition is overhead unless you expect 100% cache hits. Actually, even with 100% hits you still have to execute the actual comparison.
Overhead might not be large, but it's not for free.


Plus, if that overhead wasn't in the driver, it'd be in the user's application.??
There's no overhead of something that's not even there.


How can you expect me to believe that a mechanism that, by its very nature, induces stalls in the pipeline (ie, state changes) is going to be slower than one that works exactly like the hardware wants to?You mean faster, right? Well, because that's what it does in practice in many situations already as it is. Even though you don't need multiple index copies as you thought for real instancing, while constant based instancing needs to duplicate both vertices and indices, it still beats real instancing in many situations, in particular on small models, like < 20 triangles. If you would need multiple copies of the indices for real instancing, that would make it slower, and thus constant based instancing would beat it in more situations. Plus, a few stalls aren't really hurting much. It's already reducing the number of state changes enough to hide the cost of that.


Oh, and BTW, this thread already contains substantial evidence (not proof because we have no nVidia results) that the "shader constant" instancing method is weaker under OpenGL in both tests (by a 6:1 margin) than using attributes. MZ's uniform-vs-attribute test (on the first page) attests to this fact. So, clearly, using uniforms for instances is a bad idea, compared to attributes.Uhm, MZ's test was not a what I mean with shader constant based instancing. That's what I call naive one-instance per call implementation. It did prove one thing though. Even with such a naive implementation he reached performance close to theortechical maximum, which would support that instancing hardly would improve performance at all.

bobvodka
01-09-2005, 01:15 PM
as intresting as discussion is, I'd like to jump in here and ask if anyone wants to give an update on the kinda time table we are likely to work towards for the specs and implimentation for the framebuffer_object extension?

ffish
01-09-2005, 03:08 PM
I wouldn't expect much until March ;) .

Korval
01-09-2005, 10:50 PM
Uhm, the difference is that we know for a fact that these flags improve performance if used correctly.We only know that because the implementers decided to. They could have decided to ignore the flags and simply stick everything into video memory. Or, they could have moved the memory around based on how you use it.

There is nothing intrinsic in the spec that makes the flags advantageous, performance wise. Only implementations make them useful.


Actually, even with 100% hits you still have to execute the actual comparison.Not if they use a v-table. And, considering the current number of conditionals that have to live inside glDraw* calls (among others), I wouldn't be surprised if they did invoke v-table calls. Various state changes would simply swap v-tables in and out.


There's no overhead of something that's not even there.As in, the user's application will have to do the instancing stuff. It will have to make decisions about how to render stuff, and how to convert a sequence of instances into a sequence of draw calls.


You mean faster, right?You're right. I meant faster.


Even though you don't need multiple index copies as you thought for real instancing, while constant based instancing needs to duplicate both vertices and indices, it still beats real instancing in many situations, in particular on small models, like < 20 triangles.How does it beat something that invokes O(n) stalls (where n is the number of instances)?


Plus, a few stalls aren't really hurting much. It's already reducing the number of state changes enough to hide the cost of that.Having no state changes is faster than having state changes. The only potential performance negative is the need to walk a large index buffer (which means indices won't be pre-T&L cached), but that's not nearly as bad as stopping the entire pipeline to wait until the post-T&L cache is flushed (if not more).

Stalls waste performance. Not having stalls does not.


Uhm, MZ's test was not a what I mean with shader constant based instancing.His slower method may have had more state changes than yours, but it does demonstrate that state changes are bad.


That's what I call naive one-instance per call implementation. It did prove one thing though. Even with such a naive implementation he reached performance close to theortechical maximum, which would support that instancing hardly would improve performance at all.But that method (the one that did well) used attributes (specifically, constant attributes per draw call), not state changes.

More importantly, it only demonstrated this on ATi cards (you know, the ones that don't have hardware instancing). It proves nothing about nVidia cards, some of which actually do provide hardware instancing support.

V-man
01-10-2005, 01:00 PM
Originally posted by ffish:
I wouldn't expect much until March ;) .It migth also mean that the spec will be released at the same time. ****tttttttttttt!

Humus
01-10-2005, 07:51 PM
Originally posted by Korval:
Only implementations make them useful.Exactly what I'm saying while you're arguing that stuff should be implemented only because it could improve performace.


Not if they use a v-table.v-tables are even slower in most cases unless the number of paths are large and the conditions can be linearized.


How does it beat something that invokes O(n) stalls (where n is the number of instances)?Stalls impair performance, but there are hundreds of other factors that could be equally or more important.


His slower method may have had more state changes than yours, but it does demonstrate that state changes are bad.But is it the worst thing in the equation. Apparently not.


But that method (the one that did well) used attributes (specifically, constant attributes per draw call), not state changes.So? He still reached near theorethical maximum performance with currently available APIs. What more do you need? Instancing can't make it go faster than the hardware.

davepermen
01-10-2005, 09:55 PM
everyone learns premature optimisation is the root of all evil.

korval: in opengl instancing has been proven to be unneeded because it is not a main bottleneck. reasoning are the much lighter functioncalls than in dx, where each call is quite hefty.

there are bigger bottlenecks in opengl, much more worth to invest time and money to build workarounds than instancing.

and this was stated both from ati and nvidia. instancing could help, but only a little. opengl is different, performance bottlenecks are at different places.

and currently, opengl lacks much bigger things than instancing. we can talk about that again once we finally got an about equal feature set dx9 has since long (a.k.a. easy rendertargets/rendertotexture still as main api-design-bottleneck).

it's fun to discuss about it, yes, but you should learn your facts, korval. and use common sense.

knackered
01-10-2005, 10:17 PM
It's like the other 2 pages of discussion just didn't happen.
Dave, did you just read the topic title and decide to throw in a random comment on the subject?
And just what other features should take precedence over this? You mention rendertargets, but that has already been finalised and awaiting implementations...so what other features is GL missing in comparison to dx9? Oh, it's instancing isn't it? Instancing is now the big difference between the two API's, Dave. Hence this discussion.

KRONOS
01-10-2005, 11:30 PM
Originally posted by ffish:
I wouldn't expect much until March ;) .ARB_pixel_buffer_object was just posted... Maybe EXT_framebuffer_object can soon follow... :)

ffish
01-10-2005, 11:53 PM
I just noticed pbo was given full ARB status. I check the registry every day :p . I guess I need to get a life :D . Anyway, it'd be nice if fbo gets posted soon. Leaked driver implementations supporting fbo may appear before March too, but I'm not gunna hold my breath.

davepermen
01-11-2005, 01:18 AM
Originally posted by knackered:
It's like the other 2 pages of discussion just didn't happen.
Dave, did you just read the topic title and decide to throw in a random comment on the subject?
And just what other features should take precedence over this? You mention rendertargets, but that has already been finalised and awaiting implementations...so what other features is GL missing in comparison to dx9? Oh, it's instancing isn't it? Instancing is now the big difference between the two API's, Dave. Hence this discussion.i normally don't talk with spammers, but yes, i've read the whole discussion and followed it closely.

instancing is just an api difference, but not a performance difference.. you could cry about not having immediate mode on dx, too. if you want the api, you can code it yourself and have a simple lib for it. if you want the performance differences, there aren't any worth the effort. thats different to dx, where there is big possible gain.

knackered
01-11-2005, 02:33 AM
I believe you're mixing up the term 'spammer' with the term 'troll', dave. I have nothing to sell.
You seem to be missing the point in this discussion - it's not about whether the feature improves performance in the common case, it's about whether the feature makes sense for exceptional cases and for possible future scenarios. It's about whether you agree that an API should provide the ability for applications to give more information about their scene as a whole at draw time, or is it a case of 'the more atomic the better'. Care to answer, dave?
If you argue against an instancing mechanism, you must also have argued against glDrawRangeElements at the time of its introduction. With small batches, glDrawRangeElements gives very little performance increase...but for big batches it's dramatic. What's your opinion on this specific point, dave?
Korval had a good point when he said that because of the expense in submitting many small batches, application programmers avoid doing so, which in turn makes API review boards believe there's no need for a feature to make small batches more efficient because nobody submits large numbers of small batches. What's your view on this specific point, dave?

zeckensack
01-11-2005, 04:37 AM
Originally posted by Humus:
v-tables are even slower [than conditional branches] in most cases unless the number of paths are large and the conditions can be linearized.OT:
AMD recommends padding out tables of function pointers to 8 bytes per entry. This avoids contention in the branch prediction (max three entries per 16 byte window).


So? He still reached near theorethical maximum performance with currently available APIs. What more do you need? Instancing can't make it go faster than the hardware.True. But it just might reduce the CPU burden. Even if rendering itself cannot be made faster, there's still some potential benefit to overall sysem performance.

I think this was one of Korval's points, if I read your discussion correctly.

This is a moot point if the driver does the application's work of duplicating vertices, indices or (to a lesser degree) batches. Not so if there's full hardware support.

V-man
01-11-2005, 05:24 AM
It's best if driver developers invest more time on render targets (in whatever form) and GLSL.
GLSL drivers have too many issues and since it's going to be core in GL 2.0, it should be very much bug free.

Everyone is talking about instancing as beeing a performance solver. Maybe you should make your point in other ways, because most people won't see a performance boost with this.

zed
01-11-2005, 09:52 AM
It's best if driver developers invest more time on render targets (in whatever form) and GLSL.
GLSL drivers have too many issues and since it's going to be core in GL 2.0, it should be very much bug free.
Everyone is talking about instancing as beeing a performance solver. Maybe you should make your point in other ways, because most people won't see a performance boost with this.agreed i wanna see time being spent on important things, instancing is only gonna benifit a limited number of situations BUT it will disbenifit? everything else (by adding to driver complexity).
what are the maximum benifits possible with instancing 10-20%, yet ppl are mentioning with this extra 20% all of a sudden instead of drawing a wall with plain ol boring bricks, now we can draw a wall of amazing (adjective goes here) bricks! get real! whatever the results its still gonna look worse than the raycasting shader that was presented here recently,

Korval
01-11-2005, 10:42 AM
there are bigger bottlenecks in opengl, much more worth to invest time and money to build workarounds than instancing.Such as? What performance problems are you talking about, that could be fixed with API changes, rather than simply faster hardware or better drivers?


and this was stated both from ati and nvidia.The exact quote from the notes was, "Instancing geometry - no need, GL immediate mode rendering calls support this. Khronos may want to engage on this, though, since they don't have finegrained immediate mode calls."

Who does Khronos work for? His name isn't mentioned attached to any IHV, so I'd like to know what hardware doesn't support "finegrained immediate mode calls".


easy rendertargets/rendertotexture still as main api-design-bottleneckAlready "done" (ie, the API and spec discussion is finally complete, and all we are waiting on is the publication of the spec and implementations). The ARB is now free to entertain other significant topics of debate.

Outside of other render-to issues (render-to-vertex-array, etc), what issues would you suggest they take up?

Zengar
01-11-2005, 10:45 AM
I would like to precisize the "instancing issue".

1. OpenGL's API calls are very lightweight and therefore don't need any further optimisations

but

2. If we try rendering 100000 grassholms(dunno how it's written) something like instancing may be very usefull as it's a simple and elegant trick. We can use another tricks with the same result of course - Humus described a instancing method using the shaders but I think it is not so straighforward. Instancing is a way of reusing some input streams and it may be very usefull for drawing lots of similar objects. Period.

So: instancing is only useful for some very rare cases but it is usefull.

zeckensack
01-11-2005, 11:07 AM
Originally posted by Korval:
Who does Khronos work for? His name isn't mentioned attached to any IHV, so I'd like to know what hardware doesn't support "finegrained immediate mode calls". Khronos (http://www.khronos.org/) is the body behind OpenGL ES. OpenGL ES got rid of glBegin/glEnd, and only supports array based draw calls.

You could have known that :p

Korval
01-11-2005, 12:08 PM
Khronos is the body behind OpenGL ES.Really? Why is Khronos working on something like precompiled shaders, rather than the ARB proper? It seems that precompiled shaders is something that the ARB should be addressing, not Khronos, since it will effect OpenGL more than ES (which, as I understand it, has no real shading language support yet).

MikeC
01-11-2005, 12:47 PM
Originally posted by Korval:
Really? Why is Khronos working on something like precompiled shaders, rather than the ARB proper?Maybe because embedded systems tend to be less-than-ideal platforms to run compilers on? Don't know, just guessing.

harsman
01-11-2005, 12:51 PM
And there are embedded platforms with vertex programmability available today (MBX).

Korval
01-11-2005, 01:31 PM
Maybe because embedded systems tend to be less-than-ideal platforms to run compilers on?A fair point. Though I'm not entirely convinced of the idea of binding our precompiled extension to theirs (though, really, OpenGL ES is a cleaner graphics system than OpenGL proper). Then again, it's probably better that ES remains close to GL proper.


And there are embedded platforms with vertex programmability available today (MBX).Wow, embedded stuff has come a long way.

MZ
01-11-2005, 02:12 PM
Originally posted by Humus:
So? He still reached near theorethical maximum performance with currently available APIs. This statement is no longer valid. It is now "close to 20%", not "close to 100%". See my update above.

Korval
01-11-2005, 03:32 PM
This statement is no longer valid. It is now "close to 20%", not "close to 100%". See my update above.Actually, looking back on it, I'm not so sure that your computation is even accurate. R300 cards have between 2 and 4 (I forget offhand how many) vertex pipes, so, in theory, they can run a 6 opcode program in 2-3 cycles. In actual practice, I could believe 4-5, though. 6 would be possible, but unlikely. We could directly test how long the vertex program takes to run simply by making a perfectly vertex program limitted application (one huge batch of the same vertex, effectively) and compare it's performance to the theoretical maximum.

In any case, there's clearly some significant performance loss over the case of maximal performance. Whether it is in vertex T&L or vertex throughput/driver CPU overhead is unknown.

harsman
01-11-2005, 04:43 PM
The card he used has the equivalent of one vertex shader unit so his calculation of max transform rate is correct. If you need info on chipset specs, Beyond3Ds 3D tables (http://www.beyond3d.com/misc/chipcomp/) are an excellent resource. You can look up capabilities based both on chipset and board model.

Humus
01-11-2005, 06:49 PM
Originally posted by MZ:
This statement is no longer valid. It is now "close to 20%", not "close to 100%". See my update above.Ok, that changes things. But my point still holds true that using constant based instancing can reach performance close to or better than real instancing on current hardware. In a sample I wrote for work I get 108fps with real instancing, 103fps with shader constant based and 22fps with a naive one instance per call implementation. That's with an asteroid field of 16384 instances of a 48 triangles model (26 vertices). If I boost the subdivision model one step I get a 192 triangle model (98 vertices), and instancing performance is 34fps, constant based instancing 42fps and naive at 22fps.

zed
01-11-2005, 08:47 PM
If we try rendering 100000 grassholms(dunno how it's written) something like instancing may be very usefull as it's a simple and elegant trick.u mean like
http://uk.geocities.com/sloppyturds/nelson/2005_01_12.jpg
unfortunatly as ppl are demanding more from apps, leaves blades standing up in the air, aint really as acceptable as they once were, so unless u do some non basic physics this will rule out the use of instancing for your grass

Korval
01-11-2005, 09:19 PM
In a sample I wrote for work I get 108fps with real instancing, 103fps with shader constant based and 22fps with a naive one instance per call implementation.Define "real instancing". Did you write a D3D app that used D3D instancing, or did you use the GL hack? Unless you're actually testing D3D instancing (and on hardware that supports it), we don't know how fast this app truly can be.


unfortunatly as ppl are demanding more from apps, leaves blades standing up in the air, aint really as acceptable as they once were, so unless u do some non basic physics this will rule out the use of instancing for your grassIf you need moving grass (why, I don't know, but if you do), it's simple. Use 2 instance attributes: one on the ground and one for the stalk. Then generate the vertices in the vertex shader via a simple spline between the two points. This may make things vertex bound on modern cards, but on more advanced ones (with greater processing power), it will again be transfer bound.

Alternatively, instead of a spline, and doing vertex generation, you could also send a single-float parameter (coupled with the 3-vector blade position) that pushes the grass in a pre-defined direction (the direction of the wind). It just does a bias to the vertex coordinate based on the value of this parameter. It'd look very good, but not perfect. But, it'd be a much simpler shader as well as taking up fewer attributes.

Or, if you want to keep the shader simpler, just pass a direction for the up of the stalk, thus allowing you to create moving, but non-bending, blades of grass.

Oh, and we don't have games nowadays that show decent quantities of grass. It always ends somewhere (even in your picture). Imagine having grass all the way out to the horizon. I would prefer static alpha-tested flatcards of grass that went on forever than highly-detailed physically based grass that stopped after 20 feet. Until we get grass going out to the horizon, physics should be far from our thoughts.

Zengar
01-12-2005, 05:42 AM
My point, Korval ;-)

zed
01-12-2005, 10:04 AM
If you need moving grass (why, I don't know, but if you do), it's simple.look at shrek etc the grass is typically animated, see how much better it looks than static grass, even grass demos eg speedtree (which looks better? with the grass moving or static) and speedtree does a really crappy job of its animation, imagine how much better it would look using proper bones!
the problem with lots of bones in the vertexshader (thats if it compiles under the limits) is u need to redo it for every rendering pass when u do it on the cpu u do it once and thats it.
the reason im insterested in instancing is, ive been working a lot with vegetation over the last couple of years and would have more than most to 'gain' from instancing, yet i dont want it!

knackered
01-12-2005, 10:27 AM
Still don't get the objections. I've always assumed it was a matter of 'when' the instancing mechanism is proposed rather than 'if'.
It's uses seem obvious. Saving memory, saving bandwidth, saving CPU overhead. Scale the performance benefits up to the demands in a few years and you see that it's obviously going to be needed in the future, so why wait till it's absolutely needed? Introduce it now, and lets play with it. The hardware's there. Can't one of the vendors expose it in an extension for now?
Stick shader attributes in the low frequency stream and interesting things become possible. Combine it with render-to-vertex array and bizarre ideas may emerge. But don't introduce it now, and risk the same frenzy of demand that the ARB suffered when developers absolutely 'needed' render targets.

idr
01-12-2005, 10:52 AM
But don't introduce it now, and risk the same frenzy of demand that the ARB suffered when developers absolutely 'needed' render targets. I'm not going to argue for or against instancing. However, I do want to point this out as a crappy analogy. Render targets enable new functionality where as instancing enables new optimization. There's a big, big difference.

Korval
01-12-2005, 11:57 AM
look at shrek etc the grass is typically animated, see how much better it looks than static grass, even grass demos eg speedtree (which looks better? with the grass moving or static) and speedtree does a really crappy job of its animation, imagine how much better it would look using proper bones!It's window dressing; it is, by definition, unimportant. The reason we want to put it into instancing is so that we can spend as little time on it as we can get away with while still providing some visual quality improvement.

Unless, of course, you'd rather we just had a flat texture on the ground?

If you can spare the extra performance for moving grass in your application, sure. If not, don't; it won't kill anyone, and hardware will get to the point where it can handle it.

BTW, even with infinite performance, I would never apply skinning to blades of grass. At the most, I would use some kind of spline curve and generate the vertices in the shader.


the reason im insterested in instancing is, ive been working a lot with vegetation over the last couple of years and would have more than most to 'gain' from instancing, yet i dont want it!That's an interesting contradition. You're interested in instancing, you acknowledge that it could help you, but you specifically choose not to want it. Well, that's your perrogative, as long as you understand the inhierent irrationality of your position.


Render targets enable new functionality where as instancing enables new optimization. There's a big, big difference.The difference between functionality and performance is not as big as you might think. We would consider, say, rendering lots of grass to be "functionality". That is, some kind of visual effect that couldn't be done on earlier hardware. However, it is only made possible by performance; in this case, the performance of drawing massive quantities of instances.

What this means is that, eventually, only D3D games will see the kinds of visual quality that instancing will provide. Just like only D3D games can provide the kinds of visual quality that RTT provides.

knackered
01-12-2005, 02:32 PM
Originally posted by idr:
Render targets enable new functionality where as instancing enables new optimization. There's a big, big difference.That seems quite narrow minded, no offense idr.
It's late, I'm running out of steam...if you can render to a vertex buffer, then it seems odd to me that a mechanism for controlling the frequency at which that vertex buffer is sampled is not seen as providing additional functionality.
In addition - render-targets offer no new functionality, everything that can be done with render targets can be already be done with framebuffer copies...but obviously at a much greater cost. So it could be regarded as an optimisation too.

zed
01-12-2005, 03:48 PM
BTW, even with infinite performance, I would never apply skinning to blades of grass. At the most, I would use some kind of spline curve and generate the vertices in the shader.perhaps ive explained it badly

the evolution of a grass field with computer hardware.
level0 - plain texture with grass on it
level1 - grass blades
level2 - basic animated blades
level3 - slightly more basic animated blades (due to instancing)
level4 - grass blades with a pretty realistic physics simulation

today im at level4 running on my computer in realtime with many thousands of blades, why on earth should should i step back to level3, ok the fps might go up but the visual quality will go down, hell if thats an excuss i might as well go back to level0 it might look like crap but it runs at 1000fps :)
youre thinking to shortterm, remember 3-5years ago (i forget exactly when) stencil shadows were 'the thing' now look at them dead, im 99% suredoom3 would never of used them if it hadnt of been delayed until 2004.

Korval
01-12-2005, 05:54 PM
today im at level4 running on my computer in realtime with many thousands of blades, why on earth should should i step back to level3Because all you're rendering is grass. When you're running something that might like to have the lion's share of the CPU, you'll find that accurate physics simulations for things that, ultimately, don't matter are irrelevant and inappropriate.

If you can spare your entire CPU and GPU to grass, great; you don't need instancing. For those of us who would like to make things go faster so that we can do other things with out processor, we need instancing.

More importantly, you're ignoring the fact that having the grass go out to the horizon is more important to the illusion than having 15 feet of moving grass.


youre thinking to shortterm, remember 3-5years ago (i forget exactly when) stencil shadows were 'the thing' now look at them dead, im 99% suredoom3 would never of used them if it hadnt of been delayed until 2004.Your general lack of proper grammar, punctuation, and capitalization make this statement somewhat difficult to understand. Trying to find where sentences begin and end is difficult at best. Are you saying that stencil shadows are "dead" for some reason, and that there was a heyday of them 3-5 years ago? And that, if Doom3 had been released earlier, that it wouldn't have used them?

I don't understand how any of that makes sense.

In any case, long-term, instancing becomes more important, not less. Your "physics simulation" will eventually fully reside in the vertex shader (thus alieviating the CPU completely, which now makes it worthwhile from a game standpoint). And vertex shader performance increases much faster than bandwidth for transfering vertices and CPU performance. As such, the vertex shader will eventually not be the bottleneck, and it will go back to being the bandwidth and CPU cost of vertex transfer.

Nowadays, the limits on useful instancing tend to be around 100 vertices and maybe 6-opcode shaders. In the future, it could be 300 vertices and 12-opcode shaders. After that, maybe 500 verts and 24-opcodes. And so forth.

Humus
01-12-2005, 08:00 PM
Originally posted by Korval:
Define "real instancing". Did you write a D3D app that used D3D instancing, or did you use the GL hack? Unless you're actually testing D3D instancing (and on hardware that supports it), we don't know how fast this app truly can be.
Of course I'm talking about instancing in D3D using SetStreamSourceFreq() when I say "real instancing". That compared to the alternative method I've described, "shader constant instancing".

zed
01-12-2005, 10:07 PM
Originally posted by Korval:
Because all you're rendering is grass. When you're running something that might like to have the lion's share of the CPU, you'll find that accurate physics simulations for things that, ultimately, don't matter are irrelevant and inappropriate.

If you can spare your entire CPU and GPU to grass, great; you don't need instancing. For those of us who would like to make things go faster so that we can do other things with out processor, we need instancing. http://steampowered.com/status/survey.html
the most compreshensive survey of user hardware on the net, ppl often have powerful cpu's compared to their graphics cards



More importantly, you're ignoring the fact that having the grass go out to the horizon is more important to the illusion than having 15 feet of moving grass.there u go again making th same FALSE anology u made with the bricks,
say im drawing the grass to the range of 100m, now to reach the hoirzon 1km away, i need to draw 100X as much grass, are u saying instancing will help me do this, no it wont at most itll let me draw an extra 20m, ie bugger all.


Your general lack of proper grammar, punctuation, and capitalization make this statement somewhat difficult to understand. Trying to find where sentences begin and end is difficult at best. Are you saying that stencil shadows are "dead" for some reason, and that there was a heyday of them 3-5 years ago? And that, if Doom3 had been released earlier, that it wouldn't have used them?sorry unfortunatly(fortunatly) my mind is always ruuning as a thousand miles an hour, i find it very difficult to write easy to understand sentences, what i mean to say is basically stencil shadows were big a few years ago, lots of demos, new extension(s) for gl etc, now look at it, gone! never to be seen again, a passing fad.


In any case, long-term, instancing becomes more important, not less. Your "physics simulation" will eventually fully reside in the vertex shader (thus alieviating the CPU completely, which now makes it worthwhile from a game standpoint). And vertex shader performance increases much faster than bandwidth for transfering vertices and CPU performance. As such, the vertex shader will eventually not be the bottleneck, and it will go back to being the bandwidth and CPU cost of vertex transfer.i recently removed all my vertex shader calculations to the cpu, why? it ran a lot quicker, u just need to calculate it once on the cpu, whereas with the vertexshader each time u render the geometry it needs to be recalculated again

Toni
01-13-2005, 03:02 AM
Sorry for the off topic of this, but...


Originally posted by zed:
what i mean to say is basically stencil shadows were big a few years ago, lots of demos, new extension(s) for gl etc, now look at it, gone! never to be seen again, a passing fad.
Are they? why? I understand they were fillrate eaters and such, but... they had a big advantadge, in theory the method was robust. I havent seen a paper saying "Robust shadowmaps with _put_your_favourite_projection_here and paraboloid maps and whatever"
What i mean is, i still haven't seen a method that guaranties no jaggies in shadowmapping shadows, for example (that is, a general and fast solution). But of course i can be totally wrong (wouldn't be the first time hehe).
So if i am wrong, could somebody point me to those papers that are killing the stencil shadows? I was going to implement stencil shadows due the robustness as i can't be tweaking with light positions and such to avoid aliasing, but if that
aliasing can be avoided (that is a general and robust solution for shadowmaps without aliasing), well, i would go the trough the shadowmap technic.

Thanx in advance

zed
01-13-2005, 09:09 AM
*note i never implemented stencil shadows (cause of there flaws) but this is my understanding

*stencil shadows require a lot more work per vertex, look at doom3, the models aint really more tesselated than quake3, now if shadowmaps were used then the number of polygons is not so important, ie if doom3 used shadowmaps then it wouldnt be known as the attack of the pointy heads, thus shadowmaps scale better
*stencil shadows require closed meshes thus u cant throw any mesh at it and expect it to work, with shadowmaps everything works
*stencil shadows dont handle allpha tested polygon eg a grill texture, shadowmaps do.

shadowmap cons,
* hardware support for shadowmaps is less than stencilshadows
* they require multiple renders to do pointlights (this scares a lot of ppl off i believe)
- these renders though are typically very quick cause practically everything is disabled, ie u only want to render the depth information. cards today handle many millions of triangles
* jaggies, can be lessened by upping the shadowmap resolution, also there are various methods proposed to get a better ratio of screen to shadowmap space, eg psm, tsm, lispsm. ive only tried to do the first one psm (gotta say i failed to implement it though)
an another method is to create soft shadow edges one way by combining offset shadowmaps

Korval
01-13-2005, 10:06 AM
Of course I'm talking about instancing in D3D using SetStreamSourceFreq() when I say "real instancing". That compared to the alternative method I've described, "shader constant instancing".So you have, in fact, written an app that tests D3D instancing vs. the OpenGL alternatives. I'd like to see this app.

Oh, and how does nVidia hardware perform on it?


say im drawing the grass to the range of 100m, now to reach the hoirzon 1km away, i need to draw 100X as much grass, are u saying instancing will help me do this, no it wont at most itll let me draw an extra 20m, ie bugger all.It's better than not having that extra 20m.


i recently removed all my vertex shader calculations to the cpu, why? it ran a lot quicker, u just need to calculate it once on the cpu, whereas with the vertexshader each time u render the geometry it needs to be recalculated againYour application draws grass. In a game situation, grass is irrelevant; it's a visual effect. It doesn't deserve spending time on it, so nobody multipasses on their grass. It isn't included in shadow volumes or shadow maps. As such, there's no problem with putting it into the vertex shader.

[ OT ]


stencil shadows dont handle allpha tested polygon eg a grill texture, shadowmaps do.I have yet to see an implementation of shadow mapping that would handle that either. Oh, it would in theory, with an infinitely large shadow map. But, in the real world, even a 2048x2048 or 4096x4096 shadow map doesn't pick up details like grill textures and so forth.


this scares a lot of ppl off i believeNo, what scares people away from shadow maps is that nobody's found a solution that makes virtually all cases look good. There are ways to make some cases look better, but there's no known generic solution that you can plug in and have work.

I like the idea of shadowmaps. It scales better in the number of lights to stencil shadows. But, when it comes down to shipping a polished, professional product, you just can't put up with the artifacts. So I don't blame people for using stencil shadows.

zed
01-13-2005, 02:38 PM
Your application draws grass. In a game situation, grass is irrelevant; it's a visual effect. It doesn't deserve spending time on it, so nobody multipasses on their grass. It isn't included in shadow volumes or shadow maps. As such, there's no problem with putting it into the vertex shader.:)
following your schizophrenic reasoning, since grass is an 'effect' u dont need all the (*)grass blades in fact youre better off doing a grass textured quad, so why do u need instancing? youve dug yourself into a hole there korval. i dont expect u to admit youre wrong though bay, after 1991 posts if it hasnt yet happened it most likely wont happen ;)

http://uk.geocities.com/sloppyturds/nelson/richmond.html

though its difficult to see from a static screenshot ( u can notice it a bit in the second screenshot) having moving grass etc adds so much more to a scene's realism compared to static grass


I have yet to see an implementation of shadow mapping that would handle that either. Oh, it would in theory, with an infinitely large shadow map. But, in the real world, even a 2048x2048 or 4096x4096 shadow map doesn't pick up details like grill textures and so forth.check the screenshot 2005_01_05 on my page (256x256 sized depth texture), the dome is solid with a alpha texture on it

Korval
01-13-2005, 03:17 PM
following your schizophrenic reasoning, since grass is an 'effect' u dont need all the (*)grass blades in fact youre better off doing a grass textured quad, so why do u need instancing?If it takes only 5% of your overall performance (and as little precious CPU time as possible) to render large quantities of grass and other "window dressing" effects, then it is worthwhile to try. However, the problem is that, without instancing, it never will be (or it will be much later than it could). In the most optimal case, the hardware probably can render lots of grass in 5% or less. However, if we can't access that hardware, if we're forced to use brute-force methods, then we will take longer to get there.

No, you can still use just a flat texture. But if you can make more realistic representations of grass take up trivial quantities of performance, then it is no longer unreasonable to use those methods. After all, if real grass only took 2x the performance of flat textured grass, why not just do the real thing. Instancing is a mechanism for lowering the performance of such window dressing to the point where it becomes more reasonable to use.

The point my statement was trying to make was that, because it is an optional "effect", because it is window dressing, you don't apply things to it like shadowing (or, at least, rendering volumes or maps. They could still use the shadow texture to determine if they were in shadow) or anything that makes it take any longer than it absolutely needs to.


having moving grass etc adds so much more to a scene's realism compared to static grassI never said that having moving grass was a bad idea. Clearly, moving grass is preferable to static grass. It is simply a question of what one can afford. Instancing is a method for being able to make what was once unaffordable (but theoretically doable on the hardware) now affordable. Much like VBO's allow you to now use bigger meshes, instancing allows you to use lots of small ones.

V-man
01-13-2005, 04:43 PM
Can't one of the vendors expose it in an extension for now?
I don't know but maybe there are legal issues.

As far as what idr said, it's better to argue that render targets is likely to be more needed than instancing.

PS: I'm not arguing against instancing. It would be nice to see it in action in GL.

cyclone
01-13-2005, 04:59 PM
>So? He still reached near theorethical maximum >performance with currently available APIs. What >more do you need? Instancing can't make it go >faster than the hardware ...

But it cleanning all this glBegin / glEnd / gl*Pointer/ gl*Array* /glCallList / glVertex / glNormal / glTexCoord / gl*Matrix*, glPush*, glPop*, glDraw*[Indexed]*, gl*Shader, .... (very complex and old) API stuff !!!!!

For the driver, the hardware, the devellopper and the final user, it is a good hack / optimisation / advance for the next generation ...

And on my 400 Mhz PocketPC, this give me more time for raytrace the scene with more triangles :)

I haven't implemented the div/modulo great hack until now but this seem to be a cool thing for my gl2999 API :)

But it's true, OpenGL is an HARDWIRED library ...
Very good hardware, but badly used ...

@+
Cyclone

KRONOS
01-21-2005, 01:52 AM
Concerning instancing, NVIDIA's SDK8.5 has an example and PDF discussing it. Here is the link:
http://download.developer.nvidia.com/developer/SDK/Individual_Samples/samples.html
(scroll down to Pseudo Instancing)

It seems that GL doesn't need instancing after all.

gdewan
01-21-2005, 04:40 AM
I really like the bibliography in that document.

cyclone
01-21-2005, 05:36 AM
No, no, no, you haven't understand what this paper say !!!

It only say that the Nvidia hardware have a very bad vertex cache for more than 50 vertices ...

And it say too that with less than 50 vertices per instance, it can bee more that 35x more speed on some cards :)

This isn't the same thing that to say that instancing cannot be a good thing for OpenGL ....