PDA

View Full Version : glDrawElement with independant vertex, normal and texCoord indexes



cyclone
07-10-2001, 10:41 PM
Exist one OpenGL command that can display geometries when colors, normals, texCoords and vertices use separates indexes, but with only one call per vertex ?

Something like glDrawIndexedEXT(vertexIndex, normalIndex, texCoordIndex, colorIndex) for example ...

OK, I can always decompose it into 4 OpenGL commands for example :

glColor4fv(&colors[colorIndex]);
glNormal3fv(&normals[normalIndex]);
glTexCoord2fv(&texCoords[texcoordIndex]);
glVertex3fv(&vertices[verticeIndex]);

but I think that is really one lose of cpu time that to use 4 functions per vertex ...

@+
Cyclone

john
07-11-2001, 03:54 PM
you're saying, in other words that;



push param1
push param2
push param3
call func
push param1
push param2
push param3
call func
push param1
push param2
push param3
call func

is significantly slower than



push param1
push param2
push param3
push param4
push param5
push param6
push param7
push param8
push param9
call func

...

cheers,
John

cyclone
07-11-2001, 09:20 PM
Jumps and functions calls are one of slowers instructions on the 80x86/Pentium architecture ...

Heu, for example, can you explain me why exist glVertexPointer and glDrawElement family commands on the OpenGL API if the cpu overhead for function call is so little ;-)

And the more important, this form of "indexed 3D data" is commonly used because this form of indirection minimize the space used for 3D data storage and can directely map to the Wavefront .obj format and a lot of others 3D data file format ...

@+
Cyclone

john
07-11-2001, 10:51 PM
Hello,

branches and calls have *historically* always been painful because it messes with the pipeline. But thats why they have branch prediction (and feeding insturctions through following calls) these days to keep the pipeline stoked.

THey might be particularly bad on intel chips, but since when has opengl been an intel chip based API? (yes, yes, i know, majority of consumer card malacky yada yada etc dicknose)

and, finally, 'they' use vertex pointer *primarily* for DMA, not necessarily to avoid function calls.

BTW, I am putting forth these arguments not *necessarily* because I disagree, but because being a sheep is just too sucky. =)Also, it isn't true to say that every procedure call will really REALLY soak performance.

cheers,
John

cyclone
07-12-2001, 01:55 AM
OK, functions calls aren't really the biggest bottleneck in the OpenGL API http://www.opengl.org/discussion_boards/ubb/smile.gif

But, my real problem is that i find that i have always to create a LOT of new vertices for nothing ...

For example, a "good textured cube" is for me only 8 SHARED vertex coordinates, 8 SHARED colors, 4 SHARED textures coordinates, 6 SHARED normals and 6 faces where each properties of each vertex can be independantly indexed. But this is certainly not 24 vertices of the form {x,y,z, r,g,b, i,j,k} that are NOT SHARED between the 6 faces of the cube ...

For me, and on this example of the cube, the OpenGL VertexArray mechanism force to allocate 3x more memory that necessary ...


@+
Cyclone

mcraighead
07-12-2001, 08:39 AM
Immediate mode really _is_ a slow API. Don't use it in production code. Replicate vertices if you need to, that's not a big deal.

- Matt

cyclone
07-12-2001, 10:01 PM
Why do you say that the OpenGL API is slow ???

Personnaly, I found this immediate mode API very speed : my current "Cuboule" demo/test run at more that 50 fps on 1600x1200 32 bits with my GeForce2 Ultra on a Athlon 1Ghz with Xfree 4.03 and latest NVidia drivers
=> for me, it's really a speed 3D API http://www.opengl.org/discussion_boards/ubb/smile.gif

OK, this demo didn't use already multi-texture extensions such as bumpmapping or multi-passe algorithms such as reflexions, but I'm certain that the OpenGL 1.2 API coupled with GeForce2 extensions and a little of experience have really the potential for to implement this extensions on my demo without sacrify the real-time framerate.

No, my real problem is more theoric and about the size of the the 3D data storage : I begin to work with some .obj that can have more than 30K vertices and the desindexation of 3D data, necessarry for the OpenGL VertexArray mechanism, can frequently generate more that 65K vertices, thing that cannot handle directly the current Nvidia OpenGL/GLX implementation for Linux (limited to unsigned short, cf. 65K vertices, if my memory is good).


@+
Cyclone

V--man
07-14-2001, 03:08 AM
If you are not seeing a difference between immediate mode and indexed mode, then go ahead and use immediate.

Typically, immediate is slower.

Besides, I think this has been suggested before.

V-man

Dirk
07-20-2001, 02:29 PM
> Immediate mode really _is_ a slow API. Don't use it in production code.
> Replicate vertices if you need to, that's not a big deal.

It might not be for static objects, but for dynamic stuff (e.g. progressive meshes) it can hurt having to know which vertices are really identical and need to be updated, too. Furthermore the convenience of being able to manipulate a color array without having to worry about which faces/vertices are actually influenced is useful, too.

I'd like to propose something like interleaved indices, where you have a bunch of indices for every vertex. Then you either select one of a number of predefined formats (a la Interleaved Arrays) or give a mapping from index to attribute (i.e. index 0:Vertex|Color, 1:Normal etc.). The former is simpler to implement, the latter more flexible and extensible

Memory access will be worse than InterleavedArrays, but should be very comparable to separate arrays. I can't really judge how much additional work the driver has to do to support this, or if current chips can do this at all, but I do like the idea and think it shouldn't be too bad.

Comments?

kieranatwork
08-16-2001, 04:45 AM
Please can we have the glDrawElementsEXT? It's a much nicer solution than duplicating vertices...which flies in the face of the original idea behind indexed arrays.
When I'm deforming my geometries using dynamics, I have to apply the change I make to 1 vertex to a whole bunch of others because they really should be the same vertex...it's a pain in the arse, and slows things down.
Why can't we have this extension?

cyclone
08-17-2001, 05:17 AM
typedef struct{

GLint size;
GLenum type;
GLsizei stride;
GLuchar *pointer;
void (*func)(GLvoid *)
GLuint *index;

}GLArrayNEW;


GLArrayNEW colorArray;
GLArrayNEW normalArray;
GLArrayNEW vertexArray;
GLArrayNEW texcoordArray;


glVertexPointerNEW(GLint size, GLenum type, GLsizei stride, const GLvoid *pointer){

vertexArray.size = size;
vertexArray.type = type;
vertexArray.stride = stride;
vertexArray.pointer = pointer;

switch(type){

case GL_FLOAT :
switch(size){
case 1 : vertexArray.func = glTexcoord1fv; break;
case 2 : vertexArray.func = glTexcoord2fv; break;
case 3 : vertexArray.func = glTexcoord3fv; break;
case 4 : vertexArray.func = gltexcoord4fv; break;
};
break;

case GL_UNSIGNED_INT :
case GL_INT :
switch(size){
case 1 : vertexArray.func = glTexcoord1iv; break;
case 2 : vertexArray.func = glTexcoord2iv; break;
case 3 : vertexArray.func = glTexcoord3iv; break;
case 4 : vertexArray.func = gltexcoord4iv; break;
};
break;

case GL_UNSIGNED_SHORT :
case GL_SHORT :
switch(size){
case 1 : vertexArray.func = glTexcoord1sv; break;
case 2 : vertexArray.func = glTexcoord2sv; break;
case 3 : vertexArray.func = glTexcoord3sv; break;
case 4 : vertexArray.func = gltexcoord4sv; break;
};
break;

case GL_UNSIGNED_BYTE :
case GL_BYTE :
switch(size){
case 1 : vertexArray.func = glTexcoord1cv; break;
case 2 : vertexArray.func = glTexcoord2cv; break;
case 3 : vertexArray.func = glTexcoord3cv; break;
case 4 : vertexArray.func = gltexcoord4cv; break;
};
break;

default : vertexArray.func = NULL;
break;
}

glVertexPointer(size, type, stride, pointer);
}


glNormalPointerNEW(GLint size, GLenum type, GLsizei stride, const GLvoid *pointer){

normalArray.size = size;
normalArray.type = type;
normalArray.stride = stride;
normalArray.pointer = pointer;

switch(type){

case GL_FLOAT : normalArray.func = glNormal3fv; break;
case GL_UNSIGNED_INT :
case GL_INT : normalArray.func = glNormal3iv; break;
case GL_UNSIGNED_SHORT :
case GL_SHORT : normalArray.func = glNormal3sv; break;
case GL_UNSIGNED_BYTE :
case GL_BYTE : normalArray.func = glNormal3cv; break;

default : normalArray.func = NULL;
}

glnormalPointer(size, type, stride, pointer);
}


glColorPointerNEW(GLint size, GLenum type, GLsizei stride, const GLvoid *pointer){

colorArray.size = size;
colorArray.type = type;
colorArray.stride = stride;
colorArray.pointer = pointer;

switch(type){

case GL_FLOAT :
switch(size){
case 1 : colorArray.func = glColor1fv; break;
case 2 : colorArray.func = glColor2fv; break;
case 3 : colorArray.func = glColor3fv; break;
case 4 : colorArray.func = glColor4fv; break;
};
break;

case GL_UNSIGNED_INT :
case GL_INT :
switch(size){
case 1 : colorArray.func = glColor1iv; break;
case 2 : colorArray.func = glColor2iv; break;
case 3 : colorArray.func = glColor3iv; break;
case 4 : colorArray.func = glColor4iv; break;
};
break;

case GL_UNSIGNED_SHORT :
case GL_SHORT :
switch(size){
case 1 : colorArray.func = glColor1sv; break;
case 2 : colorArray.func = glColor2sv; break;
case 3 : colorArray.func = glColor3sv; break;
case 4 : colorArray.func = glColor4sv; break;
};
break;

case GL_UCHAR :
case GL_CHAR :
switch(size){
case 1 : colorArray.func = glColor1cv; break;
case 2 : colorArray.func = glColor2cv; break;
case 3 : colorArray.func = glColor3cv; break;
case 4 : colorArray.func = glColor4cv; break;
};
break;

default : colorArray.func = NULL;
break;
}

glColorPointer(size, type, stride, pointer);
}


glTexcoordArrayNEW(GLint size, GLenum type, GLsizei stride, const GLvoid *pointer){

texcoordArray.size = size;
texcoordArray.type = type;
texcoordArray.stride = stride;
texcoordArray.pointer = pointer;

switch(type){

case GL_FLOAT :
switch(size){
case 1 : texcoordArray.func = glTexcoord1fv; break;
case 2 : texcoordArray.func = glTexcoord2fv; break;
case 3 : texcoordArray.func = glTexcoord3fv; break;
case 4 : texcoordArray.func = glTexcoord4fv; break;
};
break;

case GL_UNSIGNED_INT :
case GL_INT :
switch(size){
case 1 : texcoordArray.func = glTexcoord1iv break;
case 2 : texcoordArray.func = glTexcoord2iv; break;
case 3 : texcoordArray.func = glTexcoord3iv; break;
case 4 : texcoordArray.func = glTexcoord4iv; break;
};
break;

case GL_UNSIGNED_SHORT :
case GL_SHORT :
switch(size){
case 1 : texcoordArray.func = glTexcoord1sv; break;
case 2 : texcoordArray.func = glTexcoord2sv; break;
case 3 : texcoordArray.func = glTexcoord3sv; break;
case 4 : texcoordArray.func = glTexcoord4sv; break;
};
break;

case GL_UNSIGNED_BYTE :
case GL_BYTE :
switch(size){
case 1 : texcoordArray.func = glTexcoord1cv; break;
case 2 : texcoordArray.func = glTexcoord2cv; break;
case 3 : texcoordArray.func = glTexcoord3cv; break;
case 4 : texcoordArray.func = glTexcoord4cv; break;
};
break;

}

glTexcoordPointer(size, type, stride, pointer);
}


glVertexIndexiNEW(GLuint *indices){

vertexArray.index = indices;
}


glNormalIndexiNEW(GLuint *indices){

normalArray.index = indices;
}


glColorIndexiNEW(GLuint *indices){

colorArray.index = indices;
}


glTexcoordIndexiNEW(GLuint *indices){

texcoordArray.index = indices;
}


glArrayElementNEW(GLuint i){

if( texcoordArray.func && texcoordArray.index && texcoordArray.stride )
texcoordArray.func(texcoordArray.pointer+(texcoord Array.index[i]*texcoordArray.stride));

if( colorArray.func && colorArray.index && colorArray.stride )
colorArray.func(colorArray.pointer+(colorArray.ind ex[i]*colorArray.stride));

if(normalArray.func && normalArray.index && normalArray.stride )
normalArray.func(normalArray.pointer+(normalArray. index[i]*normalArray.stride));

if(vertexArray.func && vertexArray.index && vertexArray.stride )
vertexArray.func(vertexArray.pointer+(vertexArray. index[i]*vertexArray.stride));
else
glDrawElement(i);
}


glDrawArraysNEW(GLenum mode, GLint first, GLsizei count){

glBegin(mode);
while(count){
glArrayElementNEW(first);
first++;
count--;
}
glEnd();
}


glDrawElementsEXT(GLenum mode, GLsizei count, GLenum type, const GLvoid *indices){

int i;
GLubyte *pubyte;
GLushort *pushort;
GLuint *puint;

glBegin(mode);
switch(type){
case GL_UNSIGNED_BYTE : pubyte = indices;
for(i=0;i<count;i++)
glArrayElementNEW(*pubyte++);
break;

case GL_UNSIGNED_SHORT : pushort = indices;
for(i=0;i<count;i++)
glArrayElementNEW(*pushort++);
break;

case GL_UNSIGNED_INT : puint = indices;
for(i=0;i<count;i++)
glArrayElementNEW(*puint++);
break;
}
glEnd();
}


Perhaps a beginning for this extension ???

cyclone
08-17-2001, 06:07 AM
I have make errors in the glVertexArrayNEW() function :

You have to read

"vertexArray.func = glVertex...; break;"

and not

"vertexArray.func = glTexcoord...; break;"


But I think that you have already correct this yourself http://www.opengl.org/discussion_boards/ubb/smile.gif


@+
Cyclone

santyhamer
08-19-2001, 12:45 AM
Great idea!!! I have the same problem:
I use pure triangle list instead indexed-shared triangles. This is because in 3dsmax, exist "smooth groups". Each triangle face has 3 vertex normals ( for making sharp accentuated edges ), 3 vertex coordinates ( for using UVW Unwrap, face-map...) and 3 indices to vertex positions. The problem is that using pure trinagle lists ( i think the UNIQUE valid method for preserving smooth groups and accentuated sharp edges ), i am replicating the vertices positions 3 times.... and them, I CAN'T use vertex cache of Geforce... So, WE NEED really this special form of indexing you propose.

zander76
08-13-2003, 06:47 AM
Hello All

Has this been suggested to gl. Does anybody know were this stands. I am currently trying to find a solution to the same problem.

Ben

zander76
08-13-2003, 06:52 AM
Originally posted by cyclone:
Why do you say that the OpenGL API is slow ???

Personnaly, I found this immediate mode API very speed : my current "Cuboule" demo/test run at more that 50 fps on 1600x1200 32 bits with my GeForce2 Ultra on a Athlon 1Ghz with Xfree 4.03 and latest NVidia drivers
=> for me, it's really a speed 3D API http://www.opengl.org/discussion_boards/ubb/smile.gif

OK, this demo didn't use already multi-texture extensions such as bumpmapping or multi-passe algorithms such as reflexions, but I'm certain that the OpenGL 1.2 API coupled with GeForce2 extensions and a little of experience have really the potential for to implement this extensions on my demo without sacrify the real-time framerate.

No, my real problem is more theoric and about the size of the the 3D data storage : I begin to work with some .obj that can have more than 30K vertices and the desindexation of 3D data, necessarry for the OpenGL VertexArray mechanism, can frequently generate more that 65K vertices, thing that cannot handle directly the current Nvidia OpenGL/GLX implementation for Linux (limited to unsigned short, cf. 65K vertices, if my memory is good).


@+
Cyclone

Hello, actually immediate mode will never reach the speed of a vertex buffer or a display list for a few reasons.

For display lists, it gives the driver writers a chance to optimize, remove duplicate states, change the format to better match the card. You name it.

As far as vertex buffers, current video cards run most efficient when reaching 200 excecutions per call (can't remember the name for this term off hand). With imitiate mode its a one to one relation ship. One call and one vertex being processed.

So the reason for this, one it give the driver writer control over the loop and were its running, more notably the cpu can be used to process at the same time. Also video cards have multiple pipelines to process verticies.

The last thing to consider. A box has 8 verticies but to renderer in immediate mode you will have to make 24 calls to glVertex3f.

Later

zander76
08-13-2003, 08:03 AM
Hello Everybody

Has anybody figured out how this new extention is should work. Here is my guess.

// Set the vertex pointer
glVertexPointerNEW(3, GL_FLOAT, 0, vertexList);

// Set the texture coordinate list
glTexCoordPointerNEW(2, GL_FLOAT, 0, textureCoord);

// Now here come the trick part
glDrawElementsEXT(GL_TRIANGLES, Count, GL_UNSIGNED_SHORT, pVertexIndices);


Do i have to call glDrawElementsEXT() for texture coordinates as well.. If so what is the proper order. Should the texture coords go first or the verticies.

OR

Do i set the indicies first.. Something like this.

// Set the vertex pointer
glVertexPointerNEW(3, GL_FLOAT, 0, vertexList);

// Set the texture coordinate list
glTexCoordPointerNEW(2, GL_FLOAT, 0, textureCoord);

// Set the index lists
glVertexIndexNEW(pVertexIndicies);
glTexcoordIndexNEW(pTexIndicies);

// So what is the last option for here then
glDrawElementsEXT(GL_TRIANGLES, Count, GL_UNSIGNED_SHORT, pVertexIndices);

As you can see i have set the Index twice. That does now quite seem right to me.

Thanks, Ben

[This message has been edited by zander76 (edited 08-13-2003).]

vincoof
08-13-2003, 09:58 AM
One of the goals of the vertex array extension (yes it is an extension) is to provide grouped vertex data on the pipeline. If the vertex array, color array, normal array, whatever array has to look up a different index table, it will significantly reduce performance and may be worth of immediate-mode performance.

Moreover, interleaved arrays become really pointless with independant index array lookups.

So, I tend to join what matt said : replicate vertices. It will be significantly faster and does not waste too many memory (especially since new index arrays also use memory). It's also more efficient to cache vertices this way, as it is ensured that vertex, normals, texcoords, etc are always grouped.

zeckensack
08-13-2003, 12:47 PM
Don't want to spoil the party, but the classic box example is about the only thing that significantly benefits. It's the pathological corner case.

Then we have index traffic. An int index per color is *twice* the fetch bandwidth compared to just a color per vertex. That may be reduced by caches, approaching equal bandwidth. Ditto for fog coords.

Normals, tex coords and positions will benefit more, but I'm tempted to say that there's nowhere near enough benefit to be had in *real* meshes to justify this kind of (hardware!) complexity.

Who primarily renders cubes anyway?

zander76
08-13-2003, 07:46 PM
Hello everybody.

I have been trying to implement this extention. It would be really nice to submit to the review board. Anyway, i ran into a problem.

// This does not work
vertexArray.func = glVertex3fv;

The declaration uses GLvoid* as its parameter while glVertex3fv uses const GLfloat*. My compiler will not compile this. Does anybody know how to cast the parameter of glVertex3fv or change the declaration of the original function.

Ben

Cyranose
08-14-2003, 09:31 AM
Originally posted by cyclone:
Exist one OpenGL command that can display geometries when colors, normals, texCoords and vertices use separates indexes, but with only one call per vertex ?


Ignoring the tangential issues here, the original question was whether it's useful to have separate indices for the various vertex components when using vertex arrays.

Unfortunately, it is (and not just for the cube example), though I don't think it would be easy to even simulate unless such an extension converts multi-indexed data to single-indexed data by duplicating on the fly (yuck, no go).

The desire for such a feature (for me) is not to avoid duplicating colors or texcoords in memory. It's mainly to avoid duplicating vertices where two vertices at the same point in space have different attributes, such as different normals, different texcoords, or different colors.

The problem comes in in trying to optimize vertex caching. Two nearly identical vertices can't be cached as one. This shows up wherever you need discontinuities in normals (e.g., hard edges), colors, or texcoords.

But under the hood, it's likely that the vertex cache stores not only the post-transformed vertex (and normal) but the other post-transformed post-lit parameters too, all keyed by the original index value. So there's a real question as to whether such an extension could reasonably be supported in HW without multiple caches and lots of extra complexity. And it doesn't make much sense for vertex programs, where you'd want to cache the computed VP outputs, not the separate inputs.

Anyway, back to the original question. If you have a small number of colors, normals, and/or tex coords and want to re-use those with separate indices, you can simulate this "extension" with a simple vertex program. Load your arrays colors, texcoords, normals, into VP constants (up to the limits, say 96 total vec4s for 1st gen HW) and send down vertices with XYZ and color only. In the VP, use those color R,G,B values to do one or more table lookups for real colors, texcoords, and normals stored in the registers and emit the combined result.

This, of course, does not solve the cache re-use issue either. But I assume (or at least hope) vertices are cached post VP execution.

Avi

[This message has been edited by Cyranose (edited 08-15-2003).]

zander76
08-16-2003, 09:33 AM
Originally posted by Cyranose:
Ignoring the tangential issues here, the original question was whether it's useful to have separate indices for the various vertex components when using vertex arrays.

Unfortunately, it is (and not just for the cube example), though I don't think it would be easy to even simulate unless such an extension converts multi-indexed data to single-indexed data by duplicating on the fly (yuck, no go).

The desire for such a feature (for me) is not to avoid duplicating colors or texcoords in memory. It's mainly to avoid duplicating vertices where two vertices at the same point in space have different attributes, such as different normals, different texcoords, or different colors.

The problem comes in in trying to optimize vertex caching. Two nearly identical vertices can't be cached as one. This shows up wherever you need discontinuities in normals (e.g., hard edges), colors, or texcoords.

But under the hood, it's likely that the vertex cache stores not only the post-transformed vertex (and normal) but the other post-transformed post-lit parameters too, all keyed by the original index value. So there's a real question as to whether such an extension could reasonably be supported in HW without multiple caches and lots of extra complexity. And it doesn't make much sense for vertex programs, where you'd want to cache the computed VP outputs, not the separate inputs.

Anyway, back to the original question. If you have a small number of colors, normals, and/or tex coords and want to re-use those with separate indices, you can simulate this "extension" with a simple vertex program. Load your arrays colors, texcoords, normals, into VP constants (up to the limits, say 96 total vec4s for 1st gen HW) and send down vertices with XYZ and color only. In the VP, use those color R,G,B values to do one or more table lookups for real colors, texcoords, and normals stored in the registers and emit the combined result.

This, of course, does not solve the cache re-use issue either. But I assume (or at least hope) vertices are cached post VP execution.

Avi

[This message has been edited by Cyranose (edited 08-15-2003).]

Hello, as far as complexity goes i think it is very possible. Infact the biggest problem is bus bandwidth and not the processor on the card. Infact this problem could be very simple to solve on the card.

Currently there is a vertex, texture coords, normals and colors buffers on the card. Then of course there is a list that goes though and says something along the lines of

i = index
vertexArray[i], normalArray[i] and so on.

Do you really think that this would become alot more complicated if the had something a

v = vertexindex
n = normalindex

vertexArray[v], normalArray[n]

If the index list was something along the lines of a packed array

indexlist[] = vertex, normal, color, texture;

instead of just indexlist = vertex

I really find this hard to believe that this will add large amounts of complexity to a card.

Later, Ben

zander76
08-16-2003, 09:37 AM
Hello again.

While thinking of this i realized that this would also have to address another problem more then one set of texture coordinatees. Anybody have any ideas about how to do this.. Perhaps telling gl what format you will use.

Perhaps defining a type.
set( GL_VERTEX_4F | GL_NORMALS | GL_TEXTURE0 | GL_TEXTURE1)

That way you can define the way this list is packed into memory.

What does everybody think of that.

Korval
08-16-2003, 11:46 AM
Infact the biggest problem is bus bandwidth and not the processor on the card.

No, the biggest problem is with the post-T&L cache.

This cache, currenlty, operates by storing the post-T&L values and the single index that was used to create this set of post-T&L data. What you are asking, however, cannot be post-T&L cached at all.

Here's why. If vertex data only has one index, then, if the vertex program is deterministic (same input yields same output), you know that if that index shows up again, you will get the same post-T&L data. So you may as well fetch it from a cache.

But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match. The reason for this is the fact that the caching mechanims, espcially with vertex programs, cannot guarentee that only the data from the input TexCoord0 value was used to compute the output TexCoord0 interpolant. Maybe it used TexCoord0 in addition to the position. Maybe it's using generic attributes, which could be anything.

Therefore, you get poor caching behavior.

Cyranose
08-16-2003, 12:01 PM
Here's some simple index-keyed T&L pseudo-code for an idea of why the multi-index is a problem. I'm not a HW designer or driver engineer, so I'm just guessing this organization -- use it for discussion purposes only:




struct vertex
{ position, normal, color, tcoord };

For each index[i]
if (!is_in_cache(index[i]))
{
vertex = fetch_agp(index[i]);
// pulls from separate or interleaved agp areas into register memory
vertex = transform(vertex);
// pos+norm, color, tcoord can have matrix
vertex.color = light(vertex);
vertex.tcoord = texgen(vertex);
store_cache(index[i],vertex);
emit(vertex);
}
else
{
vertex = fetch_cache(index[i]);
emit(vertex);
}

The idea is to make the fetch, transform, texgen, and light calls as infrequent as possible. In your scenario, I don't believe there's a way to have multiple sets of indices and still have the cache be as useful -- the solution would most likely be to have multiple caches, one for each kind of index. However, there's a combinatorial problem, in that you may find your position, normal, and color cached, but you haven't cached that combination. Therefore, the light() function at least would have to happen for every (or every unique) combination of V,N and C.

If you're interested, try rewriting the loop above to have multiple sets of indices and see how often you have to call the expensive functions. It's not pretty. You can't just rewite the fetch() function to take multiple indices since the is_in_cache() function won't give the correct answer in that case.

The key to making it work, I imagine, is to make the cache keyed by the either the concatenated indices (I1,I2,I3,..) or the complete untransformed vertex (P,N,C,T). That way you could expand your multiple indices first and still get some caching. But that cache-key is 8-16 times bigger and who knows if the lookup cost is linear or exponential with # of key bits. As I said, I'm not a HW designer.

Avi


[This message has been edited by Cyranose (edited 08-16-2003).]

zander76
08-16-2003, 12:49 PM
Originally posted by Korval:
No, the biggest problem is with the post-T&L cache.

This cache, currenlty, operates by storing the post-T&L values and the single index that was used to create this set of post-T&L data. What you are asking, however, cannot be post-T&L cached at all.

Here's why. If vertex data only has one index, then, if the vertex program is deterministic (same input yields same output), you know that if that index shows up again, you will get the same post-T&L data. So you may as well fetch it from a cache.

But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match. The reason for this is the fact that the caching mechanims, espcially with vertex programs, cannot guarentee that only the data from the input TexCoord0 value was used to compute the output TexCoord0 interpolant. Maybe it used TexCoord0 in addition to the position. Maybe it's using generic attributes, which could be anything.

Therefore, you get poor caching behavior.

So why not transfer it as indexed lists and have the card convert it to a format that it likes. Perhaps a compiled mesh. It packs images in a format that it likes. I don't see why this is not possible for mesh data.

Ben

zander76
08-16-2003, 01:00 PM
Originally posted by Cyranose:
Here's some simple index-keyed T&L pseudo-code for an idea of why the multi-index is a problem. I'm not a HW designer or driver engineer, so I'm just guessing this organization -- use it for discussion purposes only:




struct vertex
{ position, normal, color, tcoord };

For each index[i]
if (!is_in_cache(index[i]))
{
vertex = fetch_agp(index[i]);
// pulls from separate or interleaved agp areas into register memory
vertex = transform(vertex);
// pos+norm, color, tcoord can have matrix
vertex.color = light(vertex);
vertex.tcoord = texgen(vertex);
store_cache(index[i],vertex);
emit(vertex);
}
else
{
vertex = fetch_cache(index[i]);
emit(vertex);
}

The idea is to make the fetch, transform, texgen, and light calls as infrequent as possible. In your scenario, I don't believe there's a way to have multiple sets of indices and still have the cache be as useful -- the solution would most likely be to have multiple caches, one for each kind of index. However, there's a combinatorial problem, in that you may find your position, normal, and color cached, but you haven't cached that combination. Therefore, the light() function at least would have to happen for every (or every unique) combination of V,N and C.

If you're interested, try rewriting the loop above to have multiple sets of indices and see how often you have to call the expensive functions. It's not pretty. You can't just rewite the fetch() function to take multiple indices since the is_in_cache() function won't give the correct answer in that case.

The key to making it work, I imagine, is to make the cache keyed by the either the concatenated indices (I1,I2,I3,..) or the complete untransformed vertex (P,N,C,T). That way you could expand your multiple indices first and still get some caching. But that cache-key is 8-16 times bigger and who knows if the lookup cost is linear or exponential with # of key bits. As I said, I'm not a HW designer.

Avi


[This message has been edited by Cyranose (edited 08-16-2003).]

I see what you guys are getting at. The only thing that still catches my attention and to be honest i don't know how often this would help but caching the normals with give you the ability to only calculate the lighting for that normal once and have it cached for each subsiquent normal. But how often are there matching normals in a model, and is it worth the extra cache to run something like this.

The only thing i can really think of is each triangle would have 3 matching normals at each vertex(assuming the point was to normal each face attached to a vertex). Calculating them once would reduce some overhead.

Also one thing that you may want to consider is with multi-texture there is alot of coordinates to transfer. Assuming for a minute that you want to start using 4 texture (current max or the last time i checked anyway) then you have to pass a lot of info but how much of that is going to be duplicated. If you combined the 4 lists into one then removed duplicates then index that list. It has to be improving some of this.

Thanks everybody for the responces.
Ben

al_bob
08-16-2003, 01:02 PM
But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match.
Or you can have multiple smaller caches, one per vertex attribute. Granted, this would be far less efficient / much larger than a cache that holds more data per cache line, as you'd need to store more indicies and more control logic for less data.

Of course, this is all IMHO.

You could also do as zander76 suggests, but then you don't get much in return for the added complexity; the hardware still has to the added processing, and needs to use more memory for the vertex data, there's additional logic for the unpacking, etc.

harsman
08-17-2003, 07:42 AM
On the kind of complex models where saving bus traffic is important, you usually have lots of shared vertices. For all those shared vertices you still need to send all the extra indices for tex coords and what not. This means, that on typical complex models, you probably get *more* bus traffic with separate indices than without. If you're flatshading all your models, then separate indices might be interesting.

Csiki
08-17-2003, 08:30 AM
Originally posted by harsman:
On the kind of complex models where saving bus traffic is important, you usually have lots of shared vertices. For all those shared vertices you still need to send all the extra indices for tex coords and what not. This means, that on typical complex models, you probably get *more* bus traffic with separate indices than without. If you're flatshading all your models, then separate indices might be interesting.

But if the future cards able to upload indexes too then it will be faster use less bandwith.
You would have just give the first index of indeces( http://www.opengl.org/discussion_boards/ubb/smile.gif ), and the length...

Korval
08-17-2003, 05:41 PM
Or you can have multiple smaller caches, one per vertex attribute. Granted, this would be far less efficient / much larger than a cache that holds more data per cache line, as you'd need to store more indicies and more control logic for less data.

That, and it doesn't work.

As I pointed out, vertex programs can generate an interpolant from any of the data, not just the incoming interpolant value. TexCoord0 could be generated from TexCoord1, Color0, and the position. You can't be sure, so any time one of them changes, you have to run the entire vertex program over again.

al_bob
08-17-2003, 06:47 PM
You can't be sure, so any time one of them changes, you have to run the entire vertex program over again.
Yep. It'll work as long as the data doesn't change, which is a reasonable assumption considering the small window that the T&L cache caches.

At worst, the cache is just not in effect, which wouldn't make you lose much anyway, as the indices are already optimized to elimitate redundancies.

That is - the cache is there in the first place *because* you can't have multiple indices per vertex, and so data must be replicated.

Edit: I don't know what I was smoking when I wrote that last paragraph. Just ignore it.

[This message has been edited by al_bob (edited 08-18-2003).]

Mazy
08-18-2003, 02:35 AM
Is this really a problem? most modells are smooth on most places, or if they arent, you can have huge polygons ( like for walls), for spheres or other smooth shapes ( faces have mostly smooth corners ) you end up with most of the vertices having all data shared and then you only will sending a couple of indexlist to much each frame.

for a cube you still can use shared vertices, and specify GL_FLAT as shade mode, if youre using the index carefully you can make so that you only use one vertex as the first one in a triangle one time, and with GL_FLAT the first normal of the triangle is used for the whole surface.

zeckensack
08-18-2003, 10:56 AM
Originally posted by zander76:
So why not transfer it as indexed lists and have the card convert it to a format that it likes.Implementations "like" single-indexed meshes because they are easy to support, easy to build caches for, and offer ample opportunity for burst memory transfers. That's why most people convert multiple-index meshes created by those modelling packages at load time, before passing them to the renderer.

Can you show us any non-trivial mesh where the calculation would swing in favor of multiple indices per vertex?

[This message has been edited by zeckensack (edited 08-18-2003).]

Cyranose
08-18-2003, 05:51 PM
Originally posted by zeckensack:
Can you show us any non-trivial mesh where the calculation would swing in favor of multiple indices per vertex?

I'm not entirely in favor of having multiple indices, as I think it complicates things, but here are some examples:

If post-transformed positions can be cached by a p-index and reused with new tcoords, the age-old texture-wrap seam problem goes away without duplicating verts. Not a huge win for most models, but worth considering.

If post-transformed normals can be cached and reused by a n-index, the savings could be large for many models. Consider a set of stairs, a ladder, or a sculpted brick wall (non-bump-mapped) -- things with lots of repeated normals, whether or not the model is smooth or faceted.

In general, the re-use of normals, colors, and texcoords goes way up when considering the case of drawing many objects in a combined buffer with multi-index support, especially many instances of the same model, regardless of how smooth or faceted it is.

The big problem in the 2nd case is the need to relight re-used normals in local lighting sitiations, so the post-lit-color result can't be cached with the normal as you might want.

Just some examples. Food for thought.

Avi


[This message has been edited by Cyranose (edited 08-18-2003).]

j
08-19-2003, 11:02 AM
The problem is that transforming a normal is not very computationally expensive compared to actually doing the lighting computations for each vertex. You're going to have to relight each vertex even if they share normals but not position (with point and spot lights, you can get away with lighting only based on normal with non-attenuated directional lights), so any savings on the transformation probably won't even be noticable thanks to the lighting cost.

j

Korval
08-19-2003, 01:07 PM
If post-transformed positions can be cached by a p-index and reused with new tcoords, the age-old texture-wrap seam problem goes away without duplicating verts. Not a huge win for most models, but worth considering.

How do you know that the post-transformed position isn't dependent on which texture coordinate is being sent? It's possible; strange, but possible. And what happens when the user is using generic attributes rather than glVertex/Normal/TexCoordPointer? You have no idea which generic attributes correspond to position, normal, color, etc.

Cyranose
08-19-2003, 02:43 PM
Originally posted by Korval:
How do you know that the post-transformed position isn't dependent on which texture coordinate is being sent? It's possible; strange, but possible. And what happens when the user is using generic attributes rather than glVertex/Normal/TexCoordPointer? You have no idea which generic attributes correspond to position, normal, color, etc.

Hi Korval. Most of those examples are for traditional T&L. Also, J, granted the relighting for local lighting is a problem with caching just the normal transformation (I think I mentioned that directly).

Okay. If you want to talk about vertex programs, then having multiple indices becomes even more useful.

Let's assume the VP outputs are what's cached, even for multiple indices. We know it can be done, albeit for unknown cost. And given invariance, I hope we agree that for a given combination of indices, cached or not, the VP outputs should be the same.

With the current approach, we may define extra generic attributes as a way of getting pseudo-per-vertex data into vertex programs. Alternatively, we could store commonly used data in VP constants, but we often choose not to because of space constraints, or the API cost of loading them, or the reality of having to start and stop big buffer draws to load new parameters mid-stream. Stuffing the data into per-vertex attribs, even if it's not truly unique, gets around this, albeit by wasting a lot of space and bandwidth.

But using multi-indices gets around both limitations nicely.

Case in point: skinning. Currently, I might load a set of matrices into VP constants to do matrix blending based on an matrix-index or indices that masquerade as vertex attribs. If I had a huge number of bones that didn't fit in constants, I could chop up my skin, or alternatively use 4 generic attribs per vertex per matrix (not pretty if they're changing a lot).

With multi-index, the matrices could be stored once in AGP, we can have many more of them, and we can blast larger groups of objects in single draw calls.

Even for the non-skinned case, multi-index here is a huge win, since now I can draw 10,000 objects with 10,000 different matrices (but otherwise shared state) all in one big batch instead of using 10,000 push/pop matrix calls around 10,000 draws.

Another case might be a VP that does selective lighting based on per-vertex params. Those light constants could be stored per-vertex, or they could be indexed and shared. Similar issues.

Finally on your point about using texcoords as data, I've found the times I use things like vertex texcoords or color as generic data inputs to VP algorithms, I'm often using them as indices into table data. So having that explicitly supported would obviate some or even many of those cases.

Avi


[This message has been edited by Cyranose (edited 08-19-2003).]

Korval
08-19-2003, 09:06 PM
Case in point: skinning. Currently, I might load a set of matrices into VP constants to do matrix blending based on an matrix-index or indices that masquerade as vertex attribs. If I had a huge number of bones that didn't fit in constants, I could chop up my skin, or alternatively use 4 generic attribs per vertex per matrix (not pretty if they're changing a lot).

With multi-index, the matrices could be stored once in AGP, we can have many more of them, and we can blast larger groups of objects in single draw calls.


You can get the same effect (plus others) just by having a bindable buffer of memory attached to vertex programs, which a vertex program can access like memory (alomst certainly, read-only). That's much easier than having multiple vertex attribute indexing.

Also, with skinning done the via passing indices, a set of 4 indices and weights only takes up 2 vertex attributes. A single matrix requires 3 + 1 for the weight. To get 4-bone skinning, you have to give up 10 attributes.

Cyranose
08-20-2003, 09:45 AM
Originally posted by Korval:
You can get the same effect (plus others) just by having a bindable buffer of memory attached to vertex programs, which a vertex program can access like memory (alomst certainly, read-only). That's much easier than having multiple vertex attribute indexing.

Also, with skinning done the via passing indices, a set of 4 indices and weights only takes up 2 vertex attributes. A single matrix requires 3 + 1 for the weight. To get 4-bone skinning, you have to give up 10 attributes.

I think it's worse than 10, actually. But you could just barely fit 4-bone skinning along with base position and normal attribs and some UV.

Yes, it uses up your attribs indirectly (i.e., doesn't waste memory per vertex other than those indices and weights), but it allows you to batch-render one or more characters, which is the whole point.

I certainly like the idea of bindable VP memory, but how much slower would it be? If there's an early piece of pipeline that does nothing but de-index vertices (read various AGP arrays and output a complete vertex) for streaming to a VP block that needs only look at internal registers, that seems faster than having the VP fetch external memory at random points. If anything, I'd expect such bindable memory to get block-read into VP constants first, meaning they're still space-limited. But IANAHWD.

But the real point is that with bindable AGP memory, you'd wind up using vertex attribs as indices into that memory, no? Well, then you're doing exactly the multi-index thing, but without direct support at the caching level. In essense, this configuration is forcing the developer to turn their natural multi-index data into a single-index for the sake of caching and the current API. That could be made automatic to improve the semantic sense of things.

Avi www.realityprime.com (http://www.realityprime.com)


[This message has been edited by Cyranose (edited 08-21-2003).]

Korval
08-22-2003, 10:18 AM
But you could just barely fit 4-bone skinning along with base position and normal attribs and some UV.

Um, 4-bone skinning the normal way only requires 2 attributes: 1 for the weights and 1 for the indices. It requires a lot of uniforms, but not many attributes.


I certainly like the idea of bindable VP memory, but how much slower would it be?

Why would it necessarily be slower? Presumably, the memory is cached in the same cache as the vertex attributes. The only difference is when the memory is being accessed.


If there's an early piece of pipeline that does nothing but de-index vertices (read various AGP arrays and output a complete vertex) for streaming to a VP block that needs only look at internal registers, that seems faster than having the VP fetch external memory at random points.

Why does it have to be faster? As I mentioned above, as long as the memory is cached into the same cache as the vertex data, there's no difference in performance.


If anything, I'd expect such bindable memory to get block-read into VP constants first, meaning they're still space-limited.

Why would you expect that? The entire point of having the bindable memory is so that you remove the space limitations that are inhierent in current vertex programs.


But IANAHWD.

DHEUDJKE to you too.

Cyranose
08-22-2003, 01:14 PM
Originally posted by Korval:
DHEUDJKE to you too.

As I said, I Am Not A HardWare Designer (IANAHWD)...

Sorry for being terse earlier (ahem), my point before was that 4-bone skinning would barely fit using the aforementioned hypothesized multi-index approach (i.e., where the matrices would get sucked into the VP, masquerading as attribs).

I think we agree using up attribs this way is not ideal, but I'm not convinced we're going to see arbitrary bindable AGP that's as fast constant registers or vertex attribs bound via multi-index.

Maybe an actual HW designer can speak to this, but fast VP execution would seem to require not waiting on any AGP memory fetches. I don't know. Does the VP itself do the job of pulling the AGP vertex data? I doubt it. I'd bet it's an earlier instruction block and the VP silicon is tuned to work entirely in register space. (the actual answer is probably proprietary, so I don't really expect much, but here's to asking).

Anyway, if there's a reasonably large cache to ease the pain of AGP reads, fine, but that's still a fixed window--matrices might be very spread out in AGP, meaning a potential for multiple cache fills per vertex program cycle. Sounds painful.

However, if the memory gets pulled into the VP via multi-index, that pre-fetch can happen before the VP even starts. The pre-fetcher block can be doing nothing but look at indices, read AGP, and assemble vertices into a transform queue for VP or standard T&L. Seems much more deterministic and much simpler to my naive eye. And it sounds like what would be happening anyway with a single index reading from arbitarily placed AGP (non-interleaved) arrays.

Avi www.realityprime.com (http://www.realityprime.com)

zander76
08-25-2003, 06:25 AM
Hello Everybody

I personnally think that we are missing the point. This should be about the feature and not the current design of the card.

I personnally think that multi indexes adds a lot of flexibility to my model loading and it could possible save bandwidth going to the card.

After that its the manufactures responsibility to make it fast. NVidia and ATI have some smart engineers working there. Leave it the them to figure out the best way to optimize.

Lets discuss the idea and possible good point and bad points of why multi-indexing is a bad idea. Once it goes to the review board you can be damn sure that hardware companies are going to be regecting this is they don't have a way to optimize it.

Ben

Cyranose
08-26-2003, 08:28 AM
Originally posted by zander76:
Hello Everybody

I personnally think that we are missing the point. This should be about the feature and not the current design of the card.

I personnally think that multi indexes adds a lot of flexibility to my model loading and it could possible save bandwidth going to the card.

After that its the manufactures responsibility to make it fast. NVidia and ATI have some smart engineers working there. Leave it the them to figure out the best way to optimize.

Lets discuss the idea and possible good point and bad points of why multi-indexing is a bad idea. Once it goes to the review board you can be damn sure that hardware companies are going to be regecting this is they don't have a way to optimize it.

Ben

I don't know. Given there are alternate ways to do the same thing, it may come down to what gets the best results for the least effort, and that involves weighing API changes and cost to implement in HW. I don't see a lot of good winning a new extension that's not supported because it's too expensive. So yes, NVidia and ATI can undoubtedly do better than us, but it's also important to convince people this is feasible _and_ worthwhile.

One thing that did occur to me the other day (possibly to other people too) was that the existing single-index cache structure might be sufficient for caching if there's an easy/fast way to create a unique hash of multiple indices into one. Seems like that would do the trick, however uniqueness is a non-trivial problem. Just an idea.

Anyway, I think we've given some good agruments as to why multiple-indices don't always save bandwidth, so you may not want to keep falling back to that without your own examples. In fact, without AGP'd indices, it would probably be more expensive to separately index texcoords (8 bytes) and colors (4 bytes), for example.

But it could be a big win, IMO, for bigger shared data, like matrices, perhaps even normals. The biggest win, IMO, is in being able to pull vertex-indexed data into vertex programs without wasting massive per-vertex storage or stopping/starting draw batches to reload VP constants.

So unless you want to make some new arguments or someone else wants to argue the previous ones, I'm not sure what else we can add at this point.

Avi


[This message has been edited by Cyranose (edited 08-26-2003).]

vincoof
08-26-2003, 11:22 PM
The way I see the multi-index feature is that we will end up in using an index table to lookup in all the index tables, thus index of index. That's not likely to be an optimization at all, in my humble opinion.

cyclone
09-05-2003, 05:13 AM
If this can no to be in the GL core for performance reasons, this can perhaps to be included into one standard OpenGL library add-on like GLU or GLUT ?

Hum, and why NVIDIA prefer to use a cubemap texture for vector normalisation if the table lookup path if as bad as this on modern GPU ???

And somes SSE/3DNow! function don't work internally with some table look-up ???

And what about the use of a palette with CGA/MCGA/VGA cards if the index lookup is as bad as this on old/current/next hardware ???

And for me, 4 indices of one byte each (cf. a 32 bit value) can certainly be transfered more speed that (4 float for the position + 3 float for the normal + 4 byte for the color + minimum 2 byte for the texture coords), with or without AGP ...

@+
Cyclone

cyclone
09-05-2003, 06:06 AM
"Infact the biggest problem is bus bandwidth and not the processor on the card."

"No, the biggest problem is with the post-T&L cache"


And why not cancel this [in]famous post T&L cache if the GPU is now more speed that the memory ??????????????? http://www.opengl.org/discussion_boards/ubb/smile.gif

In fact, the problem is **REALLY** in the input of the T&L stage (cf. load from memory/disk to GPU memory), it is not at the output (cf. the final fragment's values computed on the GPU and used in the rasterisation stage)

@+
Cyclone

cyclone
09-05-2003, 07:19 AM
"The way I see the multi-index feature is that we will end up in using an index table to lookup in all the index tables, thus index of index. That's not likely to be an optimization at all, in my humble opinion"


With the language C/C++ this type of instruction is really used frequently :

**a = something

or

something = **a

And if it is really frequently used, it isn't for nothing ...

But don't have the texture stage something that is already equivalent, cf. dependent texture fetch or something like that ?


@+
Cyclone

Cyranose
09-05-2003, 08:20 AM
Originally posted by vincoof:
The way I see the multi-index feature is that we will end up in using an index table to lookup in all the index tables, thus index of index. That's not likely to be an optimization at all, in my humble opinion.


That seems like a good idea to me. I can see the initial lookup being more expensive to resolve because of the extra indirection, but if you implement it as you said, the post T&L cache cost should be the same as now and the cache-hit ratio should be at least as good.

The bigger question I have is API. Ideally, the extra indices (I'll call them indirections so as not to confuse with the master index) would need to be flexible. For example, I might want three attribs to be tied to one indirection but a fourth to have it's own indirection. Plus, we'd want the indirections to live in fast memory. So maybe something like an enable/disable pair plus normal glDraw* calls plus the equivalent of

glBindAttribIndirect(attribute_number,indirection_ table)

...where the programmer could use the same table for two or more attributes if she so chose. I also suppose it would be possible (though perhaps not useful) to default to indirection = master-index for any attribute where a table isn't bound.

Avi
www.realityprime.com (http://www.realityprime.com)

[This message has been edited by Cyranose (edited 09-05-2003).]

cyclone
09-05-2003, 11:18 AM
glBindAttribIndirect(attribute_number,indirection_ table)

where attribute_number can be something such as GL_POSITION, GL_NORMAL, GL_COLOR or GL_PARAM0_EXT, GL_PARAM1_EXT, ...

And when indirection_table is set to NULL,
a default table of incremented indices from 0 to 65536 is used

Yes, it's something like this http://www.opengl.org/discussion_boards/ubb/smile.gif http://www.opengl.org/discussion_boards/ubb/smile.gif

zeckensack
09-05-2003, 11:52 AM
I still don't see immediate use for this, but

Originally posted by cyclone:
And when indirection_table is set to NULL,
a default table of incremented indices from 0 to 65536 is usedNope. You'd use GL_NONE (which is zero, too) and make that behave like pass-through. Some people use more than ushorts for indices http://www.opengl.org/discussion_boards/ubb/wink.gif