fast path vertex tweening

I’m trying to do what I believe is called ‘vertex tweening’ on the GPU. Basically, I want to calculate a final vertex position based on a base position, target position, and an interpolation factor, e.g.
vec3 v_f = v_b + (v_t - v_b)*f;

Not sure if this is relevant, but I’m using a NV 8800 GTX. I’m not sure which shader model I will need, but I’d like to keep that as relaxed as possible for now, unless some huge performance benefit is possible. My code is built around the functions of a GL2 context so using any new features in GL3 might not be worth the effort for this (simple) animation feature.

I’ve been brainstorming to figure out the ‘fast path’ for vertex tweening. Here is the process my brain went through:
I am using VBOs to store my vertex data, but I’m not sure how to pack the target position into the VBO because I don’t know what semantic to declare it as. Pretending that the target position is a (3d) texture coordinate might work, but it seems less than ideal because my C code and shader code might be confusing to another developer. Also, I’d like to have several keyframes, so the base position and target position will need to come from different buffers (which means different VBOs) as the keyframe index changes. This makes me realize that there is no benefit of packing the vertex data into an interleaved VBO at all. I’ve thought about sending the target position as an attribute, which seems like a good fit because the code will make sense to other developers, but this seems slower because the attributes are not stored on the hardware.

So, the results from brainstorming tell me that multiple VBOs seem like the way to go. I can have a separate VBO to hold each keyframe position (and normal?). The current (base) keyframe will be assigned with the glVertexPointer function (and glEnableClientState(GL_VERTEX_ARRAY)), allowing my GLSL code to use the reserved attribute name gl_Vertex to reference it…My question is, when I bind the VBO holding the target positions, which gl*Pointer function should I use to assign them and what attribute name will my GLSL code use to reference them?

Thanks!

I finally realized that I can avoid using a reserved name for gl_Vertex by writing a shader using custom names, finding their names after linking the program, and binding vertex data elements to the custom names with the glVertexAttribPointer function, instead of the other gl*Pointer functions which force us to use the reserved attribute names. So, now I’m back to figuring out the code to render VBOs again, using ‘glVertexAttribPointer’. When I get this figured out, I can have proper names for the variables needed to do the tweening interpolation.

Do I need to call ‘glEnableClientState’ in order to render when using glVertexAttribPointer? As far as I can tell, I should change the call to ‘glEnableClientState’ to instead call ‘glEnableVertexAttribArray’. Is this correct?

My render code now looks like this:

for_each_vbo
{
glBindBuffer( GL_ARRAY_BUFFER, vbo )
for_each_elem_in_vbo
{
glVertexAttribPointer( attrib… )
glEnableVertexAttribArray( attrib->location )
}
}

glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo )
glDrawElements( indices… )

for_each_vbo
{
for_each_elem_in_vbo
{
glDisableVertexAttrib( attrib->location )
}
}

glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, 0 )
glBindBuffer( GL_ARRAY_BUFFER, 0 )

Look right???

Here is something more appropriate.

// The logic of this loop is only valid if all the attributes of a given mesh are in a single VBO.
for_each_vbo
{
glBindBuffer( GL_ARRAY_BUFFER, vbo )
for_each_elem_in_vbo
{
glVertexAttribPointer( attrib... )
glEnableVertexAttribArray( attrib->location )
}

glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo )

// Probably also need to setup matrices, uniforms, textures, etc. before glDrawElements

// This will issue a draw call for each VBO.
glDrawElements( indices... )
}

// Optional, only if you need to draw stuff without VBOs afterwards
glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, 0 )
glBindBuffer( GL_ARRAY_BUFFER, 0 )

I’ve been moving forward with this implementation. I agree that I needed to move the disableclientstates and bindbuffer(0) code outside the vbo for-loop to minimize state changes. I’m setting up all the other stuff outside the for loop because it isn’t specific to each vbo. Now I’m struggling through the use of ‘glEnableVertexAttribArray’ and ‘glVertexAttribPointer’. I think I’ll have it working soon though. Thanks for looking at the algorithm.

I have a working implementation of the C code and vertex shader discussed above. Now I’m trying to make a general purpose vertex shader that can use any number of keyframes. My current implementation is hard coded for 2 keyframes. The difficult part seems to be what C code to call to supply an array of vertex attributes. I am able to supply automatic variables using ‘glVertexAttribPointerARB’, but I don’t see how to supply an array. For example, this is the vertex shader code that would be ideal because it is reusable:


#define MAX_KEYFRAMES 50
in vec3 morph_positions[MAX_KEYFRAMES]; // how to supply this
//in vec3 v_base; // old way, single variable works well
//in vec3 v_target; // old way, single variable works well
uniform float t; // interpolation factor
uniform uint current_keyframe; // index into morph array
void main()
{
  vec3 v_base = morph_positions[current_keyframe];
  vec3 v_target = morph_positions[current_keyframe+1];
  vec3 v_pos = v_base + (v_target-v_base)*t;
  // transform v_pos
  // etc etc
}

Will someone point me to the function that helps me set an array of vertex attributes? Many thanks!

You don’t need the whole array each frame. Knowing the time moment, you need only v_base & v_target passed properly.

Actually, I believe I do need it to test the performance using a single VBO containing all the morph targets. For a small number of keyframes, using a single packed VBO seems ideal as it will avoid extra BindBuffer calls. In my example, it would only make sense if MAX_KEYFRAMES is less than GL_MAX_VERTEX_ATTRIBUTES, which is 16 for my hardware, minus any extra attributes (texture coords, etc).

I am able to compile a vertex shader with the morph_positions[MAX_KEYFRAMES] array, and GL reports a MAX_KEYFRAME number of user defined attributes for it. I compiled with MAX_KEYFRAMES set to 3 for my mesh data. I finally figured out that glVertexAttribPointer can be called repeatedly to set the attribute values because GL reports a unique attribute for each index in the array. My vertex shader now contains an ‘int’ uniform to select the keyframe, I’m pretty sure I needed GLSL version 140 to set integer types, so I had to upgrade my drivers, and now the integer uniform compiles.

I am not seeing the animation play like it should. It seems to be stuck at one of the keyframes, although I’m updating the uniform with the keyframe. My interpolation factor doesn’t seem to have an effect now either, so I think I’ve got a bug somewhere and I still see no reason why this won’t work. Due to time constraints, I’m moving to plan B, avoiding the GLSL array of morph data, instead binding selected VBOs to the v_base and v_target attributes, which seems like what most people think of when doing this.

Each call (glBindBuffer, glVertexAttribPointer, glEnableVertexAttribArray) is about 2000 cycles iirc. Relying on indexing of huge-uniforms in shaders is a slippery slope. So, make the gpu fetch attribs for you by abusing the “offset” in specifying a vtx-attrib-array.


const int NUM_VERTS = 778;
const int NUM_FRAMES= 34;


struct MorphTuple{
	vec3 pos;
	vec3 norm;
};
MorphTuple Morphs[NUM_VERTS*NUM_FRAMES];



int vbo1;
...

void init(){
	glGenBuffers(1,&vbo1);
	glBindBuffer(GL_ARRAY_BUFFER_ARB, vbo1);	
	glBufferData(GL_ARRAY_BUFFER_ARB, 12*2*NUM_VERTS*NUM_FRAMES,Morphs,GL_STATIC_DRAW);
	
	
	..
	[here init the other VBOs and the IBO]
}


void draw(int frame1,int frame2,float lerp_value){
	glEnableVertexAttribArray(0); // pos1
	glEnableVertexAttribArray(1); // norm1
	glEnableVertexAttribArray(2); // pos2
	glEnableVertexAttribArray(3); // norm2
	
	
	
	int offset;
	glBindBuffer(GL_ARRAY_BUFFER_ARB,vbo1);
	
	offset = NUM_VERTS*frame1*24;
	glVertexAttribPointer(0,  3,GL_FLOAT,false,24,(void*)offset);
	offset+=12;
	glVertexAttribPointer(1,  3,GL_FLOAT,false,24,(void*)offset);
	
	offset = NUM_VERTS*frame2*24;
	glVertexAttribPointer(2,  3,GL_FLOAT,false,24,(void*)offset);
	offset+=12;
	glVertexAttribPointer(3,  3,GL_FLOAT,false,24,(void*)offset);
	
	
	[here bind other attribs, in the regular way]
	
	
	glBindBuffer(GL_ELEMENT_ARRAY_BUFFER_ARB,ibo1);

	[here send uniform values, especially the lerp_value]
	
	glDrawElements(...);
	
	
	[here disable vtx-attrib-arrays on demand]
}



attribute vec4 pos1; // attr0
attribute vec3 norm1; // attr1
attribute vec4 pos2; // attr2
attribute vec3 norm2; // attr3

uniform float lerp_value;

void main(){
	vec4 pos = mix(pos1,pos2,lerp_value);
	vec3 norm = normalize(mix(norm1,norm2,lerp_value));
	
	...
	
}

Each call (glBindBuffer, glVertexAttribPointer, glEnableVertexAttribArray) is about 2000 cycles iirc.

How much is glBindVertexArray? You really should be using VAOs where possible.

I’m actually trying to write a reusable vertex shader that does not use a hard coded number of attributes. Anyway, I found my mistake which relates to your post, I wasn’t calculating the offset properly and all values were being bound to the 0’th attribute. I was lucky to see anything, but it explains why I saw the last keyframe’s position when rendering. This method is working well, rendering quite fast. I’ll need to start considering blends with other animations, and compare the performance to using multiple VBOs. My hunch is the reusability gained by using an array in the GLSL code is outweighed from the expense of calling glVertexAttribPointer for each of the keyframes, even when eliminating extra calls to glBindBuffer. However, it is definitely possible to store keyframe data in the vertex buffer, and bind the data to a GLSL array which can be indexed per keyframe.

Thanks for all the tips.

Hmm, I did the benchmarks again, with quite different results. Either the 190.38 drivers have improved GL speed, or the 2k-cycle results were on my previous PC (c2d E4600, DDR2@800). From the disasm-view during debug of the old drivers I remember seeing lots of clutter which is not available now.
These bench results are consistent between 2.x, 3.0 and 3.1:


Bench results:
	glBindTexture:			avg=53				min=47	last=57		max=170192, 	count=1809330
	glBindFramebuffer:		avg=152				min=47	last=142	max=24814, 	count=14710
	glBindBuffer:			avg=59				min=47	last=295	max=842222, 	count=7193190
	glVertexAttribPointer:		avg=94				min=67	last=76		max=1171626, 	count=7193190
	glEnableVertexAttribArray:	avg=109				min=47	last=67		max=93432, 	count=14710
	glDisableVertexAttribArray:	avg=128				min=47	last=171	max=940, 	count=14710
	null cycle (count=numFrames):	avg=53				min=38	last=57		max=24007, 	count=7355

It’s important to note my current PC is: c2d E8500 @3.8GHz (overclocked), DDR3@1.6GHz (timing 7-7-7-20), GeForce 8600GT, Forceware 190.38

The avg/min/last/max values are in cpu-cycles. “count” is how many times it’s been called. The “null cycle” shows RDTSC latency (you see the out-of-order execution effects). The rendering thread is affinity-locked to a core, and benchmark start/end is manually triggered after caches are warm. Duration of benchmark was 7355 frames @ ~700fps.

I’ll try different new drivers for differences (maybe those 2k cycles were in a beta-driver), and then start using VAO to test their speed.

Under 190.57, some nice and funny results:


without creating or using any VAO:
	glBindTexture:			avg=53				min=47	last=48	max=60040, count=1656565
	glBindFramebuffer:		avg=155				min=47	last=180	max=40014, count=13467
	glBindBuffer:			avg=59				min=47	last=238	max=1238420, count=6585854
	glVertexAttribPointer:		avg=95				min=76	last=142	max=1004710, count=6584874
	glEnableVertexAttribArray:	avg=89				min=47	last=48	max=608, count=13467
	glDisableVertexAttribArray:	avg=122				min=47	last=180	max=11048, count=13465
	null cycle (count=numFrames):	avg=54				min=38	last=85	max=16330, count=6733

creating and using VAO:
	glBindTexture:			avg=314				min=199	last=627	max=115653, count=1326432
	glBindFramebuffer:		avg=40982			min=32414	last=34124	max=82299, count=10784
	glBindBuffer:			avg=509				min=190	last=218	max=1207, count=10784
	glVertexAttribPointer:		avg=514				min=418	last=456	max=817, count=10784
	glBindVertexArray:		avg=412				min=256	last=437	max=1258256, count=2631296
	glBindVertexArray(null):	avg=453				min=285	last=637	max=220523, count=2631296
	null cycle (count=numFrames):	avg=91				min=76	last=86	max=123, count=5392

creating (completely!) but NOT using VAO while drawing:
	glBindTexture:			avg=52				min=47	last=47	max=12074, count=1211058
	glBindFramebuffer:		avg=147				min=47	last=114	max=1045, count=9846
	glBindBuffer:			avg=58				min=47	last=104	max=119614, count=4814694
	glDisableVertexAttribArray:	avg=120				min=47	last=114	max=228, count=9846
	glVertexAttribPointer:		avg=99				min=66	last=76	max=130150, count=4814694
	glEnableVertexAttribArray:	avg=100				min=47	last=57	max=219, count=9846
	null cycle (count=numFrames):	avg=46				min=38	last=95	max=105, count=4923

The VAOs in the test encapsulate 1-5 vtx-arrays, 1 VBO and 1 IBO each. The rendering is done to a FBO, and then with a streaming-VBO the fbo-texture is splashed as a fullscreen quad.
It’s interesting how binding a texture and FBO become slower with VAOs being used.

@Ilian Dinev:

Could you please explain what method you use to measure those timings? I’d be interested to try it out for some of my work.

I guess these timings are only for CPU time right? We would need to know if/what those calls “cost” on the GPU as well to get a complete picture (forcing a sync between buffers, flushing writes to memory, etc.) and I have no idea how to get that information…

While I enjoy the company of the readers of the advanced forum, I’m going to continue to talk about issues I’m having with my original goals. I hope you’ll comment on my thoughts, even though I think I’m beyond the ‘beginner’ forum, maybe I don’t have the experience to appreciate all the results from the benchmarks above.

Goal: realize fastest possible vertex animation.
Method: use server side memory to store baked keyframe data and interpolate on the GPU because it should be faster than interpolating in software and writing client side data each frame.
Optimization: avoid extra calls to glBindBuffer by packing keyframes into a single server side vertex buffer.

Implementation: currently needs a small number of keyframes because I’m binding each element in vertex buffer to a vertex shader attribute.
Symptom: I was really happy when my shader compiled and I had all the C code correctly binding elements of the VBO to my shader’s vertex attributes. To my surprise, supplying 2 uniforms to control the state of the animation actually causes a big problem. I’m supplying (a) the keyframe index [let’s call it ‘kf_index’] required for accessing an array of morph_states, and (b) the interpolation factor [let’s call it ‘t’] for the current keyframe. There is a big problem when switching the keyframe index uniform. It seems that the interpolation factor is not always changed at the same time as the keyframe index. I am updating the uniforms for keyframe index and interpolation factor immediately after each other in my C code (2 sequential lines of C code). However, when the keyframe index uniform is changed, I can see a render that appears to have the interpolation factor of previous application frames. As you all know, you must switch the kf_index when the interpolation factor reaches its final value (t goes from 0 to 1). For example, when the keyframe index is 0, and ‘t’ reaches its end value, I command this change to the shader state:
[kf_index=0, t=1]–>[kf_index=1, t=0]
But I actually see a frame or 2 render with this shader state:
[kf_index=0, t=1]–>[kf_index=1, t=1]

So, this implementation has a big problem. I think there must be a synchronization issue since a GPU runs on its own frequency, or there is just a weird bug in the driver. Whatever the problem is, I know I can’t rely on 2 variables being updated at the exact same frame.

I’m just now realizing that my vertex shader can have a smaller number of attributes than the number elements in the VBO, if I don’t bind elements of the VBO that are not used by the shader. I won’t need the array of morph_states in my shader code, so I won’t need the keyframe index uniform variable anymore. You guys have been telling me this the whole time, but I thought I needed to bind each element in the VBO with the glVertexAttribPointer function. I think that I don’t need to do this, and my code will need to be reworked to do the book-keeping necessary to know the offsets for only the necessary shader attribute bindings. Like I said above, I’m probably not quite ready for the ‘advanced’ forum. :slight_smile: I’ll give this one last shot as it should fix my timing symptom AND avoids the maximum number of keyframes limitation.

Did you try the code I suggested? I think it’s the best solution in the given environment (GL2.0). Vertex-texture-fetch is slow and not present in many of the target-cards. Indexing stuff inside a shader can only be done easily on DX10 cards.
The benchmarks show that most of those GL calls take 47-100 cycles. On a 2GHz cpu, that would be 23-50 nanoseconds. (this is damn fast) The subsequent glDrawElements() call takes 1065-3000 cycles. (also quite fast)

The benchmark-app draws the cs_assault level from Counter-Strike:Source. 250 textures, 1000 VBOs; no spatial-culling enabled. On my underpowered gpu, the whole level is drawn:

  • @2500FPS (0.4ms) at 640x480 and 128x128 with VAO, and
  • @3000FPS (0.33ms) without VAO at 128x128. (but @2700FPS at 640x480)
    (the level being visible or outside view-frustum doesn’t change results)

Ilian,
Yes, I have working what you described. Although I’ve had some success, I don’t completely understand the meaning of the ‘stride’ term (edit: and the offset needed) for ‘glVertexAttribPointer’. The manual’s definition is “Specifies the byte offset between consecutive generic vertex attributes”, but I’m not sure how to apply that to the following examples.


// EDIT: this is actually the case I've been dealing with,
// and is what I think could be the fastest,
// so I'm most interested in it.
struct Vertex
{
    vec2 tex_coord;
    vec3 pos[NUM_KEYFRAMES];
    vec3 norm[NUM_KEYFRAMES];
};

struct BufferV
{
    Vertex verts[NUM_VERTS];
};

struct MeshV
{
    BufferV buff_v;
    GLhandle shader_program;
};


NUM_VERTS = 100;
NUM_KEYFRAMES = 20;

struct MorphData
{
    vec3 pos;
    vec3 norm;
};

struct BufferA
{
    vec2 texcoord[NUM_VERTS];
    MorphData morph_states[NUM_VERTS*NUM_KEYFRAMES];
}

struct MeshA
{
    BufferA buff_a;
    GLhandle shader_program;
};

What if I start using a mesh that binds 2 buffers? E.g.


struct MorphData
{
    vec3 pos;
    vec3 norm;
};

struct BufferA
{
    MorphData morph_states[NUM_VERTS*NUM_KEYFRAMES];
}

struct BufferB
{
    vec2 texcoord[NUM_VERTS];
};

struct MeshAB
{
    BufferA buff_a;
    BufferB buff_b;
    GLhandle shader_program;
};


// EDIT: forgot to show the vertex shader
in vec2 tex_coord;
in vec3 pos1;
in vec3 norm1;
in vec3 pos2;
in vec3 norm2;
void main()
{
  // do interpolation
  // transform position, normal, tex coord
}

In your example, the vertex buffer only consists of an array of MorphData. I’ve had success when binding this kind of VBO by using ‘stride=2*sizeof(vec3)’ (edit: offset being 0, and sizeof(vec3), respectively). Answers to all the examples above will surely clear up my confusion about that the ‘stride’ parameter (edit: and the offset needed) really means.

Finally, are you also advocating that I use VAOs for shader based interpolation? It seems to me that having the data on the GPU would be faster than having the data on the client. Yes, your benchmarks are fast, but how do they compare when using VBOs?

Many thanks again!

Now that I’ve written out what I’ve been confused about, I think I can see the answer. ‘stride’ and ‘offset’ are terms used to bind elements in the vertex buffer to shader attributes. Therefore, the number of attributes the shder uses doesn’t factor into the stride and offset values. Is that correct?

With that, my understanding has values needed for ‘glVertexAttribPointer’ to be:
BufferV,
stride=sizeof(vec2)+NUM_KEYFRAMES*(sizeof(vec3)+sizeof(vec3))
offsets (respectively)= 0, sizeof(vec2), sizeof(vec2)+NUM_KEYFRAMES*sizeof(vec3)

That is probably the hardest one, and I think I’ve got it figured out. If it’s correct, then the others will be straight forward to calculate also.

The “struct Vertex” approach is bad, the “struct MorphData+BufferB” is what I had been writing about to be good.
You can still put everything in one VBO:


struct MorphData{
    vec3 pos;
    vec3 norm;
};
struct MorphFrame{
	MorphData verts[NUM_VERTS];
};


struct BufferA{
	vec2 texcoord[NUM_VERTS];
	
	MorphFrame frames[NUM_KEYFRAMES];
	
};





void init(){
	glGenBuffers(1,&vboA);
	glBindBuffer(GL_ARRAY_BUFFER_ARB, vboA);	
	glBufferData(GL_ARRAY_BUFFER_ARB, sizeof(BufferA),pData,GL_STATIC_DRAW);
	
	
	[here init the IBO]
}


void draw(int frame1,int frame2,float lerp_value){
	glEnableVertexAttribArray(0); // pos1
	glEnableVertexAttribArray(1); // norm1
	glEnableVertexAttribArray(2); // pos2
	glEnableVertexAttribArray(3); // norm2
	glEnableVertexAttribArray(4); // texcoord
	
	
	
	glBindBuffer(GL_ARRAY_BUFFER_ARB,vboA);
	
	BufferA* buf=null;
	
	glVertexAttribPointer(0,  3,GL_FLOAT,false,24,&buf->frames[frame1].verts[0].pos);
	glVertexAttribPointer(1,  3,GL_FLOAT,false,24,&buf->frames[frame1].verts[0].norm);
	
	glVertexAttribPointer(2,  3,GL_FLOAT,false,24,&buf->frames[frame2].verts[0].pos);
	glVertexAttribPointer(3,  3,GL_FLOAT,false,24,&buf->frames[frame2].verts[0].norm);
	
	glVertexAttribPointer(4,  2,GL_FLOAT,false,8,&buf->texcoord);
		
	
	glBindBuffer(GL_ELEMENT_ARRAY_BUFFER_ARB,iboA);

	[here send uniform values, especially the lerp_value]
	
	glDrawElements(...);
	
	glDisableVertexAttribArray(0);
	glDisableVertexAttribArray(1);
	glDisableVertexAttribArray(2);
	glDisableVertexAttribArray(3);
	glDisableVertexAttribArray(4);
}


The benchmarks I did use only VBOs and VAOs (that point to VBOs). All of the vtx-data was thus residing in VRAM.
Your vertex-tweening cannot and must not use VAOs, so forget about them for now.