VAR performance

Hi!

I’ve finished my renderer which is using VAR. Texturing and 1 light is on, and I get 1.3 million triangles/sec on a 633 celeron + gef2MX200. Is this performance good? Can someone tell me the performance of his renderer (to compare with mine)?

I copy the data from system memory into AGP memory with memcpy every time. This slows things a bit. Is there a way to copy data into AGP faster?

Something I noticed the other day was a memcpy function on AMD’s page, albeit AMD optomized whereas you are Intel it might be a good place to start.

I’ve an AMD Athlon 600 and a GF256DDR and I am using VAR with one directional light and (software-)bezierpatches for the terrain and progressive mesh for the objects with 1.4-1.6 million triangles/sec

http://www.rapso.de/newst.htm

so we’ve nearly the same performance.

Thanks! So this performance isn’t so bad. BTW, I’ve downloaded a few assembly memcpy routines, I’ll try them…

What’s interesting in this case is more the number of vertices you copy, since this will be the bottleneck. Using triangle strips, the Geforce can probably transform up to 8-10 Million triangles/s with 1 texture and one light (non-local viewer).

What’s important is to minimize the amount of geometry you have to transfer. In the case of PMs, you might be able to get away with sending the highest LOD once, if the lower LODs use subsets of the vertex positions of the higher LODs…

Michael

On my (Geforfce3) setup, there were some gains made using clever memcpy’s (block writes, prefetched, SIMD streaming etc), I have example code but it’s Intel specific.

If you can restict the % of vertices you update each frame to <20% you may see a gain over a full copy. Quite nice for LOD schemes.

You have to take care no to be updating you buffer while it’s being rendered so I assume your using fences/double VAR buffers ?

Rob

1.3 mio tris/sec is a low number. Without texturing and lighting my GF1 DDR pushes 10 mio tris/sec. What results do you get if you disable VAR and use regular system memory for the vertex arrays?

-Lev

I tried some memcpy routins but they didn’t improve performance.

I think I’m doing something wrong, because normal vertex array performance is better than VAR. When I don’t copy data with memcpy (I create data directly in AGP memory) I get a bit higher performance, than normal vertex array.
I’m using a large VAR buffer filling it seqentially (starting from the beginning when it’s full) and I use fence, but I can’t get better transfer rate… What am I doing wrong? I transfer 4096 vertices with one glDrawElements call, can this be the problem? (I can show some code tomorrow…)

I keep saying this, because it seems non-intuitive, but make sure you don’t store your indices array in agp memory, it needs to be in cached system memory - using a normal malloc() or new.
If you put the indices in agp/video memory, you get crap performance. Only the attribute arrays (vertex,normal,texcoords etc.) should be in agp/video mem.

memcpy() is pretty decent for copying to AGP memory from cached memory. It’s actually one of the few cases where you CAN’T get much better with hand-coding. Cached to cached memory is a MUCH worse case for memcpy().

I’d look at the vertex formats you’re sending. For example, consider making your positions and normals shorts, and make up the difference in the modelview matrix. Texture coordinates, same thing, if present.

Now, an MX200 is about as slow as you get, so you should make sure you’re not fill rate bound. Make the window really small, or turn on glCullFace(GL_FRONT_AND_BACK), and then measure vertex throughput.

I have only been able to achieve 6 million tris a sec using glDrawElements on a GeForce. The trick is to remove any elements that are duplicated. In other words, if you have a cube, after removing duplicate elements you’ll only have 8 vertices in the best case. If your normals are perpindicualr to there surfaces then you’ll have 36 vertices, but only if you use tris.

If you use tristrip then you can bet an even higher benchmark.

[This message has been edited by WhatEver (edited 05-09-2002).]

To get any kind of repeatable and comparable benchmark numbers for your geometry performance, you have to

  1. turn of rasterization (e.g., glCullFace(GL_FRONT_AND_BACK, as jwatte said)

  2. make sure you are not measuring swapbuffers time in a windowed mode (this can be considerable!), so make sure the window is very small if your timing extends beyond swapbuffers

  3. count TRANSFORMED VERTICES, not triangles! Ultimately, this means you would have to set up a vertex cache simulation to find out how many shared vertices can be taken from the cache. If you don’t share any vertices, then counting the number of indices sent via glDrawElements suffices…

Any optimizations dealing with triangle strips and making your mesh cache-friendly should be treated separately.

Michael

Thanks for your help guys, but I think those are not the problem in my code. So here it is:

These are the variables used in my code:

typedef struct tsVertex
{
GLfloat u;
GLfloat v;
GLfloat nx;
GLfloat ny;
GLfloat nz;
GLfloat x;
GLfloat y;
GLfloat z;
} tsVertex;

tsVertex *VertexBuffer;
tsVertex *Buffers[TS_RENDERER_BUFF_NUM];
GLuint BufferFence[TS_RENDERER_BUFF_NUM];
GLuint BufferLevel;
GLuint CurrentBuffer;

tsVertex VA[4096];
GLushort indices[8064];

This is where I initialize OpenGL:

glViewport(0, 0, CurrentMode->sWidth, CurrentMode->sHeight);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(45.0, (GLfloat)CurrentMode->sWidth/(GLfloat)CurrentMode->sHeight, TS_NEAR_CLIPPING_PLANE, TS_FAR_CLIPPING_PLANE);

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(1.0);
glDrawBuffer(GL_BACK);
glShadeModel(GL_SMOOTH);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
glPolygonMode(GL_FRONT, GL_FILL);
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);
glHint(GL_FOG_HINT, GL_NICEST);
glEnable(GL_LIGHTING);
glEnable(GL_LIGHT0);
glEnable(GL_TEXTURE_2D);

glClearColor(0.0, 0.0, 0.0, 1.0);

glMatrixMode(GL_MODELVIEW);
glLoadIdentity();

This is the VAR initialization code:

FlushVertexArrayRangeNV = (PFNGLFLUSHVERTEXARRAYRANGENVPROC)wglGetProcAddress(“glFlushVertexArrayRangeNV”);
VertexArrayRangeNV = (PFNGLVERTEXARRAYRANGENVPROC)wglGetProcAddress(“glVertexArrayRangeNV”);
AllocateMemoryNV = (PFNGWGLALLOCATEMEMORYNVPROC)wglGetProcAddress(“wglAllocateMemoryNV”);
FreeMemoryNV = (PFNGWGLFREEMEMORYNVPROC)wglGetProcAddress(“wglFreeMemoryNV”);
if (!FlushVertexArrayRangeNV | | !VertexArrayRangeNV | | !AllocateMemoryNV | | !FreeMemoryNV)
{
return -1;
}

GenFencesNV = (PFNGLGENFENCESNVPROC)wglGetProcAddress(“glGenFencesNV”);
DeleteFencesNV = (PFNGLDELETEFENCESNVPROC)wglGetProcAddress(“glDeleteFencesNV”);
SetFenceNV = (PFNGLSETFENCENVPROC)wglGetProcAddress(“glSetFenceNV”);
TestFenceNV = (PFNGLTESTFENCENVPROC)wglGetProcAddress(“glTestFenceNV”);
FinishFenceNV = (PFNGLFINISHFENCENVPROC)wglGetProcAddress(“glFinishFenceNV”);
if (!GenFencesNV | | !DeleteFencesNV | | !SetFenceNV | | !TestFenceNV | | !FinishFenceNV)
{
return -1;
}

VertexBuffer = (tsVertex *)AllocateMemoryNV(sizeof(tsVertex) * 65536, 0.2f, 0.2f, 0.7f);
if (VertexBuffer == NULL)
{
return -1;
}

VertexArrayRangeNV(sizeof(tsVertex) * 65536, VertexBuffer);
Buffers[0] = VertexBuffer;
BufferLevel = 0;
CurrentBuffer = 0;
glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
glInterleavedArrays(GL_T2F_N3F_V3F, 0, Buffers[0]);
for (i = 0; i < TS_RENDERER_BUFF_NUM; i++)
GenFencesNV(1, &(BufferFence[i]));

This is the rendering function:

void RenderArray(tsVertex *VArray, GLsizei VNumber, GLushort *Indices, GLsizei INumber)
{
int i,j;

if (!TestFenceNV(BufferFence[CurrentBuffer]))
FinishFenceNV(BufferFence[CurrentBuffer]);
if (BufferLevel + VNumber > 65536)
BufferLevel = 0;
Buffers[CurrentBuffer] = &(VertexBuffer[BufferLevel]);
memcpy(Buffers[CurrentBuffer], VArray, sizeof(tsVertex) * VNumber);

glInterleavedArrays(GL_T2F_N3F_V3F, 0, Buffers[CurrentBuffer]);
glDrawElements(GL_TRIANGLE_STRIP, INumber, GL_UNSIGNED_SHORT, Indices);
SetFenceNV(BufferFence[CurrentBuffer], GL_ALL_COMPLETED_NV);
BufferLevel += VNumber;
CurrentBuffer++;
CurrentBuffer %= TS_RENDERER_BUFF_NUM;
}

Here is the data generation code:

for (i = 0; i < 64; i++)
{
for (j = 0; j < 64; j++)
{
VA[i * 64 + j].x = j * 0.1f;
VA[i * 64 + j].y = i * 0.1f;
VA[i * 64 + j].z = 0.0f;
VA[i * 64 + j].nx = 0.0f;
VA[i * 64 + j].ny = 0.0f;
VA[i * 64 + j].nz = 1.0f;
VA[i * 64 + j].u = (GLfloat)j / (GLfloat)63;
VA[i * 64 + j].v = (GLfloat)i / (GLfloat)63;
}
}

for (i = 0; i < 63; i++)
{
for (j = 0; j < 64; j++)
{
indices[i * 128 + j * 2] = i * 64 + j;
indices[i * 128 + j * 2 + 1] = (i + 1) * 64 + j;
}
}

And finally the rendering loop:

for (i = 0; i < 7; i++)
{
glLoadIdentity();
Camera->Apply();
glTranslatef(0.0, 0.0, i * (-5.0f));
RenderArray(VA, 4096, indices, 8064);
}

The main problem is, that I get better performance in normal vertex array mode than in VAR mode. The performance gets higher than in normal vertex array mode if I replace memcpy in RenderArray() with the data generation code. Is my code correct, and this is the maximum transfer rate of my MX200, or am I doing something wrong?

[This message has been edited by Catman (edited 05-10-2002).]

I forgot to tell that 1.3M was with GL_TRIANGLES, after I started to use GL_TRIANGLE_STRIP I got 1.5-1.6M…

So did someone check my code? Is it correct? Please, answer my question…

Maybe the bottleneck is in your data generation code? You have 2 divides in there (would be smart to replace them with reciprocal multiplies).

Also, for static data ofcourse you don’t want to memcpy it all the time. You just want to put it in AGP/Videomemory and just leave it there.

In my experience (mainly tested on XBox, but also a bit on regular PCs) large gains CAN be made over memcpy when copying into AGP. First prefetching blocks of say 4k into L1 cache, then storing them out. This is to avoid reads to non cached memory while you are writing to AGP.

I would expect your data generation code to already sortof have this benefit, as it doesn’t do any reads.

Yes, you have to be careful about write combining. Some CPUs (P4, ahem) have rather surprisingly picky write combiners…

P3 and Athlon have fairly friendly write combiners, but you still need to be careful.

In general, the bad things are unaligned/nonsequential writes; any read that results in a cache miss (exact rules depend on chip); and partially-full write combine buffers being flushed.

  • Matt

Write combining is the lowest priority functionality of the LFBs. If a fetch is necessary for anything (including promoting data from L2 to L1) that has higher priority than write combining.

I’ve heard someone say that he’d seen a CPU in simulation choose to evict a partially full LFB (write combiner) rather than choose one that’s available and empty. I almost believe him.

Anyway, even an L1 miss (L2 hit) is likely to evict your write combiners. Make sure you only operate on an L1-sized working set size at a time; thus batching your operations/updates and using cache pre-warming to make sure you don’t blow away your combining fetching into L1 while processing.

I don’t really understand all this hardware things, but…

Yesterday I tested my program on another system with a Gef4MX and a motherboard with Intel chipset (mine has VIA chipset). I got 5.5M with normal vertex arrays and 7.2M with VAR. So I think the problem is my motherboard. BTW, the Gef4 system didn’t have an AGP 4x, so these numbers would be higher on an AGP4x system.

What the…

how the hell do you guys get over millions of triangles ?
what’s the fps ? 0.00000000001 ?
Even with a simple cube I drop to something like 100 fps with a gforce 2 mx.
Is it the VAR that makes the whole difference ?
I use display list for now but I don’t see how I can get millions of triangles running at a good fps !

How do you do that ?


Evil-Dog
Let’s have a funny day

These are million triangles/sec not /frame. 1.6M triangles/sec = 80000triangles/frame at 30fps (on my system).