PDA

View Full Version : VAR performance



Catman
05-09-2002, 01:02 AM
Hi!

I've finished my renderer which is using VAR. Texturing and 1 light is on, and I get 1.3 million triangles/sec on a 633 celeron + gef2MX200. Is this performance good? Can someone tell me the performance of his renderer (to compare with mine)?

I copy the data from system memory into AGP memory with memcpy every time. This slows things a bit. Is there a way to copy data into AGP faster?

JelloFish
05-09-2002, 01:10 AM
Something I noticed the other day was a memcpy function on AMD's page, albeit AMD optomized whereas you are Intel it might be a good place to start.

rapso
05-09-2002, 01:37 AM
I've an AMD Athlon 600 and a GF256DDR and I am using VAR with one directional light and (software-)bezierpatches for the terrain and progressive mesh for the objects with 1.4-1.6 million triangles/sec

http://www.rapso.de/newst.htm

so we've nearly the same performance.

Catman
05-09-2002, 02:05 AM
Thanks! So this performance isn't so bad. http://www.opengl.org/discussion_boards/ubb/smile.gif BTW, I've downloaded a few assembly memcpy routines, I'll try them...

wimmer
05-09-2002, 02:28 AM
What's interesting in this case is more the number of vertices you copy, since this will be the bottleneck. Using triangle strips, the Geforce can probably transform up to 8-10 Million triangles/s with 1 texture and one light (non-local viewer).

What's important is to minimize the amount of geometry you have to transfer. In the case of PMs, you might be able to get away with sending the highest LOD once, if the lower LODs use subsets of the vertex positions of the higher LODs...

Michael

Lev
05-09-2002, 03:57 AM
1.3 mio tris/sec is a low number. Without texturing and lighting my GF1 DDR pushes 10 mio tris/sec. What results do you get if you disable VAR and use regular system memory for the vertex arrays?

-Lev

pocketmoon
05-09-2002, 03:57 AM
On my (Geforfce3) setup, there were some gains made using clever memcpy's (block writes, prefetched, SIMD streaming etc), I have example code but it's Intel specific.

If you can restict the % of vertices you update each frame to <20% you may see a gain over a full copy. Quite nice for LOD schemes.

You have to take care no to be updating you buffer while it's being rendered so I assume your using fences/double VAR buffers ?

Rob

Catman
05-09-2002, 07:19 AM
I tried some memcpy routins but they didn't improve performance.

I think I'm doing something wrong, because normal vertex array performance is better than VAR. When I don't copy data with memcpy (I create data directly in AGP memory) I get a bit higher performance, than normal vertex array.
I'm using a large VAR buffer filling it seqentially (starting from the beginning when it's full) and I use fence, but I can't get better transfer rate... http://www.opengl.org/discussion_boards/ubb/frown.gif What am I doing wrong? I transfer 4096 vertices with one glDrawElements call, can this be the problem? (I can show some code tomorrow...)

knackered
05-09-2002, 01:40 PM
I keep saying this, because it seems non-intuitive, but make sure you don't store your indices array in agp memory, it needs to be in cached system memory - using a normal malloc() or new.
If you put the indices in agp/video memory, you get crap performance. Only the attribute arrays (vertex,normal,texcoords etc.) should be in agp/video mem.

jwatte
05-09-2002, 02:58 PM
memcpy() is pretty decent for copying to AGP memory from cached memory. It's actually one of the few cases where you CAN'T get much better with hand-coding. Cached to cached memory is a MUCH worse case for memcpy().

I'd look at the vertex formats you're sending. For example, consider making your positions and normals shorts, and make up the difference in the modelview matrix. Texture coordinates, same thing, if present.

Now, an MX200 is about as slow as you get, so you should make sure you're not fill rate bound. Make the window really small, or turn on glCullFace(GL_FRONT_AND_BACK), and then measure vertex throughput.

WhatEver
05-09-2002, 04:30 PM
I have only been able to achieve 6 million tris a sec using glDrawElements on a GeForce. The trick is to remove any elements that are duplicated. In other words, if you have a cube, after removing duplicate elements you'll only have 8 vertices in the best case. If your normals are perpindicualr to there surfaces then you'll have 36 vertices, but only if you use tris.

If you use tristrip then you can bet an even higher benchmark.

[This message has been edited by WhatEver (edited 05-09-2002).]

wimmer
05-09-2002, 09:25 PM
To get any kind of repeatable and comparable benchmark numbers for your geometry performance, you have to

1) turn of rasterization (e.g., glCullFace(GL_FRONT_AND_BACK, as jwatte said)

2) make sure you are not measuring swapbuffers time in a windowed mode (this can be considerable!), so make sure the window is very small if your timing extends beyond swapbuffers

3) count TRANSFORMED VERTICES, not triangles! Ultimately, this means you would have to set up a vertex cache simulation to find out how many shared vertices can be taken from the cache. If you don't share any vertices, then counting the number of indices sent via glDrawElements suffices...

Any optimizations dealing with triangle strips and making your mesh cache-friendly should be treated separately.

Michael

Catman
05-10-2002, 02:11 AM
Thanks for your help guys, but I think those are not the problem in my code. So here it is:





These are the variables used in my code:

typedef struct tsVertex
{
GLfloat u;
GLfloat v;
GLfloat nx;
GLfloat ny;
GLfloat nz;
GLfloat x;
GLfloat y;
GLfloat z;
} tsVertex;

tsVertex *VertexBuffer;
tsVertex *Buffers[TS_RENDERER_BUFF_NUM];
GLuint BufferFence[TS_RENDERER_BUFF_NUM];
GLuint BufferLevel;
GLuint CurrentBuffer;

tsVertex VA[4096];
GLushort indices[8064];


This is where I initialize OpenGL:

glViewport(0, 0, CurrentMode->sWidth, CurrentMode->sHeight);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(45.0, (GLfloat)CurrentMode->sWidth/(GLfloat)CurrentMode->sHeight, TS_NEAR_CLIPPING_PLANE, TS_FAR_CLIPPING_PLANE);

glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glClearDepth(1.0);
glDrawBuffer(GL_BACK);
glShadeModel(GL_SMOOTH);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
glPolygonMode(GL_FRONT, GL_FILL);
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);
glHint(GL_FOG_HINT, GL_NICEST);
glEnable(GL_LIGHTING);
glEnable(GL_LIGHT0);
glEnable(GL_TEXTURE_2D);

glClearColor(0.0, 0.0, 0.0, 1.0);

glMatrixMode(GL_MODELVIEW);
glLoadIdentity();


This is the VAR initialization code:

FlushVertexArrayRangeNV = (PFNGLFLUSHVERTEXARRAYRANGENVPROC)wglGetProcAddres s("glFlushVertexArrayRangeNV");
VertexArrayRangeNV = (PFNGLVERTEXARRAYRANGENVPROC)wglGetProcAddress("glVertexArrayRangeNV");
AllocateMemoryNV = (PFNGWGLALLOCATEMEMORYNVPROC)wglGetProcAddress("wglAllocateMemoryNV");
FreeMemoryNV = (PFNGWGLFREEMEMORYNVPROC)wglGetProcAddress("wglFreeMemoryNV");
if (!FlushVertexArrayRangeNV | | !VertexArrayRangeNV | | !AllocateMemoryNV | | !FreeMemoryNV)
{
return -1;
}

GenFencesNV = (PFNGLGENFENCESNVPROC)wglGetProcAddress("glGenFencesNV");
DeleteFencesNV = (PFNGLDELETEFENCESNVPROC)wglGetProcAddress("glDeleteFencesNV");
SetFenceNV = (PFNGLSETFENCENVPROC)wglGetProcAddress("glSetFenceNV");
TestFenceNV = (PFNGLTESTFENCENVPROC)wglGetProcAddress("glTestFenceNV");
FinishFenceNV = (PFNGLFINISHFENCENVPROC)wglGetProcAddress("glFinishFenceNV");
if (!GenFencesNV | | !DeleteFencesNV | | !SetFenceNV | | !TestFenceNV | | !FinishFenceNV)
{
return -1;
}

VertexBuffer = (tsVertex *)AllocateMemoryNV(sizeof(tsVertex) * 65536, 0.2f, 0.2f, 0.7f);
if (VertexBuffer == NULL)
{
return -1;
}

VertexArrayRangeNV(sizeof(tsVertex) * 65536, VertexBuffer);
Buffers[0] = VertexBuffer;
BufferLevel = 0;
CurrentBuffer = 0;
glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
glInterleavedArrays(GL_T2F_N3F_V3F, 0, Buffers[0]);
for (i = 0; i < TS_RENDERER_BUFF_NUM; i++)
GenFencesNV(1, &amp;(BufferFence[i]));


This is the rendering function:

void RenderArray(tsVertex *VArray, GLsizei VNumber, GLushort *Indices, GLsizei INumber)
{
int i,j;

if (!TestFenceNV(BufferFence[CurrentBuffer]))
FinishFenceNV(BufferFence[CurrentBuffer]);
if (BufferLevel + VNumber > 65536)
BufferLevel = 0;
Buffers[CurrentBuffer] = &amp;(VertexBuffer[BufferLevel]);
memcpy(Buffers[CurrentBuffer], VArray, sizeof(tsVertex) * VNumber);

glInterleavedArrays(GL_T2F_N3F_V3F, 0, Buffers[CurrentBuffer]);
glDrawElements(GL_TRIANGLE_STRIP, INumber, GL_UNSIGNED_SHORT, Indices);
SetFenceNV(BufferFence[CurrentBuffer], GL_ALL_COMPLETED_NV);
BufferLevel += VNumber;
CurrentBuffer++;
CurrentBuffer %= TS_RENDERER_BUFF_NUM;
}


Here is the data generation code:

for (i = 0; i < 64; i++)
{
for (j = 0; j < 64; j++)
{
VA[i * 64 + j].x = j * 0.1f;
VA[i * 64 + j].y = i * 0.1f;
VA[i * 64 + j].z = 0.0f;
VA[i * 64 + j].nx = 0.0f;
VA[i * 64 + j].ny = 0.0f;
VA[i * 64 + j].nz = 1.0f;
VA[i * 64 + j].u = (GLfloat)j / (GLfloat)63;
VA[i * 64 + j].v = (GLfloat)i / (GLfloat)63;
}
}

for (i = 0; i < 63; i++)
{
for (j = 0; j < 64; j++)
{
indices[i * 128 + j * 2] = i * 64 + j;
indices[i * 128 + j * 2 + 1] = (i + 1) * 64 + j;
}
}


And finally the rendering loop:

for (i = 0; i < 7; i++)
{
glLoadIdentity();
Camera->Apply();
glTranslatef(0.0, 0.0, i * (-5.0f));
RenderArray(VA, 4096, indices, 8064);
}


The main problem is, that I get better performance in normal vertex array mode than in VAR mode. The performance gets higher than in normal vertex array mode if I replace memcpy in RenderArray() with the data generation code. Is my code correct, and this is the maximum transfer rate of my MX200, or am I doing something wrong?

[This message has been edited by Catman (edited 05-10-2002).]

Catman
05-14-2002, 12:11 AM
I forgot to tell that 1.3M was with GL_TRIANGLES, after I started to use GL_TRIANGLE_STRIP I got 1.5-1.6M...

So did someone check my code? Is it correct? Please, answer my question...

Jurjen Katsman
05-14-2002, 11:06 AM
Maybe the bottleneck is in your data generation code? You have 2 divides in there (would be smart to replace them with reciprocal multiplies).

Also, for static data ofcourse you don't want to memcpy it all the time. You just want to put it in AGP/Videomemory and just leave it there.

In my experience (mainly tested on XBox, but also a bit on regular PCs) large gains CAN be made over memcpy when copying into AGP. First prefetching blocks of say 4k into L1 cache, then storing them out. This is to avoid reads to non cached memory while you are writing to AGP.

I would expect your data generation code to already sortof have this benefit, as it doesn't do any reads.

mcraighead
05-14-2002, 11:35 AM
Yes, you have to be careful about write combining. Some CPUs (P4, ahem) have rather surprisingly picky write combiners...

P3 and Athlon have fairly friendly write combiners, but you still need to be careful.

In general, the bad things are unaligned/nonsequential writes; any read that results in a cache miss (exact rules depend on chip); and partially-full write combine buffers being flushed.

- Matt

jwatte
05-14-2002, 02:15 PM
Write combining is the lowest priority functionality of the LFBs. If a fetch is necessary for anything (including promoting data from L2 to L1) that has higher priority than write combining.

I've heard someone say that he'd seen a CPU in simulation choose to evict a partially full LFB (write combiner) rather than choose one that's available and empty. I almost believe him.

Anyway, even an L1 miss (L2 hit) is likely to evict your write combiners. Make sure you only operate on an L1-sized working set size at a time; thus batching your operations/updates and using cache pre-warming to make sure you don't blow away your combining fetching into L1 while processing.

Catman
05-15-2002, 02:04 AM
I don't really understand all this hardware things, but...

Yesterday I tested my program on another system with a Gef4MX and a motherboard with Intel chipset (mine has VIA chipset). I got 5.5M with normal vertex arrays and 7.2M with VAR. So I think the problem is my motherboard. BTW, the Gef4 system didn't have an AGP 4x, so these numbers would be higher on an AGP4x system.

Evil-Dog
05-15-2002, 06:43 AM
What the....

how the hell do you guys get over millions of triangles ?
what's the fps ? 0.00000000001 ?
Even with a simple cube I drop to something like 100 fps with a gforce 2 mx.
Is it the VAR that makes the whole difference ?
I use display list for now but I don't see how I can get millions of triangles running at a good fps !

How do you do that ?

--------------------------------------
Evil-Dog
*Let's have a funny day*

Catman
05-15-2002, 10:39 PM
These are million triangles/sec not /frame. 1.6M triangles/sec = 80000triangles/frame at 30fps (on my system).

Evil-Dog
05-16-2002, 07:47 AM
You're right it's not by frame but by second...my mistake
but still 80000tri/frame...
What's boosting the performance like that ?
vertex arrays ?

-----------------------------------------
Evil-Dog
*Let's have a funny day*

Jurjen Katsman
05-16-2002, 08:31 AM
Displaylists can do that as well, shouldn't be a major problem.

Just don't assume that because the cube spins at 100fps you can't spin the same cube subdivided into 80000 triangles at 30fps.

V-man
05-16-2002, 03:26 PM
Originally posted by jwatte:
Write combining is the lowest priority functionality of the LFBs. If a fetch is necessary for anything (including promoting data from L2 to L1) that has higher priority than write combining.

I've heard someone say that he'd seen a CPU in simulation choose to evict a partially full LFB (write combiner) rather than choose one that's available and empty. I almost believe him.

Anyway, even an L1 miss (L2 hit) is likely to evict your write combiners. Make sure you only operate on an L1-sized working set size at a time; thus batching your operations/updates and using cache pre-warming to make sure you don't blow away your combining fetching into L1 while processing.

Write combining is what precisely and what is its relation with cache warming & memcpy?

Can you explain the eviction process when a cache miss occurs? Miss means what? while memcpy is executing, or a task switch?

PS: I dont know much about this stuff and it appears to be important to VAR, so that's why Im asking.

V-man

tfpsly
05-16-2002, 05:19 PM
Using VAR, I'm able to get these performances on a Duron800+GeForce 2GTS:
22.272 fps 907698 triangles ~ 20.22 MT
Var are allocated/copied only once, all my geometry feets into the allocated memory.

But I have never been able to get such performances using DisplayLists. DispLists would rather give me <5fps.

I suppose the difference comes from the fact that I am using many va or displist, whereas using a single Displist for the whole scene would improve performance. But that would be too restrictive for my apps.

Correction: sorry, fog was on =)
Without fog, var give me:
25.316 fps * 907698 triangles ~ 22.98 MT

[This message has been edited by tfpsly (edited 05-16-2002).]

Catman
05-16-2002, 09:49 PM
tfpsly, are you using only vertex array, or normal, texcoord, etc. arrays too? And what fps do you get with normal VA?

tfpsly
05-16-2002, 10:26 PM
This was done using 1 light on a 3ds model that has no texture. I repeat the model (the Capitol) 22 times to get a big amount of faces.
With one texture it might be a bit slower, but not that much (I'm not fill-rate limited).

So: vertex+normal arrays (stored in agp memory, only the indexes are sent to the card).

Using normal VA (no VAR), I get only:
5.115 fps 907698 polys triangles ~ 4.64 MT

I do not use tri-striping (I could, but stripping is just too slow to be computed on this mesh).

stefan
05-17-2002, 12:05 AM
V-man, intel has some articles covering Caches, Write Combining & AGP-Memory in a series "Maximum FPS" about that. Check out
http://cedar.intel.com/cgi-bin/ids.dll/topic.jsp?catCode=CLM

jwatte
05-18-2002, 10:27 AM
The trick to render detailed environments is to not aim for Quake-style frame rates. 30 fps is quite playable in most games, and you can push 100,000 tris/frame at 30 fps using even low-end cards (like an MX 200) if you're careful about fill rate and vertex formats. We use a combination of VAR and display lists.

If you want the insane benchmark style numbers of tris/second, you need to drop your frame rate fairly low, though (although that situation IS getting better!)