PDA

View Full Version : CVA's and VAR in a huge terrain engine



Adrian
04-20-2001, 03:23 AM
I have a terrain engine with a total poly count of 18 million (3072x3072 grid) split into 576 tiles of 128x128. I am using frustum culling and LOD so at any time no more than 1 million polys are rendered. I get 6-8 Million tris/second. PIII 700 + GF2.

I am using drawelements and have vertex arrays totalling 207Mb. I want to speed it up with either CVA or VAR. I tried implementing CVA but got no speed up. I read somewhere that 1) I have to use a particular vertex array format and 2) For vertices greater than 10,000 CVA is disabled by NVIDIA drivers because the buffer would need to be too large. Is this still the case post detonator 10?

I then looked into VAR but Im not sure I will be allowed to allocate 207Mb of AGP memory or will I since it is just system memory? The other potential problem is that I read that reading the arrays from AGP is very very slow. I need to read the vertex arrays to calculate the joins between different LOD's and possibly also collision detection. It was recommended that I keep a copy of the arrays in normal memory but this would mean I would be using 414Mb! which is too much.

So what is peoples opinion on what I should do? Will I be able to use VAR at all with such big arrays. If I use VAR am I going to see worse performance because I need to read the arrays (assuming I dont keep a local copy)

Any advice appreciated

I could of course try this out but from my experience of trying to implement VAR before its not a small job.

Lars
04-20-2001, 03:41 AM
I would say that you should try to reduce the amount of data your Grid takes.
Because you have a regular grid, you can compute the x and y Values of your vertices easily when needed, and only store the height information. This would reduce your Data to about 36 MB for the whole Grid.

At the moment where you determine the Viewable Area, you create only the necessary Vertices, and copy them into the VAR (which can be costant, cause you say you have a limit of 1 Million tris all the time.

This should be much more efficient then storing all of the Vertices all the Time (not everyone has more then 256 Megs of RAM, and i think you wan't some textures too http://www.opengl.org/discussion_boards/ubb/smile.gif)

hope this helps
Lars

WhatEver
04-20-2001, 11:35 AM
The other potential problem is that I read that reading the arrays from AGP is very very slow.

I heard the AGP should be about the same speed as the video memory depending on the AGP bus speed. I obtained that info from over at nvidia.com. Look in the developer section where they have all the samples. It's the one with this (http://www.nvidia.com/Marketing/Developer/DevRel.nsf/Lookup/VAR_75/$file/VAR_75.gif) picture.

Adrian
04-21-2001, 03:07 PM
Thanks Lars, I changed my engine to just set up the arrays for the tile as it is being rendered. This saved a lot of memory and there was only a little slow down. I then tried implementing VAR and performance went from 7M Tri/sec to 4M ? Took me ages to figure out that I had to call the flush function or use fences otherwise I got garbage on screen. Also that I couldnt have my colour array as non VAR. Anyway I played around with adding more fence buffers and no speed up. So I've given up. I am not writing sequentially to the AGP memory (because of LOD), so that could be the reason.

Whatever, when they say AGP and video are the same speed I think they mean they render at the same speed. I meant if you need to read back the vertex data from AGP into the program it is slow.

HFAFiend
04-21-2001, 06:25 PM
Agp mem might be the same speed if you use ddr ram and have an old video card, but just looking at bandwidth, your normal video card (geforce DDR or above) has much more bandwidth, I do believe. (133mhz vs 200mhz DDR for a normal video card.)

Korval
04-21-2001, 07:06 PM
I would suggest 2 spped improvements for VARs:

1: Do everything you can do to ensure that you are writing sequentially. You're clever enough to figure out a way to do that. It is vital to getting any kind of performance out of dynamic VARs.

2: I'm not sure what you mean when you say that you couldn't have your color array as a non-VAR. All of your arrays should be in VAR memory. If they aren't, I wouldn't expect to see much in the way of speed increase.

As for AGP being as fast as Video memory, I can't say. But, certainly AGP is faster if your arrays are dynamic (writing to video memory multiple times per frame will kill performance).

jwatte
04-21-2001, 07:08 PM
VAR and AGP memory are somewhat orthogonal.

You can allocate memory with malloc() and still see some speed-up with VAR. VAR allows the driver to lock your virtual address range into physical memory and build a scatter/gather table (among other things) even if it lives in regular memory. This is, of course, assuming that you have enough physical memory free to actually lock your entire vertex array range...

Allocating memory out of the AGP pool and making that the VAR adds additional benefits if you can write to the AGP memory correctly. However, it's not necessary to use AllocateMemory just to use VAR.

Some popular VIA chip sets or BIOSes don't allow you to set the AGP aperture to larger than 64 megabytes, and typically come with a default of less than that, so trying to get 207 MB of AGP memory would be... aggressive. Even assuming you have 384 MB or more of physical RAM to fit it all in.

Now, 6 million triangles per second is about all you can get on current geforces when you're using things like fogging, lighting, texturing, Z-test&update and possibly vertex colors. Thus, you're possibly already at close to the maximum performance for your case (depending on which features you're using). The 15/20/30 MTPS numbers quoted in advertizing is still achievable, if you turn off pretty much every nice feature that you would want to use in real life :-) (at least that's been my experience, and it makes sense)

I have not timed a GF3, so the effective throughput there might be better.

[This message has been edited by jwatte (edited 04-22-2001).]

Adrian
04-23-2001, 12:25 PM
I have lighting switched off so I would expect close to 18 M/Tris per second (The same as Benmark) I would certainly expect more than 10.

With VAR implemented the same as the learning var project(as far as I can see) I get a drop of 1 million tris per second to about 4.5M Tris/sec

Heres my code, can you spot what is wrong?


#define BUFFER_SIZE 800000
int NUM_BUFFERS = 8;

....

initialisation

big_array = (GLfloat *)wglAllocateMemoryNV(BUFFER_SIZE*sizeof(GLfloat), 0, 0, 0.5f);

printf("\nAllocated VAR\n");

for (i=0; i < NUM_BUFFERS; i++)
{
NVbuffer[i] = big_array + (i * BUFFER_SIZE / NUM_BUFFERS);
}

glVertexArrayRangeNV(BUFFER_SIZE*sizeof(GLfloat), big_array);
glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
glGenFencesNV(1, fence[0]);
glGenFencesNV(1, fence[1]);
glGenFencesNV(1, fence[2]);
glGenFencesNV(1, fence[3]);
glGenFencesNV(1, fence[4]);
glGenFencesNV(1, fence[5]);
glGenFencesNV(1, fence[6]);
glGenFencesNV(1, fence[7]);
......

main rendering loop

// Loop through each tile in terrain
for (l=0;l<NO_TILES;l++)
{
for (k=0;k<NO_TILES;k++)
{
// Render this tile
if (DrawTile[l][k]>0)
{
curr = k%NUM_BUFFERS;
if(!glTestFenceNV(fence[curr]))
{
glFinishFenceNV(fence[curr]);
}

pMeshV2 = NVbuffer[curr];
glVertexPointer(3, GL_FLOAT, 5 * sizeof(GLfloat), pMeshV2);
glTexCoordPointer(2,GL_FLOAT,5 * sizeof(GLfloat), pMeshV2+3);

i7=Tile_Step[l][k],i9=TILE_SIZE-i7,i5=k<<7,i3=l<<7,ftemp2=(float)i3*fTexScaleX,i10=(5*(i7-1));

for (j=0;j<TILE_SIZE;j+=i7)
{
i6=i5+j;

*pMeshV2++=i3;
*pMeshV2++=(float)mesh_height[i3][i6]*HeightScaling;
*pMeshV2++=i6;
*pMeshV2++=(float)i6*fTexScaleY;
*pMeshV2++=ftemp2;
pMeshV2+=i10;
}
pMeshV2+=((i7-1)*640);

// Number of triangles in each column
i4=256/(float)i7;
// Draw each column of Tile
for (i = 0; i <i9; i += i7)
{
i2=i+i7,i3=(l<<7)+i2,ftemp2=(float)i3*fTexScaleX;

for (j=0;j<TILE_SIZE;j+=i7)
{
i6=i5+j;

*pMeshV2++=i3;
*pMeshV2++=(float)mesh_height[i3][i6]*HeightScaling;
*pMeshV2++=i6;
*pMeshV2++=(float)i6*fTexScaleY;
*pMeshV2++=ftemp2;
pMeshV2+=i10;

}
pMeshV2+=((i7-1)*640);

glDrawElements(GLMode,i4,GL_UNSIGNED_SHORT,&mesh_indices[l][k][i]);
}
glSetFenceNV(fence[curr], GL_ALL_COMPLETED_NV);
}
}
}


[This message has been edited by Adrian (edited 04-23-2001).]

MarcusL
04-23-2001, 02:07 PM
About speeds, I've found it very useful to time my code, such that I measure all my main code, and then how much time the SwapBuffers call takes.

If most of your time is in swapbuffers, you're using VAR correctly and you can add some more funky physics to your game (calc between rendering and swapbuffers), if almost no time is in swapbuffers, then you're killing performance somewhere in your rendering code.

Our engine currently spends 2 ms/frame calling gl and 7 ms/frame waiting for the card to finish. (1-2 ms of physics) .. http://www.opengl.org/discussion_boards/ubb/smile.gif

But we haven't scaled things up yet, with a lot of objects/terrain/effects/ai etc.

jwatte
04-23-2001, 07:12 PM
> I have lighting switched off so I would
> expect close to 18 M/Tris per second (The
> same as Benmark) I would certainly expect
> more than 10.

Really? Did you switch of Z testing and writing, too? That takes a big cut out of the triangle throughput. Also, I don't see what version of the GF2 you're using; the GF2MX is only rated at 15 Mtri/s (or was that the GF2Go? both?)

Also, if your triangles are big, and/or you draw with blending, you'll probably be fill rate bound rather than transform bound. The GF2 doesn't fill all that much faster than a TNT2 Ultra; it just lets you throw many more triangles at the card in any given time assuming you don't hit the fill rate limits, which is easier the more triangles you're using...

Last, try timing your program looking exactly the same, just removing the call to DrawElements. (QueryPerformanceCounter()/QueryPerformanceFrequency() is a good place to start) Perhaps your heightfield-generation code is thrashing the cache or causing other non-optimal behaviour.

Something which I can see right now is that you're writing 20 bytes at a time, which is a pretty poor number, as that'll cause many partial evictions. Try adding 3 useless floats to each "vertex", making the stride 32, and make sure you write some data to each of these "useless" floats right where you write the useful data. Once you reach this point, "VTune" starts to look like a good investment.

PS: If it's an Athlon rather than a P-3, cache lines are 64 bytes, so more trickery is needed.

harsman
04-24-2001, 12:01 AM
FYI, the GF2MX is rated at 20 million triangles/sec, it's the GF2 Go thats rated at 15c like you said.

mcraighead
04-24-2001, 08:32 AM
Z has no impact on triangle throughput...

AGP memory is uncached, so it does not do any good to mess around with cache behavior.

- Matt

Lars
04-24-2001, 12:21 PM
You should use a profiler to find out where the performance goes, if it points to the nvopengl.dll then it is something with the rendering code, but maybe there is something else in the program which eats performance.

To just test the achievable performance, you can just draw one giant grid, with no changes, and then with VAR and without.

Lars

Adrian
04-24-2001, 12:49 PM
I renamed my main file from .c to .cpp and the compiler told me I needed to pass a pointer into Genfence, doh! I now get 12M tris/sec with VAR and 5.5M without. Thats more like it! http://www.opengl.org/discussion_boards/ubb/smile.gif

Thanks for everyones suggestions.

btw fog seems to make a significant difference to T&L performance, about 10% for me.

jwatte
04-24-2001, 02:23 PM
> Z has no impact on triangle throughput...

If you're only rendering back-facing triangles (that get culled) this is true. If you're actually rendering triangles that render to the frame buffer, it certainly cuts down the maximum triangle rate, because you hit your fill rate limit that much faster. I agree it doesn't affect the "rated triangle throughput" of the card; it does however affect most real-life uses of the card (probably including this poster).

> AGP memory is uncached, so it does not do
> any good to mess around with cache behavior.

That would be true, except the size of a cache line is a very good hint for the size of "whatever it is" that buffers and write combines outgoing write transactions (LFBs in the P-3). Thus, you want to write the size of a cache line amount of data, aligned on that same size boundary, in one write transaction (i e as fast as you can without hitting any other external memory).

[This message has been edited by jwatte (edited 04-24-2001).]

MarcusL
04-24-2001, 02:31 PM
Originally posted by Adrian:
btw fog seems to make a significant difference to T&L performance, about 10% for me.

Yup, does for me too (GeForce DDR). That's odd, I thought those calcs had been 'for free' since the voodoo1.

mcraighead
04-24-2001, 04:07 PM
Per-pixel fog factor computation and fog blending is free, but fog also involves per-vertex calculations (the computation of the fog coordinate at each vertex).

- Matt

MarcusL
04-25-2001, 02:46 AM
Originally posted by mcraighead:
Per-pixel fog factor computation and fog blending is free, but fog also involves per-vertex calculations (the computation of the fog coordinate at each vertex).


D'oh! .. of course. Nevertheless, the GeForce with all it's T&L pipelined glory should be able to figure this out quite speedily, but that's probably fixed in the GeForce 3. (gimme gimme gimme! http://www.opengl.org/discussion_boards/ubb/smile.gif