Compiled Vertex Arrays, GeForce2, Crash !?

Hi !

I’m messing around with this for quite some time… I’m working on a terrain engine that uses CVAs. I switched from using normals to specifing vertex colors and do my own lighting. I read in the nVidia OpenGL Performance FAQ that this special vertex/color/texcoord combination is optimized on the GeForce. That’s right, it’s really much faster when you meet those special conditions…

Everything worked fine on my GeForce1. But then I noticed that the demos crashes on my GeForce2, it simply hangs inside the driver code. I found out that I can avoid this by a.) not locking the arrays or b.) using another vertex format. Strange, eh ? The only method that produces this crash is the only method that nVidia suggestes as fast !?!

After some time of playing around I found out that the crashes go away when I lock / unlock the arrays every frame. At least I thought this fixes the problem. But then I tried another landscape and the crashes appeared again…

I played a bit around and changed my code to lock the arrays, here is how it looks now:

void CVertexArray::Lock()
{
////////////////////////////////////////////////////////////////////////
// Lock the whole vertex array. This allows the driver to place the
// buffer in AGP memory and perform optimisations ond shared vertices
////////////////////////////////////////////////////////////////////////

#ifdef _DEBUG
assert(m_pfnglLockArraysEXT);
assert(m_iArraySize > 0);

if (!m_pfnglLockArraysEXT | | m_iArraySize < 1)
	return;

#endif

// I'm a bit unsure about this parameters. The function seem to expect a
// one based start index, and the vertex count seems to be m_iArraySize
// minus one. The expected 0 / m_iArraySize combination works fine on
// GeForce DDR cards, but crashes on GeForce 2 cards. Maybe a driver bug,
// maybe it's just right ?

(* m_pfnglLockArraysEXT) (1, m_iArraySize - 1);
m_bLocked = true;

}

Looks strange, doesn’t it ? But it seems to work ob both GeForce. I read the specification and couldn’t any infos on zero based or not zero based indices and so on…

Is this the right way to do it ? Has anyone experienced the same problem ? The code above seems to work, but I want to be sure that I have a reliable solution, you know this “bad” feeling :wink:

If you need some more code or vertex array related infomations about my engine, let me know ! I’m using the 6.34 drivers, but it also happend on other driver releases and on other systems than my both W2k ones… Always worked on GF1, never on GF2

Please share your ideas / experinces…

Tim

So, to be perfectly clear…

You’re using CVAs. You’re using the vertex format V3F/C4UB/T2F/T2F. Right?

Works on one system with one card, fails on another system with another card, right? Or is it the same system and two different cards? If it’s two systems, can you isolate whether it’s the system or the card that the problem is specific to? (i.e., swap cards and see which system has the hang, if any)

Goes away when you give a start index of 1 rather than zero, right?

Goes away if you don’t lock, right?

Using the 1 for start and n-1 for count is a bad idea. What this is effectively saying is that your indices will be in the range [1,n-1] rather than [0,n-1]; in fact, if you use an index of 0, the results are undefined.

What OpenGL call, exactly, does the hang occur at? Does it occur on LockArrays, on DrawElements, on UnlockArrays, or on something else?

Now, I can understand that the only case that you’d get the hang would be the fast case – and that would be because you’re going through a different code path. But that code path shouldn’t hang.

GF1 and GF2 should behave almost exactly identically in that code path.

Approximately how many vertices are you using? That is, what is the typical value of m_iArraySize?

Are you getting any multipass reuse, i.e., >1 DrawElements per LockArrays? What are the exact GL calls you use inside the lock?

Can you post more source code or a binary?

Any more system specs? What is the GL_RENDERER string in each case?

By the way, the CVA extension is an extremely poorly written OpenGL extension. It is extremely vague as to some of the semantics of locking – which arrays are locked, which arrays can be modified, etc. So, in general, we are a bit scared of trying to optimize CVAs much because the extension is so easily misinterpreted. In fact, by my preferred interpretation of the extension, the way that Quake 3 uses it is actually illegal! But that’s just my interpretation, and by reading the spec, you’d never know.

  • Matt

“You’re using CVAs. You’re using the vertex format V3F/C4UB/T2F/T2F. Right?”

Yes, this was suggested by the nVidia OpenGL FAQ as fast, and it is significant faster than ANY other format.

“Works on one system with one card, fails on another system with another card, right?”

GF1 always works, GF2 never. I maybe I try to exchange the cards between my two systems… After all, some friends tested it and they got the same results with their GF1/2 cards, so we can be sure this problem is reproduceable and not related to my config

“Goes away when you give a start index of 1 rather than zero, right?”

Start = 1 / Count = RealCount - 1
Works on both cards, Start = 0 / Count = Count ONLY works on GF1 cards

“Goes away if you don’t lock, right?”

Yes, it also goes away if I use ANY other vertex format

“Using the 1 for start and n-1 for count is a bad idea. What this is effectively saying is that your indices will be in the range [1,n-1] rather than [0,n-1]; in fact, if you use an index of 0, the results are undefined.”

Yes, I know that. But this is the only combination of parameteres that worked on all system configs…

“What OpenGL call, exactly, does the hang occur at? Does it occur on LockArrays, on DrawElements, on UnlockArrays, or on something else?”

Random, but never seems to hang inside my code. This sounds logical since locking the array doesn’t affect the rest of my code, only what the driver does.

“Now, I can understand that the only case that you’d get the hang would be the fast case – and that would be because you’re going through a different code path. But that code path shouldn’t hang.”

Yeah, it only hangs in the optimized case: CVAs AND optimized format. So this special optimisation is it.

“GF1 and GF2 should behave almost exactly identically in that code path.”

Yes, at least they should or I hope they do :wink:

“Approximately how many vertices are you using? That is, what is the typical value of m_iArraySize?”

Something between 10000 and 60000.

“Are you getting any multipass reuse, i.e., >1 DrawElements per LockArrays? What are the exact GL calls you use inside the lock?”

Locking/Unlocking the arrays before my DrawElements() reduces the crashed a bit, but not entirely…

“Can you post more source code or a binary?”

On my homepage there’s an early version of the engine. Go to http://glvelocity.gamedev.net and download my terrain engine source. This version should not crash since it does not use the special vertex format. Just ignore all the normal stuff and think of the additional color values in the vertex array…

“Any more system specs? What is the GL_RENDERER string in each case?”

I guess it doesn’t really matter. I had GeForce SDR/DDR/2/MX/GTS 64 to test, on 98/ME and 2000.

"By the way, the CVA extension is an extremely poorly written OpenGL extension. "

I guess I’ll disable it anyway since it doesn’t seem to make much speedup. I thought it would help me becuase I have a hell lot of shared vertices and don’t use strips. But the vertex cache seems to handle everything perfect… I’m just still interested since I share my code an want it to be correct :wink:

“It is extremely vague as to some of the semantics of locking – which arrays are locked, which arrays can be modified, etc. So, in general, we are a bit scared of trying to optimize CVAs much because the extension is so easily misinterpreted. In fact, by my preferred interpretation of the extension, the way that Quake 3 uses it is actually illegal! But that’s just my interpretation, and by reading the spec, you’d never know.”

I think that the optimisation towards Quake is a bad thing. I mean your app becomes faster just when you do it the way Quake does it, even if this way is slower and introduces useless data…

Tim

It’s strange that you’re using 10K-60K vertices and see a hang – we disable these optimizations with numbers of vertices that high, because the buffer needs to be much too large. Are you also using lower numbers some of the time?

I agree that optimizing specifically for one app, be it Quake 3 or anything, is bad, but if you look at the benchmarks people judge us on, you can probably understand the tendency. CVA optimizations will be massively generalized to handle all vertex formats in a future driver version. As such, it’s even more important that it work properly.

  • Matt

Yes, I was already pretty sure it doesn’t give me any performance advantage. And storing 60K vertices on the board isn’t a good idea :wink: The optimized format is pretty OK, most of the games doesn’t use OpenGL lighting anyway. The 4ub is because of the 32 bit alignment, isn’t it ?

I’ll try to lock smaller portions of the array and draw it in smaller groups of triangles.

I think that OpenGL really needs a better vertex array implementation. It can’t be that a developer has to use all those CVA and nvAllocateMemory (fence and rage, too)stuff just to get the same performance as a D3D8 developer. I really think all this should become a part of the API…

At least I don’t judge cards based on their Q3 performance. I always bought the best card (=most features, best drivers). And this was always nVidia in the past years.

Tim

Alignment is a function of stride, not of data type. 3ub with a stride of 4 is aligned. Note that in theory, though, it is not safe to do a DWORD read of 3ub data. You could fall off the end of a page and crash.

Our CVA implementation will be a lot better in a future driver release, fixing a lot of these issues, generalizing the formats and such. I can provide more details at a later date, but not now.

A decent VAR/fence implementation will easily outperform both D3D7 and D3D8 on dynamic geometry.

You may not judge us on Q3 performance, but a LOT of people do.

I still want to figure out why this is causing an app hang.

  • Matt

I think that OpenGL really needs a better vertex array implementation. It can’t be that a developer has to use all those CVA and nvAllocateMemory (fence and rage, too)stuff just to get the same performance as a D3D8 developer. I really think all this should become a part of the API…

But Direct3D 8 doesn’t let you specify
different strides and different locations
for different arrays; you have to use a
FVF and put it all in the buffer. glXxxPointer()
is so much easier to work with.

Allocating your buffer with the NVidia
extension is really no more work than
allocating a Direct3D vertex buffer (actually
it’s less work :-); the problem is that
no-one else seems to spend much effort on
OpenGL.

Anyway, unless you really are geometry
upload limited (and I doubt you are – do
you have the profile logs to prove it?) then
using glVertexPointer() and friends on plain
malloc()-ed memory will give you very close
to optimal results anyway. My guess is the
driver copies the data into an AGP buffer
for sending to the card, but it may be that
the card can do PCI DMA (just less efficiently
than AGP). Of course, PCI DMA has concurrency
issues that a copy step does not, so for
estimation purposes, think “one optimized
copy from cached to uncached memory”.

Where was I?

Use glVertexPointer() and friends. It will be
fast enough for 96% of uses. If you’re part
of the 4%, use the NV allocator functions if
they’re available – but chances are that
you will be doing the per-frame copy instead
of the driver, and there won’t be a
difference after all. Profile!

Note that in theory, though, it is not safe to do a DWORD read of 3ub data. You could fall off the end of a page and crash.

That, in turn, depends on your DWORD read
alignment. If you always read aligned
DWORDS (a very good idea in general) then it
IS safe to read a DWORD even if you only
“want” three bytes; any naturally aligned
type is guaranteed to not straddle page
boundaries (or cache lines, for that matter).

Mmm… natural alignment… <slurp>

Yes, you are right. The only thing that worries me is that you need all those extensions. Not that they are so horrible to deal with, it’s the fact that they are no standarized core parts of the API, and they are so important !

I’ll post a new version of my engine in the next hours/days, maybe you could download it and see the crash yourself, I’m not experinced in assembly so this stuff doesn’t say much to me.

I’m still looking for bugs in my code, but thanks for all your explanations and help, have a happy new year :wink:

Tim

Well, philosophically, the size of the OpenGL
spec plus extensions is no bigger than the
size of the Direct3D 8.0 API documentation.
It’s probably smaller :slight_smile:

In OpenGL, you have to check for extensions
and use a fall-back (say, malloc() instead
of the nVidia allocator). In Direct3D, you
have to check for capability bits, and the
fall-back behaviour is much murkier.

The biggest difference I see (philosophically)
is that OpenGL uses more global state, which
makes the code more readable, but makes
threading harder; Direct3D puts most of the
state in structures and opaque cookies
(handles) which makes threading easier (at
least on the API implementor) but your code
more voluminous.

Note I’m using “threading” in an abstract sense.

OpenGL is a nice API and has been a good choice for countless years, but I’m a bit worried about the future on the PC platform.
Since you rely so much more on a good driver and the drivers tend to be worse than DirectX. A good OpenGL driver is hard to write, and when I see the crap that some other companies release :wink:

Ok, here’s the new source release of my engine, which still has this CVA crash. I seems to be stable without. There’s a setup dialog so you can enable/disable CVAs
http://glvelocity.gamedev.net/files/terrainsource.zip

Tim

I found out that the crash is not only related to CVAs, it also occurs without them on both systems, but it takes a long time. usually it can happen after 10 minutes, but in most cases it takes much longer till the access violation inside NVOGLNT.dll (or similar, don’t remember the exact name :wink: ) occurs. CVA speed it up on GF2 cards, but it always happens after some time, even without them.

Tim

Originally posted by mcraighead:
[b]CVA optimizations will be massively generalized to handle all vertex formats in a future driver version. As such, it’s even more important that it work properly.

  • Matt[/b]

Is it true now ?
I mean, can I use other formats than v3f/t2f/t2f/c4ub and get maximum performances with CVA ?

Thanks

Well,

I had lot of problems with CVA too so I decided to throw them away. The problem is you should use CVA as Quake does. Otherwise you’ll get a chance to have problems with it like crashes, wrong rendering, etc.