glVertexArrayRangeNV performance

Hi. I have two questions.

  1. I want to know if I should expect improved performance by using glVertexArrayRangeNV as described in the following sentences. A chunk of memory is allocated using wglAllocateMemoryNV. Every frame, glVertexArrayRangeNV is called once. Each frame, the following sequence occurs one or more times:

glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
memcpy(nvArray, systemMemoryArray, numSystemMemoryBytes);
glVertexPointer(nvArray); // float verticies
glDrawElements(indicies); // short indicies
glFlushVertexArrayRangeNV();
glDisableClientState(GL_VERTEX_ARRAY_RANGE_NV);

1b) If there is improved performance with the above, then why would a programmer be required to write the extra code? Why wouldn’t the copy from system memory to AGP memory be done automatically for all code that calls glVertexPointer and glDrawElements (assuming that the hardware supports glVertexArrayRangeNV)?

  1. I read that arrays allocated using wglAllocateMemoryNV have slower access times than system memory arrays, and that a program should have a duplicate array in system memory if the array is accessed frequently. Is this still true?

Thanks.

Mike

That

memcpy(nvArray, systemMemoryArray, numSystemMemoryBytes);

is not done every frame! Only dynamic data has to be updated. All the static stuff is uploaded once and never again. And dynamic stuff that didn´t change over the last frame doesn´t have to be uploaded either.

Anyway, i never really experienced any speedup, although i have only static data. However i think there is an improvement if you have really lots of vertices. (Meaning some 10 or 100 thousands or so). If you don´t have enough vertices, you are not bus-limited :wink:

But from some posts other users posted, i have the impression, that ARB_vertex_object or whatever its name is, is faster than NVs range extension anyway. So maybe you should try that out. And it´s an ARB extension!

So to answer your questions:

  1. By copying it yourself, you can control when something is updated + what part is updated (so not everything is always updated, even if 90% didn´t change). Also if you use fences there is no way the driver could know what you really want to do next.

  2. Yes, it´s still true. And certainly (unfortuanataly) it will still be true in 20 years (although i hope it won´t).

Jan.

Hi!

Never use that:

glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
glDisableClientState(GL_VERTEX_ARRAY_RANGE_NV);

Instead use :

glEnableClientState(GL_VERTEX_ARRAY_RANGE_WITHOUT_FLUSH_NV);
glDisableClientState(GL_VERTEX_ARRAY_RANGE_WITHOUT_FLUSH_NV);

Read the opengl nvidia extension for more info.

See you.

And one more thing. You don’t have to enable/disable vertex array range except if you use glBegin/glEnd and standard opengl routines.

In my computer, (800 mhz), not to enable/disable array range saves me about 10% of my process, so you have to try

See you

Originally posted by list67:

memcpy(nvArray, systemMemoryArray, numSystemMemoryBytes);

Do not copy data each frame. If you have to, you’d better use plain vertex arrays - the performance will be the same, but no meddling with extensions.


1b) If there is improved performance with the above, then why would a programmer be required to write the extra code? Why wouldn’t the copy from system memory to AGP memory be done automatically for all code that calls glVertexPointer and glDrawElements (assuming that the hardware supports glVertexArrayRangeNV)?

The driver does exactly this - copies the data to GART memory. This is the only memory an AGP card can access (besides its own video memory). That’s why the above copying does not buy you anything over plain vertex arrays.


2) I read that arrays allocated using wglAllocateMemoryNV have slower access times than system memory arrays, and that a program should have a duplicate array in system memory if the array is accessed frequently. Is this still true?

Yes. GART memory is not cached (because AGP does not provide cache coherence). Thus reads is very slow and writes should be sequential, so the CPU write-combining amortizes the memory access overhead.

~velco

Do not copy data each frame. If you have to, you’d better use plain vertex arrays - the performance will be the same, but no meddling with extensions.

Not always true. In a 100% dynamic, but multi-pass configuration, you’d better use streaming to agp/video memory and reuse the vertices for the next pass. Saves a lot of bandwidth.

Y.

>>>>Yes. GART memory is not cached (because AGP does not provide cache coherence). Thus reads is very slow and writes should be sequential, so the CPU write-combining amortizes the memory access overhead.

~velco<<<<

This I dont understand. Why the hell is this not cached and can’t it be made cacheable?

Someone said that it could but Im not sure if he knew what he was talking about.

Originally posted by V-man:
[b]>>>>Yes. GART memory is not cached (because AGP does not provide cache coherence). Thus reads is very slow and writes should be sequential, so the CPU write-combining amortizes the memory access overhead.

~velco<<<<

This I dont understand. Why the hell is this not cached and can’t it be made cacheable?

Someone said that it could but Im not sure if he knew what he was talking about.

[/b]

It could be made cacheable, of course (it’s simply a bit in the PTE), but with disastrous results. GART memory is read/written by both the CPU and the GPU. If they cache accesses to GART memory there MUST be a mechanism to determine cache line ownership - like MESI and variants on SMP systems. The AGP bus provides no such mechanism.

Further info in AGP 2.0 spec, “2.4 Platform Dependencies”

~velco

So if it was cacheable, then GPU and CPU would have to behave as a SMP system. IC

Well, couldn’t to GPU send a command to the CPU telling it to spit out the cache back to RAM just before it begins reading from AGP memory?

Or perhaps this could be done inside our program.

Originally posted by V-man:
[b]So if it was cacheable, then GPU and CPU would have to behave as a SMP system. IC

Well, couldn’t to GPU send a command to the CPU telling it to spit out the cache back to RAM just before it begins reading from AGP memory?

Or perhaps this could be done inside our program.

[/b]

Manually maintained cache-coherence ? Well, it could be done, in principle. Many common PCI devices work this way, i.e. the CPU explicitly flushes the cache to memory before initiating bus-master read from the device (read from device point POV), so the device sees current data. Likewise on device write to memory the CPU invalidates its own cache so it reads what the device has written, instead of the stale data in its cache.

Dunno why drivers/cards are not implemented that way.

~velco