VAR memory manager

Hi, I was reading this article http://www.codeproject.com/tips/optimizationenemy.asp
where Joseph M. Newcomer says that

Back in the early days of C, the C storage allocator was one of the worst-performing storage allocators in existence. It was “first fit”, which meant that the way it worked was the allocator went traipsing down the free list, looking for a block at least as big as the one requested, and if it found one, it split it and returned the residue to the free list. This had the advantages of being as slow as possible and fragmenting memory as badly as possible. In fact, it was worse than you can imagine. It actually walked the list of all blocks of storage, free and allocated, and had to ignore the allocated blocks. So as you got more and more blocks, its performance degraded, and as the blocks got too small to be usable, they simply added to the overhead without adding to the value.
and
if you use the brain-dead Unix allocator, you’re bound to have performance problems. A decent storage allocator makes this a non-issue

I use this kind of memory allocator for VAR, is there a good open source memory allocator available ? What are you using to manage VAR memory ?
Does anyone ported one of these mem manager to VAR ? http://www.cs.colorado.edu/~zorn/Malloc.html

A first-fit allocator is usually enough for graphics ( I wrote a first-fit allocator with adjacent blocks merging ). It all depends on how you use it ( and what assumptions you can make about your data ). Most of my allocations are done at scene/level load time ( at that point the free list contains one blocks - the memory returned by wglAllocateMemoryNV or malloc ).

If you do a lot of allocs/frees during a walkthrough, then perhaps quick-fit will perform better.

I don’t know about any free allocators though. A simple one shouldn’t take to long to write and debug.

I’m wondering what strategies you use when you simply run out of memory ? The memory being AGP or even video prevents a typical LRU/MRU scheme because writing is so slow – is it better to just forget about the whole VAR thing when you’ve just ran out of optimal memory to contain the whole scene ?

Well, you could either reuse VAR memory using fences, or simply disable VAR and fallback to system memory.

-Lev

Writing to AGP memory is not “so slow”. It’s as fast as a regular memcpy() as long as you copy entire blocks with no gaps. Sometimes, it’s faster to write to AGP than to regular memory with memcpy(), because there’s no cache to get polluted and get in the way.

I manage my AGP as one part static data, and one part dynamic data. The dynamic data is just a simple cyclic FIFO with double-buffering. The static data is managed on a fairly granular level (on the order of 4 kB per block) and a separate “block belongs to allocation ID” vector in system memory (hey, it works!).

When I run out of static pool memory, I will deny the allocation, which means that the client of the memory system will have to start purging less important stuff to fit more important stuff into memory. A slightly fancier version would allocate system memory instead, and then when time comes to draw the geometry, copy system memory into transient FIFO memory. You can then start doing modified LRU or something if you want to go bonkers, but I haven’t had the need yet.

Yipes ! That’s interesting – your comment led me to actually do some tests and it turns out you’re right

Copying 2048000 bytes a 1000 times (after 100 “warming up” copies) takes 10.245 seconds between system memory (32 byte aligned) and 10.865 from system memory to video memory (allocated via 0, 0, 1; and it doesn’t show up on the task manager, so I believe it really is vidmem) – aka 190.64 Mb/sec vs 179.76 Mb/sec. GF1, Win2K. So that would be about 5% slower, not nearly what I feared.

I was wondering about the out-of-mem case because I’m curious how the D3D VBs are implemented – would they also be allocating as much memory as possible at startup and then just handing it out until it fails ?

Anyway thanks for the tips, gonna give memory management a shot now Haven’t done this since Glide 2 =)

Originally posted by bpeers:

Copying 2048000 bytes a 1000 times (after 100 “warming up” copies) takes 10.245 seconds between system memory (32 byte aligned) and 10.865 from system memory to video memory (allocated via 0, 0, 1; and it doesn’t show up on the task manager, so I believe it really is vidmem) – aka 190.64 Mb/sec vs 179.76 Mb/sec. GF1, Win2K. So that would be about 5% slower, not nearly what I feared.

Actually, I found copying to video memory significantly faster in some cases, because you get more or less the full bandwidth of both memories.

An example: with AGP 4x, fastwrites enabled, I can write with >780MB/s to video memory, sustained, even if the GPU is currently drawing stuff (i.e., pulling vertices out of its own memory). For AGP memory, on the other hand, I can achieve >900MB/s if the GPU is idle, but as soon as I start rendering, this drops to about 500MB/s, because the GPU is concurrently pulling vertices from AGP memory, and the total theoretical bandwidth of AGP 4x is 1GB/s… But even on my BX home machine with AGP 2x, I found writing to video memory comparable if not faster than AGP, even though this system does not have fastwrites.

On a system with lower memory performance, a good memory manager becomes really important. On the BX machine with a Celeron 433, I can get >22MTriangles/s with texturing and one infinite light source on a Geforce 2 and storing the geometry in video memory. This is even faster than the Geforce3 Ti500 at work…

Another issue is that interleaving data can make a difference. In some cases, I had a significant speedup when going from vertices/normals/texcoords stored consecutively to an interleaved format.

BTW, it seems wglAllocateMemoryNV will drop you back to AGP memory if you request too much video memory. On the Geforce 2 with 32MB, I only get video memory if I request less than 16MB, although I can request up to 32MB. I can verify this easily because on this system, going from video memory to agp drops me from 22MTri/s to 11MTri/s because agp is too slow…

I have been playing around with VAR for a while now, and I must say you learn something new each day, it’s almost a science by itself…

Michael

For AGP memory, on the other hand, I can achieve >900MB/s if the GPU is idle,

holy crap ! What kind of performance do you get for regular system-to-system-memory memcpys ? 900 Mb compared to my 180-190 is too suspicious I’m using an AMD 1.4 Ghz with 256 Mb DDR ram – do I need to tweak BIOS or otherwise enable something special here to get that kind of figures ?

On the Geforce 2 with 32MB, I only get video memory if I request less than 16MB, although I can request up to 32MB. I can verify this easily because on this system, going from video memory to agp drops me from 22MTri/s to 11MTri/s because agp is too slow…

Are there official numbers (read/write/priority numbers) for requesting AGP memory ? I coulnd’t get anything out of wglAlloc if not using (0, 0, 1) so maybe there is something broken with my AGP/memory (I was asking for about 2 Mb).
Also, on your GF2, what resolution and bpp did you use to get those 22MTris ?
I’m very interested in these reference numbers because comparing them with others is the only way to assume that you’re getting the optimal path :\

Originally posted by bpeers:
Are there official numbers (read/write/priority numbers) for requesting AGP memory ?

Read more FAQs:

Quote from the NVIDIA OGL Performance FAQ:
Currently, you should only use the vertex_array_range with memory allocated by wglAllocateMemoryNV (or glXAllocateMemoryNV) given the following settings:

Memory Allocated ReadFrequency WriteFrequency Priority
AGP Memory [0, .25) [0, .25) (.25, .75]
Video Memory [0, .25) [0, .25) (.75, 1]

All other settings will yield relatively poor performance. Use video memory sparingly, and only for static geometry. You may use AGP memory for dynamic geometry, but write your data to these buffers sequentially to maximize memory bandwidth (it is uncached memory, and sequentially writing is essential to take advantage of the write combiners within the CPU that batch up multiple writes into a single, efficient block write). And being uncached, read access will be very, very slow - it may be best to keep two buffers, one allocated by standard malloc for general R/W access and the other allocated by wglAllocateMemoryNV that is only written to - synchronization would copy data from the R/W buffer sequentially into the AGP memory. Keep the vertex array strides to a reasonable length (less than 256), and mind the necessary alignment restrictions in the extension specification.

[This message has been edited by richardve (edited 03-01-2002).]

Nice, thanks.

Originally posted by bpeers:
holy crap ! What kind of performance do you get for regular system-to-system-memory memcpys ?

About 300MB/s on a KT266A. I think the 900MB/s were with an MMX memcpy routine, not the libc one… Be sure to have fastwrites and sideband addressing enabled.
The GF2 was in a very small window 32bpp to test geometry throughput.

[This message has been edited by wimmer (edited 03-01-2002).]

Ok, thanks, looks like it’s tweaking time

On a related note, is there an NVidia equivalent of ATI_element_array ? I checked the faq and this forum but didn’t find anything. Simply asking more mem from wglAlloc and using glDrawElements with an index pointer from within that pool tanked the framerate, so I guess that won’t work Just wondering if the index transfer is the last thing keeping me from 15M :\ (I’m at 14.8, unlit v3f c3ub, so it might be)

[This message has been edited by bpeers (edited 03-01-2002).]

I don’t think the GeForce 2 supports index buffers (indices in AGP or video mem) anyway, so then the driver probably has to touch your vertices and since these types om memory aren’t cached that will be very slow. It might work on a GeForce 3 or 4 though.

Try glCullFace( GL_FRONT_AND_BACK ) to test whether you are set-up/raster bound or not.

Also, try v3s instead and perhaps c4ub in a separate array.