AGP memory question

i read from this board that you have to send 64byte aligned data to AGP memory to get optimal performance. (for example, when you directly write to AGP memory using VBO mapper) Can anyone explain why this is the case?

It depends on the implementation. I think it was mentioned in another thread that for NVidia cards it’s 32 bit aligned data. Not sure about ATI, though.

i read from this board that

Where exactly ? Link please…?

Y.

do you use memcopy/struct to send 32 or 64 byte aligned data?

Could someone please quickly explain exactly what is meant by ‘aligned’ memory? Does this mean that the source memory address must be a 64byte multiple? I doubt this but this is one meaning of aligned memory. Maybe that you have to write it in 64-byte multiples? Or maybe that , if you’re writing say vertex data, that the data for each vertex must be a 64-byte multiple? I have only heard the term aligned memory reffering to the first or the third explanation on the list above but I have a hunch it’s the second (or somehing completely different).

zen:
aligned memory pointer means, that the address is a multiply of the aligment size.
so 16 byte aligned pointers are be:
0,16,32,…n*16

as i can see, there are some confusion about aligment issues at all (they do not only apply to vertex data!) here some qutes from an intel-paper(“Intel Architecture Optimization Reference Manual”):

“[…]On Pentium II and Pentium III processors, a misaligned access that crosses a cache line boundary does incur a penalty. A Data Cache Unit (DCU) split is a memory access that crosses a 32-byte line boundary. Unaligned accesses may cause a DCU split and stall Pentium II and Pentium III processors. For best performance, make sure that in data structures and arrays greater than 32 bytes, the structure or array elements are 32-byte-aligned and that access patterns to data structure and array elements do not break the alignment rules.[…]”

“[…]A misaligned data access that causes an access request for data already in the L1 cache can cost six to nine cycles. A misaligned access that causes an access request from L2 cache or from memory, however, incurs a penalty that is processor-dependent. Align the data as follows:
• Align 8-bit data at any address.
• Align 16-bit data to be contained within an aligned four byte word.
• Align 32-bit data so that its base address is a multiple of four.
• Align 64-bit data so that its base address is a multiple of eight.
• Align 80-bit data so that its base address is a multiple of sixteen.
A 32-byte or greater data structure or array should be aligned so that the beginning of each structure or array element is aligned in a way that its base address is a multiple of thirty-two.[…]”

(the cache size of P4 is 64 bytes)

I see, thanks for the info Adrian. So all we have to do to get n-byte alignment is get a n-byte aligned block and make sure that we pack our data according to those 5 rules. So only question is how do I allocate an aligned block of memory to begin with. I suppose I can’t just rely on malloc to get it and I would have to use some platform specific call like posix_memalign or mmap (I believe anonymous pages obtained by mmap are aligned on VM page boundaries so they should be suitable for any kind of alignment right?).

Regarding the initial post though:

i read from this board that you have to send 64byte aligned data to AGP memory to get optimal performance. (for example, when you directly write to AGP memory using VBO mapper) Can anyone explain why this is the case?

Exactly what is the ‘unit’ that you have to align to 64-byte boundaries? All the data needed for one vertex I believe so that only one memory access (cache line read, that is) is needed per vertex. Is that correct?

Thanks again for the info. I suppose I’ll have to read the entire manuals at some point but I still have a long way to go to reach the optimization stage.

This only matters when using the NV_vertex_array_range extension. Most other cases, the driver “does stuff” that makes alignment issues MOSTLY irrelevant (although 16 byte alignment is still a good idea, for good copy performance using SSE, and 64 bytes may give a very incremental gain, for cache line alignment).

Note that the alignment rules for start-of-buffer and vertex-within-buffer may be different. For VAR, the rules are typically 4 byte alignment for vertex-within-buffer, but 64-byte alignment for start-of-buffer; this is because the write combiners (line fetch buffers) in the CPU work fastest if you overwrite entire fetch lines, rather than leaving some padding before/after/inbetween. If you don’t write completely, contiguously, and aligned, then the LFB will have to back-fill FROM AGP memory, which is slow.

To work with aligned buffers, you have to keep track of the start of the buffer, as returned by your allocator, and, separately, the start of the buffer that you use (which is aligned).

struct buffer {
char * gotFromAlloc;
char * start
size_t size;
};

void allocate( buffer * buf, size_t size, size_t align ) {
assert( !(align & (align-1)) ); // must be power of 2
b->gotFromAlloc = (char )malloc( (size + 2align-1) & -align );
b->start = (char *)(((ptrdiff_t)b->gotFromAlloc + align - 1) & -align);
b->size = (size+align-1)&-align;
}

You hang on to the value returned from alloc so that you can free/release it later. Of course, for VAR, you need to write your own allocator anyway; you could make this allocator always return chunks that are properly aligned.

I’m assuming that “-size_t” makes sense on your compiler, thatn “ptrdiff_t” is as big as a pointer, and that the result of “-size_t” is sign-extended to be as wide as a pointer. If this is not true, your C library is broken, and you need to substitute appropriate types.

thx for the reply. however, i’m not entirely clear about this aligned memory issue. anyone knows a link to some source codes that actually use this kind of aligned memory allocation? (also, a link to a paper on optimization will be helpful too. (for both Intel and AMD cpu))

thanks