PDA

View Full Version : Use of C++ structures with Opengl?



LostInTheWoods
11-08-2002, 08:49 AM
Ok, this is my setup.

//Structures
struct PointStruct
{
float X, Y, Z;
}

struct VectorStruct
{
float X, Y, Z;
}

//Then I declare an array for my data.
//They are pointers to arrays dynamicaly
//allocated later on

int *Index;
PointStruct *VertexArray;
VectorStruct *NormalArray;

//Then I use my variables like so.
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_FLOAT, 0, &VertexArray);

glEnableClientState(GL_NORMAL_ARRAY);
glNormalPointer(GL_FLOAT, 0, &NormalArray);

glDrawElements(GL_TRIANGLES, 0, GL_UNSIGNED_INT, &Index);

Will that work? I mean, if i create a structure for my Points and vectors like that, is the information stored cronologicaly like if I had simply created just an array of floats??

Bob
11-08-2002, 09:07 AM
It will work if you remove the &-sign in gl*Pointer. With the sign, you're passing the address of the pointer, but OpenGL expects the pointer itself.

LostInTheWoods
11-08-2002, 09:16 AM
Alright thanks, i have been working on an engine for a while, and just realized how many times im doing the same thing over and over, so i am simplifiying my life. lol. I will do that, thanks.

jwatte
11-08-2002, 07:50 PM
glVertexPointer() copies the POINTER VALUE that you pass into it.

When you later call DrawElements() (or similar friends), GL will read memory from the address that this pointer value specified.

If you change the values of the pointer that you pass into glVertexPointer() after you've passed it into glVertexPointer(), there's no way that glVertexPointer() can know about that change, as it copies the VALUE of the pointer, just like with any function argument in the C calling convention.

GPSnoopy
11-09-2002, 05:04 AM
What you're doing is dangerous!!!

You can't make any assumptions on how the structures you defined are aligned in memory. What you're doing might work with one compiler but not with another. (there might be gaps between each floats in your structures)

You should simply make an array of floats.
If you wanna use C++, you should use std::vector<GLfloat> and use &(MyVector[0]) to get the address of the first element for glVertexPointer() and other functions.

Nutty
11-09-2002, 09:01 AM
Quite true GPSnoopy. You should use sizeof(VectorStruct) for the stride, not 0.

IF you specify 0, it assumes (size*type), but the compiler might've put extra padding in there.

Nutty

knackered
11-09-2002, 09:05 AM
Mmm, if the stride didn't equate to 3 floats, then GL would have to unpad the data, which would make the driver copy a lot slower, no?
I'm sure there are compiler pragma's for not padding specific structures?

Bob
11-09-2002, 09:23 AM
Just curoius, how many compilers today will pad those structures? Not may I guess.

knackered, why would it be a lot slower? Given the start address of the vertex array, it's pretty easy to calculate the address of any vertex. Just multiply the index of the vertex by the size of a vertex and you have the offset. If there's padding between vertices you just multiply the index by another value (size of the vertex + padding). It doesn't involve any extra work. There may be some problems with the cache though since the size of a vertex is larger (including the padding), but certainly not a lot slower.

jwatte
11-09-2002, 12:02 PM
C is a systems programming language. As such, each compiler defines the alignment of structures well, and/or allows you to specify how to do alignment. Not doing so would make C useless as a systems programming language.

Thus, as long as you know what you're doing, using structs for vertices is absolutely safe and absolutely the right thing to do.

If you don't know what you're doing, well, that's what Visual Basic's for.

knackered
11-09-2002, 03:14 PM
Originally posted by Bob:
knackered, why would it be a lot slower? Given the start address of the vertex array, it's pretty easy to calculate the address of any vertex. Just multiply the index of the vertex by the size of a vertex and you have the offset. If there's padding between vertices you just multiply the index by another value (size of the vertex + padding). It doesn't involve any extra work. There may be some problems with the cache though since the size of a vertex is larger (including the padding), but certainly not a lot slower.

I'm assuming the driver won't want to send redundant bytes over the bus, in which case it would have to strip out the padding - but I've just realized that it wouldn't do this because the stride could be being used to skip other attribute data addressed by another gl***pointer() instruction.
There's a point - how does the driver deal with interleaved vertex arrays?
glVertexPointer(pointer=base+0, stride=6)
glColourPointer(pointer=base+3, stride=6)
glDrawElements()...
At drawelements time, does it:
a) run down each array pointed to copying the elements to video ram, skipping over 6 elements as it goes,
or does it
b) do 2 block copies, therefore copying redundant colour bytes when doing the verts, and redundant vertex bytes when doing the colours?
Surely it does a). In which case it doesn't matter what the stride is, it can't do a straight block copy...

zeckensack
11-09-2002, 03:58 PM
I've always assumed that the sum of attribute sizes is checked against the vertex stride in interleaved vertex array setups. If both are equal the driver can shove it down the bus back to back (hello, Mr DMA transfer).

If you have gaps (real gaps, ie padding), I don't think the driver can initiate a clean transfer, no matter what. It will have to reshuffle data and that's bound to be slow.

Of course you won't notice any difference if you only use glArrayElement but then your performance will suck anyway.

OldMan
11-10-2002, 02:09 AM
As long your classes or structs do not have a Polimorfic members they are kept in memory tighted as their members are described. I use it in my work (writing Operating Systems and Drivers) all the time.

[This message has been edited by OldMan (edited 11-10-2002).]

GPSnoopy
11-10-2002, 08:32 AM
Originally posted by OldMan:
As long your classes or structs do not have a Polimorfic members they are aligned to memory. I use it in my work (writing Operating Systems and Drivers) all the time.

By polymorphic members you mean something like struct { bool a; int b; float c; };? ('cause in that case, b isn't just next to a in memory)

If what you're saying is right, then I didn't know that the variables in a structure were contiguous in memory when they're all of the same type.

I still think that, even if you're right, it's dangerous to rely on such low level mechanisms when you can do it in another way. (If you're not sure about what you're doing, don't do it.)


[This message has been edited by GPSnoopy (edited 11-10-2002).]

Nutty
11-10-2002, 09:18 AM
Whats polymorphic about bool, int, float?

Although ideally you shouldn't use "bool" in cross platform (or cross library), as I've seen it different sizes in different versions of Visual C++.

I thought by polymorphic, he meant virtual items, which require the use of a V-table in the class/struct, which causes the sum of the members to not equal the size of the struct/class.

AFAIK member variables/class fields are _always_ in the same order that they are declared in, or it would be impossible to reference the same struct from 2 different compiled pieced of code.

The issue at hand is weather 3 floats in a row, (Vector3 struct/class) get padded to a larger byte boundry because of "compiler optimizations" and the target hardware. They shouldnt at this moment, but it's not illegal for the compiler to do so.

Nutty

OldMan
11-10-2002, 09:37 AM
By polimorfic I mean virtual functions.

If you have a struct{float a;byte b;float c;}
you will have it using exaclty the same space as 2 floats and a byte (all toghether in memory). Just forcing alignment you will get all with in 32 bit boundary (each compiler has its own strange flags to do that).

It is ilegal for the compiler to do it without explicit command, since doing it would result in failture to standards. If a compiler decided to do it by himself, all cross platform capability of C++ is gone.

And that is not dangerous...or how do you think drivers are made (in modern hardware) ?

Just something important: a struct {} will not get size 0, it will get size 1, since no 2 objects can be hold in the exaclty same space in memory.

GPSnoopy
11-10-2002, 10:07 AM
directly from the C++ book (3rd Edition), p.102

"The size of an object of a structure type is not necessarily the sum of the sizes of its members. This is because many machines require objects of certain types to be allocated on architecture dependent boundaries or handle such objects much more efficiently if they are."


As I understand it, the standard doesn't make any guarantee on the size of a structure. And struct { char a; float b; }; might have a size of 9 or 12 depending on the compiler/platform.

I knew about the "empty" structure size. http://www.opengl.org/discussion_boards/ubb/wink.gif

Anyway, I still maintain it's a bad idea.
I mean, is it really that hard to use a simple array of floats in that case?

t0y
11-10-2002, 10:29 AM
Every member of a struct will be padded to match the platform the compiler is targetting. This way, every member's address will aligned properly.

Example in a 32 bits compiler:

struct{
float a; // 32 bits
byte b; // 32 bits, although it only uses 8
float c; // 32 bits
}

Example in a 64 bits compiler:

struct{
float a; // 64 bits
byte b; // 64 bits, although it only uses 8
float c; // 64 bits
}

So if your programming in x86 and using floats or doubles you won't be padded http://www.opengl.org/discussion_boards/ubb/wink.gif.

At least this is what I've been assuming, and this may not be the standard way of doing things.

If you're using msvc just use the #pragma pack stuff and you can be sure...

jwatte
11-10-2002, 10:34 AM
If you have gaps (real gaps, ie padding), I don't think the driver can initiate a clean transfer, no matter what. It will have to reshuffle data and that's bound to be slow.


Even if there's data "not being used" between your vertex elements, why would this mean the driver has to re-pack the data? The card can just DMA a large chunk, and pick out the parts it needs. As long as there's data on a physical page of memory, you know that it's safe to read that entire physical page.

I wonder if the GL standard actually requires you to have padding in the LAST vertex, if you specify padding. Else you could set up a contrived case where it wasn't safe to read the padding "after" the last vertex in an array, because that might be on a different page boundary than the actual vertex data. Gack!

I guess this is why driver writers either make us specify exactly what memory range we're dealing with (vertex_array_range, vertex_array_object) OR just build their hardware so that it fetches power-of-two-aligned cache lines (and only fetches those cache lines that actually contain data, not just padding).

Regarding struct padding, most architectures will pad "as you expect" when you sort your members from largest to smallest. Or, if your members are all of the same type, most architectures will pack the members fully. The only one I know about that doesn't is a DSP which puts everything on a 32-bit boundary (even if it's a char) because that's the only type it actually implements...

GlynnC
11-10-2002, 11:59 AM
LostInTheWoods: On a 32-bit system, you're probably safe; on a 64-bit system, you may well get an extra 32 bits of padding at the end of the structure. To be safe, use sizeof(PointStruct) etc as the stride.

t0y: The situation is a bit more complex than your example suggests. If you have fields of differing sizes, then a smaller field may be followed by sufficient padding to ensure that the following field is aligned accordingly.

However, consecutive fields of the same type typically won't have any padding between them. E.g. assuming a 32-bit system, for:

struct {
char c;
int i;
}

there would typically be 3 bytes of padding between c and i, but for:

struct {
char c1, c2, c3, c4;
int i;
}

there wouldn't be any padding.

OldMan
11-10-2002, 12:32 PM
struct{
float a; // 32 bits
byte b; // 32 bits, although it only uses 8
float c; // 32 bits
}

for sure that is not correct with Visual C++, GCC or Intel C++ compiler.. or all drivers I made until today would not work.

Architecture dependent boundaries in x86 is 8bit.. not 32...

32 bits packages are faster than 24 bit ones to move.. but not faster than 16 bits or 8 bits in a x86 processor.

[This message has been edited by OldMan (edited 11-10-2002).]

Bob
11-10-2002, 12:50 PM
struct{
float a; // 32 bits
byte b; // 32 bits, although it only uses 8
float c; // 32 bits
}

for sure that is not correct with Visual C++, GCC or Intel C++ compiler.. or all drivers I made until today would not work.

MSVC 6 will add three bytes between b and c. Unless you alter the packing manually of course.

OldMan
11-10-2002, 01:00 PM
It may be.. the only driver I made with MSVS 6 was made to map 16 bit registers. But GCC and Intel C++ wont do that (just checked by looking at ASM generated)

LostInTheWoods
11-10-2002, 01:10 PM
WOW, Well, i either just spent ALOT of time fixing something to the point that its broken, or making it better. Im not quite sure which now. But from what I have grasped, if i simply use a struct{float, float, float}. Then I should be able to do this. Expecialy if i use sizeof(PointStruct) as my stride. Is this correct?

Im also looking at the fact that i create a pointer to begin with. Then using that pointer i Dynamicaly allocate an ARRAY of PointStructs. Thus those memory addresses are saved all inline. Thus if i simply use sizeof() to know my stride i can still do this. My only consern is the padding, imposed by the compiler. But you have said in 32 bit systems, using floats this should not be a problem. But on a 64 bit system, i will be a full 32bit pad off. Currenly most platforms are 32bit, so i should be fine. Is this correct??????

zeckensack
11-10-2002, 01:33 PM
Originally posted by LostInTheWoods:
WOW, Well, i either just spent ALOT of time fixing something to the point that its broken, or making it better. Im not quite sure which now. But from what I have grasped, if i simply use a struct{float, float, float}. Then I should be able to do this. Expecialy if i use sizeof(PointStruct) as my stride. Is this correct?Yep.


Im also looking at the fact that i create a pointer to begin with. Then using that pointer i Dynamicaly allocate an ARRAY of PointStructs. Thus those memory addresses are saved all inline. Thus if i simply use sizeof() to know my stride i can still do this. My only consern is the padding, imposed by the compiler. But you have said in 32 bit systems, using floats this should not be a problem. But on a 64 bit system, i will be a full 32bit pad off. Currenly most platforms are 32bit, so i should be fine. Is this correct??????Power-of-two boundaries are probably safe.

Ie even on a 64bit system (like upcoming x86-64) a struct with two float members won't contain any padding.

AMD like to call that 'natural alignment'. Your data members only have to be aligned to a sizeof(type) boundary.

eg
struct {
ubyte a,b; //offsets zero, one
ushort c; //offset two
float d; //offset four
double e; //offset eight
};

This case is mostly guaranteed.

The problem is this:

struct {
short a;
float b;
short c;
};

a will be at offset 0, b will be at offset four (!!), but c can go either to offset two or two offset 8, depending on compiler 'cleverness'. If the compiler ignores 'natural alignment' they may even be in order.

So to make life a little easier,
1)Excplicitly arrange for natural alignment
2)Sort members largest to smallest wherever possible

Strategy one would yield
struct
{
short a;
short c;
float b;
};
Works fine.


Strategy two (compiler paranoia) would yield
struct
{
float b;
short a;
short c;
};

Thanks, and good night http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Bob
11-10-2002, 02:16 PM
but c can go either to offset two or two offset 8, depending on compiler 'cleverness'

I believe the specification requires that members appears in memory in the same order as they are specified in the structure. Therefore the compiler is not allowed to place c between a and b. Not sure, but almost.

t0y
11-11-2002, 07:31 AM
Originally posted by OldMan:

Architecture dependent boundaries in x86 is 8bit.. not 32...

32 bits packages are faster than 24 bit ones to move.. but not faster than 16 bits or 8 bits in a x86 processor.


I'm not questioning your programming skills http://www.opengl.org/discussion_boards/ubb/wink.gif(I've never programmed drivers in win32, only DOS and very simple ones). (OT) Is there any source of information on the net that can help me start playing with that? I miss programming my SB Pro http://www.opengl.org/discussion_boards/ubb/wink.gif

IIRC from the Intel docs, any memory access will read 32 bits from memory and is 32 bit aligned. If, by any chance, any variable (16 bits or more) crosses a 32 bits boundary 2 memory reads are necessary. This will make 16bit operations slower than the equivalent 32 bit when aligned.

I'm not sure if 8 bits vars should be aligned this way (since they always only need one mem access), but if the next member crosses the boundary padding will be used like in the example I've shown.

About the order of struct members: I've never come across a case where the compiler rearranged the order for me. Maybe this only happens when you optimize for size instead of speed.

zeckensack
11-11-2002, 07:51 AM
Originally posted by t0y:
IIRC from the Intel docs, any memory access will read 32 bits from memory and is 32 bit aligned. If, by any chance, any variable (16 bits or more) crosses a 32 bits boundary 2 memory reads are necessary. This will make 16bit operations slower than the equivalent 32 bit when aligned.What did you mean there?

If all types are naturally aligned, the shorter types are better. They potentially conserve cache space and on require less memory reads. Might work out equal but they are never worse. It's also very nice if you just have to fit some data structure into a given alignment (say, you want each object in a large array to occupy 64 bytes and be perfectly cache line aligned).

I also guess that Intel doc maybe referenced 32 bytes, not bits? That would be the P3 data cache line size.

Or did you mean

Ox00 .
0x01 .
0x02 .
0x03 X
0x04 X
0x05 X
^^ that sounds like that first sentence (16 bit value crossing a 32 bit boundary but the next sentence puzzled me http://www.opengl.org/discussion_boards/ubb/confused.gif

V-man
11-11-2002, 01:11 PM
>>>This will make 16bit operations slower than the equivalent 32 bit when aligned.<<<

OK, then 32 bit accesses that are *not* aligned will also cause slowdowns.

Also, I think there is something seriously wrong with this statement. When you allocate memory, very often you dont get a 32 bit aligned memory. Pretty much 99% of apps out there are not making sure their memory accesses are aligned.

With SSE and MMX, not having 16 bit aligned will lead to slow downs. At least I think it was 16 bit. I think that's what you were thinking of.

V-man

OneSadCookie
11-11-2002, 06:26 PM
I've never seen a compiler re-order fields in a struct. I don't know whether that's even legal according to the C spec.

Typically, each machine architecture has an alignment requirement to be able to load from memory into a register. If your data doesn't meet this alignment requirement, you'll either trigger an alignment exception (very slow), or just fail, depending on hardware and/or OS.

For example, on 32-bit PowerPC systems, all loads to integer registers must be on 32-bit boundaries, all single-precision floating point loads must be on 32-bit boundaries, all double-precision floating point loads must be on 64-bit boundaries, and all vector loads must be on 128-bit boundaries. Integer and FP loads will generate alignment exceptions handled by the OS if their address is misaligned; the vector unit chooses the closest 128-bit aligned address and silently loads from that.

The compiler will pad the fields of your struct so that they meet the alignment requirements of your architecture. For example, in the case already mentioned with a n int, a char, and another int in a struct in that order, there will be three bytes of padding around the char on almost all current architectures (including Intel, PowerPC, 32-bit MIPS, ...). Which side of that char the padding goes on depends on the endianness of the machine in question, so you shouldn't rely on it in any way.

I can't speak to other compilers in this regard, but GCC aligns the entire struct based on the alignment of the first field. That means that the length of the struct will be padded to be a multiple of the alignment requirement of the first field. For example, a struct containing a double followed by a float will have 32 bits of padding at the end.

In summary, the specific case of { float, float, float } should be 12 bytes on most current common architectures, but you can't really rely on that. To be safe, you should use arrays of GLfloats (since there's no guarantee float maps to GLfloat even).

V-man
11-11-2002, 08:35 PM
I think that structs get padded, but whether they are aligned properly, I dont know. Its pretty compiler depended I guess.

When I did some SSE coding, I had exceptions beeing raised on movaps instructions and when I checked the addresses, they werent 16 bit aligned. I had to search for the nearest aligned location in the large array that I allocated (using new). The array type was float (32 bit float), but I still had to make sure I was getting proper alignment.

I use VC++ 6. WHat's that easy to use compiler flag again?

V-man

Devulon
11-12-2002, 08:54 AM
Anyone who has done SIMD/SSE programming knows how #pragma align works. Its a requirement of the fast versions of the SSE read/write instructions. Padding in a struct is not garunteed but predictable. You are only compiling for one platform anyways. Usually ( :

Devulon

GPSnoopy
11-12-2002, 09:38 AM
This is getting a bit OT.

Anyway, the variables in a structure will always stay in the same order you defined them in. But the compiler is free to align them as it want.

I think #pragma align is specific to MSVC, so you should use __mm128f and so when programming in SSE 'cause it automaticly takes care of the alignement.
When trying to align an array there is _mm_malloc() and _aligned_malloc() depending on your compiler, it's not portable of course.

But anyway, to come back to the topic, I don't really see what's the point to use an array of "Vector3f-like" structures instead of simply using an array of floats; really, I mean it brings more troubles than it helps.
Like OneSadCookie said, an array of GLfloats should be used.


[This message has been edited by GPSnoopy (edited 11-12-2002).]

OneSadCookie
11-12-2002, 12:38 PM
In any case, padding your vector3 out with a fourth element will usually make things more efficient for CPU computations, rather than less (better cache-line alignment).

The GL drivers will be optimized for this case, too, because that's how Q3A submits its vertices.

LostInTheWoods
11-12-2002, 12:46 PM
So your saying i should add another component to my Vector stucture to make it faster? How does that work?

Bob
11-12-2002, 01:23 PM
The cache works is blocks, not sure about the size on the common architectures, but I believe 32 and 64 bytes per block are usual. If a block is not in the cache (I'm talking about the processor cache by the way), you have a cache miss and the processor has to access the main memory to fetch this block. This can be expensive. If you keep your data structures of a size such that N structures can fit exactly in a cache block, you can with proper alignment store your structures so that they never cross two blocks. If they cross two blocks, you will have two cache misses if that single structure is not in cache (note, if one of the blocks is in the cache because of a previous access, there's only one miss, but that's one too much anyway). Adding an extra element to the structure can save expensive cache misses.

zeckensack
11-12-2002, 05:58 PM
Originally posted by Bob:
The cache works is blocks, not sure about the size on the common architectures, but I believe 32 and 64 bytes per block are usual. If a block is not in the cache (I'm talking about the processor cache by the way), you have a cache miss and the processor has to access the main memory to fetch this block. This can be expensive. If you keep your data structures of a size such that N structures can fit exactly in a cache block, you can with proper alignment store your structures so that they never cross two blocks. If they cross two blocks, you will have two cache misses if that single structure is not in cache (note, if one of the blocks is in the cache because of a previous access, there's only one miss, but that's one too much anyway). Adding an extra element to the structure can save expensive cache misses.

Sorry, but that's complete bogus.

First of all you get 33% more storage eaten up by the data. That alone causes 33% more cache misses. Not that cache misses would matter in streaming type stuff anyway ...

Proper SSE or 3DNow optimized code can handle arrays of 3 element vectors without penalty. If you're not running software T&L this doesn't even matter either ...

Bottom line: you're wasting memory and bandwith for nothing.

MZ
11-12-2002, 06:44 PM
Originally posted by OneSadCookie:
I can't speak to other compilers in this regard, but GCC aligns the entire struct based on the alignment of the first field. That means that the length of the struct will be padded to be a multiple of the alignment requirement of the first field.

What C spec says here is that compiler must ensure sizeof(struct) is padded so that the struct could be safely used as a element of an array.
So, when you declare 'my_struct arr[10];' then at any index 'i' all members of 'arr[i]' must be properly aligned, according to rules you described.

jwatte
11-12-2002, 07:30 PM
First: MSVC by default will align on natural alignment up to 32 bits, and thus will pad float-char-float with 3 bytes between the char and the float. This can be changed with compiler options and/or #pragmas. GCC has something similar with its type attribute syntax. Just running any type through the compiler will show you this.

Program:

#include <stdio.h>

struct foo {
float a;
char b;
float c;
};

int
main()
{
foo * f = 0;
printf( "%lx, %lx, %lx, %x\n",
&f->a, &f->b, &f->c, sizeof( foo ) );
return 0;
}

Output (both using GCC 3.0.3 for i686 and MSVC 6.0 sp5):

0, 4, 8, c

Whoever said it didn't and he'd just checked, obviously hadn't, or checked something totally different (it was kind of vague, that comment).


Second: the native alignment size on any modern x86 is 32 bits, and smaller alignment may cost you performance. malloc(), as implemented in the MSVC runtime library, will do its darndest to return to you 32-bit aligned data. If you "randomly" access data (rather than streaming through it), it makes sense to pad out to powers of 2, and make sure your array starts on the same alignment, to be cache optimal about your accesses.


Third: 16 bits accesses are typically slower than 32 bit on a modern x86, because, if nothing else, each instruction using 16 bit registers needs a size prefix byte. Also, on not-so-modern x86 processors, you get a partial register stall if you mix 32 and 16 bit mode code without properly indicating that you don't care about the upper bits by clearing the register with xor reg,reg.


Fourth: It is not possible to get optimal SSE throughput using non-padded 3-element vertices. The shuffle instruction will tie up the SSE execution unit for three (3!) cycles, and is thus more expensive than add or multiply. (Btw: P-III can only decode a single SSE instruction per clock, and Athlon XPs aren't any faster at SSE than at regular FP :-( )

There may be cases where you can code your loop to "just work (tm)" with 3-interleaved vertex arrays, but that's not the norm. If you can save memory and still be efficient, by all means, do so, but there's many cases where padding actually does matter; all depending on your data access pattern.


Fifth: writing 3-aligned float triplets to AGP memory is a recipe for disaster, as the Pentium III has only 6 line fetch buffers (doubling as write combiners) and will evict a partially filled one at first hint of running low; thus, you REALLY want to be writing full, aligned 32-BYTE quantities at a time when going to AGP memory. No can do if your input or data format is only 12 byte aligned. Well, unless you write 96 bytes at a time, but at that point, you're all out of LFBs to get data in from L2 or RAM in the first place...


Now, let's return to our regularly scheduled hand-wringing over the total absence of released OpenGL drivers supporting ARB_fragment_program in hardware.

zeckensack
11-12-2002, 11:44 PM
Sorry for being such a pain http://www.opengl.org/discussion_boards/ubb/wink.gif

but

2)Yup, I was talking about streaming

3)MOVZX - reads 8 or 16 bits, writes a full 32 bit register, thus no PRS. Every non-half dumb compiler should use it nowadays.

4)I basically concur. It depends, heavily. Something worth trying is instead of creating permutations of the stream you can prepare the permutations of the static data (matrices typically). This will be too much for the register file and will take reg,mem instructions instead of reg,reg. No worries though, it'll easily stay resident in the L1 cache.

5)See #4. Also try and do computations to temp memory (read: small L1 cache buffers) and shovel them out via MMX.