16 Byte aligned Textures

Hi all,

I am implementing my own CPU Image type. I like to utilize SSE for some of the operations. However SSE requires 16byte aligned memory. Thats why I would like my image data type to have a 16byte aligned row stride.

However, when loading such an image to a OpenGL texture with glTexImage2D I can not specify this row alignment requirement.

GL_UNPACK_ALIGNMENT does only allow 1,2,4 and 8 and not 16. Why is this limitation?
Don’t other people having the same problem in combination with SSE requirements?

I tried to hack something with GL_UNPACK_ROW_LENGTH this works in many cases but still some combination where it does not.


if( (img.rowSize() % 16) == 0 ) {
  UNPACK_ALIGNMENT= 8;
  UNPACK_ROW_LENGTH = img.rowSize() / (img.channelCount() * img.pixelSize());
  int test = UNPACK_ROW_LENGTH * img.channelCount() * img.pixelSize();
  if(test % 8 == 0 && test != img.rowSize()) {
    // this is the problem case. 
    // it is not dividable by 16 but a 
    // img.channelCount() * img.pixelSize() is  > 8
  }
}

Does anyone know a solution for that?

GL_UNPACK_ALIGNMENT doesn’t deal with memory alignment. Actually, it is about row alignment. 4 is probably the best choice.

Anyway, what does that have to do with SSE?

That is actually what I meant. My CPU image class uses SSE and therefore requires a 16byte row alignment on the CPU. (Such that SSE loads work on all rows of the image the same way).

If I now want to create a texture from that CPU Image for example for displaying or some GPU operations I have to rearrange the image rows first because GL_UNPACK_ALIGNMENT does not support 16byte row alignment padding. This is a bit slow.

I was just wondering why this restriction exists and if there is maybe an other way around it without having to rearange my cpu image before handing it over to Opengl.

Also i saw that new OpenGL spec 4.2 adds ARB_map_buffer_alignment prob. for a similar reason. So would it maybe make sense to also add 16 byte GL_UNPACK_ALIGNMENT, GL_PACK_ALIGNMENT in the future?

Your image row size should be a multiple of 16 bytes wide anyway, if you want good performance. OpenGL likes textures where the dimensions are powers of 2, and can perform better with them. If you can’t chose the dimensions of the images yourself, then I don’t know any efficient way.

If you are using SSE, then I know that it can process up to 4 floats at the same time. So, if you are using a format such as GL_RGBA32F, then you are all set.
That’s how you solve your problem.

If you do need GL_UNPACK_ALIGNMENT, 16
I don’t know when that would become available. Probably never.
Even if it was available, would it be supported by the GPU or would the driver just convert it to something the GPU wants.

Heck, I don’t even know what GL_UNPACK_ALIGNMENT, 8 is for. Just because something is present in GL, doesn’t mean that it is fast.

I should probably just repeat this one : 4 is probably the best choice.

I’m not sure what you would even need 16-byte row alignment for. If you’re using SSE when reading from an RGBA32F texture, all you need to do is ensure that the pointer you give to OpenGL is aligned to 16 bytes. Since every pixel will be 16-bytes, it doesn’t matter how wide each line will be; it will be a multiple of 16-bytes in size.

And if you’re using PBOs, then we just got a handy extension that will let us know what the minimum mapping alignment will be. If it’s not what you need, then you need to use glBufferSubData to copy it into your data structures.