best performance for texture upload?

i am loosing very much performance in my application through texture uploads, so i like to ask, whats currently the best way, to upload textures.

i am using opengl together with nvidia cg v1.4

in the initialization stage, i create 2 big texture with a glTexImage2D call for each

		glTexImage2D(
			GL_TEXTURE_2D, 
			0, 
			GL_RGBA_FLOAT32_ATI,
			2048,  // have to store 4*2 textures in one
			3072, // have to store 4*2 textures in one
			0, 
			GL_BGRA_EXT, 
			GL_FLOAT, 
			NULL);

the textures are so big, cause they should store 2 1024x768 images in width and 4 in height (8 images).

for rendering every frame, i bind the texture and make 16 calls to glTexSubImage2D (like the following)

  		glTexSubImage2D(
			GL_TEXTURE_2D, 
			0,
			0,
			0,
			width, 
			height,
			GL_BGRA_EXT, 
			GL_FLOAT,
cam_data_video[i*4].fgImgLabA[cnt]->imageData); // data of first image

but in this case, it takes to much time.
what possibilities do you know to increase the performance?

thanks!
chris

using RGBA float 32 data means a lot of space per texel, if I understood well you want to upload :
44102476816 = 192 Mbytes per frame !

What do you mean by “takes too much time” ? can you post some numbers ?
What is your hardware ?

you are right, i need to use this amount for every frame.

takes to much time means, when i just do the texture uploading without rendering something, i get 1fps.

my hardware is:
nvidia geforce 7800gtx, 256mb
amd x2 3800+ processor
4gb ram

The use of the pixel_buffer_object extension could help you saving a copy, and BTW “theoretically” increase your framerate. That remains a big transfert.

I’m not sure how your code is even working. Your texture height is not a power of 2, and your code doesn’t indicate you’re using texture rectangles. Is there something else going on?

Originally posted by jtipton:
I’m not sure how your code is even working. Your texture height is not a power of 2, and your code doesn’t indicate you’re using texture rectangles. Is there something else going on?
probably NPOT

I figured as much. Could NPOT be a possible source of slow down? It is my understanding that NPOT transfer is not optimized to the same extent as traditional textures.

The ideal solution would be to use dynamic textures. See ARB_render_texture.

No the slow down comes part of the fact that what he does requests a lot of memory, more than what its graphic card support, more than its AGP size (certainly). So a lot of transfer is needed each time (several times for each frame) plus the enourmous sizes of its textures which also slows down things.

As far as I know textures must be stored into the graphic memory for beeing used by GL, so having thousands of GB of RAM won’t help, maybe with having a larger AGP aperture size could help a bit.

You might try sending the data in RGBA order instead of BGRA.

ok guys,
thanks for the big help,

i will have a look at the
pixel_buffer_object extension
and
dynamic textures with ARB_render_texture.

but one question remains,
its right, that i dont use 2^n texture sizes. but it works in my case.
should i use

  • the next higher power of 2 size with GL_TEXTURE_2D and just update the data i need every frame with glTexSubImage2D(), or
  • GL_TEXTURE_RECTANGLE_ARB with non power of 2.

what is faster?

chris

… and there is something left.

i use float textures, because its stated in the nvidia document for vertex shader texture fetches.

however, i can use unsigned byte as well. so my question is, how to use unsigned byte with vertex textures. which internal format should i use. for float now i use GL_RGBA_FLOAT32_ATI.

if i separate the rgb and the alpha data. i can even say, that the data i encode in alpha channel us just 1bit information. so i could further reduce the mem needed to be transfered, if i can use a 1bit texel depth texture in vertex shader.

does someone know something about this?
i mean, what formats a vertex shader can cope with? would be very appreciated.

thanks a lot.
chris

I would recommend sticking with standard GL_RGBA format. This is typically the most optimized format on consumer graphics cards. I would make the texture a power of 2 to ensure you aren’t hitting a software path with the NPOT textures.

thanks,
that sounds good!
i will try…

it seems, i am getting best results with internal format set to GL_RGBA_FLOAT32_ATI.
GL_RGBA gives me much lower framerate.

for the external format i have to use GL_UNSIGNED_BYTE as type and GL_RGB bzw. GL_BGR_EXT, because my data.

any other hints?

Originally posted by jtipton:
I would recommend sticking with standard GL_RGBA format. This is typically the most optimized format on consumer graphics cards. I would make the texture a power of 2 to ensure you aren’t hitting a software path with the NPOT textures.
NPOT vertex textures are working. I use them. AFAIK you should use RGBA for float textures and BGRA for textures. Need you really so big textures in vertex shader? Is no compression possible?

  1. As suggested before - render directly to texture instead of using glTexSubImage. Unfortunately - if you need z-buffer during rendering, then memory usage will be even bigger since you will need 2048x3072 depth buffer.
  2. When using Vertex Texture Fetch FLOAT32 format is required - other formats will fall back to software mode.

Originally posted by k_szczech:
1. As suggested before - render directly to texture instead of using glTexSubImage. Unfortunately - if you need z-buffer during rendering, then memory usage will be even bigger since you will need 2048x3072 depth buffer.
how can i render directly to texture?
i get the data at a maximum of 30fps as OpenCV IPLimages, then i copy them with glTexSubImage into the related part of the texture.

Originally posted by k_szczech:
2. When using Vertex Texture Fetch FLOAT32 format is required - other formats will fall back to software mode.
… yes, i felt this very hard with a decreasing framerate, but thanks for verifying this.

so the best way is,

  1. to use rgba floating textures? they are fast, but use a lot of memory.

  2. upload data of type unsigned byte and not flot to save upload bandwith?

but here the question is,

  • does it take much time to convert from external unsigned byte rgb to internal float rgba format?
    and

  • what effect does it have if i use bgr/bgra instead, because my data is original bgr?

how can i render directly to texture?
i get the data at a maximum of 30fps as OpenCV IPLimages

Oops! I should read more carefully. I was thinking about glCopyTexSubImage wchich copies a part of renderbuffer to texture. It can be supplemented by rendering directly to a texture instead.
But your case is different - you get images on CPU and neet to transfer them to GPU. My mistake, sorry again.

Perhaps we’re looking in the wrong place? Perhaps you do not need to update every texture in every frame. Maybe updating only these fragments that you really need would suffice? I’m just guessing, but recently I optimized my application this way - instead of transfering entire 128x128 texture from GPU to CPU i transfer 64x64 texture wchich contains 4 32x32 areas of the original texture.

Another tip:
You can transfer RGBA8 texture to GPU (wchich gives 4 times less data than RGBA_FLOAT32), and then render to RGBA_FLOAT32 texture using this RGBA8 texture. It will take 25% more GPU memory, but I guess it will be faster, and leave much more CPU time.

Hey Chris,

k_szczech is right. You don’t have to upload float textures from your CPU to the GPU. These have wh432 Bit, compared to wh38 Bit that you would have if you upload an RGB texture to the GPU and convert it there. This would reduce the amount of data that you have to transfer from CPU to GPU to ~18%.

You can use a Framebuffer Object to render a textured Quad into a 32 Bit FBO and bind this in a second pass to your vertex program.

Try this FBO class, it implements most of the FBO features, also some you might not need [stencil attachment, 32 Bit, multiple render targets etc]:

http://gonzo.uni-weimar.de/~wetzste1/download/TestFramebufferObject-1.0.rar

You might have to change the _internalColorFormat in the FBO class to the ATI format!

Good luck and greetings from Weimar to Osaka :]

Cheers Gordon

thanks a lot guys,

especially many greetings to gordon in weimar!
i hope you can do well with your work!


im using GL_RGB with GL_UNSIGNED_BYTE to upload from cpu, since i need just color image data in that format. however, on the gpu its represented as GL_RGBA_FLOAT32_ATI.

  1. im wondering, if it takes the same time to upload GL_BGR_EXT data, compared to GL_RGB?

  2. the other issue is the NPOT thing,
    now im using non power of 2 texture size and it works fine. why?
    do i have impacts on the speed?
    whats happening internally?
    should i use the next higher power of 2 size or use GL_TEXTURE_RECTANGLE_ARB instead?

what gives the best performance?

cheers,
chris