PDA

View Full Version : best performance for texture upload?



nitschke
04-03-2006, 05:16 AM
i am loosing very much performance in my application through texture uploads, so i like to ask, whats currently the best way, to upload textures.

i am using opengl together with nvidia cg v1.4

in the initialization stage, i create 2 big texture with a glTexImage2D call for each


glTexImage2D(
GL_TEXTURE_2D,
0,
GL_RGBA_FLOAT32_ATI,
2048, // have to store 4*2 textures in one
3072, // have to store 4*2 textures in one
0,
GL_BGRA_EXT,
GL_FLOAT,
NULL);the textures are so big, cause they should store 2 1024x768 images in width and 4 in height (8 images).

for rendering every frame, i bind the texture and make 16 calls to glTexSubImage2D (like the following)


glTexSubImage2D(
GL_TEXTURE_2D,
0,
0,
0,
width,
height,
GL_BGRA_EXT,
GL_FLOAT,
cam_data_video[i*4].fgImgLabA[cnt]->imageData); // data of first imagebut in this case, it takes to much time.
what possibilities do you know to increase the performance?

thanks!
chris

ZbuffeR
04-03-2006, 05:44 AM
using RGBA float 32 data means a lot of space per texel, if I understood well you want to upload :
4*4*1024*768*16 = 192 Mbytes per frame !

What do you mean by "takes too much time" ? can you post some numbers ?
What is your hardware ?

nitschke
04-03-2006, 06:00 AM
you are right, i need to use this amount for every frame.

takes to much time means, when i just do the texture uploading without rendering something, i get 1fps.

my hardware is:
nvidia geforce 7800gtx, 256mb
amd x2 3800+ processor
4gb ram

loicm
04-03-2006, 06:25 AM
The use of the pixel_buffer_object extension could help you saving a copy, and BTW "theoretically" increase your framerate. That remains a big transfert.

jtipton
04-03-2006, 08:53 AM
I'm not sure how your code is even working. Your texture height is not a power of 2, and your code doesn't indicate you're using texture rectangles. Is there something else going on?

ZbuffeR
04-03-2006, 08:58 AM
Originally posted by jtipton:
I'm not sure how your code is even working. Your texture height is not a power of 2, and your code doesn't indicate you're using texture rectangles. Is there something else going on? probably NPOT (http://oss.sgi.com/projects/ogl-sample/registry/ARB/texture_non_power_of_two.txt)

jtipton
04-03-2006, 11:47 AM
I figured as much. Could NPOT be a possible source of slow down? It is my understanding that NPOT transfer is not optimized to the same extent as traditional textures.

The ideal solution would be to use dynamic textures. See ARB_render_texture.

jide
04-03-2006, 02:26 PM
No the slow down comes part of the fact that what he does requests a lot of memory, more than what its graphic card support, more than its AGP size (certainly). So a lot of transfer is needed each time (several times for each frame) plus the enourmous sizes of its textures which also slows down things.

As far as I know textures must be stored into the graphic memory for beeing used by GL, so having thousands of GB of RAM won't help, maybe with having a larger AGP aperture size could help a bit.

Michael Gold
04-03-2006, 04:13 PM
You might try sending the data in RGBA order instead of BGRA.

nitschke
04-04-2006, 01:18 AM
ok guys,
thanks for the big help,

i will have a look at the
pixel_buffer_object extension
and
dynamic textures with ARB_render_texture.

but one question remains,
its right, that i dont use 2^n texture sizes. but it works in my case.
should i use
- the next higher power of 2 size with GL_TEXTURE_2D and just update the data i need every frame with glTexSubImage2D(), or
- GL_TEXTURE_RECTANGLE_ARB with non power of 2.

what is faster?

chris

nitschke
04-04-2006, 03:45 AM
... and there is something left.

i use float textures, because its stated in the nvidia document for vertex shader texture fetches.

however, i can use unsigned byte as well. so my question is, how to use unsigned byte with vertex textures. which internal format should i use. for float now i use GL_RGBA_FLOAT32_ATI.

if i separate the rgb and the alpha data. i can even say, that the data i encode in alpha channel us just 1bit information. so i could further reduce the mem needed to be transfered, if i can use a 1bit texel depth texture in vertex shader.

does someone know something about this?
i mean, what formats a vertex shader can cope with? would be very appreciated.

thanks a lot.
chris

jtipton
04-04-2006, 09:08 AM
I would recommend sticking with standard GL_RGBA format. This is typically the most optimized format on consumer graphics cards. I would make the texture a power of 2 to ensure you aren't hitting a software path with the NPOT textures.

nitschke
04-04-2006, 10:32 AM
thanks,
that sounds good!
i will try...

nitschke
04-04-2006, 10:38 AM
it seems, i am getting best results with internal format set to GL_RGBA_FLOAT32_ATI.
GL_RGBA gives me much lower framerate.

for the external format i have to use GL_UNSIGNED_BYTE as type and GL_RGB bzw. GL_BGR_EXT, because my data.

any other hints?

marco_dup1
04-04-2006, 10:45 AM
Originally posted by jtipton:
I would recommend sticking with standard GL_RGBA format. This is typically the most optimized format on consumer graphics cards. I would make the texture a power of 2 to ensure you aren't hitting a software path with the NPOT textures. NPOT vertex textures are working. I use them. AFAIK you should use RGBA for float textures and BGRA for textures. Need you really so big textures in vertex shader? Is no compression possible?

k_szczech
04-04-2006, 06:15 PM
1. As suggested before - render directly to texture instead of using glTexSubImage. Unfortunately - if you need z-buffer during rendering, then memory usage will be even bigger since you will need 2048x3072 depth buffer.
2. When using Vertex Texture Fetch FLOAT32 format is required - other formats will fall back to software mode.

nitschke
04-04-2006, 11:09 PM
Originally posted by k_szczech:
1. As suggested before - render directly to texture instead of using glTexSubImage. Unfortunately - if you need z-buffer during rendering, then memory usage will be even bigger since you will need 2048x3072 depth buffer. how can i render directly to texture?
i get the data at a maximum of 30fps as OpenCV IPLimages, then i copy them with glTexSubImage into the related part of the texture.


Originally posted by k_szczech:
2. When using Vertex Texture Fetch FLOAT32 format is required - other formats will fall back to software mode. ... yes, i felt this very hard with a decreasing framerate, but thanks for verifying this.


so the best way is,

1) to use rgba floating textures? they are fast, but use a lot of memory.

2) upload data of type unsigned byte and not flot to save upload bandwith?

but here the question is,

- does it take much time to convert from external unsigned byte rgb to internal float rgba format?
and

- what effect does it have if i use bgr/bgra instead, because my data is original bgr?

k_szczech
04-05-2006, 02:04 AM
how can i render directly to texture?
i get the data at a maximum of 30fps as OpenCV IPLimages
Oops! I should read more carefully. I was thinking about glCopyTexSubImage wchich copies a part of renderbuffer to texture. It can be supplemented by rendering directly to a texture instead.
But your case is different - you get images on CPU and neet to transfer them to GPU. My mistake, sorry again.

Perhaps we're looking in the wrong place? Perhaps you do not need to update every texture in every frame. Maybe updating only these fragments that you really need would suffice? I'm just guessing, but recently I optimized my application this way - instead of transfering entire 128x128 texture from GPU to CPU i transfer 64x64 texture wchich contains 4 32x32 areas of the original texture.

Another tip:
You can transfer RGBA8 texture to GPU (wchich gives 4 times less data than RGBA_FLOAT32), and then render to RGBA_FLOAT32 texture using this RGBA8 texture. It will take 25% more GPU memory, but I guess it will be faster, and leave much more CPU time.

Gordon Wetzstein
04-05-2006, 06:36 AM
Hey Chris,

k_szczech is right. You don't have to upload float textures from your CPU to the GPU. These have w*h*4*32 Bit, compared to w*h*3*8 Bit that you would have if you upload an RGB texture to the GPU and convert it there. This would reduce the amount of data that you have to transfer from CPU to GPU to ~18%.

You can use a Framebuffer Object to render a textured Quad into a 32 Bit FBO and bind this in a second pass to your vertex program.

Try this FBO class, it implements most of the FBO features, also some you might not need [stencil attachment, 32 Bit, multiple render targets etc]:

http://gonzo.uni-weimar.de/~wetzste1/download/TestFramebufferObject-1.0.rar

You might have to change the _internalColorFormat in the FBO class to the ATI format!

Good luck and greetings from Weimar to Osaka :]

Cheers Gordon

nitschke
04-06-2006, 01:00 AM
thanks a lot guys,

especially many greetings to gordon in weimar!
i hope you can do well with your work!

----

im using GL_RGB with GL_UNSIGNED_BYTE to upload from cpu, since i need just color image data in that format. however, on the gpu its represented as GL_RGBA_FLOAT32_ATI.

1) im wondering, if it takes the same time to upload GL_BGR_EXT data, compared to GL_RGB?

2) the other issue is the NPOT thing,
now im using non power of 2 texture size and it works fine. why?
do i have impacts on the speed?
whats happening internally?
should i use the next higher power of 2 size or use GL_TEXTURE_RECTANGLE_ARB instead?

what gives the best performance?

cheers,
chris

ZbuffeR
04-06-2006, 09:10 AM
forget about GL_TEXTURE_RECTANGLE_ARB, its use is very specific.
If your card support it, go for NPOT.

nitschke
04-10-2006, 04:14 AM
ok, so im using GL_TEXTURE_3D with a size of 2048*3072, that is NPOT, but works.

1) but has anyone a suggestion becauif it takes the same time to upload GL_BGR_EXT data, compared to GL_RGB?

2) for the vertex shader im restricted to use floating point internal representation (GL_RGBA_FLOAT32_ATI).
but for the fragment shader, i need just 8bit data RGB, so GL_UNSIGNED_BYTE with RGB is fine as internal representation. can i use without performance penalty and which format is the best for that?

thanks a lot,
chris

jide
04-10-2006, 04:48 AM
1) Why would it take less time to upload the same amount of data ? Wether it is BGR or RGB, it's the same, to my point of view.
I didn't read all the topic, but maybe some compression would help.

nitschke
04-10-2006, 05:26 AM
thanks jide,

i ask that, because somewhere someone stated, that bgr instead of rgb is not natively supported, so the data has to be uploaded (same amount like rgb) and rearranged into rgb. the rearranging could be a performance penalty.

here, i just want to verify, if its really like this and what happens internally.

jide
04-10-2006, 05:52 AM
I might be wrong unfortunately. I just guessed. But I also guess that rearangement is not such an important task so the difference might be unnoticeable, almost with the current high speed data transfer rates.

Try S3 compressions, it should help.

nitschke
04-10-2006, 05:59 AM
hey jide,
thanks for your help!

what are s3 compressions?

yooyo
04-10-2006, 12:02 PM
I suppose you have following scenario:
input device -> system memory -> ogl texture -> render.

In this case you have a 2 memcopy operations from input device to a system memory and from system memory to GF7800. You may try to avoid this double copy by using PBO.

1. Create 2-4 PBO's each have size for one 1024x768 texture.
2. In loop, obitain pointer to PBO memory buffer (by glMapBufferARB) and copy data from input device to mapped buffer, then unmap buffer and call glTexSubImage2D(...). This will start async transfer, so you may immediatly use another PBO for next step in loop.

Why using several PBO's? Well, while one PBO is busy during image data transfer you may use another PBO to prepare and even more start new transfer.

Theoretically, you can get up to 2 GB/sec in very special case (no memcpy, just glTexSubImage2D from PBO), but in real usage you may expect ~600MB/sec.

Use BGRA textures, if you don't need 32bit precission (in floats) you may use half type (16bit float - faster) or regular 8-bit (fastest).

Im wondering how did you manage to "feed" 192MB/sec in system memory. What input device (or HDD) you have?