AGP texture transfer time

I Currently have a 2048x2048 texture that changes every frame. It is only Monochrome, so is 4MB in size. However, when the texture is transferred across the AGP (running at 8x on a Geforce 5900TD) it takes 10ms to do the transfer.

The TexImage call I am using is as follows…

glTexImage2D(GL_TEXTURE_2D, 0, GL_LUMINANCE8, 2048,2048, 0, GL_LUMINANCE, GL_UNSIGNED_BYTE, Data);

… where Data is an array of 'unsigned char’s held in System Memory.

What I cannot explain is why this transfer takes 10ms. With AGP running at 8x, it should be able to do the transfer about 4 or 5 times quicker than that surely.

My only clue so far is in the Blue book, which says that with GL_LUMINANCE, each element is converted to floating point and then assembled into an RGBA element by replacing the luminance value 3 times for red,green and blue and attaching a 1 for alpha. By doing this, does that mean that the texture now takes up 16MB on the graphics card (as that seems excessive) ? Also, does that mean that the transfer across the AGP cannot use DMA, as there is work to do on the destination side ?

This all seems to be quite inefficient at the moment, so does anyone have any ideas on how to actually transfer a luminance texture across to the graphics card without incurring all the extra processing overhead (IIf I have understood things correctly) ?

thanks.

For updating the texture use

glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 2048, 2048, GL_LUMINANCE, GL_UNSIGNED_BYTE, Data);

kon

You might want to check out the NV_PixelDataRange extension. It has accelerated support for the GL_LUMINANCE_ALPHA format for glTexSubImage2D. http://oss.sgi.com/projects/ogl-sample/registry/NV/pixel_data_range.txt

You answered yourself!The reason why it takes that long is cause the API enters a enourmous loop to convert the data from grayscaled to RGBA!!!Note it will be EXTREMELY more safer and faster if you convert the data yourself to RGB and pass it like that.You will safe 1/4 of the space and will gain speed increase.

Originally posted by kon:
[b]For updating the texture use

[quote]

glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 2048, 2048, GL_LUMINANCE, GL_UNSIGNED_BYTE, Data);

kon[/b][/QUOTE]

Doing it this way (assigning a blank texture at initialisation and then updating with the call described above) made no difference to the timing. Sorry, thanks for the input though.

Originally posted by Mihail121:
You answered yourself!The reason why it takes that long is cause the API enters a enourmous loop to convert the data from grayscaled to RGBA!!!Note it will be EXTREMELY more safer and faster if you convert the data yourself to RGB and pass it like that.You will safe 1/4 of the space and will gain speed increase.

Unfortunately, I need the bandwidth on the AGP though. I wanted it to only transfer 4MB per texture across. Yours is a potential solution though, as long as that is accelerated properly.

Originally posted by roffe:
You might want to check out the NV_PixelDataRange extension. It has accelerated support for the GL_LUMINANCE_ALPHA format for glTexSubImage2D. http://oss.sgi.com/projects/ogl-sample/registry/NV/pixel_data_range.txt

After reading the above referenced material, it becomes clear that no current drivers use DMA to transfer texture data across the AGP due to the OpenGL compliance with Client Server architecture. Gutted.

If i understood NVidia correctly, the LUMINANCE8 texture format is supported natively in graphics hardware. That means, the texture data is not! converted to RGBA data. Thus, it is more efficient to send grayscale data in the LUMINANCE8 format to the GPU as blowing it up to RGBA (more efficient in the amount of data to transfer and store).

Unfortunately, the pixel range extension does not support this texture format. As a consequense, this data is not send to the GPU using DMA transfers. This possibly explains why the transfer rate is not anywhere near the theoretical AGP limit.

Does anybody know what the reasons for not supporting DMA transfers for grayscale (LUMINANCE8/16,HILO) texture formats are ?

Cass ?

Klaus

[This message has been edited by Klaus (edited 07-10-2003).]

lazrhog, I haven’t read that part of the bluebook, but luminance textures are certainly NOT converted to float rgb during download then back to component rgb. It may be explaining what happens conceptually to the luminance value in the texture unit during fragment generation. Luminance is generally stored as luminance, the only issue may be pixel packing/padding internally in some implementations where an 8 bit single component uses 16 bits, internally, but these things are always hidden and are nothing to do with the spec. Luminance is stored as luminance internally (although if a vendor chose not to they could and you’d never know).

w.r.t. DMA , you can still wait on a DMA and benefit from the transfer speed if not the parallelism :-/ So there’s still no real excuse (AFAIK) and for big transfers it should eb a win.

Reading Matt’s extension I can’t help wondering if some ‘dangerous’ mode with nvfence wouldn’t be more appropriate, i.e. you’d better damned well not access these pages until this fence is complete. Also can’t you simply block on memory pages being read or written by the application and have your way with teh DMA, that way you DMA and don’t block unless there’s a conflicting access? This would make the fence merely optional but good practice.

Originally posted by lazrhog:
I Currently have a 2048x2048 texture that changes every frame. It is only Monochrome, so is 4MB in size. However, when the texture is transferred across the AGP (running at 8x on a Geforce 5900TD) it takes 10ms to do the transfer.

I am curious… what number did you expect?

(20482048 texels)(1 B/texel)(1 MB/2^20 B)(8 Mb/MB)/(10^-2 s) = 3200 Mb/s which is slightly under half peak. What’s the speed of your memory interface?

> The reason why it takes that long is
> cause the API enters a enourmous loop to
> convert the data from grayscaled to RGBA

When the spec says “converts to floating point and expands” then that means “gives the appearance of converting …”. The actual implementation is very likely more efficient than that.

Note that data that lives in regular RAM can NOT be transferred over to AGP without first copying it. Thus, you’re looking at:

reading 4 MB from cacheable memory
writint 4 MB to AGP memory
reading 4 MB out of AGP memory (from the card)

This means 12 MB per transfer. Multiply by 100 (because you say it takes 10 milliseconds) and you’re already getting 1200 MB/s of throughput. PC133 RAM has a peak throughput of 1014 MB/s, so I guess you either have DDR RAM in your machine, or the driver actually does DMA out of regular memory, turning off the AGP-ness of it.

Also, if you touch those 4 MB every frame, that’s another 8 MB of memory traffic (read + write) to account for, outside GL but still in your program and competing with GL system resources.

I get 204820481B = 4MB, 10 ms = .01s

4MB/.01s = 400 MB/sec

I couldn’t follow your expression, it’s a respectable number although AGP 8X should peak at 2.1GB/sec so it’s actually less than 1/5th of peak theoretical throughput. Other factors like memory performance may affect this of course, but many systems have memory performance in this ballpark (in theory).

All peak numbers from specs which don’t mean much, but since we’re throwing peak numbers around :-).

Originally posted by dorbie:
I get 204820481B = 4MB, 10 ms = .01s

4MB/.01s = 400 MB/sec

Same as I :slight_smile: I wrote “Mb” not “MB”.

I couldn’t follow your expression, it’s a respectable number although AGP 8X should peak at 2.1GB/sec so it’s actually less than 1/5th of peak theoretical throughput.

Yes, my bad, I forgot a 2 somewhere.

Edit: Forgot a /

[This message has been edited by m2 (edited 07-10-2003).]

I posted before seeing jwatte’s response. Ahh the copy, then there’s the app data modifications which are definitely always there in a real app if not a benchmark attempt.

I see the pixel range extension uses the allocation of AGP mapped memory and avoids the copy, you’d still have your app updating the memory in the real world so it may not work out any faster…

If your intent was a flipbook animation it could work out faster for you though so long as you could fit them all in your aperture.

[This message has been edited by dorbie (edited 07-10-2003).]

Thanks for the replies guys.

I will have to modify my own drivers to incorporate DMA. No probs. Thanks.