PDA

View Full Version : texSubImage.. 50 MB /sec...



Stephen Webb
02-12-2004, 11:43 PM
I'm using an FX5950 ultra, and I'm doing a bunch of large glTexSubImage2d calls.. (1024x1024)

Source data is in RGB format, target texture is in RGBA (or RGB, doesn't seem to matter)..

I'm getting about 20 frames per second doing this copy and nothing else. Does this sound right, or am I getting screwed?

I was really hoping I could get more than that http://www.opengl.org/discussion_boards/ubb/smile.gif

Supposedly AGP 4X is enabled...system is P4 2.5 GHz..

I can read from my HD faster than 50 MB per second...something's got to be wrong here..

Thanks

-Steve

dorbie
02-12-2004, 11:50 PM
Are you loading the image into GART mapped system memory? That's the only way you'll get a DMA to the graphics card and achive best performance.

Adrian
02-13-2004, 12:04 AM
That sounds very slow.

I posted a question about performance of subimage2d a while ago http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/004092.html

It's probably better if you use BGRA_EXT.

Also if you don't mind using NVidia only extensions PDR should give you slightly better performance.

But neither of those account for the low throughput you are seeing.

What speed do you get with a smaller texture?

Edit: This benchmark might help you http://www.adrian.lark.btinternet.co.uk/GLBench.htm

I get 270Mb/sec on my GF5900u, AGP 4x.

From the results itll be obvious whether its your code thats causing the slowness or your system.

[This message has been edited by Adrian (edited 02-13-2004).]

Bozfr
02-13-2004, 03:32 AM
Are you loading the image into GART mapped system memory? That's the only way you'll get a DMA to the graphics card and achive best performance.

Dorbie or somebody else, please, could you tell me more about this ?
How do you load the image into GART mapped system memory ?
How do you do that under Windows and/or Linux ?



[This message has been edited by Bozfr (edited 02-13-2004).]

KuriousOrange
02-13-2004, 04:25 AM
Originally posted by Stephen Webb:
I can read from my HD faster than 50 MB per second...something's got to be wrong here..
-Steve

*Cough* scuse me? What type of hard drive is giving you such good read transfer rates?
The average I've found is around the 15mb/s, except on a decent RAID setup where you can get mad speeds.
Are you sure your timing code is accurate?

dorbie
02-13-2004, 09:10 AM
Well PCI i/o should be able to deliver much better performance than this anyway.

AGP memory pages are all over the place so you need something that's aware of the GART map and can contiguously allocate from that table, I think it has to be in the kernel. On Linux there's a kernel module agpgart for this and sample code, there's also glXAllocateMemoryNV and for windows AGP allocation there's wglAllocateMemoryNV.

I have not tested the relative performance for image transfers, it's generally used for vertex arrays but it should be an ideal candidate for image subloads.

I expect superbuffers will ultimately replace this for image xfers. You already have VBOs for vertex data so this kind of direct allocation seems like it will ultimately be deprecated.

I'd also try the full image transfer vs the subload (and certainly full width) and watch your glpixeltransfer settings. Any kind of stride you've set up (or anything off the beaten path) could have a disastrous impact on your performance purely for implementation/optimization reasons.


[This message has been edited by dorbie (edited 02-13-2004).]

dorbie
02-13-2004, 09:19 AM
15MB/sec? Maybe it's time for you to upgrade :-). That seems far from what's achiveable from a single drive with large contiguous or sequential reads.
http://www.tomshardware.com/storage/20040209/seagate-03.html#benchmark_results

rgpc
02-13-2004, 04:08 PM
Originally posted by KuriousOrange:
*Cough* scuse me? What type of hard drive is giving you such good read transfer rates?


My Seagate SATA drive is capable of 50Mbytes per second but achieves around 40Mbytes on average.

Stephen Webb
02-15-2004, 02:24 PM
Originally posted by KuriousOrange:
*Cough* scuse me? What type of hard drive is giving you such good read transfer rates?


Well, I don't currently have access to the system I was referring to...But I'm pretty sure I was getting 52+ MB/sec.

Current system I'm getting around 42 megs/second:

red:/home/swebb# hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 1.53 seconds = 41.83 MB/sec

I don't think either hardware is particuarly special...UDMA, 7200 rpm IDE drives..?

-Steve

Stephen Webb
02-15-2004, 02:31 PM
Originally posted by dorbie:
[B]... GART map and can contiguously allocate from that table, I think it has to be in the kernel. On Linux there's a kernel module agpgart ... glXAllocateMemoryNV.... should be an ideal candidate for image subloads.


Dorbie...others. Thanks for the pointers. I will be working with this and hopefully I will be able to crank up the performance.

I'll report back with my progress..

Thanks again

-Steve

(BTW, I wasn't able to run the benchmark because I am running on a Linux system. Is the source code for that executable available?)

Adrian
02-15-2004, 02:40 PM
I'm adding some new features, I'll have the source and new exe up tomorrow.

Stephen Webb
02-15-2004, 03:09 PM
Originally posted by Adrian:

I posted a question about performance of subimage2d a while ago http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/004092.html


I checked out that discussion and ran the test program that was provided. I get much better results (and quite acceptable) if I run the program as is...about 600-900 MB/sec (depending on texture size...)

But, if I add the line:

glBindTexture(GL_TEXTURE_2D, texID);

in the init funciton, AFTER the glTexImage2D call...My performance drops to about 72 MB / sec.

Doh!

I guess I don't really understand what is going on under the covers, here...

-Steve

jwatte
02-15-2004, 03:11 PM
WRT the original question:

I suggest giving the data to GL in GL_BGRA/UNSIGNED_BYTE format (if you're on an x86 CPU). This should result in optimal transfer rates.

Perhaps the driver needs to do a copy to AGP memory before uploading, but good ol' PC-133 can copy at 512 MB/s, which is 10x what you're seeing -- and you probably have DDR-something in that box.

Also, make sure you calculate 4 bytes per pixel, not just the number of pixels, when doing your math :-)

KuriousOrange
04-15-2004, 11:05 PM
Originally posted by Stephen Webb:

Well, I don't currently have access to the system I was referring to...But I'm pretty sure I was getting 52+ MB/sec.

Current system I'm getting around 42 megs/second:

red:/home/swebb# hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 1.53 seconds = 41.83 MB/sec

I don't think either hardware is particuarly special...UDMA, 7200 rpm IDE drives..?

-SteveI had to search for this thread, as I got distracted way back in feb.
I'm glad to see you say "buffered" in your answer - of course you're getting 40 odd mb/s buffered, it's coming from the cache! Whether that's the cache on the interface or the OS cache, no matter, it's still a cache - and therefore involves just memory copies.
This is no indicator of the true transfer rate of your hard drive - generally the transfer rate in the spec of a hard drive is the speed of the interface rather than it's physical ability to read from the disk. If you were streaming data off the disk, you'd soon run into the true transfer rate, for you wouldn't be able to render faster than it! You'll probably find your real transfer rate is more like 15mb/s.
Use the windows performance analyser(?) to benchmark your hard drive.
EDIT:
I've looked at your 1st post again. 1024*1024*3=3mb. In order to read 20 frames per second you would require a transfer rate of 60mb/s, which is in excess of even the buffered transfer rate you report, let alone the actual transfer rate you will actually be getting.

yooyo
04-16-2004, 01:30 AM
Originally posted by Stephen Webb:
I'm using an FX5950 ultra, and I'm doing a bunch of large glTexSubImage2d calls.. (1024x1024)

Source data is in RGB format, target texture is in RGBA (or RGB, doesn't seem to matter)..

I'm getting about 20 frames per second doing this copy and nothing else. Does this sound right, or am I getting screwed?

I was really hoping I could get more than that http://www.opengl.org/discussion_boards/ubb/smile.gif

Supposedly AGP 4X is enabled...system is P4 2.5 GHz..

I can read from my HD faster than 50 MB per second...something's got to be wrong here..

Thanks

-SteveYou have to use PBO's or PDR+fences. On my system (P4 2.6, FX5900, AGP8x) I got ~1.8GB/sec uploading speed. Note that this transfers are async, so texture data will be "avaible" later in a frame or maybe few frames later!
When you use PBO or PDR driver start async DMA data transfer, and CPU can continue it's work. But if you try to change source buffer while data transfer are not yet finished you'll get corrupted texture data.

I think, the best approach is using multiply PDR and fences w/o glFlushPixelDataRange call. This will free your CPU.

Use glTexSubImage2D call: glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, ptr);

yooyo

KuriousOrange
04-16-2004, 02:13 AM
I think I may have misunderstood what is actually happening in his program here.
When he said:
"I can read from my HD faster than 50 MB per second...something's got to be wrong here.."
He was merely comparing his texture upload rate to hard drive transfer rate, wasn't he? He wasn't actually saying he was reading the data he's uploading to texture, was he?
I feel quite embarrassed, someone should have stopped me sooner - I'm still right about transfer rates though....... he says looking sternly at dorbie.

Bozfr
04-16-2004, 11:40 AM
But if you try to change source buffer while data transfer are not yet finished you'll get corrupted texture data.

I think, the best approach is using multiply PDR and fences w/o glFlushPixelDataRange call. This will free your CPU. yooyo, could you explain more precisely how it works ?
How can you be sure that your texture has been uploaded w/o glFlushPixelDataRange ?
I am not familiar with fences...

Is there any chance to see some of your code ?

yooyo
04-17-2004, 06:32 AM
It is easy.. Allocate one big chunk using wglMem = wglAllocateMemoryNV(NumBuffers*ImageSize, 0, 1, 1) call. You have to split this big memory chunk in several smaller "transfer" buffers.


typedef struct tagBuffer
{
byte *ptr;
GLuint fence;
int status; // free, transfer
GLuint texture;
}Buffer;Setup all this structures as follow:


// init code
glEnableClientState(GL_WRITE_PIXEL_DATA_RANGE_NV);
glPixelDataRangeNV(GL_WRITE_PIXEL_DATA_RANGE_NV, NumBuffers*ImageSize, wglMem);

Buffer buf[NumBuffers];
for (i=0 i<NumBuffers; i++)
{
buf[i].ptr = wglMem + i*ImageSize;
glGenFenceNV(1, &amp;(buf[i].fence));
glGenTextures(1, &amp;(buf[i].texture));
glBindTexture(GL_TEXTURE_2D, buf[i].texture);
// setup texenv, filtering...
// this texture format are accelerated
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, NULL);
buf[i].state = FREE;
}Somewhere in your code:


// Find FREE buffer
index = FindFreeBuffer();
// Copy data to buffer
memcpy(buff[index].ptr, srcbuff, ImageSize);
glBindTexture(GL_TEXTURE_2D, buff[index].texture);
// Start texture transfer. This is async call. It returns immediatly after call. Uploading are NOT finished
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, buff[index].ptr);
// Set Fence
glSetFenceNV(buf[index].fence, GL_ALL_COMPLETED_NV);
buf[index].status = TRANSFER;When you want to draw texture from some buffer you have to test is transfer finished.


if (glTestFenceNV(buf[index].fence))
{
// uploading is finshed... you can render texture
buf[index].status = FREE;
}
else
{
// do something else
}Depending on you ImageSize uploading can take 2-50ms, but your CPU is free to do something else.
For example if you do video playback you will get delay 2-3 frames but your CPU can deal with decoder.

If your app really need to render current uploading image CPU must wait until transfer is finished, so you have to use glFinishFenceNV() call. If you really have to call this function than you don't need PDR (it is same as classic glTexSubImage2D sync call codepath).

Note that if transfer are still pending and you change data in this buffer you can expect currupted texture data. CPU can copy to wgl mem buffer MUCH faster than GPU can copy it from wgl mem to texture.

In my player application I spend 2-3 720x576x32 buffers and I have 2-3 frames delay, but playback CPU usage are the same in my app and in MediaPlayer (less than 20% for MPEG2). Without PDR my player spent more than 60% CPU time.

Code was written online so it may have some errors but clue is there... :)

You can find PBO example in it's spec
PBO spec (http://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt)

yooyo

Bozfr
04-18-2004, 06:18 AM
Thanks a lot for the code.

As I try to do something similar to you (a video player) and you have far more experience than me, I still have a couple of questions :

- Fence are reported (in their spec) to be expansive. So it seems that you have found that using fences is cheaper than using glFlushPixelDataRange, right ?

- Regarding your code, what do you do when glTestFenceNV returns false ? You still draw something ? with a previous texture ?
And what happens when FindFreeBuffer does not find a free buffer ?

- I understand the PDR exemple of the PBO spec correctly, the glFlushPixelDataRangeNV is done before the glTexSubImage2D and not after to take advantage of this aynchronicity. Then, the first frame is not necessary drawn correctly, right ?

- How to be sure a transfer is finished with a PBO ?

I must say that I am really impressed by your figures : 1.8GB/s, wooah !

Do you get similar results with PBO ?

yooyo
04-18-2004, 10:17 AM
- Fence are reported (in their spec) to be expansive. So it seems that you have found that using fences is cheaper than using glFlushPixelDataRange, right ?


Yep... glFlushPixelDataRange forces flushing all pending operations in PDR and force CPU to wait for flush.


- Regarding your code, what do you do when glTestFenceNV returns false ? You still draw something ? with a previous texture ?
And what happens when FindFreeBuffer does not find a free buffer ?


Usualy, Im drawing previous frame. To prevent lack of free buffers you have to make enough buffers to avoid this case. But, when it's happend... Just do a glFlushFence on oldest buffer (try to add timestamp to buffers)


- I understand the PDR exemple of the PBO spec correctly, the glFlushPixelDataRangeNV is done before the glTexSubImage2D and not after to take advantage of this aynchronicity. Then, the first frame is not necessary drawn correctly, right ?


Yes. But it still depends on image size... Maybe it can finish on time for small textures...
If your app do vsync you can start transfers before SwapBuffers call. It can help you a lot.


- How to be sure a transfer is finished with a PBO ?


I don't know. Maybe some glGetInteger(...) can tell you.


I must say that I am really impressed by your figures : 1.8GB/s, wooah !
Do you get similar results with PBO?


AGP 4x system ~920MB/sec
AFP 8x system ~1.8GB/sec

All this using PDR or PBO.

yooyo

Bozfr
04-21-2004, 04:35 AM
One more question:
Do you use 1 or several PBOs ?

Thanks again.

yooyo
04-21-2004, 02:00 PM
In PBO codepath, you have to use several PBO's.

yooyo

Stephen Webb
10-28-2004, 11:00 PM
Originally posted by KuriousOrange:
[QB][QUOTE]Originally posted by Stephen Webb:
[qb]
I'm glad to see you say "buffered" in your answer - of course you're getting 40 odd mb/s buffered, it's coming from the cache! Whether that's the cache on the interface or the OS cache, no matter, ...
... If you were streaming data off the disk, you'd soon run into the true transfer rate, for you wouldn't be able to render faster than it! You'll probably find your real transfer rate is more like 15mb/s.
I just happened upon this thread again. I figured I'd respond to this even though it's probably long forgotten...

The buffered disk reads are not cached, just buffered. I ran it on a much larger scale to overwhelm any cache that might be involved.

Timing buffered disk reads: 1040 MB in 20.00 seconds = 52.00 MB/sec

The program does use some tricks to get this speed, and in practice I get about 2/3 of this rate. Drives are fast these days.

l_belev
10-30-2004, 05:14 AM
In my own experience problems with texsubimage/etc pixeltransfers performance are most frequently caused by format mismatch. When this happen, the driver does a software conversion and you suffer big fps drop.
You have to use one of the native hw formats. For example on x86 (which is little endian) the blue occupies the least significant bits, then is green, red and the alpha (if present). To be safe, you can try all meaningful combinations of the "format" and "type" parameters of the function and see which one yields best performance. You may also try different internal formats of the texture. If you use a compressed texture, forget about decent speeds with glTexSubImage - the driver must compress it in software. For compressed textures only glCompressedTexSubImage is fast, but then you should provide already compressed data.

nystep
10-30-2004, 10:27 AM
Hi,

Actually, i already experienced the same issue with glTexSubImage2D some time ago when i was coding a procedural fire effect and uploaded the resulting texture calculated on the cpu at every frame.

My conlcusions about it are that the bigger the texture you upload the more the pixel transfert rate seems to slow down... Here are some benches I had done:

512*128 => 190 fps : 12.4 MTexels/s
1024*256 => 30 fps : 7.8 MTexels/s
2048*512 => 4 fps : 4.1 MTexels/s

The decreasing texel rate is certainly not due to the application since i'm just doing a matrix convolution per texel...

The external format for the pixels was GL_RGBA, the internal format was the same. My system is an athlonxp 2000+, agp 4x, radeon 9600 pro.

The program was just a loop doing the texture update, upload it, and draw a quad on the screen with it.

I'm interrested in knowing how much performance boost you could get with nv extensions... But on the other side, maybe uploading small parts of your texture will be faster than uploading the whole in one single call. In all cases you should seriously reconsider the need for textures of such size. why not using rectangular textures?

regards,

knackered
11-02-2004, 01:16 PM
CuriousOrange is right, you don't get anything near those streaming read speeds using buffered async reading on mortal hard drives, even now...you musn't be streaming. Also, if you were streaming, you would definately not be complaining about sysmem->videomem transfer speeds, you'd have more pressing concerns.