texSubImage.. 50 MB /sec...

I’m using an FX5950 ultra, and I’m doing a bunch of large glTexSubImage2d calls… (1024x1024)

Source data is in RGB format, target texture is in RGBA (or RGB, doesn’t seem to matter)…

I’m getting about 20 frames per second doing this copy and nothing else. Does this sound right, or am I getting screwed?

I was really hoping I could get more than that

Supposedly AGP 4X is enabled…system is P4 2.5 GHz…

I can read from my HD faster than 50 MB per second…something’s got to be wrong here…

Thanks

-Steve

Are you loading the image into GART mapped system memory? That’s the only way you’ll get a DMA to the graphics card and achive best performance.

That sounds very slow.

I posted a question about performance of subimage2d a while ago http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/004092.html

It’s probably better if you use BGRA_EXT.

Also if you don’t mind using NVidia only extensions PDR should give you slightly better performance.

But neither of those account for the low throughput you are seeing.

What speed do you get with a smaller texture?

Edit: This benchmark might help you http://www.adrian.lark.btinternet.co.uk/GLBench.htm

I get 270Mb/sec on my GF5900u, AGP 4x.

From the results itll be obvious whether its your code thats causing the slowness or your system.

[This message has been edited by Adrian (edited 02-13-2004).]

Are you loading the image into GART mapped system memory? That’s the only way you’ll get a DMA to the graphics card and achive best performance.

Dorbie or somebody else, please, could you tell me more about this ?
How do you load the image into GART mapped system memory ?
How do you do that under Windows and/or Linux ?

[This message has been edited by Bozfr (edited 02-13-2004).]

Originally posted by Stephen Webb:
I can read from my HD faster than 50 MB per second…something’s got to be wrong here…
-Steve

Cough scuse me? What type of hard drive is giving you such good read transfer rates?
The average I’ve found is around the 15mb/s, except on a decent RAID setup where you can get mad speeds.
Are you sure your timing code is accurate?

Well PCI i/o should be able to deliver much better performance than this anyway.

AGP memory pages are all over the place so you need something that’s aware of the GART map and can contiguously allocate from that table, I think it has to be in the kernel. On Linux there’s a kernel module agpgart for this and sample code, there’s also glXAllocateMemoryNV and for windows AGP allocation there’s wglAllocateMemoryNV.

I have not tested the relative performance for image transfers, it’s generally used for vertex arrays but it should be an ideal candidate for image subloads.

I expect superbuffers will ultimately replace this for image xfers. You already have VBOs for vertex data so this kind of direct allocation seems like it will ultimately be deprecated.

I’d also try the full image transfer vs the subload (and certainly full width) and watch your glpixeltransfer settings. Any kind of stride you’ve set up (or anything off the beaten path) could have a disastrous impact on your performance purely for implementation/optimization reasons.

[This message has been edited by dorbie (edited 02-13-2004).]

15MB/sec? Maybe it’s time for you to upgrade :-). That seems far from what’s achiveable from a single drive with large contiguous or sequential reads.
http://www.tomshardware.com/storage/20040209/seagate-03.html#benchmark_results

Originally posted by KuriousOrange:
Cough scuse me? What type of hard drive is giving you such good read transfer rates?

My Seagate SATA drive is capable of 50Mbytes per second but achieves around 40Mbytes on average.

Originally posted by KuriousOrange:
Cough scuse me? What type of hard drive is giving you such good read transfer rates?

Well, I don’t currently have access to the system I was referring to…But I’m pretty sure I was getting 52+ MB/sec.

Current system I’m getting around 42 megs/second:

red:/home/swebb# hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 1.53 seconds = 41.83 MB/sec

I don’t think either hardware is particuarly special…UDMA, 7200 rpm IDE drives…?

-Steve

Originally posted by dorbie:
[b]… GART map and can contiguously allocate from that table, I think it has to be in the kernel. On Linux there’s a kernel module agpgart … glXAllocateMemoryNV… should be an ideal candidate for image subloads.

Dorbie…others. Thanks for the pointers. I will be working with this and hopefully I will be able to crank up the performance.

I’ll report back with my progress…

Thanks again

-Steve

(BTW, I wasn’t able to run the benchmark because I am running on a Linux system. Is the source code for that executable available?)

I’m adding some new features, I’ll have the source and new exe up tomorrow.

Originally posted by Adrian:

I posted a question about performance of subimage2d a while ago http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/004092.html

I checked out that discussion and ran the test program that was provided. I get much better results (and quite acceptable) if I run the program as is…about 600-900 MB/sec (depending on texture size…)

But, if I add the line:

glBindTexture(GL_TEXTURE_2D, texID);

in the init funciton, AFTER the glTexImage2D call…My performance drops to about 72 MB / sec.

Doh!

I guess I don’t really understand what is going on under the covers, here…

-Steve

WRT the original question:

I suggest giving the data to GL in GL_BGRA/UNSIGNED_BYTE format (if you’re on an x86 CPU). This should result in optimal transfer rates.

Perhaps the driver needs to do a copy to AGP memory before uploading, but good ol’ PC-133 can copy at 512 MB/s, which is 10x what you’re seeing – and you probably have DDR-something in that box.

Also, make sure you calculate 4 bytes per pixel, not just the number of pixels, when doing your math :slight_smile:

Originally posted by Stephen Webb:
[b]
Well, I don’t currently have access to the system I was referring to…But I’m pretty sure I was getting 52+ MB/sec.

Current system I’m getting around 42 megs/second:

red:/home/swebb# hdparm -t /dev/hda

/dev/hda:
Timing buffered disk reads: 64 MB in 1.53 seconds = 41.83 MB/sec

I don’t think either hardware is particuarly special…UDMA, 7200 rpm IDE drives…?

-Steve[/b]
I had to search for this thread, as I got distracted way back in feb.
I’m glad to see you say “buffered” in your answer - of course you’re getting 40 odd mb/s buffered, it’s coming from the cache! Whether that’s the cache on the interface or the OS cache, no matter, it’s still a cache - and therefore involves just memory copies.
This is no indicator of the true transfer rate of your hard drive - generally the transfer rate in the spec of a hard drive is the speed of the interface rather than it’s physical ability to read from the disk. If you were streaming data off the disk, you’d soon run into the true transfer rate, for you wouldn’t be able to render faster than it! You’ll probably find your real transfer rate is more like 15mb/s.
Use the windows performance analyser(?) to benchmark your hard drive.
EDIT:
I’ve looked at your 1st post again. 102410243=3mb. In order to read 20 frames per second you would require a transfer rate of 60mb/s, which is in excess of even the buffered transfer rate you report, let alone the actual transfer rate you will actually be getting.

Originally posted by Stephen Webb:
[b]I’m using an FX5950 ultra, and I’m doing a bunch of large glTexSubImage2d calls… (1024x1024)

Source data is in RGB format, target texture is in RGBA (or RGB, doesn’t seem to matter)…

I’m getting about 20 frames per second doing this copy and nothing else. Does this sound right, or am I getting screwed?

I was really hoping I could get more than that

Supposedly AGP 4X is enabled…system is P4 2.5 GHz…

I can read from my HD faster than 50 MB per second…something’s got to be wrong here…

Thanks

-Steve[/b]
You have to use PBO’s or PDR+fences. On my system (P4 2.6, FX5900, AGP8x) I got ~1.8GB/sec uploading speed. Note that this transfers are async, so texture data will be “avaible” later in a frame or maybe few frames later!
When you use PBO or PDR driver start async DMA data transfer, and CPU can continue it’s work. But if you try to change source buffer while data transfer are not yet finished you’ll get corrupted texture data.

I think, the best approach is using multiply PDR and fences w/o glFlushPixelDataRange call. This will free your CPU.

Use glTexSubImage2D call: glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, ptr);

yooyo

I think I may have misunderstood what is actually happening in his program here.
When he said:
“I can read from my HD faster than 50 MB per second…something’s got to be wrong here…”
He was merely comparing his texture upload rate to hard drive transfer rate, wasn’t he? He wasn’t actually saying he was reading the data he’s uploading to texture, was he?
I feel quite embarrassed, someone should have stopped me sooner - I’m still right about transfer rates though… he says looking sternly at dorbie.

But if you try to change source buffer while data transfer are not yet finished you’ll get corrupted texture data.

I think, the best approach is using multiply PDR and fences w/o glFlushPixelDataRange call. This will free your CPU.
yooyo, could you explain more precisely how it works ?
How can you be sure that your texture has been uploaded w/o glFlushPixelDataRange ?
I am not familiar with fences…

Is there any chance to see some of your code ?

It is easy… Allocate one big chunk using wglMem = wglAllocateMemoryNV(NumBuffers*ImageSize, 0, 1, 1) call. You have to split this big memory chunk in several smaller “transfer” buffers.

  
typedef struct tagBuffer
{
 byte *ptr;
 GLuint fence;
 int status; // free, transfer
 GLuint texture;
}Buffer;

Setup all this structures as follow:

 
// init code
glEnableClientState(GL_WRITE_PIXEL_DATA_RANGE_NV);
glPixelDataRangeNV(GL_WRITE_PIXEL_DATA_RANGE_NV, NumBuffers*ImageSize, wglMem);

Buffer buf[NumBuffers]; 
for (i=0 i<NumBuffers; i++)
{
 buf[i].ptr = wglMem + i*ImageSize;
 glGenFenceNV(1, &(buf[i].fence));
 glGenTextures(1, &(buf[i].texture));
 glBindTexture(GL_TEXTURE_2D, buf[i].texture);
 // setup texenv, filtering...
 // this texture format are accelerated
 glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, NULL);
 buf[i].state = FREE;
}

Somewhere in your code:

  
// Find FREE buffer
index = FindFreeBuffer();
// Copy data to buffer
memcpy(buff[index].ptr, srcbuff, ImageSize);
glBindTexture(GL_TEXTURE_2D, buff[index].texture);
// Start texture transfer. This is async call. It returns immediatly after call. Uploading are NOT finished
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, XRES, YRES, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, buff[index].ptr);
// Set Fence
glSetFenceNV(buf[index].fence, GL_ALL_COMPLETED_NV);
buf[index].status = TRANSFER;

When you want to draw texture from some buffer you have to test is transfer finished.

  
if (glTestFenceNV(buf[index].fence))
{
 // uploading is finshed... you can render texture
 buf[index].status = FREE;
}
else
{
 // do something else
}

Depending on you ImageSize uploading can take 2-50ms, but your CPU is free to do something else.
For example if you do video playback you will get delay 2-3 frames but your CPU can deal with decoder.

If your app really need to render current uploading image CPU must wait until transfer is finished, so you have to use glFinishFenceNV() call. If you really have to call this function than you don’t need PDR (it is same as classic glTexSubImage2D sync call codepath).

Note that if transfer are still pending and you change data in this buffer you can expect currupted texture data. CPU can copy to wgl mem buffer MUCH faster than GPU can copy it from wgl mem to texture.

In my player application I spend 2-3 720x576x32 buffers and I have 2-3 frames delay, but playback CPU usage are the same in my app and in MediaPlayer (less than 20% for MPEG2). Without PDR my player spent more than 60% CPU time.

Code was written online so it may have some errors but clue is there… :slight_smile:

You can find PBO example in it’s spec
PBO spec

yooyo

Thanks a lot for the code.

As I try to do something similar to you (a video player) and you have far more experience than me, I still have a couple of questions :

  • Fence are reported (in their spec) to be expansive. So it seems that you have found that using fences is cheaper than using glFlushPixelDataRange, right ?

  • Regarding your code, what do you do when glTestFenceNV returns false ? You still draw something ? with a previous texture ?
    And what happens when FindFreeBuffer does not find a free buffer ?

  • I understand the PDR exemple of the PBO spec correctly, the glFlushPixelDataRangeNV is done before the glTexSubImage2D and not after to take advantage of this aynchronicity. Then, the first frame is not necessary drawn correctly, right ?

  • How to be sure a transfer is finished with a PBO ?

I must say that I am really impressed by your figures : 1.8GB/s, wooah !

Do you get similar results with PBO ?

[b]

  • Fence are reported (in their spec) to be expansive. So it seems that you have found that using fences is cheaper than using glFlushPixelDataRange, right ?
    [/b]

Yep… glFlushPixelDataRange forces flushing all pending operations in PDR and force CPU to wait for flush.

[b]

  • Regarding your code, what do you do when glTestFenceNV returns false ? You still draw something ? with a previous texture ?
    And what happens when FindFreeBuffer does not find a free buffer ?
    [/b]

Usualy, Im drawing previous frame. To prevent lack of free buffers you have to make enough buffers to avoid this case. But, when it’s happend… Just do a glFlushFence on oldest buffer (try to add timestamp to buffers)

[b]

  • I understand the PDR exemple of the PBO spec correctly, the glFlushPixelDataRangeNV is done before the glTexSubImage2D and not after to take advantage of this aynchronicity. Then, the first frame is not necessary drawn correctly, right ?
    [/b]

Yes. But it still depends on image size… Maybe it can finish on time for small textures…
If your app do vsync you can start transfers before SwapBuffers call. It can help you a lot.

[b]

  • How to be sure a transfer is finished with a PBO ?
    [/b]

I don’t know. Maybe some glGetInteger(…) can tell you.


I must say that I am really impressed by your figures : 1.8GB/s, wooah !
Do you get similar results with PBO?

AGP 4x system ~920MB/sec
AFP 8x system ~1.8GB/sec

All this using PDR or PBO.

yooyo