PDR, fast memcpy

I have searched and read the archives.
I have read “GL_NV_pixel_data_range.txt” (Matt Craighead / NVIDIA).
I have made a lot of progress, but I am still stuck in the ~100 MB / sec range.

What am I trying to do? Pull textures from { sys memory | hard disk | network } and stuff them into textures as fast as possible.

I am running an FX-5950 Ultra card in a 2.53 GHz machine w/ Intel 845G/GL chipset.

Standard glTexSubimage2D with malloc’d ram yields ~100 MB/sec. No surprise here, I don’t think.

My understanding is that the “proper” way to use the PDR extensions for pushing textures on to the card is:

  1. Allocate memory with glXAllocateMemoryNV with read/write/priority parameters that will hopefully allocate memory in the AGP aperature.
  2. Enable the extension, notify the driver where the memory is, and how you plan to use it.
  3. do large block copies into this memory.
  4. use glTexSubimage2D (or whatever) to get it in a texture. (no borders, pixel transfer operations, or funny formats. GL_BGRA_EXT is what I used)

Does memcpy() qualify for step 3?

SO…when I do all of this, I am able to EITHER get fast glTexSubImage2D performance, OR fast memcpy performance, but not both.

I wrote a program which iterates through all possible parameters for the glXAllocateMemoryNV call (in 0.1 increments), and timed (100 cycles for each) the various operations. (I used glFlushPixelDataRangeNV()) to block after the TexSubimage2D call. I am using two textures and two memory ranges (allocated with glXAllocateMemoryNV as one large chunk) alternating between the two.

Here are two sample timing sequences:
(All tests are for 1024x1024x4 = 4 MB chunks)

glXAllocateMemoryNV parameters: 0.9, 0, 0
Timer stats after 100 iterations:
glTexSubImage : Elapsed time sec: 4 usec:220798
memset : Elapsed time sec: 0 usec:633433
memcpy : Elapsed time sec: 0 usec:811151
glutSwap : Elapsed time sec: 0 usec:5310

or

glXAllocateMemoryNV parameters: 0, 0.9, 0.5
Timer stats after 100 iterations:
glTexSubImage : Elapsed time sec: 0 usec:609059
memset : Elapsed time sec: 0 usec:605422
memcpy : Elapsed time sec: 4 usec:453805
glutSwap : Elapsed time sec: 0 usec:4046

every combination of parameters that I tried gave (more or less) the same results as one of the above two tests…(or returned a NULL pointer).

FAST WRITES ARE NOT CURRENTLY ENABLED. I’m not sure if this matters, and I’m fairly confident that I should be able to achieve better performance without them anyway.

It’s pretty clear to me that memcpy CAN be fast. It’s also fairly clear that in some cases my glTexSubImage2D is fast. My 90ish MB/sec bottleneck has me guessing that I’m going over the PCI bus at some point…

This would make sense if:
When the texSubimage2D is slow, the memory returned by glXAllocateMemoryNV() is just “normal” system memory (hence no fast-path to the card)

When the memcpy is slow, the glXAllocateMemoryNV() is vid mem…hence no fast path for the memcpy()…

Which leaves me wondering why I can’t get AGP memory to be allocated, as I’ve tried all (more or less) combinations of parameters to glXAllocateMemoryNV…

Are fast writes necessary? I guess I’m fairly convinced that it’s time to start upgrading my kernel. When that doesn’t work I will be looking for a motherboard which is supported by the NVidia AGP drivers…and if necessary I’ll run this project on Windows.

Arg!

Thanks, guys…
-Steve

I know glFlushPixelDataRangeNV is supposed to block but it looks from your figures as though it isn’t and that the blocking is occurring in the memcpy. Try adding a glFinish before the flush and see if that makes any difference.

I don’t think PDR would make the huge difference in speed your figures suggest, its main advantage is the asynchronous behaviour. At least that was the case with readpixels.

I ran the tests again with the glFinish().

I didn’t notice any appreciable difference.

My next step is to try to figure out what actual physical addresses are being allocated when I call glXAllocateMemoryNV() so I can determine (hopefully) where the ram is being allocated (video mem, system ram, or AGP ram…)

My theory had been that I was getting video ram in some cases, and system ram in some cases, explaining why I was getting “PCI speeds” in one of the two calls (memcpy or TexSubimage…), but I think the TexSubimage would acutally be much faster if the memory were on the card, no?

Now I have to wonder (assuming the timings for the TexSubimage2D are correct) if I am getting AGP transfer rates (400 megs in 0.6 seconds = 660 MB/sec)…and somehow my memcpy to the agp memory is slow…?

I’m getting outside of my area here, so I’ll stop speculating.

-Steve

Stephen,

You should be able to get around 450Mbytes/sec texture downloads with PDR on a decent machine.

I have couple of questions:

  1. What driver version are you running?

  2. Have you tried running Matt Craighead’s PDR benchmark? What results does it give?
    http://www.adrian.lark.btinternet.co.uk/PixPerf.zip

Note that we hope to soon replace PDR with the pixel buffer object extension, which is the pixel equivalent of vertex buffer objects (VAR->VBO, PDR->PBO).

Thanks,

Simon.

Originally posted by Stephen Webb:
[b]I ran the tests again with the glFinish().

I didn’t notice any appreciable difference.

My next step is to try to figure out what actual physical addresses are being allocated when I call glXAllocateMemoryNV() so I can determine (hopefully) where the ram is being allocated (video mem, system ram, or AGP ram…)

My theory had been that I was getting video ram in some cases, and system ram in some cases, explaining why I was getting “PCI speeds” in one of the two calls (memcpy or TexSubimage…), but I think the TexSubimage would acutally be much faster if the memory were on the card, no?

Now I have to wonder (assuming the timings for the TexSubimage2D are correct) if I am getting AGP transfer rates (400 megs in 0.6 seconds = 660 MB/sec)…and somehow my memcpy to the agp memory is slow…?

I’m getting outside of my area here, so I’ll stop speculating.

-Steve[/b]

You should be able to get around 450Mbytes/sec texture downloads with PDR on a decent machine.

I have couple of questions:

  1. What driver version are you running?
  1. Have you tried running Matt Craighead’s PDR benchmark? What results does it give?
    http://www.adrian.lark.btinternet.co.uk/PixPerf.zip

Note that we hope to soon replace PDR with the pixel buffer object extension, which is the pixel equivalent of vertex buffer objects (VAR->VBO, PDR->PBO).

Simon,

Thanks for your help…

To answer your questions, I am running 53.36 drivers. I neglected to be specific about the fact that I am running linux. Perhaps this is the problem? I have tried running the PixPerf program, but so far I have not been able to get it to work under Linux…(program crahses in glutInit(), and I haven’t tracked down the problem yet.)

Hopefully I will be able to figure it out soon…

Thanks again,

Steve

EDIT:

I’m still getting some weird problems with glutInit() seg faulting…but I was able to get the program to run, sort of…

Results are:
-read -type ubyte -format bgra -size 128
39.718643 Mpixels/sec

-read -type ubyte -format bgra -size 256
44.482670 Mpixels/sec

-read -type ubyte -format bgra -size 512
45.567848 Mpixels/sec

-read -type ubyte -format bgra -size 1024
46.067711 Mpixels/sec

With the -readpdr option, I get curious results…

All tests with size 512 or less give very similar results as before. For size 1024 my performance numbers jump to:
-read -type ubyte -format bgra -size 256 -readpdr
171.364716 Mpixels/sec

I’ll hammer on this some more, and see what I can come up with…

Thanks again,

-Steve

[This message has been edited by Stephen Webb (edited 02-25-2004).]

Do you have working AGP?

Else, you’ll be getting PCI transfers, which theoretically top out at 133 MB/s, but you won’t quite get there in reality on any existing system.

Those numbers look normal, you should be seeing slightly higher numbers with PDR, about 50 MPixels/sec.

I also get a large number(171Mpixels/sec) for the size 1024, I’m pretty sure this is a driver bug. With previous drivers it used to crash on that size.

You know the pixperf benchmark is for read/draw and copy pixels and not subimage2d I presume.

Here’s an interesting question for the 1024x1024 test…does the performance of ReadPixels depend on the contents of the read buffer? If the contents are undefined (after a swap, for example), then perhaps it’s being optimized away.

I found that you get “normal” values when you put a bitmap in there.

-Won

Won, What sort of values do you get with a bitmap in there?

[This message has been edited by Adrian (edited 02-26-2004).]

I have made a test app that upload 720x576x32bpp buffer on texture_rectangle.
On my machine(P4 2.4/GF4800SE) I got 266MB/sec.

This is exactly speed of AGP2X. Im not using PDR. This is simple copy data from system memory to texture.

I have try to use PDR, but I got same speed…

btw… Im using GL_BGRA texture.

yooyo

Adrian – I get the 50MPixels with the bitmap.

Yooyo – we’re talking about download, not upload.

-Won