gpgpu image processing

Image processing on GPUs could be considered the most interesting part of the gpgpu subject.

I would like to discuss the current state of the art. There are different kinds of algorithms wrt. the relation between arithmetic complexity and memory bandwidth. In my experience, only algorithms that profit from high bandwidth of GPU memory are really adequate for GPUs. Many of the others lead to much overhead when running on GPUs.

To start the discussion: One of the most important aspects for gpgpu would be to be able to use the GPU as a real coprocessor that delivers additional computational power for algorithms running on CPUs. That would include a lot of requirements that are not fulfilled at the moment.

Upload AND readback AND rendering should be done at the same time. This would for example enable to do streaming pipelined image processing: When image n is processed, image n+1 is uploaded and image n-1 is read back.

So, to use the full power of gpus we would need a kind of interface (probably a new extension to the GL) for programmers to explicitly control asynchronous data transfer to and from the gpu.I know that ext_pixel_buffer_object delivers functionality like that, but it seems a bit too restricted. Why don’t we define an extension that allows to explicitly control memory transfers. Unfortunately, I don’t know enough about this topic (DMA transfers etc.) to suggest how this could look like and if it would be easy for the leading hardware manufacturers to support it.

One example to clarify: The data we want to process are located in general purpose main memory. To process them I would call glTexSubImage which triggers a copy of the data but does not control when the data are transported to the gpu. The data are probably transferred only when they are accessed by the GPU. As an alternative I could directly memcpy the data to a mapped memory of a PBO. Then another copy would have to be done by the GL when transferring the data to the graphics memory. Then, the gpu would have to copy the data from the PBO into a texture object. Then, after processing, I would either do a (blocking) ReadPixels or a (hopefully) asynchronous readPixels into a PBO (which triggers a copy of the data within the graphics memory) and subsequently map the PBO via glMapBuffer followed by a memcpy to get the data out of the mapped memory since this is recommended in the PBO spec. Many many copies…

To conclude, there seems to be much overhead during the data transfers that should be avoided to use GPUs efficently as coprocessors.

Finally, another question: Did anybody ever try to do rendering, upload and readback at the same time. How exactly could I do that? I’m ready to do some experiments… Which hardware is theoretically able to do that? Do you guys from 3dlabs, ati, nvidia have any hints?

First of all, this is one of my pet peeves, but I don’t think image processing really falls under the category of “GPGPU”. GPUs are designed for processing images.

Yes, asynchronous image download and readback is an important issue for these kind of applications. The PBO extension provides this functionality at a relatively high level of abstraction. A lower level interface would be possible, but this isn’t really the OpenGL philosophy, and would be hard to do in a hardware-independent way.

There is some scope for improving current driver implementations of PBO to improve performance. But some of the limitations are due to the hardware - in most current GPUs any kind of readback will stall the pipeline.

Despite these issues many software vendors (Apple, Adobe, Avid) are starting to the use the GPU for image and video processing, with a lot of success.

The latest version of our SDK includes a new PBO example:
http://download.developer.nvidia.com/dev…ePerformancePBO

First, if you’re really sure about what you are thinking, then go and study ‘digital sciences’, I mean hardware. Then make your own new gpu :wink:

Second, I cannot say I have the suffisant level to continue really discuss about those issues. But what I know is that harware is the only limit of softwares. And then, if hardware is not evoluating so quickly (even if I think they do), software engineers are the first to be able to notice new possible vistas to new hardware functionalities or technologies. Indeed, harware engineers are inside their ‘hard world’, and can forget to see new other ways. [I don’t mean software engineers are the best to discover new technologies]

Third, about the parallelism you talked about last. Generally, you’ll need multithreading, and more, two CPUs. Even so, two graphic cards, because many of OpenGL commands do not return before they have finish. I personnaly don’t know how will the whole system behave then.

>Upload AND readback AND rendering <

The design of current RAM prevents read and write at the same time. A single device can access it, like a DMA and can use it for a max period of time, and then has to let go.

Uploading while GPU is processing is what video cards are best at. Not the other one.

There is a thread about PCIExpress and readback with PBO that should be listed here.

Here are some readback related threads…

Topic: framebuffer readback performance
http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=012788

This thread says, nv40 does 3x to 5x faster readback compared to older hardware. Approximately 600MB/s. This is true even for AGP boards. The cards seem to have new hardware support for faster readback, even on non-quadro boards.

Topic: PCI express readpixels performance
http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=012129

This one says that up to 1GB/s is achievable, depending on the motherboard. (“I have seen it”)

Most other threads are very old. They mainly contain unconfirmed speculations about nv40 capabilities…

Topic: NV40 glReadPixels…
http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=011846

Topic: PCI Express - Is anything going to change?
http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=011677

Michael

Simon,

you are definitely right, image processing shouldn’t be called gpgpu :wink:

> A lower level interface would be possible, but this isn’t really the OpenGL philosophy, and would be hard to do in a hardware-independent way.

I agree, but as long as we have to deal with special hardware, it would be adequate to expose a special API for it. Concerning image processing, what about a new subset of GL specialized for that purpose. It should be abstract enough to be hardware independent and special enough to get optimized.

<begin brainstorm>

We could define a new GL object called “image stream object”. This is an abstract object that reads input images (from a defined source), executes some fragment shaders for these images and writes the result (to a defined location). Source and destination could be located in graphics or main memory.

It would consist of the following:

  • one (or more) stream varying images that define the input stream(s)
  • a set of fragment shaders processing these images
  • some texture objects and constants acting as additional inputs for shaders
  • some output buffers that store the result images
  • some additional functions that would be responsible for triggering data transfers e.g. “processImages(numImages, w, h, internalFormat, void* source, void *destination)”

The data are streamed through all shaders one after another. The output of one shader defines the input of the next one. The output of the last shader is written back to main memory (or remains in graphics memory). At any time one could possibly query the number of completed images of the stream.

By defining this narrow scoped extension we would encourage driver developers to support AND optimize for it. By avoiding eplicit memory allocations etc. we get a universal mechanism.

Internally, the driver could even decide that half of that image stream pipeline could run on the first GPU and the other half on the second GPU in a multi-GPU setting.

<end brainstorm>

> in most current GPUs readback will stall the pipeline

Are there any examples not stalling? Is this a hardware or driver issue? We could read back from one FBO and render into another…

>The latest version of our SDK includes a new PBO example

I tried it, but the results show that “asynchronous” is not yet working very well (with release 65 drivers). The kind of asynchronousity in the SDK would mean to trigger upload or readback while the CPU is doing something useful. Is it possible to get GPU-asynchronous operations? That means that the GPU itself would do some operations asynchronously, e.g. upload+download via PCIExpress or render+readback…

Michael

V-man,

> The design of current RAM prevents read and write at the same time.

I see. If upload and rendering is good, why is rendering and readback (from another buffer) evil? I’m not into the hardware architecture but isn’t it the same: a data transfer while the pipeline is working.

> Uploading while GPU is processing is what video cards are best at.

What’s the standard way to do that, could you please give me an example? I’m ready to do some experiments.

Michael

For hardware image processing I’d like to see programmable blend modes instead of just having a few configurable types that most of the time aren’t even useful if you’re using the movie industry standard of premultiplied alpha.

Also, I’d like to see GPU functionality extended in some way that would make recursive filtering and ping/pong between rendertargets faster… eg, some way to read and write from the same render target, or in pixel shaders some memory space to write temporary results that could be later retrieved and used between different evocations of the shader program. Btw, for those who don’t know, recursive filters are ways of doing arbitrary filter kernal size operations in constant time.

The problem is that this keeps pushing the GPU closer and closer to a general purpoose CPU + vector processing unit.

Source and destination could be located in graphics or main memory.
That’s the restriction that creates problems, and is why specs won’t be accepted for it. DMA’ing to arbitrary memory is not a reasonable thing; DMA-able memory needs to be allocated “correctly”. Malloc/new isn’t what you need; you need some things that most users just don’t have access to. You need to be a driver to create the kind of memory that you can just DMA to/from. There are alignment restrictions and cache coherency issues that need to be dealt with, and the user simply doesn’t have an API for that.

At best, you might see something like VAR, which gives you a memory allocator. But I wouldn’t suspect that the ARB would approve such an extension. They’d rather you used standardized buffer objects or renderbuffers or things of that nature.

PBO provides a happy medium between the needs of the driver in terms of where memory goes and the needs of the user in terms of async behavior.

I, also, don’t think the ARB is going to be happy with creating an entirely new “mode” of rendering when they already have one that works just fine. With ARB_FBO, you can already create the API you described. And, depending on how good the implementation/hardware is, it could be done asyncronously.

stephen_h,

I like the idea of programmable blending. This will hopefully be implemented in the next generation hardware, unfortunately we won’t see it until the end of 2005 or later.

Ping/pong rendering will hopefully be faster as soon as the FBO implementation will be available.
(btw. are there any drivers out there supporting it?)

For details about FBO see http://www.opengl.org/documentation/extensions/EXT_framebuffer_object.txt

I wouldn’t expect that we will soon see read/write access to render targets, i.e. framebuffer readback within a fragment shader. That would be a severe change of the design of current hardware (speculation).

> The problem is that this keeps pushing the GPU closer and closer to a general purpoose CPU + vector processing unit.

Yes, that would be a consequence. The gpu will get more general but probably less efficient. Btw. gpus ARE vector processors, they are working on streams of vertices or fragments. Unfortunately the stream cannot always read or write in main memory. E.g. vertex arrays are a stream source located in main memory (so called client state).

>> Source and destination could be located ingraphics or main memory.

> That’s the restriction that creates problems, and is why specs won’t be accepted for it. DMA’ing to arbitrary memory is not a reasonable thing; DMA-able memory needs to be allocated “correctly”.

I see. But the user doesn’t need (or want) to allocate. User data are usually located within vanilla main memory. The GL should be able to stream data out of a user given block of memory. Or the user should get a safe way to allocate DMA’able memory and get some kind of handle to that. Btw., why is it suggested/requested to map PBO as short as possible (see ext_pixel_buffer_object.txt). Are these mapped data somehow in danger?

> But I wouldn’t suspect that the ARB would approve such an extension. They’d rather you used standardized buffer objects or renderbuffers or things of that nature.

I understand. I will use standard methods if possible, but I have the impression that the GL has a lack of streaming support. The best way how to possibly do that should be discussed more. Are there ways to combine existing standardized buffer objects with streaming?

> PBO provides a happy medium between the needs of the driver in terms of where memory goes and the needs of the user in terms of async behavior.

Could we add PBOs as a new kind of input for fragment shaders? Similar to a sampler. Since the PBO is simply an array of bytes, we would have to specify the kind of access we would like to have.

E.g. float4 = pbo2D(pbohandle, RGBA32, i, j)

That would save a glTexSubImage copying from PBO to a texture object.

> I, also, don’t think the ARB is going to be happy with creating an entirely new “mode” of rendering when they already have one that works just fine.

It wasn’t my aim to modify the degree of happiness of ARB members :wink:

> With ARB_FBO, you can already create the API you described.

Yes, but probably with a significant amount of overhead. Defining real stream objects allows driver developers for special optimizations.
The whole process from reading main memory to writing main memory could get abstractified.
We simply would specify source and destination of a stream and the whole magic would happen internally. The chance of a special streaming mode would be, that it could (and has to) be handled specially within the driver.

cass,

when will we be able to sample a pbo from within a shader? I think that kind of access would be useful for lots of streaming applications. It would make a copy action redundant.

cass,

I am interested in asynchronous transfers and I’m not able to get a real speedup by using pbo compared to traditional TexSubImage or ReadPixels with release 60 drivers. Will streaming be better supported in release 70?

Will we see asynchronous transfers for the FBO extension? It would be really great!