Image processing on GPUs could be considered the most interesting part of the gpgpu subject.
I would like to discuss the current state of the art. There are different kinds of algorithms wrt. the relation between arithmetic complexity and memory bandwidth. In my experience, only algorithms that profit from high bandwidth of GPU memory are really adequate for GPUs. Many of the others lead to much overhead when running on GPUs.
To start the discussion: One of the most important aspects for gpgpu would be to be able to use the GPU as a real coprocessor that delivers additional computational power for algorithms running on CPUs. That would include a lot of requirements that are not fulfilled at the moment.
Upload AND readback AND rendering should be done at the same time. This would for example enable to do streaming pipelined image processing: When image n is processed, image n+1 is uploaded and image n-1 is read back.
So, to use the full power of gpus we would need a kind of interface (probably a new extension to the GL) for programmers to explicitly control asynchronous data transfer to and from the gpu.I know that ext_pixel_buffer_object delivers functionality like that, but it seems a bit too restricted. Why don’t we define an extension that allows to explicitly control memory transfers. Unfortunately, I don’t know enough about this topic (DMA transfers etc.) to suggest how this could look like and if it would be easy for the leading hardware manufacturers to support it.
One example to clarify: The data we want to process are located in general purpose main memory. To process them I would call glTexSubImage which triggers a copy of the data but does not control when the data are transported to the gpu. The data are probably transferred only when they are accessed by the GPU. As an alternative I could directly memcpy the data to a mapped memory of a PBO. Then another copy would have to be done by the GL when transferring the data to the graphics memory. Then, the gpu would have to copy the data from the PBO into a texture object. Then, after processing, I would either do a (blocking) ReadPixels or a (hopefully) asynchronous readPixels into a PBO (which triggers a copy of the data within the graphics memory) and subsequently map the PBO via glMapBuffer followed by a memcpy to get the data out of the mapped memory since this is recommended in the PBO spec. Many many copies…
To conclude, there seems to be much overhead during the data transfers that should be avoided to use GPUs efficently as coprocessors.
Finally, another question: Did anybody ever try to do rendering, upload and readback at the same time. How exactly could I do that? I’m ready to do some experiments… Which hardware is theoretically able to do that? Do you guys from 3dlabs, ati, nvidia have any hints?