Normal path:

1. Render scene into uncompressed framebuffer.
2. Sample from that for post fx, etc.

Suggestion:

1. Render scene into uncompressed framebuffer.
2. Compress framebuffer into BCn texture format.
3. Sample from compressed texture.

Tradeoff is compression cost and artifacts vs. memory bandwidth cost sampling the texture. So this only makes sense in situations where the sampler is read alot.

Areas where this "BC pass" could be applied:

  • Impostors
  • Deferred Shading
  • Blurs, SSAO, etc.


Implementation

The ideal solution would be to accelerate gl(Copy)Tex(Sub)Image2D calls targeting a BCn internal format with the GPU. Like that, it could even accelerate existing code without any changes.

A test case compressing a 1080p RGBA8 framebuffer to BC1 cost 18ms on a Geforce 680 when using glCopyTexImage2D, no clue how to determine whether that compression is CPU or GPU, but I'm fairly certain it's CPU because of some talk about C libraries that were at some point mentioned in conjunction with BCn compression.

I wrote to both AMD and NV but no reply, so I'm posting this here so it's not forgotten or dies in a spam filter.