Using the GPU to accelerate high quality 2D AA

We have a web based product that currently uses Anti Grain Geometry (AGG) to render Google Map tiles. It takes some complex input shapes, renders it, compresses it and sends back to client as PNG.

Currently we render most tiles offline but we want to move to fully live/interactive on demand rendering. As you can imagine latency is important with AJAX web services so we want to optimise the tile render as much as possible.

Current test have shown that the AGG render takes more time than everything else (SQL data lookup, sorting, clipping, PNG compression and HTTP server). So want to speed it up. AGG is the (second) fastest CPU 2D anti aliased vector render I know of. (The fastest is only 1.5x faster). Now we have thrown everything at AGG including Intel compilers, SSE, multi threading etc but it is still not fast enough. Put simply, the latest generation Intel Nehelam server CPUs are too slow at rendering.

We recently got access to an Amazon EC2 machine for CUDA stuff, which got me wondering can we use the two Tesla C2050 cards to render faster than the CPU? Our first attempt was to port some scanline antialiasing algorithims to CUDA but it was not fast as CUDA is not suited to such a task.

So I am now turning the OpenGL so I can use the Tesla’s fixed function vertex units etc. to accelerate the render. (Yes Telsa’s work fine with OpenGL, just that you have render to offscreen buffer and read it back to CPU otherwise you can’t see your output as their is no VGA output).

I have been trying to see what existing GPU AA techniques exist as so far none them are terribly good.

First up, MSAA and/or CSAA is just horrible. The quality is the worst I have seen as far as antialias goes (compared to the CPU not other GPUs). The problem is that with very thin lines (<1px), and small detail geometry which is important when rendering maps (e.g road outlines) the sample mask ‘misses’ some of the pixels and you get a stipple effect. At very shallow vertex edge angles (compared to vert/horz pixel edges) the quality of the AA pattern is very low and blurry.

I next looked at Microsoft’s Direct2D renderer. I have used it before in a desktop app and the quality of the AA is good and it seems fast. Unfortunately it is not available on Linux (which all our servers run) so I cannot use it directly. I decided to reverse engineer it with the Microsoft PIX Debugger to see what tricks it uses and if it is applicable to my project using OpenGL. I was rather disappointed at what I found. Direct2D precomputes all alpha coverage values on the CPU!

It generates the geometry with thin 1px wide border strips in the areas of AA blend. It then uses the vertex shader to create 0.0f to 1.0f varying outputs along these border strips. Thus the hardware calculates the alpha strips coverage values. The pixel shader simply takes the varying and multiplies it with the shapes color (or look up texture if it is a gradient) to get the final AA result (with correct alpha value of course). It then uses the fixed funtion alpha blend for the final blend to buffer.

While this is fast and simple (and works on D3D 9 class hardware coughIntelcough which is very attractive to Microsoft) it uses way to much CPU processing to generate the alpha strip geometry. (even the DirectWrite sub pixel accuracy text is rendered to a buffer on CPU and uploaded as texture to the GPU for blitting and ClearType blending).

Microsoft assume that you won’t change geometry very often and they only need to compute the vertex buffer once and cache it. (Simple rotate, scale, translate is handled by vertex shader matrix math).

But in my case I generate totally new geomtry every single frame (1 frame for 1 tile) and the CPU overhead/stall would be slower than AGG.

Thus what I want to ask is what ideas you guys have on how to do fast OpenGL 2D antialiased vector rendering with decent AA. As I have Direct3D 11 card I would like to use any new/old OpenGL 4.1 feature that you guys think will make it faster.

The geometry side is rather easy to do in OpenGL 4 using geometry shader to expand lines into thin/thick quads. I believe that using the tesselation egine for bezier line/polygon curves would also be fast.

But I am stuck on the antialiasing bit. I figure I will need to use pixel shader for this. I notice that OpenGL 4 has a lot of stuff to let you mess with the fixed function MSAA sampling. Which makes me wonder if I can get it to produce better AA by doing a custom pixel shader sample coverage output.

Has anyone done anything like this before? And where can I find good informatio on how the MSAA pipeline works under OpenGL 4 with the coverage sample output stuff.

Thanks,
Leith Bade

Which makes me wonder if I can get it to produce better AA by doing a custom pixel shader sample coverage output.

Well, not quite the way you mean. Because you can’t output the fragment’s coverage mask. Yet. GL 4.x class hardware is capable of it, but there’s no OpenGL interface for that at present.

However, if all you’re looking for is to set how many samples a fragment covers (which is less complex than setting the coverage mask), you can do that with GL 2.0 class hardware.

You use GL_SAMPLE_ALPHA_TO_COVERAGE, and just output the sample coverage as the alpha value. This value will be AND’d with the fragment’s actual coverage mask. Granted, this makes blending a bit tricky if you want to do alpha blending. But it may do what you need.

Thanks for the speedy reply!

That sucks… I guess you are talking about stuff that Direct3D 11 has? How much other stuff is still missing?

I was reading pg. 256 of the 4.1 spec and it talks about gl_SampleMask.

Would this avoid the problem of GL_SAMPLE_ALPHA_TO_COVERAGE messing with alpha blending?

Hello,

Did you look at plain-old fixed pipeline AA? (e.g. glEnable(GL_LINE_SMOOTH); + glLineWidth)
This is likely to be the closest thing to what your software renderer is doing.

Every other AA scheme in OpenGL (MSAA, CSAA, name it) is based on supersampling and share the same bottleneck: samples might not be representative of what happens overall in one pixel.

Quadro cards accelerate that stuff so my guess would be that Tesla do too.

Are you able to do thin lines <1px with GL_LINES?

Also I found the section on Sample Shading… does that work at vertex edges or would you need to add a small border (in the pixel shader)?

That’s implementation-dependent. Use glGet* with GL_LINE_WIDTH_RANGE to query for minimum and maximum widths on your particular hardware, and GL_LINE_WIDTH_GRANULARITY to get the granularity (not every continuous width is supported).

Also I found the section on Sample Shading… does that work at vertex edges or would you need to add a small border (in the pixel shader)?

Sample shading is about executing fragment shaders more than once per fragment (and up to once per sample). It has to do with the MSAA scheme and would really not help you here. It was introduced mostly to solve some issues with HDR imaging combined with MSAA.

I was reading pg. 256 of the 4.1 spec and it talks about gl_SampleMask.

That’s odd. I could have sworn that didn’t exist.

What’s even stranger is that it is mentioned in the 4.10 GLSL spec, but that spec says nothing about gl_SampleMaskIn, which is detailed in the 4.1 OpenGL spec.

Would this avoid the problem of GL_SAMPLE_ALPHA_TO_COVERAGE messing with alpha blending?

Yes. But then again, this is 4.x stuff. If you want to support 3.x, you could use alpha to coverage combined with dual-source blending. That’s the ability to output two colors which will be used in a single blending operation. The first color’s alpha will be used for the coverage alpha, and the second color’s alpha can be used for alpha blending.

Naive, but still - wont it work (image quality/render time wise) for you to just render at max resolution (possibly with SSAA) and downsample yourself?
8k8k3 ~= 200mb, im not sure if you will be able to create multisampled texture with 2 samples of this size and what the performance will be (if any).
Not saying its wise, though id like to see such setup :wink:

Draw 3px-wide lines (or 4px), in fragment shader compute coverage, output it as alpha (with blending enabled). Render to single-sampled RG11B10F texture in linear color space, then convert to SRGB.
To compute coverage, fastest hack is “distance from fragment to line”, but won’t work nicely with thin lines. Another way is to compute it is to use a mipmapped gradient texture, similarly to
http://people.csail.mit.edu/ericchan/articles/prefilter/ . Could use a geometry shader to create the wide lines and have more control/ease over the texcoordinates you need to sample from such a texture with.

Cards that accelerate 3px-wide lines that I know of also accelerate GL_LINE_SMOOTH so there’s really no need for a home-made shader here.

I mean, that’s fun reinventing the wheel with shaders and all (I know that for having made my own antialiased filled disks in GLSL using the technique you’re suggesting), but the chance of getting the coverage wrong seems just not worth it in the present case.

If it’s just quality lines you want, take a look at the “Fast Prefiltered Lines” chapter in nVidia’s GPU Gems 2. This will probably beat the the quality of built-in line smoothing, at the cost of speed.

On the other hand, with recent hardware having dropped support for polygon edge smoothing, getting quality sub-pixel polygons is probably a matter of applying as much MSAA as the hardware supports, possibly combined with crude supersampling. Or one could perhaps extend the ideas behind the prefiltered lines to polygons.

As you have observed, MSAA is not so good at the subpixel coverage. Even 8x MSAA only nets you 8 steps of coverage, where an ideal quality algorithm would provide 256 steps (for 8-bit alpha). This is mostly an issue for sub-pixel sized artifacts that should nevertheless accumulate to become significantly visible rather than just dropping out.

Combining 8xMSAA with rendering the image at 4x width and 4x height, and down-sampling should get you to 128 levels of alpha…