PDA

View Full Version : Fast rendering of 2D scanlines



Fractalus
02-12-2010, 03:46 PM
Hello all,

I need to dynamically draw horizontal 2D lines mixed with single points with solid color and/or alpha transparency. Number of lines per frame may vary from few hundreds up to tens of thousands.
So I created interleaved arrays with vertices and colors and used vertex arrays (one for GL_LINES and one for GL_POINTS) for drawing.
But I'm not much satisfied with the performace when comparing to direct mode or even software rendering.
Since I'm an OpenGL begginer I'd like to ask experts here on this forum if someone knows better solution than vertex arrays.
Also I should note that the solution should be compatible with vast majority of GFX cards(even older ones). But I'm open to a solution containing more techniques with possible fallback if the HW is older.

Thanks in advance for your replies.

Rosario Leonardi
02-13-2010, 07:00 AM
What kind of technique are you using to draw the lines and point?
Immediate mode? Vertex Array? VBO?
VBO are openGL 1.5 and should be supporter by every card nowadays (today a openGl 3.2 card is sold at 50$).
If you want to be sure and support old dinosaurs you can use Vertex array, they work the same way as VBO but the data are in client side.

pjcozzi
02-14-2010, 07:52 AM
Take a look at NVIDIA's Using VBOs (http://developer.nvidia.com/object/using_VBOs.html). If are are rewriting your vertex data each frame, make sure when you call glBufferData that your usage is GL_STREAM_DRAW.

I'm not sure what kind of video card you have but tens of thousands of lines should be fine on cards up to 5-7 years ago unless you are creating a ton of VBOs.

Do you have any screen shots? There may be alternative approaches such as using a single texture for everything but creating the texture will have CPU overhead and lots of bus traffic going to the video card.

Regards,
Patrick

Fractalus
02-15-2010, 05:27 AM
Rosario,
I'm using Vertex Arrays.

The whole process is:
1. get x,y and color+alpha values
2. put those values into float array
3. once the float array capacity is full call glDrawArrays(GL_LINES,...)
4. start filling the array from scratch
5. repeat 1. - 4. until all scanlines are drawn

I haven't tried to use VBOs but I'll give it a try soon this week and post results.

Fractalus
02-15-2010, 05:46 AM
pjcozzi,

thanks, for the link. I'll try to use VBOs instead of Vertex Arrays.

I'm currently testing on ATI Radeon 8500 card.
Here is a screenshot of the demo app: http://cyphre.mysteria.cz/tests/lion.png

Basically I'm trying HW accelerate high quality software based anti-aliasing algorithm which uses scanlines to render.
You can have a look at the picture on this page to get the idea how it works: http://antigrain.com/doc/scanlines/scanlines.agdoc.html#toc0001

ZbuffeR
02-15-2010, 10:23 AM
Fractalus, did you try just filling in a texture, and draw it as a fullscreen quad ?

Fractalus
02-16-2010, 03:57 AM
ZbuffeR,
I haven't experimented with this technique yet. But IMHO if I fill the texture using the CPU then the only performance gain I could get is during the 'blitting' phase.
But it is possible that transfer of the whole texture data during every refresh would just make it almost equally slow as rendering with CPU and blitting using GDI on Windows.
Another possible aproach I can think off is to write Pixel Shader and do all the hline fills on the GPU. This sounds like the fastest solution to me but:
1. I have no experience with writing shaders...how can the shader code access data stored on the main memory? Or do I need to send the data(vertex coords, color, aplha) to GFX card in some way first? Or does it mean I'd need to rewrite the whole CPU based blending algorithm in Shader code?
2. using shaders violates my HW compatibility needs.

Or do you know about any other solution that I'm missing?

ZbuffeR
02-16-2010, 05:22 AM
Not sure how shaders can help you here.
To me the 2 only choices are :
1) use a texture, for ultra fast scrolling, rotation, crude zooming in/out.
2) draw the lines using VBO stream draw, as explained above.

You can combine both, as in draw with 2), glCopyTexSubImage to a texture, then use 1) for pan/rot/zoom user interaction. If the rot/zoom has changes and after 500 ms of idling, rebuild the drawing using 1).

Fractalus
02-19-2010, 02:14 PM
Well, I just tried to play with VBOs and unfortunately the result is slower that using good old VAs.
I tried two techniques:
1. 'simple' = VBO using glBufferSubDataARB(...) call every frame
2. 'mapped' = use glMapBufferARB(...) to get pointer to VBO memory, set the vetrices directly using the pointer

The 1. method was not so slow but still slower than using VAs
The 2. method was way slower :-/ Looks like direct access takes more time (at least on my test configs)

I might try to play with the remaining 'texture transfer to quad' method but I don't think I can get better results than using the Vertex Arrays.

I appreciate, If anyone here have some more ideas, thanks!

Also I can post links to demo executables where it is possible to try different methods if anyone is interested to try on own PC/GFX card setup.

ZbuffeR
02-20-2010, 01:16 AM
Also I can post links to demo executables where it is possible to try different methods if anyone is interested to try on own PC/GFX card setup.

Please do so, it looks interesting :-)

Fractalus
02-21-2010, 03:29 AM
Ok, here you can download all the versions I made so far:

http://cyphre.mysteria.cz/tests/demos.zip

Some notes about the files:

01-lion-SW.exe - this is the original completely SW based version using the www.antigrain.com (http://www.antigrain.com) high-quality antialiasing
02-lion-OGL-direct.exe - basic OpenGL version using the 'direct calls mode'
03-lion-VA.exe - OpenGL version using Vertex Arrays
04-lion-VBO-simple.exe - OpenGL version using VBOs; glBufferSubDataARB() is called every frame
05-lion-VBO-direct.exe - OpenGL version using VBOs; VBO memory is mapped using glMapBufferARB() and filled with vertex data directly every frame
06-lion-GLU-tesselator.exe - this is attempt to use the GLU Tesselator and render the polygon data directly(skipping the SW based alpha coverage alogithm). As you can see the tesselation is slow and I don't know how to eliminate the antialiased inner edges of triangluated polygon shapes

Phew,so thats all I made so far. I appreciate any comments regarding performance on your HW setups. Or if you know about other possible solution how to speed up things that would be great.

Also, I tried to use polygon tesselation using the 'stencil buffer' as described in Red Book here - http://www.glprogramming.com/red/chapter14.html#name13
Looks like a cool stuff but it produces aliased results :-/ The only way I know how to antialias is to use FSAA extensions but this will AA the whole scene. I need to be able turn the AA off.
So please, speak up if anyone knows how to do fast AA of the stencil buffer mask. Thanks!

ZbuffeR
02-21-2010, 02:25 PM
These apps don't make benchmarking easy:
1) vsync is on : I had to force vsync off from the driver, otherwise I always had 1600 ms benchmark time (exactly 96 frames at 60hz) for all the GL benchs. Try wglSwapInterval(0).
2) as first "Test Performance" is different from all others (from second test, it is always faster), and benchmark duration is very short (any tiny activity on computer can change the result a lot), results are not repeatable
3) window size is very small, the lion does not fill an enlarged window, and image complexity is quite low, so not much is actually benchmarked
4) not automated, ie. needs manual action to run benchmark and need manual action to note the result

Things to fix when you ask people on a forum for performance feedback ;)

Enough (constructive) criticizing, now the results !
Nvidia GTX275 + FW 191.07 + Core2duo E8500 + Vista 64

Methods 2-3-4 were pretty much the same, having lowest times for both default window size and maximized near 1920*1200.

Complete data :


96 frames.
on default window size :
01: 412ms first "Test Performance" 404ms second "Test Performance"
02: 392 388
03: 402 377
04: 409 391
05: 711 688
06: 623 555 + ugly without AA

on maximized 1920*1200 :
01: 1335ms "Test Performance" 1238ms "Test Performance"
02: 462 454
03: 499 440
04: 441 419
05: 882 848
06: 616 552 + ugly without AA


05 being twice slower is a bit surprising, you should use 2 VBO, one used for uploading your data and one for rendering, then swap. So upload is doable in parallel with the card rendering previous one, and avoid too tight GPU/CPU sync.

It would be interesting to isolate the mandatory CPU part (spans generation) from the line rendering itself to make easier benchmarking.

Anyway, I am impressed as how the GL accelerated looks exactly the same as the high quality CPU antialiasing, impressive.

On a more general note, are you after a generic rendering system, or something tailored more for edition, manipulation, visualization , ... ?

Fractalus
02-22-2010, 04:21 AM
ZbuffeR,

thanks for all your feedback! Yes, I know the quality of benchmarking demos in not good, sorry. I'll try to improve it a bit during this week so the benchmarks can show more precise info.
I'll also try to implement(hopefully :)) your idea with double VBO swapping.

Yes, the accelerated AA quality is same which is great because the alpha blending function used by OpenGL is pretty much the same as in the SW version and the OpenGL redered lines are using the same CPU computed coverage alpha values so it should be identical.

I think the biggest bottleneck is most probably with the vertices/colors transfer. Hope the improved benchmarking will shed more light into it.

And yes, I'm after a generic rendering system. I've already implemented AGG based gfx system for REBOL v2 language interpreter. Now I'm trying to bring HW acceleration for the new upcoming REBOL v3...

ZbuffeR
02-22-2010, 06:50 AM
I think the biggest bottleneck is most probably with the vertices/colors transfer.
Why that ?
My personal interpretation, very unprecise with only 2 window size comparisons, is :
. 300ms constant CPU time needed to generate the colors+coverage values of the spans, common to all methods
. 100ms overhead for actually sending vertices+colors to GPU for each of the GL modes
. GPU rendering time is less than 400ms (whatever the complexity), and can be done in parallel in methods 2-3-4, but not for 5.
. CPU rendering time for method 1 is ((1335-300)/2.25megapix) = 460ms for each megapixel to be rendered.


... for my machine anyway.