Drawing millions of objects

Hey all,

I’ve just started coding in OpenGL, since I found that software rendering is just too slow for what I need.

Basically I’d like to render a scene with millions of objects; specifically, tiny quads. There could potentially be 10s or 100s of millions. With my initial software-only implementation, I an into performance issues. I implemented a simple OpenGL-based render, but I noticed that

  1. Loading is even slower than the software implementation
  2. There is little performance improvement when panning and zooming

I’m wondering if there are some basic improvements I can try to speed things up. All I’m doing now is reading the data into an array at startup, and in my paint routine, just looping through each point and plotting it in the scene. Here’s my pseudocode:

Init function:
Load data into memory

Render function:
Clear screen
Load identity matrix
Perform transformations
foreach point in collection
  set appropriate colour
  plot point using Vertex2
Swap buffers

All the quads are usually quite sparse, although there may be cases when they are close together.
I’m using C# with OpenTK.

Does anyone have any suggestions on how I can speed things up?

First: Are these quads all the same? (ie: each quad is similar in shape to the others), then considering creating a display list for a single quad and then calling it, instead of using direct mode. ( http://nehe.gamedev.net/data/lessons/lesson.asp?lesson=12 )

Next, calculate the shape and position of the view frustum and clip out unnecessary quads. ( http://www.lighthouse3d.com/opengl/viewfrustum/index.php?intro )

You can also clip out quads that are too far away from the camera and would look too small / insignificant on screen.

I would recommend you not to use immediate mode (i.e. vertex2 and stuff). Rather put your mesh data to a VBO and possibly use geometry instancing. These could help a lot.

Yes, all quads are identical in shape, but not colour. Looks like display lists are exactly what I need. So the basic premise is, I create the quad once, it gets stored in (the graphics card’s?) memory, and whenever I wanna draw it, the card already knows how?

Next, calculate the shape and position of the view frustum and clip out unnecessary quads. ( http://www.lighthouse3d.com/opengl/viewfrustum/index.php?intro )

I’m drawing on a 2-dimensional surface, and they’re all on the same surface, so there’s never any chance that some will be clipped by the frustum unless the user pans/zooms.

You can also clip out quads that are too far away from the camera and would look too small / insignificant on screen.

All quads will be at a constant size regardless of zoom level.

How is a VBO different from a display list? It sounds like the same thing-- objects are created in the graphics card’s memory and called up when necessary.

Yes, all quads are identical in shape, but not colour. Looks like display lists are exactly what I need. So the basic premise is, I create the quad once, it gets stored in (the graphics card’s?) memory, and whenever I wanna draw it, the card already knows how?

… Not the way you mean to do it.

If you’re going to use a display list for this, then you put all of your quads in it. Not one. You do not want to have millions of glCallList calls with calls to glColor between them.

If your field of quads is static (neither the position nor the color of any of the quads changes), and no other material parameters (shader state, etc) changes between different quads, then you simply put them all in a single display list and call glCallList once.

Oh, damn.
The quads are all indeed static, but not the same colour (shade.) I’m using an 8-bit colour palette for a maximum of 256 colours, so I could make 256 display lists, one for each colour, right?

[edit]
Actually after re-reading that tutorial, it doesn’t seem like I need to do that. I can just create a display list which contains the instructions on how to draw one quad, and then before calling the list just specify a colour.

The quads are all indeed static, but not the same colour (shade.)

What do you mean by that? The color of the quad is part of its data. The 4 vertex positions and 4 vertex colors make up its data. Does this data change? Does the color of a quad at a given position change?

If not, then just put all of the quads in a single display list.

Actually after re-reading that tutorial, it doesn’t seem like I need to do that. I can just create a display list which contains the instructions on how to draw one quad, and then before calling the list just specify a colour.

This is specifically what you don’t want to do.

Ok, I put all the points in the display list as suggested.

Panning performance is now improved, but now zooming suffers. This is because every time I zoom, I have to rebuild the display list and space the quads farther apart or closer together, depending on how the user has zoomed.

In a test case with 3.7 million quads, it takes nearly 4 seconds to rebuild the list.

Memory usage is also much higher than direct mode (as expected) since in addition to locally storing all the points from the file, I also need to store the display list.

With one set of test data, in direct mode the application uses ~93MB. With a display list, it uses ~425MB, although usage peaked at ~700MB.

This seems awfully high; I’m wonder how point cloud viewing software works? They can display tens of millions of points or more with relative ease. Here’s an example: http://www.youtube.com/watch?v=vwrJxBrni8M

With my meager 3.7 million point (quad) dataset, if I jump up to hundreds of millions like in that video, even the most powerful computer will choke.

No wonder it does. Why you need to rebuild the list again ? Just change the GL_MODELVIEW_MATRIX, no need to rebuild.
Anyway, if you prefer having more control about the data optimization, and less compile time, go for VBO.

I’m using an 8-bit colour palette for a maximum of 256 colours
Does this mean that you’re running in color index mode? If so, you should be aware that it’s very unlikely to be hardware accelerated, and you should run in RGBA mode instead.

Alfonse Reinheart, I don’t agree. Maybe I’m wrong, but making a single display list with millions of quads, is a big mess. Doing a single display list storing one quad, then rendering it billions of times with defining its color before (inside one or several for loop(s) statements) looks far better for me.

Yes you are wrong. DL is interesting when doing few GL calls, keeping a maximum of stuff near the GPU or at least in the driver.
Doing a GL call for each quad is a complete waste.

Alfonse is very right here. Every OpenGL call incurs a CPU overhead, which may be small or may be large, depending on factors such as what the call is, how full the command buffer currently is, how good (or bad) your driver is, and so on. The fewer of these calls you make, the better your application will run. One call will always be better than even thousands of calls, never mind millions. Otherwise your program will be badly CPU-bound, your GPU will not be performing efficiently, and you will never get good performance out of your program.

Because I want to keep the same quad size regardless of zoom level, and all quads are stored in a single display list. If I have a single quad in the display list and call it that many times, it’s just as slow as rendering directly.

Anyway, if you prefer having more control about the data optimization, and less compile time, go for VBO.

I’ve read a bit about VBOs but I don’t quite understand them. How are they different from display lists?

I’m unfamiliar with how to change the palette mode, so I’m running with whatever is the default (I’m guessing RGBA.)

I found that putting all points into a single display list is MUCH faster than make a display list with just one quad and calling it millions of times.

Ok some more info:
I’m basically making a point cloud viewing tool, albeit a very primitive one. So I’m going to need some kind of optimizations in order to be able to display tens of millions of points at once. The reason I’ve been using quads is because point size can change, so unless I can tell OpenGL that a point is 5x5 pixels, I need to use a quad instead.

[edit]
Jeez I feel like an idiot. Apparently there IS a way to change the point size-- glPointSize. Why didn’t I think to check this before? Anyways, I’m gonna try experimenting with this to see what kind of results I get. If anyone has any optimization techniques, I’d love to hear them.

ZbuffeR and mhagain, thank you for letting me know about this.

You can do that in the vertex shader : send 4 vertices at the same position, and ‘inflate’ them in realtime according to your need.
It will still work with a display list or a static VBO.

glPointSize may do the job, just be aware it has limitations.

Right. For instance, if you want to batch points with different sizes together in the same draw call, you can do that using gl_PointSize (GLSL) with GL_VERTEX_PROGRAM_POINT_SIZE. You can make the points bigger that way too (IIRC).

This can also be easily implemented with the geometry shader; all you need then is a single vertex for each quad.

What are the limitations of glPointSize? I just tried a quick test and it seems to do exactly what I need.

Also, could someone explain (or point me to a place that explains) vertex shaders, pixel shaders, and geometry shaders? These concepts are completely foreign to me.

This gives it to you in context:

http://www.opengl.org/wiki/Rendering_Pipeline_Overview

But basically think of it this way: When you give the GPU, say, a triangle to draw. It has to do some work to get that triangle on-screen. The work that needs to be done per vertex (a triangle having 3 vertices) gets run in a “vertex shader”. And the work that needs to be done per pixel (or in general, sample) filled by the triangle runs in a “fragment shader” (D3D calls this a “pixel shader”). Those are the two main staple shaders. You can specify what happens in each of these stages by writing your own shaders.

Geometry shader is quite a bit more recent, and isn’t near as commonly used, or as useful. It essentially lets you do “basic” dynamic generation of primitives on the GPU. For instance, you issue a draw call on the CPU with some POINTS primitives, and it turns those into QUADS primitives on the GPU. If you have a geometry shader in your program, it runs between the vertex shader and the fragment shader.