Drawing & culling BSP geometry

BSP geometry, like the brush objects in Unreal and Half-Life, tends to be the slowest thing to render in my engine, because it is all unique, non-instanced geometry.

I’m attempting to speed the rendering up, but I’m not sure what I can do. The engine collapses faces with like textures into single arrays, so each texture has one large mesh all faces with that texture get collapsed into. Additionally, the engine splits the faces up by portalzones, so each portalzone has its own list of collapsed faces.

It seems like the number of texture switches is what really slows it down. In other words, the polycount doesn’t matter so much, but the number of collapsed arrays I render, each with a different texture, makes a huge difference.

Are there any more tricks I can use to speed this up? I’m already using texture compression. Is it slow to redundantly bind a texture? I mean, if I re-bind the same texture, will OpenGL detect that I’m not really switching textures, or does it reload the texture anyways?

Would it possibly be faster if I batched the faces without collapsing them, so I would set up a texture, draw all the faces with that texture, and use a fast BSP culling to skip some faces? Is starting and stopping your drawing a big issue in OpenGL?

  1. texture compression is slower than uncompressed.
  2. Yes texture binds are always bad, so keep them to an absolute minimum and render objects depending on which texture they have, not on a per object basis.
  3. use large texture atlases on some objects instead of many small ones, that could probably remove a few unnecessary texture binds.

texture compression is slower than uncompressed.
No it isn’t.

Compressing a texture is slow, but if it’s precompressed, it’s often faster than using an uncompressed one.

I imagine texture compression would make things faster, since it would have to transfer less memory.

There are three ways I can render BSP geometry:

  1. Collapsed vertex arrays and a single indice array.

  2. Collapsed vertex arrays and a start and stop indice for each face, for use with glDrawArrays().

  3. Individual vertex and indice arrays.

The first method is fastest, but it can’t really be culled, you just throw everything at the graphics card in one chunk.

The second two methods can be optimized with culling, but are slower to draw. #3 is very slow, while I was surprised to see that #2 is about 90% as fast #1.

At this point, I am thinking that a fast BSP-based culling routine combined with #2 might be the fastest. Of course, for major culling I am using portals, but I need additional culling in the large outside world, which isn’t split into portalzones. I don’t think VBOs or CVAs would do any good here, but do they do anything for unique geometry?

Where’s the make_vertex_array_twice_as_fast extension?

I may be wrong but…
Assuming geometry isn’t very, very large, and it’s static (seems likely as BSP is mentioned), VBO could indeed help. Upload geometry to one or more VBO’s, keeping in mind the 64K-vertices (unsigned short index) limit on many(/most/all?) implementations to stay on the fast path. Partition as required to allow index-batching. Then just stream the indices.

As VBO’s are really just vectors/arrays of vertex attributes stored in memory on the server (the gfx card), instead of sending up to 64K vertex attributes (assuming just xyz, s/t and normal that’s 32bytes/vertex = 2MB/VBO for 64K vertices) you just send one or a few instructions to swap VBO. While 2MB may not seem much out-of-context, putting it into context of fps or just latency (especially as such an uploads likely are not written in a single batch) the delays can add up.

To optimize this further, use two or more VBO’s for indices and alternate between them, to allow the card to do its thing using one index array while you are filling another one with new indices.

Anyway, I believe the most performance efficient way to do this is to only upload new VBO data when actually needed, and for BSP I’d guess keeping the vertex data “cached” in VBO’s and only upload new indices is the optimal for overall performance.

Also keep in mind that rebasing indices on the CPU to allow batching is often way faster than submitting it as separate batches.

Originally posted by Korval:
[b] [quote]texture compression is slower than uncompressed.
No it isn’t.

Compressing a texture is slow, but if it’s precompressed, it’s often faster than using an uncompressed one. [/b][/QUOTE]No, compression is always slower, since it always means you have to process the data somewhat extra.
The difference might not be that great, but compressed textures are never faster unless your memory bandwidth limited and you somehow only sample within the compressed blocks and not across them.

On to the subject.

Unless you have a really old graphics card you will be fill rate limited and not vertex transform limited, this means that polygons are basically free and polygon culling doesn’t help that much.
Especially if you have early z culling.

Now, what you need is simple zone culling, use #1 ( VBO), but split it up into smaller parts (maybe into cubes of 50m x 50m x 50m, or individual buildings for themselves) and do some macro culling instead of the fine culling you are proposing.

Bollocks, more of a compressed texture fits in the texture cache therefore reducing the number of cache fills therefore being faster.

Regarding VBO’s, for batched/instanced geometry, I find VBO’s to be an order of magnitude faster. I just set up my arrays, then draw every instance of the mesh.

For static geometry that only gets drawn once, because it is all unique, are you sure VBOs would make it faster? Can anyone else comment on this?

I’m already doing macro culling, I’m just getting obsessive and seeing if there is anything more I can do to make it crazy-fast.

Originally posted by halo:
[b]
For static geometry that only gets drawn once, because it is all unique, are you sure VBOs would make it faster? Can anyone else comment on this?

I’m already doing macro culling, I’m just getting obsessive and seeing if there is anything more I can do to make it crazy-fast. [/b]
VBO’s is THE!!! (i just don’t know how more clear this could be) fastest method currently, especially for static objects.
The only case where VBO does not have such a clear advantage is when you have to build a new object for every frame, but even then there is an advantage.
I for one upload new stencil shadow volumes to a vbo several times a frame and object and i still manage to squeeze out a few more FPS compared to immediate mode, and i could probably get a few more once i check if either the object or the light has moved.

zeoverlord, compressed textures are faster then uncompressed ones. The DXT decompression is very cheap

Well I’ll be damned, using VBOs on the BSP geometry makes it about 20 FPS faster.

Here’s what I have found:

The slowest rendering step, by far, is using glBindTexture(). When you sort your renderable objects by texture, you get something like a 200% speed increase. I believe I even got a speed boost by first checking to see whether the currently bound texture was the same as the one I was about to bind, and avoiding redundant binds. It’s also a good idea to use as large a texture size as possible for lightmaps, to reduce the number of textures you have to switch.

VBOs will allow you to render many instances of the same mesh at maybe ten times the speed it would take to render them each on their own with conventional vertex arrays. VBOs also make unique geometry slightly faster.

Also, most frustum culling tests take longer to perform than the time they save in rendering. However, a simple bounding box test between the bounding box of the camera frustum and the bounding box of the renderable object, inaccurate as it may be, will cull large amounts of geometry and is fast enough that it will actually save more time than it wastes.

most frustum culling tests take longer to perform than the time they save in rendering.
depends what you’re drawing. million’s of 6 triangle objects? forget culling. thousands of 500 triangle objects? cull, preferably in a hierarchical way.

a simple bounding box test between the bounding box of the camera frustum and the bounding box of the renderable object
nope, faster to check the squared bounding spheres against the frustum planes with early rejects. FPU’s are good at dot products.

Testing the bounding sphere against the frustum is a good idea, too. That should be pretty fast, and it will cull more objects.

Can you define an “early reject”?

test against near plane first, if the sphere is behind this you can early out of the frustum test - same with the next plane etc.etc.
Advantage of testing near plane first is you’ll then have the distance-from-viewer for depth sorts etc.

No, compression is always slower, since it always means you have to process the data somewhat extra.
The difference might not be that great, but compressed textures are never faster unless your memory bandwidth limited and you somehow only sample within the compressed blocks and not across them.[/QB]
The decompression algorithm is quite simple and easy to implement. It’s definitely a no cost. On the other hand you save much by having bigger chunks of texture in the texture memory cache.

Just speak to any 3D driver programmer, he’ll tell you that even if you do not compress the texture, he’ll do it for you based on some heuristic to determine whether compressing texture will create visible changes or not.

I can’t imagine that - you mean every time I upload texture data the driver is walking it pondering its suitability for compression?? It’s no business doing such a thing.

Just speak to any 3D driver programmer, he’ll tell you that even if you do not compress the texture, he’ll do it for you based on some heuristic to determine whether compressing texture will create visible changes or not.
I have a really hard time believing that. What about the case where I call glTexImage2D and give it a blank image, and then call glTexSubImage2D to stream data? That would greatly reduce the speed of glTexSubImage2D… a function that’s been “relatively quick” (can usually do one or two large textures per frame without a perf hit that’s too noticable). In this case, the driver would have to examine the image supplied to glTexSubImage2D and potentially RE-ALLOCATE a texture! That would absolutely kill the performance of uploading texture data to GL in this case. I would bet the “heuristic” used is the internal format that you specify when you call glTexImage2D ;P.

Kevin B

Originally posted by knackered:
I can’t imagine that - you mean every time I upload texture data the driver is walking it pondering its suitability for compression?? It’s no business doing such a thing.
That’s what their heuristic has to determine: if it will change the game visually, or it it will slow it down. The case of a texture uploaded each frame is easy: for example compress it the first time, then don’t if it’s altered in the next frames. The driver could even start compressing it on the cpu in another thread, and stop that thread’s work when required.

I don’t know how the heuristic work, but I can definitely tell you that at least the Ati drivers do compress the textures whenever it can. And the nVidia’s drivers probably do the same.

I’ve heard from the horses mouths (not that you’re horses, cass and mcraighead) that the nvidia drivers don’t even check enabled bits against some kind of cache before sending them across the bus - in other words, they do NO redundancy checks because that belongs in the realm of the application. So it’s not much of an extension of this methodology to say that the decision about whether a texture should be compressed or not belongs in the application too.
Besides the fact that glTexImage2D is as quick as glTexSubImage2D these days, so it’s doubtful there’s any heuristic analysis going on.
I can well believe it of ATI, however.

Ok, I’ll ask the obvious question. What compression technique is it?

I could believe that they created special paths in the driver for certain games.