PDA

View Full Version : Drawing & culling BSP geometry



halo
02-06-2007, 12:58 PM
BSP geometry, like the brush objects in Unreal and Half-Life, tends to be the slowest thing to render in my engine, because it is all unique, non-instanced geometry.

I'm attempting to speed the rendering up, but I'm not sure what I can do. The engine collapses faces with like textures into single arrays, so each texture has one large mesh all faces with that texture get collapsed into. Additionally, the engine splits the faces up by portalzones, so each portalzone has its own list of collapsed faces.

It seems like the number of texture switches is what really slows it down. In other words, the polycount doesn't matter so much, but the number of collapsed arrays I render, each with a different texture, makes a huge difference.

Are there any more tricks I can use to speed this up? I'm already using texture compression. Is it slow to redundantly bind a texture? I mean, if I re-bind the same texture, will OpenGL detect that I'm not really switching textures, or does it reload the texture anyways?

Would it possibly be faster if I batched the faces without collapsing them, so I would set up a texture, draw all the faces with that texture, and use a fast BSP culling to skip some faces? Is starting and stopping your drawing a big issue in OpenGL?

http://www.leadwerks.com/post/tower.jpg

zeoverlord
02-06-2007, 03:48 PM
1. texture compression is slower than uncompressed.
2. Yes texture binds are always bad, so keep them to an absolute minimum and render objects depending on which texture they have, not on a per object basis.
3. use large texture atlases on some objects instead of many small ones, that could probably remove a few unnecessary texture binds.

Korval
02-06-2007, 03:56 PM
texture compression is slower than uncompressed.No it isn't.

Compressing a texture is slow, but if it's precompressed, it's often faster than using an uncompressed one.

halo
02-06-2007, 06:20 PM
I imagine texture compression would make things faster, since it would have to transfer less memory.

There are three ways I can render BSP geometry:

1. Collapsed vertex arrays and a single indice array.

2. Collapsed vertex arrays and a start and stop indice for each face, for use with glDrawArrays().

3. Individual vertex and indice arrays.


The first method is fastest, but it can't really be culled, you just throw everything at the graphics card in one chunk.

The second two methods can be optimized with culling, but are slower to draw. #3 is very slow, while I was surprised to see that #2 is about 90% as fast #1.

At this point, I am thinking that a fast BSP-based culling routine combined with #2 might be the fastest. Of course, for major culling I am using portals, but I need additional culling in the large outside world, which isn't split into portalzones. I don't think VBOs or CVAs would do any good here, but do they do anything for unique geometry?

Where's the make_vertex_array_twice_as_fast extension?

tamlin
02-07-2007, 12:36 AM
I may be wrong but...
Assuming geometry isn't very, very large, and it's static (seems likely as BSP is mentioned), VBO could indeed help. Upload geometry to one or more VBO's, keeping in mind the 64K-vertices (unsigned short index) limit on many(/most/all?) implementations to stay on the fast path. Partition as required to allow index-batching. Then just stream the indices.

As VBO's are really just vectors/arrays of vertex attributes stored in memory on the server (the gfx card), instead of sending up to 64K vertex attributes (assuming just xyz, s/t and normal that's 32bytes/vertex = 2MB/VBO for 64K vertices) you just send one or a few instructions to swap VBO. While 2MB may not seem much out-of-context, putting it into context of fps or just latency (especially as such an uploads likely are not written in a single batch) the delays can add up.

To optimize this further, use two or more VBO's for indices and alternate between them, to allow the card to do its thing using one index array while you are filling another one with new indices.

Anyway, I believe the most performance efficient way to do this is to only upload new VBO data when actually needed, and for BSP I'd guess keeping the vertex data "cached" in VBO's and only upload new indices is the optimal for overall performance.

Also keep in mind that rebasing indices on the CPU to allow batching is often way faster than submitting it as separate batches.

zeoverlord
02-07-2007, 03:38 AM
Originally posted by Korval:

texture compression is slower than uncompressed.No it isn't.

Compressing a texture is slow, but if it's precompressed, it's often faster than using an uncompressed one. No, compression is always slower, since it always means you have to process the data somewhat extra.
The difference might not be that great, but compressed textures are never faster unless your memory bandwidth limited and you somehow only sample within the compressed blocks and not across them.

On to the subject.

Unless you have a really old graphics card you will be fill rate limited and not vertex transform limited, this means that polygons are basically free and polygon culling doesn't help that much.
Especially if you have early z culling.

Now, what you need is simple zone culling, use #1 ( VBO), but split it up into smaller parts (maybe into cubes of 50m x 50m x 50m, or individual buildings for themselves) and do some macro culling instead of the fine culling you are proposing.

knackered
02-07-2007, 06:11 AM
Bollocks, more of a compressed texture fits in the texture cache therefore reducing the number of cache fills therefore being faster.

halo
02-07-2007, 07:56 AM
Regarding VBO's, for batched/instanced geometry, I find VBO's to be an order of magnitude faster. I just set up my arrays, then draw every instance of the mesh.

For static geometry that only gets drawn once, because it is all unique, are you sure VBOs would make it faster? Can anyone else comment on this?

I'm already doing macro culling, I'm just getting obsessive and seeing if there is anything more I can do to make it crazy-fast.

zeoverlord
02-07-2007, 10:48 AM
Originally posted by halo:

For static geometry that only gets drawn once, because it is all unique, are you sure VBOs would make it faster? Can anyone else comment on this?

I'm already doing macro culling, I'm just getting obsessive and seeing if there is anything more I can do to make it crazy-fast. VBO's is THE!!! (i just don't know how more clear this could be) fastest method currently, especially for static objects.
The only case where VBO does not have such a clear advantage is when you have to build a new object for every frame, but even then there is an advantage.
I for one upload new stencil shadow volumes to a vbo several times a frame and object and i still manage to squeeze out a few more FPS compared to immediate mode, and i could probably get a few more once i check if either the object or the light has moved.

Zengar
02-07-2007, 11:10 AM
zeoverlord, compressed textures are faster then uncompressed ones. The DXT decompression is very cheap

halo
02-07-2007, 12:49 PM
Well I'll be damned, using VBOs on the BSP geometry makes it about 20 FPS faster.

Here's what I have found:

The slowest rendering step, by far, is using glBindTexture(). When you sort your renderable objects by texture, you get something like a 200% speed increase. I believe I even got a speed boost by first checking to see whether the currently bound texture was the same as the one I was about to bind, and avoiding redundant binds. It's also a good idea to use as large a texture size as possible for lightmaps, to reduce the number of textures you have to switch.

VBOs will allow you to render many instances of the same mesh at maybe ten times the speed it would take to render them each on their own with conventional vertex arrays. VBOs also make unique geometry slightly faster.

Also, most frustum culling tests take longer to perform than the time they save in rendering. However, a simple bounding box test between the bounding box of the camera frustum and the bounding box of the renderable object, inaccurate as it may be, will cull large amounts of geometry and is fast enough that it will actually save more time than it wastes.

knackered
02-07-2007, 01:52 PM
most frustum culling tests take longer to perform than the time they save in rendering.depends what you're drawing. million's of 6 triangle objects? forget culling. thousands of 500 triangle objects? cull, preferably in a hierarchical way.

a simple bounding box test between the bounding box of the camera frustum and the bounding box of the renderable objectnope, faster to check the squared bounding spheres against the frustum planes with early rejects. FPU's are good at dot products.

halo
02-07-2007, 03:25 PM
Testing the bounding sphere against the frustum is a good idea, too. That should be pretty fast, and it will cull more objects.

Can you define an "early reject"?

knackered
02-08-2007, 12:01 AM
test against near plane first, if the sphere is behind this you can early out of the frustum test - same with the next plane etc.etc.
Advantage of testing near plane first is you'll then have the distance-from-viewer for depth sorts etc.

tfpsly
02-11-2007, 04:07 PM
No, compression is always slower, since it always means you have to process the data somewhat extra.
The difference might not be that great, but compressed textures are never faster unless your memory bandwidth limited and you somehow only sample within the compressed blocks and not across them.[/QB]The decompression algorithm is quite simple and easy to implement. It's definitely a no cost. On the other hand you save much by having bigger chunks of texture in the texture memory cache.

Just speak to any 3D driver programmer, he'll tell you that even if you do not compress the texture, he'll do it for you based on some heuristic to determine whether compressing texture will create visible changes or not.

knackered
02-12-2007, 06:57 AM
I can't imagine that - you mean every time I upload texture data the driver is walking it pondering its suitability for compression?? It's no business doing such a thing.

ebray99
02-12-2007, 07:18 AM
Just speak to any 3D driver programmer, he'll tell you that even if you do not compress the texture, he'll do it for you based on some heuristic to determine whether compressing texture will create visible changes or not.I have a really hard time believing that. What about the case where I call glTexImage2D and give it a blank image, and then call glTexSubImage2D to stream data? That would greatly reduce the speed of glTexSubImage2D... a function that's been "relatively quick" (can usually do one or two large textures per frame without a perf hit that's too noticable). In this case, the driver would have to examine the image supplied to glTexSubImage2D and potentially RE-ALLOCATE a texture! That would absolutely kill the performance of uploading texture data to GL in this case. I would bet the "heuristic" used is the internal format that you specify when you call glTexImage2D ;P.

Kevin B

tfpsly
02-12-2007, 08:15 PM
Originally posted by knackered:
I can't imagine that - you mean every time I upload texture data the driver is walking it pondering its suitability for compression?? It's no business doing such a thing. That's what their heuristic has to determine: if it will change the game visually, or it it will slow it down. The case of a texture uploaded each frame is easy: for example compress it the first time, then don't if it's altered in the next frames. The driver could even start compressing it on the cpu in another thread, and stop that thread's work when required.

I don't know how the heuristic work, but I can definitely tell you that at least the Ati drivers do compress the textures whenever it can. And the nVidia's drivers probably do the same.

knackered
02-13-2007, 12:29 AM
I've heard from the horses mouths (not that you're horses, cass and mcraighead) that the nvidia drivers don't even check enabled bits against some kind of cache before sending them across the bus - in other words, they do NO redundancy checks because that belongs in the realm of the application. So it's not much of an extension of this methodology to say that the decision about whether a texture should be compressed or not belongs in the application too.
Besides the fact that glTexImage2D is as quick as glTexSubImage2D these days, so it's doubtful there's any heuristic analysis going on.
I can well believe it of ATI, however.

V-man
02-13-2007, 02:34 AM
Ok, I'll ask the obvious question. What compression technique is it?

I could believe that they created special paths in the driver for certain games.

tamlin
02-14-2007, 08:54 AM
tfpsly: I also find it unlikely (read: unbelievable) that the driver should either itself or tell the hardware to compress every texture uploaded, then either by itself or telling the hardware "compare for artifacts", to finally decide whether it should be stored as compressed or uncompressed.

V-Man: Wouldn't the compression quite obviously be one of the DXT's (if this magic existed)?

tfpsly
02-16-2007, 05:40 AM
Well in my previous game company, I worked with an engineer which was previously working on the OpenGL part of the Ati drivers. When we worked on the 3D engine, he just told that explicitly using texture compression in the engine on our own was useless, as the driver would do it for us.
Looking on LinkedIn for Ati and F4 gives the name quite easily.

That's my source. I had no reason to question him. You have no more reason to believe me, I guess :D

knackered
02-16-2007, 03:09 PM
Sounds like he misunderstood something his peers were talking about to me....I've worked with a lot of people who have previously picked up quite a bit of misinformation over the years by overhearing stuff people in other areas were working on.
I've no doubt he worked in the ATI driver department, but probably more on the display properties GUI side of things.
What he's suggesting sounds like complete nonsense - if there's any truth in it, it's more likely he's talking about some rudimentary run-length encoding just to get it across the bus, but certainly not DX/S3TC.

remdul
02-17-2007, 08:32 AM
Ontopic:

1) I don't think anyone explicitly stated this above: Sort all faces by texture/material. Avoid all redundant texture binds and state changes. Texture binds, along with shader binds, are among the most 'expensive' calls in OpenGL.

2) Yes, use texture compression. It takes longer to load, but renders faster realtime (for the reasons stated above by knackered). I noticed a solid 5-10% performance increase in my own engine.

If load-time is an issue, read back the compressed textures after the first time they are loaded. Dump them into a cache file. Next time the textures are loaded, read the cache *. This is often faster than reading and uploading the original uncompressed textures (especially if they are in PNG/JPEG formats that need to be decompressed)!

IMO the only drawback of texture compression is the slightly lower visual quality.

3) Use VBOs/vertex arrays/display lists. It is always hard to say which is faster in any particular implementation, try them all and pick the fastest.

4) Take advantage of early-z culling. Sort and draw opaque geometry from front to back. If possible, draw simplified occlusion geometry to the z-buffer before rendering visible geometry.

* Does someone here know whether read back compressed textures (ARB_texture_compression) are compatible across different cards/vendors? i.e, can I read back the texture on a nVidia card, save it to file, and load it again on an ATI card and expect it to work?

ebray99
02-17-2007, 10:19 AM
IMO the only drawback of texture compression is the slightly lower visual quality.This is somewhat true... a 1024x1024 compressed texture will look worse than a 1024x1024 uncompressed texture. However, because even the best texture compression cuts memory down by 4:1, you actually get better quality when you look at it from the perspective of quality-to-memory. In otherwords, you could use 1024x1024 compressed for about the same amount of memory as a 512x512 uncompressed. However, the qaulity of the 1024x1024 compressed would be much better than the 512x512 uncompressed.

Kevin B

tamlin
02-18-2007, 12:26 PM
remdul: To not hijack the thread (more), simple answer: Yes.

tfpsly
02-20-2007, 06:26 PM
Originally posted by knackered:
Sounds like he misunderstood something his peers were talking about to me....I've worked with a lot of people who have previously picked up quite a bit of misinformation over the years by overhearing stuff people in other areas were working on.Could be, I cannot tell.


Originally posted by knackered:
I've no doubt he worked in the ATI driver department, but probably more on the display properties GUI side of things.He was working on the implementation of the gl fixed pipeline on the recent cards (R580 and +), generating VS and PS on the fly, and maybe on some more.