View Full Version : 100.000 polys w/lighting @ 30 fps - how?

05-27-2008, 04:23 AM
How in all the earth does one render 100,000 polys w/ lighting @ 30 fps on a system with a decent CPU and let's say Radeon X1900 XT gfx card?

Dark Photon
05-27-2008, 04:57 AM
How in all the earth does one render 100.000 polys w/ lighting @ 30 fps on a system with a decent CPU and let's say Radeon X1900 XT gfx card?
I assume you live someplace where the 10^3 seperator is . instead of , and mean "100,000 polys".

You'll generate a lot more response if you just post a short snippet of what your batch submission code looks like, and let folks rip it to shreds ;-)

But even before that, have you run a CPU or GPU profiler to determine where you are bound? Are you doing basic things such as culling? What do your timings suggest is the main bottleneck?

For starters, eliminate all state changes and get your batch submission code as streamlined as possible. Then add them back and see how you need to regroup/reorder to keep performance up.

05-27-2008, 05:08 AM
Oops, yes, meant one hundred thousand polys.

I only have Visual C++ Express 2008 which comes w/o built-in profiler, and my AQtime profiler doesn't connect to it. I will add some profiling code myself for a start.

I have eliminated state changes as much as I could.

I am doing face culling ahead of processing faces (omitting that decreases frame rates).

My face rendering code has over 2000 lines of code.

I could give you the entire project plus game data plus setup guide and you could take a look ... j/k.

Currently with a test case I am having about 200 fps for 909 faces, 8 state changes on a sytem w/ Athlon 64 3500+ and Radeon X850 XT w/o lighting (everything just bright). That's pretty poor I think, but I have no idea what to improve here.

I am using vertex arrays, but no VBOs because I can have a lot of dynamic face color changes and would need to update the color buffer every frame anyway.

05-27-2008, 07:11 AM
How in all the earth does one render 100,000 polys w/ lighting @ 30 fps on a system with a decent CPU and let's say Radeon X1900 XT gfx card?
That would only require a card capable of 3 million triangles per second. That is frankly nothing. I imagine that even with immediate mode you could well exceed that.
To put it in perspective, my card has a throughput of 300 million triangles per second, and it's a year old. I regularly get more than 300 million out of it.
You should go back to basics, read some basic documentation, and post on the beginners forum.

05-27-2008, 07:46 AM
You should go back to basics, read some basic documentation, and post on the beginners forum.
Sorry, you won't like that, but that is of entirely no help to me.

Read some basics then. Hm, calling glColor and glDrawArrays? Stuff like that?

I'd need at least some key words, some ideas. I know about VBOs and am using them for object rendering, I know about client arrays, etc. I just don't know how to put all that together so that I get decent frames rates with that decade old game I am trying to hack up.

But maybe you want to teach John Carmack something, too. As far as I know, none of his engines pushes 300M triangles/sec with full lighting and fx. Obviously that has nothing to do with raw (theoretical) triangle throughput of some gfx hardware.

05-27-2008, 08:28 AM
Did you try to render it without per-frame dynamic updates? Also you could try using VBOs set to streaming mode?

Do post important parts of your batch submission code, as Dark Photon suggested.

05-27-2008, 08:47 AM
The problem with VBOs is that there are a lot of faces with variable color/alpha values, and I'd need to update the color buffer for these each time I render them.

If you have links to some documents with good outlines about efficient OpenGL rendering, I'd love to read them.

I really see no way to post "relevant" parts of my code here. It's just too much.

There's code involved doing some (software) culling before faces are further processed. Turning it off slows the renderer down.

There's code changing the alpha values of certain special faces.

There's code (efficiently) sorting the faces by texture to reduce texture state changes later on (I have profiled that code, it is very fast and negligable compared to overall frame rendering time).

There's code detecting required state (texture) changes and performing these.

There's code buffering faces until a state change is required and that renders the entire batch of buffered faces at once using glDrawElements before the state change happens.

That's just the basic stuff.

There's also code dispatching transparent faces to another buffer for subsequent transparency rendering.

There's code activating or deactivating (simple) shader programs handling things like color key transparency or monochrome rendering.

I could post all this here, but I think even snippets would be too much.

05-27-2008, 08:51 AM
karx, are you saying you're hacking around with the quake3 source code? As far as I remember, that engine was designed around poor/absent hardware T&L and fill rate. He used bsp's for rendering, which meant lots of CPU work (half space tests), lots of batches being fully submitted every frame, desperately trying to eliminate overdraw. This is the antithesis of todays approaches. Carmack would not be writing a renderer like that on todays hardware - and he most certainly will be getting 300Mtps on todays hardware with most advanced features turned off.
Your problem is fundamental - you're using yesterdays algorithms on todays hardware. Just go to the nvidia or ATI developer sites and RTFM. This information is literally pushed in your face with the most simple searches - saying you need keywords is frankly bollocks.

05-27-2008, 08:55 AM
Thanks for the kind words. I tried to avoid these, but your replies are just bollox for me as well.

If you feel offended by my stupid questions, why the heck don't you just stay away and have a good day somewhere else instead of trying to ruin mine? Nobody forces you to deal with stuff you don't like here, and if you have a generally negative attitude towards noobs asking the same old questions every day of every year, I have a bad and a good news for you: The bad one is that this will never change, and the good one is that you can avoid them.

I am not playing around with Quake 3. I am coding around in Descent 2 (D2X-XL -> http://www.descent2.de ). FYI: Compared to that engine, Q3 is brand spankin' new. I would love to see someone more skilled in OpenGL coding than me do it, but there is no one willing to.

05-27-2008, 09:21 AM
Concerning VBOs: make shure, your code follows these (http://developer.nvidia.com/attach/6427) guidelines (at least).


05-27-2008, 09:29 AM
karx11erx, you seem to do a lot on per-face basis. Try to minimze that. Today, it doesn't matter if you render 1000 or 900 faces. Don't do culling, state-soring or anything at per-face level.

Also, try GLIntercept. Recently it helped me to find a stupid bug that in some cases almost halfed my render-performance. Maybe you just do a stupidly amount of glLoadMatrix() calls, or just like in my case a stupidly high amount of glPushAttrib()/glPopAttrib() calls. You can make the log of one frame public, so we can take a look at it and maybe pinpoint whats going wrong.

Also, for static geometry, use static VBOs! Even for dynamic stuff, VBOs should be your choice. Stay away from immediate mode or "old" vertex arrays.

Our engine renders a CAD scene per-frame:
2,6 Mtris
5000 glDrawElements
at 40fps (thats 104Mtris/s)

on a simple GF8800GTS.

05-27-2008, 10:38 AM
Thanks for the kind words. I tried to avoid these, but your replies are just bollox for me as well.
typical and predictable - you pick the one negative thing I said and made that the focus of your reply. Totally ignored the other stuff. Totally failed to give any more information on how you're submitting your vertices.
It's the vagueness of your question coupled with the inappropriateness of the forum (advanced GL) that annoys me. I could ignore you, but I thought I'd try and kick-start some kind of thought process in you. I have failed miserably. You just want to be spoon fed the basics.
Pray continue....

05-27-2008, 10:38 AM
I know for sure that I am not doing excessive amounts of glPush/glPop calls, and the model view and projection matrices are only set once per frame. It's true though that I am doing a lot of stuff per face, and the reason is that Descent has an excessive amount of light sources, often 16 or more per face, and they vary very frequently.

I have also found that simply pushing all faces to the OpenGL driver and not doing any dynamic lighting, culling or stuff does not speed up rendering for me, and I am clueless why.

I can try VBOs, but I am using them for rendering 3D objects in my scene already (they get loaded once during level load, so there's no frequent changing them or so, they just stay untouched in the gfx card's memory after that), and they only about doubled rendering speed.

I should probably use deferred lighting, but I am not that far yet.

05-27-2008, 10:57 AM
What format are your vertices (float3 pos, float3 norm, uint32 rgba)?
When you say you've tried pushing all faces to GL, what method do you use?
Still not enough information to give a sensible reply - you're just playing a guessing game with us.

05-27-2008, 11:29 AM
Ok, I will try to post enough information to be useful here.

vertices and normals are float*3. color values are float*4.

The face buffer renderer looks like this:

#define FACE_BUFFER_SIZE 1000

typedef struct tFaceBuffer {
grsBitmap *bmBot;
grsBitmap *bmTop;
short nFaces;
short nElements;
int bTextured;
} tFaceBuffer;

void G3EnableClientStates (int bTexCoord, int bColor, int bNormals, int nTMU)
glActiveTexture (nTMU);
glClientActiveTexture (nTMU);
glEnableClientState (GL_VERTEX_ARRAY);
if (bNormals)
glEnableClientState (GL_NORMAL_ARRAY);
glDisableClientState (GL_NORMAL_ARRAY);
if (bTexCoord) {
glEnableClientState (GL_TEXTURE_COORD_ARRAY);
glDisableClientState (GL_TEXTURE_COORD_ARRAY);
if (bColor) {
glEnableClientState (GL_COLOR_ARRAY);
glDisableClientState (GL_COLOR_ARRAY);
glEnableClientState (GL_VERTEX_ARRAY);

void BeginRenderFaces (void)
G3EnableClientStates (1, 1, 1, GL_TEXTURE0);
glNormalPointer (GL_FLOAT, 0, gameData.segs.faces.normals);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.texCoord);
glColorPointer (4, GL_FLOAT, 0, gameData.segs.faces.color);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);
G3EnableClientStates (1, 1, 0, GL_TEXTURE1);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.decalTexCoord);
glColorPointer (4, GL_FLOAT, 0, gameData.segs.faces.color);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);
G3EnableClientStates (1, 0, 0, GL_TEXTURE2);
glTexCoordPointer (2, GL_FLOAT, 0, gameData.segs.faces.texCoord);
glVertexPointer (3, GL_FLOAT, 0, gameData.segs.faces.vertices);

void G3FlushFaceBuffer (void)
//basic vertex ordering is quads, but program can turn that into tris
if (gameStates.render.bTriangleMesh)
glDrawElements (GL_TRIANGLES, faceBuffer.nElements, GL_UNSIGNED_INT, faceBuffer.index);
glDrawElements (GL_QUADS, faceBuffer.nElements, GL_UNSIGNED_INT, faceBuffer.index);
Descent 1+2 have a segment based engine. A segment is a cuboid, and levels consist of such cuboids attached to each other by their faces.

D2X-XL builds a face list from a level's segment list. Each face has properties like base texture, decal texture, etc.

BeginRenderFaces() is called before faces get rendered.
G3EnableClientStates() accepts the desired client states and the TMU to use as parameters.
The renderer then walks through a list of all faces, culls them (doing a software vertex transformation for that) and calls the face render function for each visible face.
The face render function checks whether a state change would occur, and if so flushes the face buffer.
After that check and eventual flush, the new face is pushed into the face buffer.

For the rendering, hardware transformation is used.
If the face culling is omitted, the renderer gets slower.

knackered, I'd rather have no help from you than in that tone of yours.

skynet was more helpful when telling me I should just throw my polys at the gfx driver no matter what. I simply have no clue what tasks to leave to modern gfx hardware. I know the basic OpenGL stuff (and then some), and I have understood the Descent renderer (which is 15 years old and is a software renderer). That's about it.

05-27-2008, 02:15 PM
knackered, I'd rather have no help from you than in that tone of yours.
Your wish is my command. Good luck, with that tone of yours.

05-27-2008, 03:16 PM
knackered, I'd rather have no help from you than in that tone of yours.
Your wish is my command. Good luck, with that tone of yours.
The only one who was constantly impolite and generally behaving like an arrogant prick who believes he is smarter than everybody else and that that gives him the right to behave like a jerk here were you. If you were as smart as you're trying to make us believe you'd understand that to ask good questions you already need to know half of the answer, and that apparently that is not the case for me.

Good bye.

05-27-2008, 03:35 PM

I have found out that software visibility culling and lighting cost 75% of time spent in the renderer, so it won't help much fiddling around with the actual rendering code.

Unfortunately I cannot do w/o the software culling because simply lighting and rendering all faces makes the program even slower.

I would first have to look into different lighting methods (like deferred lighting).

05-27-2008, 03:41 PM
wow.. this is wrong! Why do you set vertex pointer 3 times? You have to read some OpenGL manuals before you start coding. Your questions is not for advanced forum.

Regarding your piece of code... Setup vertex, color & normal pointers once, and then setup texture pointers for each TMU. Without using VBO your vertices are copied every time when your app call glDrawElements.

Anyway it will not gain any performance boost. I belive that you have more unappropriate usage of OpenGL API in your code.

You have to do following:
1. Use VBO for vertices
2. Use VBO for faces
3. If you have some vertex attributes thats change every frame, split your vertex into static and dynamic part and store them in two separate VBO's. Update only VBO thats contains dynamic vertex data.
4. Optimize your software culling. You dont have to check every face.
5. Try to minimize number of draw calls (glDrawElements, glDrawArrays,..). Sort faces based od material (textures).
6. Do not call glGetXXXXXX. It cause pipeline stall
7. Data layout suitable for CPU and programmer is not good for GPU. Here is some suggestion for data storage layout.

typedef struct tsgVertex
float pos[3];
float norm[3];
float color[4];
float tex0[2];
float tex1[2];
float tex2[2];
} tVertex;

typedef struct tagFace
unsigned int indices[3]; // or unsigned short.. depending on vertex number
// indices are related to array of tVertex
} tFace;

typedef struct tagCuboid
unsigned int faces[12]; // index in tFace array.
unsigned int face_material_ids[12]; // or 6?
} tCuboid;

typedef struct tagMaterial
GLuint textures[3]; // up to 3 texture per face
unsigned int* pRenderingQueue;
unsigned int max_queue_size;
unsigned int queue_pos;
unsigned int additional_flags; // transparency, texture stages or shader...
} tMaterial;

Store all your vertices in VBO. tVertex perfectly fits in hardware.
Store all your faces in another VBO. tFace perfectly fit in hardware.
Do software occlusion culling. Select only visible cuboids. In each cuboid you have faces and their materials. Add cuboid face indices to queue in material where face belongs (dont foret to reset queue_pos at beginning of frame render). At end.. iterate trought materials, setup textures, and render faces using glMultiDrawElementsEXT call. This would eliminate frequent texture switch, reduce draw calls, but it will not handle multiple lights. Im suggesting to you to do lighting in separete pass without any textures, then turn on aditive blending and render textures.

knackered is OpenGL guru.. sometimes his comment mught offend some people. He gives you few good suggestions, so its up to you to read again what he says or wait that somebody else say the same.

05-27-2008, 04:45 PM

I am unsure what to tell you. Yooyo already stated the most important parts about getting best performance from OpenGL.

Concluding from this little piece of code to the rest of it, there might be some other pitfalls you tapped into. A GLintercept log would have told us ;-)

For instance, you are using glDrawElements plus old vertex arrays. This forces the driver into kind-of-immediate mode, since he cannot know how many vertices get touched by your draw call in advance. Thus, the driver just submits one vertex after another, until the last triangle has been submitted. This is bad and gets worse the more passes you render. glDrawRangeElementsEXT + VBO is the way to go.

Also, I cannot imagine (from the screenshots I've seen) how you come close to render 100k of triangles at all. One map having 10k faces would be already much, I guess. In that case I suggest you put the whole map geometry into one static VBO and either render it all every frame or only the visible sub-ranges of it. Do not try to build face-lists from visibly cuboids at runtime. Convert all your cuboid-level-geometry already at loading time.

Another dubious statement of yours is that you spend most of the time for culling _and_ lighting. Lighting? Shouldn't this be done by the GPU? Is this the reason why you have to change the faces' vertex colors so often?

05-27-2008, 05:01 PM

thank you. I had thought of using VBOs, but I didn't know I can use two simultaneously. Afaik I can't, or am I wrong? I am using them for 3D objects (robots, player ships, powerups etc. already), so I know the basics about these.

Faces already get sorted by textures.

I wouldn't know how to further optimize occlusion culling (which costs about 40% of the entire rendering). Currently the cuboids are walked, beginning at the viewer segments, cuboid faces are transformed and projected to determine what is occluded by them until it can be safely said that all further segments' faces are occluded.

I am not using glGet...

Will look at glMultiDrawElementsEXT.

The vertex pointers are set per TMU. Don't the additional TMUs need to know the vertices, too?


I never said I'd come even remotely close to 100K polys @ 30 fps. I know other engines do, that's why I am asking.

A typical Descent 2 mine has dozens and dozens of dynamic(i.e. moving, destructible, flashing) lights. There are often 16 or more lights affecting a single face (particularly during fire fights, which can spam the area with lights), and using less leads to lighting flaws. Blame it on the stone age engine. That's why lighting takes so long: I have to determine the closest lights to each face (doing this per segment currently). That means: The less faces, the less work in this area, so I need software occlusion culling. I am already using precomputed lightmaps for static lights, but unless I have a stroke of genious (or use deferred lighting), I will not get around that type of light handling.

So while VBOs might help the draw calls, they wouldn't really help much overall, given the draw calls only make up 25% of the entire rendering process.

As I said, even using VBOs for the 3D models only doubled their rendering speed. The only thing I know of I could apply here was face reordering to optimize the gfx hardware's vertex buffer usage.

Edit: I was wrong. It's almost 8 times faster.

05-27-2008, 06:18 PM
Man.. you have to read opengl spec before you start coding!

Yes.. you can use more than one VBO. Read VBO spec and nvidia document listed above.

TMU doesnt care about vertices, colors and normals... again... take look this http://www.opengl.org/documentation/specs/version1.1/state.pdf
or better download one of pdf from http://www.opengl.org/documentation/specs/ and READ!

A question for you... do you want per-vertex or per-pixel lighting?

unsigned void
05-28-2008, 12:08 AM
If you doing per-pixel lighting via shaders, then you are not limited to standard 8 lights. If you manage to push 16 (or whatever needed) light positions (and colors) to uniforms, you could calculate their contribution to pixel color.
Of course, too much uniforms can be performance hit of its own, but only experiment will tell.
Another idea - light parameters encoded into texture.

Nicolai de Haan
05-28-2008, 01:22 AM
Ouch, if occlusion culling eats 40% of the CPU time spend on rendering each frame, then you definitely need to do something (are you sure about that number?). You can optimize the geometry layout all day long but if the CPU doesn't have time to submit your buffers because it's doing OC then nothing is gained (only time is lost). Maybe you should look into "bounding volume hierarchies" and "spatial data structures"?

05-28-2008, 01:44 AM

definitely. :D I am pretty sure about that number. For lack of a working profiler I have added some simple time measuring code, and the result is plausible, as the OC'ing does a lot of vertex transformation and projection in software.


afaik you must specify the TMU to which a color or tex coord buffer is bound, or how would you properly do multi texturing with vertex arrays or buffers? I know for sure that I can assign different tex coords to TMU0 and TMU1 this way, because I am doing it all the way. Color probably not, that wouldn't make sense anyway. Thanks for the hint, I'll try to leave out the color calls for any TMU other than TMU0.


I am doing per pixel lighting via shader, but there are pretty hard limits on what I can pass. I have been getting a lot of GLSL linker errors or even just rendering flaws when exceeding them. So I am passing the light sources via the built in HW lights. I had thought about light source data in a texture, but I failed to implement it. There's too much unclear about it. You don't want that texture be interpolated e.g., so you need an orthogonal projection for it, which would collide with rendering the other textures (I had once been playing around with GPGPU stuff).

unsigned void
05-28-2008, 02:07 AM
You don't want that texture be interpolated e.g., so you need an orthogonal projection for it
Just specify no filtering/no mipmaps for it. Orthogonal projection has nothing to do with textures here...
Then you can sample it choosing texcoords tied with number of light.

05-28-2008, 02:22 AM
You don't want that texture be interpolated e.g., so you need an orthogonal projection for it
Just specify no filtering/no mipmaps for it. Orthogonal projection has nothing to do with textures here...
Then you can sample it choosing texcoords tied with number of light.
I will definitely try that. It would be awesome to be able to get around that 8 HW lights limitation. Thanks.

I still have a question though: If I address elements of a light data texture using texture coords, then these tex coords are float. I need to compute them properly, don't I? I.e. light #9 (counting from one) would be at

uniform sampler2D lightTex;
float lightPos = texture2D (lightTex, vec2 (1.0 / 8.0, 1.0 / 8.0));
for an 8x8 texture. Would that work?

05-28-2008, 02:52 AM

In case of fixed function pipeline you have to spacify vertex color, normals, and positions once and texcoord for each TMU that you want to use.

In case of using shaders (programmabile pipeline) in a shader you can fetch texles from texture using any know coordinates in shader. This mean.. you can sample from texcoords or use result of some calculations in shader as coordinates for texture fetch.

Because you are doing lighting, split your rendering in several passes:
1. enable depth write & depth test. Render visible geometry using static lightmaps
2. disable depth write. enable additive blending. render light contribution. Sort your faces by lights and render only faces that was affected by some light. You can optimize this pass using shaders with multiple lights
3. switch blending to multiply. render visible geometry with textures

After first pass ou should see lightmaps only. After second pass you should see lightmaps + lighting result. After 3rd pass you shoild see almost complete frame..

unsigned void
05-28-2008, 04:09 AM
karxx11erx: light #8 (counting from 0) in a texture 8x8 would be at (0, 1/8) (first in second row). Also, there are NPOT (non-power-of-two) textures, where you can specify unnormalized texcoords in range [0; n-1] (see spec for details).
And texture should have more than 8 bits per component, if you will interpret its components as coordinates in space (or derive coordinate from multiple components).

05-28-2008, 04:26 AM

I would use a texture with float components. Regarding addressing the light texture elements: Ofc you are right, I was just typing too fast. I remember that NPOT stuff, but does it work on older hardware? Many Descent fans do not have the latest and greatest (or even second latest) hardware.


which blend mode is multiplicative (I know that's an absolute noob question ...) ? Or are you talking about handling that in a shader (where I ofc would know how to multiply texture and light values)?

unsigned void
05-28-2008, 04:38 AM
karx11erx: see http://delphi3d.net/hardware/allexts.php for hardware info. Using NPOT is not strictly neccessary, but GL_ARB_texture_float puts some requirements too.

05-28-2008, 06:20 AM
Additive blending:
glBlendFunc(GL_ONE, GL_ONE);

Color multiply (modulate):