Speed problems with OpenGL.

Hi,
As most of you probably know, I have been writing an OpenGL driver for a 3D engine that we have been working on for a while now and I have managed to get the majority of it working so far.

We also have a DirectX7 driver that works too. I have just tried this driver in place of my OpenGL one and have bench marked them both.

Driver | FullScreen | Window

DX | 75 FPS | 260 FPS
OpenGL | 14 FPS | 9 FPS

Now, is there any reason why the OpenGL driver is sooooooooo slow?

Here is the mesh data that we have been using:

We have the mesh loaded in and then it is optimised for speed.

We have a hierachy of meshes in a a whole mesh.

Each part of a frame is then broken down into segments. Each of these are then broken down into faces - currently triangles.

The triangles are being rendered as triangles lists.

Model Filename : Wf4 Warmup Jiggle.BM
Mesh: pelvis Total Segments : 1
Total Faces : 12
Mesh: torso Total Segments : 1
Total Faces : 39
Mesh: Rupperarm Total Segments : 5
Total Faces : 8
Total Faces : 10
Total Faces : 4
Total Faces : 3
Total Faces : 4
Mesh: Rlowerarm Total Segments : 3
Total Faces : 1
Total Faces : 2
Total Faces : 7
Mesh: Rhand Total Segments : 5
Total Faces : 4
Total Faces : 10
Total Faces : 4
Total Faces : 3
Total Faces : 9
Mesh: Lupperarm Total Segments : 3
Total Faces : 6
Total Faces : 17
Total Faces : 4
Mesh: Llowerarm Total Segments : 3
Total Faces : 1
Total Faces : 2
Total Faces : 7
Mesh: Lhand Total Segments : 5
Total Faces : 4
Total Faces : 10
Total Faces : 4
Total Faces : 3
Total Faces : 9
Mesh: head Total Segments : 6
Total Faces : 2
Total Faces : 8
Total Faces : 6
Total Faces : 8
Total Faces : 18
Total Faces : 17
Mesh: Rthigh Total Segments : 2
Total Faces : 19
Total Faces : 10
Mesh: Rshin Total Segments : 2
Total Faces : 18
Total Faces : 12
Mesh: Rfoot Total Segments : 4
Total Faces : 8
Total Faces : 10
Total Faces : 2
Total Faces : 8
Mesh: Lthigh Total Segments : 2
Total Faces : 17
Total Faces : 10
Mesh: Lshin Total Segments : 2
Total Faces : 12
Total Faces : 18
Mesh: Lfoot Total Segments : 4
Total Faces : 9
Total Faces : 9
Total Faces : 2
Total Faces : 8

Thanks for the help,
Luke A. Guest.

Whoah, whoah, slow down, you’re going to need to be a LOT more specific about your rendering before you can get any performance help.

Card? Driver? OS?

Immediate mode? Vertex arrays? Display lists? If vertex arrays, what command? (DrawArrays, DrawElements? LockArrays?)

If those are the sizes of the batches, they look a little small. Can you join up all the “segments” in each part of the mesh? Drawing just 1 or 2 triangles at a time is not good.

Can you use triangle strips or fans to reduce the amount of data and the duplication of vertices?

What kinds of features do you have enabled? Lighting? If so, how many and what type of lights? Texgen or texture matrix? Anything else?

Where inside your model are you doing state changes? State changes hurt performance, especially when you draw only 1-10 triangles per batch, since the first batch after any state change will have some overhead to handle the state changes.

  • Matt

Get ahold of a good profiler and profile your code. VTune is great, I would recommend that. Profiling will reveal all.

Originally posted by mcraighead:
Whoah, whoah, slow down, you’re going to need to be a LOT more specific about your rendering before you can get any performance help.

Okay…sorry :wink:


Card? Driver? OS?

My machine is…Win98 with Graphics Blaster RIVA TNT. The latest FastTrax drivers - thanks Matt :wink:


Immediate mode? Vertex arrays? Display lists? If vertex arrays, what command? (DrawArrays, DrawElements? LockArrays?)

I’m using vertex arrays. We have 2 types of render command, one will use DrawElements() and the other will use DrawArrays() depending on whether there are indices available.

[b]
If those are the sizes of the batches, they look a little small. Can you join up all the “segments” in each part of the mesh? Drawing just 1 or 2 triangles at a time is not good.

Can you use triangle strips or fans to reduce the amount of data and the duplication of vertices?
[/b]

I didn’t actually write the optimizations code, but I suppose if we can find a better way that works just as fast for DX and OpenGL - they won’t change the code just for OpenGL :frowning:


What kinds of features do you have enabled? Lighting? If so, how many and what type of lights? Texgen or texture matrix? Anything else?

Ok. We have ambient lighting on plus on directional light.

No texture coord generation nor matrix. We have multitexture extension if it’s available though.


Where inside your model are you doing state changes? State changes hurt performance, especially when you draw only 1-10 triangles per batch, since the first batch after any state change will have some overhead to handle the state changes.

Unfortunately, this is te difficult part in that the changes really reflect the way the 3D engine has been designed - to pretty much match DX, unfortunately.

The segments are put into texture order to limit the number of texture changes required. I have stored the texture params in each texture so that they don’t need to be reapplyed when the texture is rebound - unless the param changes.

Thanks,
Luke.

I’d advise to check out the initialization, and mostly the pixel format. Are you sure the format you are requesting is available in hardware on your card ? Your framerate looks like as you’re falling back to software mode…

Y.

So SW rendering is a possibility, but it seems unlikely. I wouldn’t count it out quite yet.

Vertex arrays… which arrays do you enable, and what data formats?

One directional light is fine. Make sure local viewer is off unless you need it.

You’re not just happening to be using vertex weighting, as you mentioned in the previous thread, are you? It just so happens that vertex weighting isn’t… uhhh, particularly optimized on TNT/TNT2. If you’re using the DirectX T&L routines, they probably have optimized SSE/3DNow routines for vertex blending, whereas our implementation may just happen to be extremely unoptimized. [don’t say I didn’t warn you – I remember saying that performance might be “suboptimal” ]

If you are using it, try turning it off [and to be safe, also disable the vertex weight array] and see what that does.

  • Matt

Originally posted by mcraighead:
So SW rendering is a possibility, but it seems unlikely. I wouldn’t count it out quite yet.

Well, I’m using the same pixel format descriptor as the NeHe tutorials. I have got a routine to go through each of the pixel formats, but I need to get more info on them before I start to use that routine.


Vertex arrays… which arrays do you enable, and what data formats?

In this order…

Vertex, texture coord, normal, colour, weight.


One directional light is fine. Make sure local viewer is off unless you need it.

Well, it gets turned on as a render state - like the DX one. AFAIK it’s the specular enable render state. but It doesn’t get turned on at the moment.


If you are using it, try turning it off [and to be safe, also disable the vertex weight array] and see what that does.

Ok, I turned it off and got a marginal increase in speed.

FullScreen = 12-14 FPS
Window = 12-15 FPS

Thanks,
Luke.

I forgot to mention that I’m keeping an array of active lights so that they can be transformed - we want the directional light to be stationary.

I just knocked that off and it made no difference either.

I’ll have a look at the pixel format thing.

Thanks again,
Luke.

I would have expected a much larger gain from turning off vertex weighting. It sounds like you are doing something else that doesn’t make us very happy. (Note, as I said, that you should still also disable the vertex weight array.)

You mentioned the D3D specular enable – are you turning on OGL separate specular? If you are, try without, since that feature can also result in a performance hit. Local viewer is definitely also a performance hit. (directional lights are really easy to do with infinite viewer – the math simplifies pretty nicely)

What are the specific data types of your arrays? For example, given the arrays you mentioned, the format most likely to be optimal would be V3F/T2F/N3F/C4UB. (W1F omitted, for reasons stated above)

Basically, turn stuff off until it becomes fast. Hopefully you will discover that there is a single thing that you can disable to get a massive performance gain, which will in turn clue me in.

  • Matt

Originally posted by mcraighead:
I would have expected a much larger gain from turning off vertex weighting. It sounds like you are doing something else that doesn’t make us very happy. (Note, as I said, that you should still also disable the vertex weight array.)

Yeah, I did comment out all of the vertex weight code.


You mentioned the D3D specular enable – are you turning on OGL separate specular? If you are, try without, since that feature can also result in a performance hit.

Yup, turned off the separate specular and, still, no change.


What are the specific data types of your arrays? For example, given the arrays you mentioned, the format most likely to be optimal would be V3F/T2F/N3F/C4UB. (W1F omitted, for reasons stated above)

As for the data storage, we have an array template that stores different types that we can cast to get an array in memory of the data we need.

Vertices are either vectors or points (3 floats or 2 longs).

Normals are also vectors.

Texture coords are a structure containing 2 floats.

Colours are 4 floats.

Weights are just floats.

Now, you mention the data type array types here, I’m not using interleaved arrays. I’m assuming that these would be faster? I’ll try that.

[/b][/QUOTE]
Basically, turn stuff off until it becomes fast. Hopefully you will discover that there is a single thing that you can disable to get a massive performance gain, which will in turn clue me in.
[/b][/QUOTE]

Thanks again.
Luke.

Hi,

I have changed the code that sets up the vertex arrays so that it minimizes the number of times it calls glEnable/DisableClientArray and that didn;t make a difference.

I am still wondering why it can be soooooo much faster under DX with the same mesh?

BTW, can I use weights in an interleaved array?

Thanks,
Luke.

Interleaved arrays do not make a significant difference. That’s just a concise way of representing the format.

“2 longs” for vertices – does that mean 2, GL_INT? If so, that’s not exactly the greatest format, though by itself it shouldn’t lead to an exceptional amount of slowness.

I’m starting to wonder whether you’re really hitting SW rasterization for some reason.

  • Matt

Matt,

If Lucretia is an nVidia registered developer, perhaps he could try those special drivers that profile your OpenGL code ???

I have not tried them coz’ I am running NT so I do not know in which way they could be useful… But as far as I know they tell you which gl calls you made. Do they tell you if you fall back to SW mode ?

Regards.

Eric

If it’s using sw rasterization, then wouldn’t he be able to test it using a very small window and that way getting a great performance gain.

Originally posted by mcraighead:
Interleaved arrays do not make a significant difference. That’s just a concise way of representing the format.

Okay, fair enough. I wasn’t really sure because we have to have weighting on at some point and it cannot be used with interleaved arrays AFAIK.


“2 longs” for vertices – does that mean 2, GL_INT? If so, that’s not exactly the greatest format, though by itself it shouldn’t lead to an exceptional amount of slowness.

Not for normal 3D vertices; 2D vertices. It seemed better to create these as two longs (GL_UNSIGNED_INT) so these could be transformed into an orthographic projection without any hassle.

If I used 3 or 4 dimensional vertices for this, my 2D graphics disappeared. Probably a Z buffer problem.


I’m starting to wonder whether you’re really hitting SW rasterization for some reason.

Well, I thought about this and wrote a routine to go through each pixel format looking for a hardware accelerated (ICD) format with 16 colour and z depth and I still have no increase.

Can I just ask; normallay when a 3D engine is written with loadable drivers, is there anything special that needs to be done in order for the performance of OpenGL and DX to be the same or of similar speed?

Because, I cannot see how DX can be that much faster than OpenGL using the same 3D engine. I would assume that there would be minor problems between the implementations but nothing that disparate.

The only thing that I can think of that could cause this is the vertex buffer class. DX allows you write directly to the hardware memory, whereas OpenGL doesn’t. This means that you have to reload the data each time. But even then you have to do this with DX - but the vertex buffer is still on the card.

Thanks,
Luke A. Guest.

But you’re on a TNT, so all your vertex buffers will be kept in system memory – they need to be processed by the CPU.

I’m pretty confused.

Is there any way I can try to run your app?

  • Matt

Hi all,

I have found the problem, I feel a bit of a dick now, but anyway.

The problem was our logging macro - it was being built into the final release.

Thanks to everybody on this board for the help though :wink:

It’s nice to see it running as fast as DX :smiley:

Thanks,
Luke A. Guest.

Luke,

Can you give us the modified benchmarks ?
Is OpenGL exactly as fast as D3D ?
Is there a difference ?

That could be very interesting !

Best regards.

Eric

Ok…

Driver | Window | Fullscreen

OpenGL | 74-75 FPS | 74-75 FPS
DX | 260 FPS | 75 FPS

Now if I use the WGL_EXT_swap_control extensions, I get 237 FPS in both windowed and fullscreen modes.

DX fluctuates severly between 300(ish) FPS and 200(ish) FPS in windowed mode and stays as 74-75 FPS in fullscreen.

Hope that helps,
Luke A. Guest.

Sounds like you need a lot more geometry and scene complexity. 237 fps on a TNT1 even while using something fairly slow (vertex weighting), just imagine how fast it could go with something more recent…

  • Matt