PDA

View Full Version : glLoadMatrix performance extension ?



ToolTech
03-17-2003, 06:44 AM
Hi !

I have a small problem. I have a large number of various geometry sets. Each set is a indexed list of tristrips etc.

Each set is drawn with a large number of different modelview matrixes to replicate att different positions and scales, rotations etc.

I have two bottlenecks left. One is the actual geometry transfer but this is quite simple to get better with VARS or VAO etc.

The last one is glLoadMatrix. Is there any extension or way to send a huge chunk of matrixes to the GFX mem and then use indices to select transform ?

I can not pre transform the geometry through each transform because that makes the memory consumption too large..

Ysaneya
03-17-2003, 07:43 AM
You can probably trick it with a vertex shader, loading up N matrix in the constants, and just issuing one call which will select which matrix to use, and transform the vertices accordingly.

glLoadMatrix is generally not too costly, how many calls per frame are we talking about here ?

Y.

jwatte
03-17-2003, 08:08 AM
I would recommend against trying to put N modelview matrices into a vertex program, and then sending more per-vertex data to select one. You'd have to re-base your index lists for this to work -- even assuming everything renders with the same material/shader/texture.

It's common to call LoadMatrix once per mesh instance in most scene graphs. Typically, the cost of this is much less than the cost of setting up textures, blending, lighting etc per mesh instance anyway (where one mesh conceptually consists of one or more calls to DrawElements).

Note: it's cheaper to draw 10 1000 triangle meshes than to draw 100 100 triangle meshes. That'll probably never change :-/

V-man
03-17-2003, 08:28 AM
Isn't it more likely that the person will be using glMultMatrix?

If you use glLoadMatrix, that means all your objects are root, but if you have a system of parents and children, then I think glMultMatrix is a nice way to do this.

For me, matrix calls is not what brings down the performance so I dont worry.

Ysaneya
03-17-2003, 09:32 AM
Yeah i'd definately agree with you jwatte, generally it's not worth using that trick. However, think about a case where you have a very big amount of objects made of very few triangles (say, cubes, so 12 triangles each).. then the vertex shader trick might improve your performance a bit.

V-Man: the problem with glMultMatrix is that generally, you have to use a push and pop to save the matrix stack. So it's actually one glLoadMatrix (basically a copy) compared to three calls (i'm guessing 2 copies with one heavy matrix multiplication).

Y.

ToolTech
03-17-2003, 11:41 AM
The scene graph compiles the graph into a linear bunch of geometry sorted after state and geometry. The traversal stage generates the hierarchical mults into a loadable matrix for each geom.

Would I benefit from putting the matrixes into a nvAllocMem area ? Any driver developer who can comment this ?

Would it be feasible for future OpenGL apps to have Matrix array extension ? I am rendering "compiled" sets of trees, 100-500 different types with 100 various combinations of branch sets times number of trees

I can see that initially I gain from multplying the shared geom with the matrix , but when the number of transformed geosets increses , the VAO and VAR gets into trouble. Hard to find a general equilibrium.

V-man
03-17-2003, 09:04 PM
Originally posted by ToolTech:
Would I benefit from putting the matrixes into a nvAllocMem area ? Any driver developer who can comment this ?
..........
I can see that initially I gain from multplying the shared geom with the matrix , but when the number of transformed geosets increses , the VAO and VAR gets into trouble. Hard to find a general equilibrium.

I cant answer your question, but what do you mean by multiplying the shared geom with the matrix? Software or using still using GL?

My understanding is that VAR and VAO are pretty much the best in terms of performance as long you respect the cards limits.
I think 1 or 2 people have said that using immediate mode was actually faster for them instead of VAO. They probably did something wrong.

Can you give some numbers?

ToolTech
03-18-2003, 01:40 AM
I have the geometry G1,G2,G3... etc but for each geometry i render M1G1,M2G1,M3G1 etc.

I could of course eliminate the matrixes M1...Mx by create geometries G11=M1G1 G21=M2G1 but that would lead to too much memory allocations and slower performance in ABO etc.

ToolTech
03-18-2003, 08:11 AM
I think I can see that the gzLoadMatrix is not the performance staller. I think it is related to previous vertex array transfers. Am I right to assume that a glLoadMatrix will wait for a previous call to glDrawBuffers to finish ??

Ysaneya
03-18-2003, 09:34 AM
Am I right to assume that a glLoadMatrix will wait for a previous call to glDrawBuffers to finish ??


I don't think so. Unless you call glFinish, or read back some result from the video card (through a glGetXXX function), i wouldn't expect anything to block. Your glDrawBuffer, then glLoadMatrix calls should be queued up in OpenGL's rendering command queue and return immediately. At least i don't see any reason for them to wait.

Y.

Won
03-19-2003, 02:01 PM
Well, it doesn't sound like it would help anymore, but the basic solution to the original problem be to encapsulate calls to LoadMatrix (or whatever state changes necessary) in a display list. It's a problem if you these get updated frequently.

-Won

ToolTech
03-20-2003, 08:15 AM
oops !

Sorry ! Wrong again. When I loop 10 times over the loadMatrix I get an increased load in the loadMatrix. Not10 times but at least 3 times more load there compared to the geometry.

It looks like I shall try to use display lists for the matrix anyway. Would that be faster ??

My question about storing the matrix in nvAllocMem memory. Would that be faster ??

kieranatwork
03-20-2003, 08:39 AM
I've encountered similar results to you, tooltech. glLoadMatrix has become a bottleneck for me too. VTune reports it as a hotspot.
You've just gotta sigh and accept that the hit is the lesser of a massive list of evils (texture upload, state changes etc.), I suppose.
Of couse, I occasionally get arrangments in the render list where several meshes under the same transform node just happen to have the same material, and so I check the current matrix pointer against the meshes matrix pointer to save myself a glLoadMatrix call - quite a rare occurance, as you can imagine. http://www.opengl.org/discussion_boards/ubb/smile.gif
You've got to accept that most uses of your scenegraph are going to be moving relatively few geometries around the place, so you should make use of a low priority worker thread that constantly traverses your scenegraph looking for transforms that haven't changed in X number of frames, and then manually transform all the (non-instanced) mesh vertices under that transform into world space - but then when that transform changes, you'll have to do an inverse transform on the verts to get them back into local space http://www.opengl.org/discussion_boards/ubb/smile.gif

[This message has been edited by kieranatwork (edited 03-20-2003).]

ToolTech
03-20-2003, 08:44 AM
So the whole idea I am working on now is bad.. ?

What I am doing is to make a scene graph compiler as part of my system. It detects and simplifies the graph when there is no change and optimizes it into just some shared geometry uploads and a whole bunch of glLoadMatrix...

jwatte
03-20-2003, 10:48 AM
You've got to accept that most uses of your scenegraph are going to be moving relatively few geometries around the place, so you should make use of a low priority worker thread that constantly traverses your scenegraph looking for transforms that haven't changed in X number of frames, and then manually transform all the (non-instanced) mesh vertices under that transform into world space

That would require blowing up all meshes into unique vertex buffers. You'd run out of memory very quickly that way if you have, say, 50 tree models, each instanced 200 times. (You should check out ToolTechs companys demo, btw; it's quite decent)

Also, I wouldn't want to traverse the graph in any kind of background manner. If you're not drawing something, you don't need to worry about it. Thus, if there's something you need to do after N frames of rendering equality, I'd probably do that inside by Render function; possibly with logic to re-set the counter if something renders after not having been rendered for a few frames. All of this state can be kept local to each renderable instance, and just requires that the current frame number is passed in as an argument to ::render().

Efficiency Rule #1: only touch data that you need to touch.

kieranatwork
03-20-2003, 10:57 AM
Huh? I said "non-instanced" geometry, not all geometry. Obviously I wasn't talking about instanced geometry (I even stated it), otherwise instancing would be pointless.
Only touch data that you need to touch?
A low priority thread should be scheduled by the OS not to interfere with CPU caching, otherwise what's the point in having a multi-thread, multi-process operating system? Do you somehow freeze Windows Explorer when your program runs so it doesn't mess up your CPU cache?
It won't interfere with your core loops.
Or have I misunderstood something about the way the scheduler works?

ToolTech
03-20-2003, 12:45 PM
Ok. I have made some more checks now.

I am able to "compile" the scene graph into a sorted bunch of data (states,matrixes and geometry)

In my scenario I now have about 10-50 shared geometries per state and about 50-100 000 matrixes per geometry.

The idea is to build a lot of varying trees etc with the same shared geometries.

I can now see that it is alot more efficient to keep them shared and not premultiply the matrixes. It doesn't take long before the load to upload various premuliplied geometries gets heavier than uploading matrixes..

What I need is a faster way to upload matrixes. What happens on the CPU when a glLoadMatrix is performed ?

Is there any possibility to use VBO buffer for e.g. matrix data to a VP ?

Is there any extension that allows you to specify an array buffer of modelview matrixes ? like vertex arrays.

Ysaneya
03-21-2003, 12:34 AM
I don't think there's any way to use video or AGP memory to store matrices with VAO/VBO. I'd expect the same for VAR, but it depends on what the driver is doing with the matrix. It's possible that it copies it to a system-memory matrix stack (in case you want to glGet it), then sends it to the video card. In that case, using VAR to store the matrix would require the driver to do a video readback which is extremely slow. I think your best bet is to decrease by an order your number of matrix uploads..

Y.

ToolTech
03-21-2003, 01:21 AM
Agree. I have been able to make an optimizing utility that measures the load and finds the point when an optimal set of premultiplied geometries can be transferred where the sum of times for glLoadMatrix + the sum of VAR times are minimal.

However I would like an extension where I could upload an array of matrixes instead. This would be very effecient ! Matt ?? Cass ?? Evan ?

knackered
03-21-2003, 02:31 AM
Aren't you supposed to post requests like this to:- http://www.opengl.org/discussion_boards/...e=20&LastLogin= (http://www.opengl.org/discussion_boards/cgi_directory/forumdisplay.cgi?action=topics&forum=Suggestions+for+the+next+release+of+OpenGL&number=7&DaysPrune=20&LastLogin=)

??

pocketmoon
03-21-2003, 06:18 AM
Originally posted by jwatte:
I would recommend against trying to put N modelview matrices into a vertex program, and then sending more per-vertex data to select one.

Nothing wrong with batching in a vertex shader this way if you are instancing identical objects in many locations, especially if your model-world transforms aren't constant.

jwatte
03-21-2003, 06:37 PM
Kieran,



A low priority thread should be scheduled by the OS not to interfere with CPU caching, otherwise what's the point in having a multi-thread, multi-process operating system?


Just because you have more threads available doesn't mean you'll be more efficient.

If you touch "background" data in a "low priority" thread, that's still work for the CPU. It still brings that data into cache, and TLB, and from disk, at some point, assuming all threads get time now and then.

The point is that you don't NEED to bring those things in, because they're not being used, so they're not candidates for the optimization. Thus, you're better off only checking for the optimization inside the render function of the object itself, because then you automatically only worry about data that you actually need to worry about.

V-man
03-21-2003, 08:55 PM
Originally posted by pocketmoon:
Nothing wrong with batching in a vertex shader this way if you are instancing identical objects in many locations, especially if your model-world transforms aren't constant.

How exactly is the selection suppose to happen?
And are we talking about ARB or NV (1.0, 1.1, 2.0)?

jwatte
03-22-2003, 09:51 AM
pocketmoon,

If you batch a bunch of instances like that, you have the following problems:

1) you need to duplicate the triangle list N times for a batch of size N
2) you need to additionally stream down a matrix index per vertex
3) oh, wait, that means that you can't actually re-use your vertices!
4) suddenly this idea isn't so great

Even if you could re-use your vertices, you'd have the extra bandwidth of the matrix index. Meanwhile, uploading all those matrices in a big batch would probably consume as much bandwidth as uploading them one at a time.

I believe that if you're limited inside LoadMatrix, you're actually limited on per-batch setup overhead, and the only way forward would be to do software transform/aggregation, and lose the instancing.

pocketmoon
03-22-2003, 10:49 AM
If you say so.