PDA

View Full Version : glGet Alternative?



Rhawk187
05-20-2008, 02:56 PM
I am doing some work that involves using the graphics card's 4x4 matrix multiplier. I did some experimenting and found that it took me about 200ms to do 1,000,000 4x4 matrix multiplications in software and only about 80ms to do 1,000,000 4x4 matrix multiplications in hardware using glMultMatrixf, which as I understand it also involves sending the information to the graphics card.

My problem comes in that I've found that doing a glGetFloatv(GL_MODELVIEW_MATRIX, x) 1,000,000 times in a row takes around 12,400ms on my machine. I did some reading and I saw that glGet* is notoriously slow, because it has to wait for the command queue to finish or something. I would have figured that doing all of them in a row it wouldn't have been this bad, but it still is.

I've been reading up on V/PBOs in hopes that maybe they might yield some advantage in bulk data transfer, because one of the articles I read said that "any" function that took a pointer now takes and offset into the bound buffer. Obviously, I thought maybe glGeting into on card memory would be a faster operation and I could send it all back at the end. Firstly, after more reading I'm thinking that the "any" was an exaggeration, and secondly even if it works, glGeting into on card memory may not actually be any faster than retrieving it in main memory on my machine.

So does anyone know of any alternative that I can use to inspect the current modelview/projection/texture/color or whatever matrix that doesn't use glGet, since I doubt there is anyway to speed that up? Because doing the multiplications in the graphics hardware isn't going to do me a lot of good if I can't get the information back and inspect the results.

If it helps, I really only have to inspect the 14th element (0-indexing column major storage) of the matrix to get what I need.

Additionally, I know all the matrices I am going to need to multiply ahead of time, so if there is a way (which I would like to imagine there should be) all I really want to do is:

1. Send a buffer of n matrices to the graphics card (16 * n floats)

2. Multiply the ith matrix time what is currently in the card and inspect the result returning the 14th element or storing it in a buffer to be sent back in the end for all is from 0 - n.

I figure I may have to end up using shaders or something, but I was hoping there was a way to do it without that.

Any help would be appreciated

-NiCo-
05-20-2008, 03:15 PM
Why are most people afraid of shaders? :)

Anyway, first of all, if you only need 1 component of a matrix that is the result of a matrix-matrix multiplication, you only need to perform 1 dot product instead of the full matrix multiplication e.g. if you need to get the value in the 2nd row and the third column of the matrix C = A*B, you only need to perform the dot product of the 2nd row of A and the 3rd column of B, this will result in 1/16th of the computational complexity you have now.

Secondly, you can send the matrix (or vec4 data if you only perform the dot product) in the rgba components of a texture and use a shader :) to calculate that 14th element and write it out to a single component 32 float texture with an FBO and read back all the values with readpixels.

PS If you don't want to use glGet, try tracking the matrix values/states yourself in your app.

Zengar
05-20-2008, 03:28 PM
I did some experimenting and found that it took me about 200ms to do 1,000,000 4x4 matrix multiplications in software and only about 80ms to do 1,000,000 4x4 matrix multiplications in hardware using glMultMatrixf, which as I understand it also involves sending the information to the graphics card.


How did you test it? I am more then sure that glMultMatrix is performed on the CPU, it would be a waste to do it on the GPU (for various reasons). If you want GPU-accelerated matrix multiplication, look into gpgpu and cuda.

Rhawk187
05-20-2008, 03:30 PM
I'm not really afraid of shaders, I just don't know how to use them yet, and therefore if I can get away with not having to until a later date, that's a plus.

Secondly, I see what you are saying about the dot product, but this may turn into more like a D = A*B*C as opposed to just a single multiplication in the future and I'm not sure if you can turn that into a series of dot products instead.

So, it is comforting to see that an implementation should be easy in a shader, but if anyone else does know of a non-shader way to do it, that would be great too, so keep responding :)

Also, keeping track of those values myself defeats the purpose, because as I stated earlier, the whole point of this exercise is to exploit the graphics cards on-board 4x4 matrix multiplier and not have to do it yourself. Nothing is actually being rendered. I'm just trying to multiply 4x4 matrices.

So, if this ends up having to be done in a shader, which I am mostly unfamiliar with, are there still performance advantages over doing a dot product to a 4x4 matrix multiplication? I assume there are hardware implementations for both on the card, therefore standard computational predictions may not apply. There most certainly is a 4x4 matrix multiplier on board.

Seth Hoffert
05-20-2008, 03:31 PM
glMultMatrix is performed on the CPU. You won't gain anything in terms of speed by using it. The speed-up you saw was most likely due to CPU vector instruction optimizations.

Rhawk187
05-20-2008, 03:35 PM
I did some experimenting and found that it took me about 200ms to do 1,000,000 4x4 matrix multiplications in software and only about 80ms to do 1,000,000 4x4 matrix multiplications in hardware using glMultMatrixf, which as I understand it also involves sending the information to the graphics card.


How did you test it? I am more then sure that glMultMatrix is performed on the CPU, it would be a waste to do it on the GPU (for various reasons). If you want GPU-accelerated matrix multiplication, look into gpgpu and cuda.

The test for multMatrix was basically:

float x[16] = { some values here }

getTimeHere()

for(int i = 0; i < 1000000; i++)
glMultMatrixf(x);

glTimeHereToo()

and take the difference of the times

I did not know that glMultMatrix is done on the CPU. Perhaps my 4x4 matrix multiplication implementation is just inefficient in some way. If this is indeed the case, that sort of blows the whole methodology I was using.

Thank you for the suggestions of gpgpu and cuda, I will look into them further.

Rhawk187
05-20-2008, 03:37 PM
glMultMatrix is performed on the CPU. You won't gain anything in terms of speed by using it. The speed-up you saw was most likely due to CPU vector instruction optimizations.

I was unaware. I do not take advantage of SIMD optimizations in my implementation so this could in fact be the case. So, I guess it's back to step 1 and actually figuring out a good way to do the multiplications on the graphics card.

-NiCo-
05-20-2008, 03:43 PM
Ow, you're really using glMultMatrix. Like Zengar and HexCat already pointed out, this is done on the CPU.

For the case of D = A*B*C, you can still split this up in dot products. If you need the 2nd row and 3rd column of D you first take the 3rd column of C (V1), then perform 4 dot products of the rows of B with with V1 to get V2 (= 3rd column of B*C), then perform a dot product of the 2nd row of A with V2 to get the result. So you'll still get some speedup compared to the full matrix multiplication method.

Rhawk187
05-20-2008, 04:05 PM
This leads me to another interesting question then. If a normal call to glMultMatrixf(x) is done in the CPU, what if the call is done in a display list?

For instance what if I were to call:

glLoadMatrixf(m)

outside of the display list

and then call

glPushMatrix();
glMultMatrixf(a)
glPopMatrix();
glPushMatrix();
glMultMatrixf(b)
glPopMatrix();
glPushMatrix();
glMultMatrixf(c)
glPopMatrix();

from inside a display list, would those multMatrix calls be done using the graphics cards 4x4 matrix multiplier?

It still doesn't solve my glGet inspection problem, but it would be nice to know.

-NiCo-
05-20-2008, 04:09 PM
display lists are static and need to be compiled, so the compilation process may be more expensive than immediate mode. Furthermore you can't change the matrix data without recompiling the display list so that doesn't seem like a good option. If you want to make use of hardware matrix multiplication you really need to do it within a shader.

Rhawk187
05-20-2008, 04:14 PM
That's sort of what my example was supposed to represent. The matrix outside the list is the only one that ever changes, the ones inside will be the same every time, and therefore could be compiled well in advance, and won't need to be changed.

I'm reading up on shaders now, but it seemed like an easier way to accomplish the same goal (minus the inspection problem).

Zengar
05-20-2008, 04:49 PM
The test for multMatrix was basically:

float x[16] = { some values here }

getTimeHere()

for(int i = 0; i < 1000000; i++)
glMultMatrixf(x);

glTimeHereToo()

and take the difference of the times

I did not know that glMultMatrix is done on the CPU. Perhaps my 4x4 matrix multiplication implementation is just inefficient in some way. If this is indeed the case, that sort of blows the whole methodology I was using.

Thank you for the suggestions of gpgpu and cuda, I will look into them further.

Made my day :)

What you are measuring is the time to call this functions. There is absolutely no evidence that the commands are actually being executed. OpenGL stores the commands in the queue and executes them at some unknown time point (in your case probably never, as the result matrices are unused). To measure it correctly, you will have to place a Get call at the end, thus making sure that the commands will be executed.

To make it short: there is absolutely NO WAY that glMultMatrix, in any scenario will be faster then your homebrew function, as the overhead is too high. If you need high performance, look into CUDA and/or GPGPU; or implement an efficient algorithm on the CPU.

Brolingstanz
05-20-2008, 05:46 PM
Though it should probably be pointed out that a well crafted CPU matrix routine could well be an order of magnitude faster than one left to the mercy of the compiler; this is especially so when dealing with the large batches you see in CPU skinning (not that you see a heck of a lot of that anymore), where there's loads of room to optimize given large chunks of data to chew on and the overhead of individual calls is eliminated (this is where glMultMatrix is hog tied).

Intel developer site has some words and some code on some CPU optimization strategies for batch matrix transforms in the off chance you're interested in going that route (same basic ideas apply to AMD too, just a different set of intrinsics).

V-man
05-20-2008, 06:41 PM
The shader method would be interesting to compare against the SSE instructions.
You would need to make 3 GL_RGBA32F textures. One of them is used to hold the result.
The shader would be something like this

mat4 matrix;
//Read the matrix
matrix[0] = texture2D(tex0, texcoord0);
matrix[1] = texture2D(tex0, texcoord1);
matrix[2] = texture2D(tex0, texcoord2);
matrix[3] = texture2D(tex0, texcoord3);

//Read the vector
vector = texture2D(tex1, texcoord0);

//Compute matrix * vector
gl_FragColor = matrix * vector;

yooyo
05-21-2008, 03:06 AM
@Rhawk187:
Did you create OpenGL context before those glMultMatrix tests?

Use shaders or CUDA.

Rhawk187
05-21-2008, 02:33 PM
I had created an OpenGL context.

I am learning how shaders work now.

My new laptop is going to be CUDA capable but my current one isn't, so I will play around with that when I get it in a week.

I also had never heard of the glFeedbackBuffer until this morning, and it may have some usefulness if I were to rework some of my experiments to transform a single vector through a series of matrices, but probably not.

Thank you everyone for your input.

Seth Hoffert
05-21-2008, 02:45 PM
I'm afraid feedback mode is also run in software in most implementations. :(

Your best bet is to use shaders or CUDA. :)

Eosie
05-21-2008, 03:22 PM
I've just made a naive implementation of matrix multiplications, doing 8 of them in the pixel shader (hitting the max. number of temporaries in ps3.0) and using 4 float32 render targets to be able to output the entire matrix at once. As you can see this is not well optimized. The throughput of this algorithm is 27M matrix multiplications per second on my mobile R520 with 12 pixel pipelines.