Imaging subset (color matrix) performance

(This follows on from my original question about YUV -> RGB conversion).
I have been doing some very simple tests with the color matrix, part of the imaging subset, under Linux using an nVidia GeForce Ti4400. This is certainly a feasible way to do the YUV->RGB conversion, but I’ve discovered that there is a big performance hit. In my test program I call glTexImage2D() and map the texture for each frame. (I understand that the color matrix is applied when the texture is generated.) If I do not touch the color matrix this code gives 410 fps. If at the init stage (i.e. only once) I load the identity matrix into the color matrix, the framerate drops to 75 fps. In other words, even a color matrix that does nothing costs about 10 ms per render. (Any other matrix has the same performance.) I may be able to live with this, but I’d rather not if it is avoidable. Perhaps I’m doing something wrong, or not doing something I should be doing. If this is just the way it is, it suggests that the color matrix transformation is being carried out in software, which raises a more general question: how can I determine which features of the driver are hardware accelerated?

thanks
Gib

Color matrix is usually a software process, and happens on pixel transfer (i e upload).

I suggest whipping up a color matrix work-alike using dot3 functions in the texture environment and/or register combiners. That, you can make fly pretty fast. You need three constant registers into the set-up though, which have to be fetched from some small texture if you’re using plain texture environment with DOT3_ARB.

You cannot, in general, determine which features are supported in hardware. This is a very common question; just do an archive search for HOURS of back-and-forth on the subject…

Thanks for clarifying the color matrix behaviour. It makes sense. Now I’m trying to figure out how to use register combiners, and getting hold of a clear explanation is not trivial, it seems. NVidia have a doc on the subject, but the .pdf is munged and its not easy for me to view a .ppt.

Would you mind explaining what you mean by “… which have to be fetched from some small texture if you’re using plain texture environment with DOT3_ARB”. Is the use of register combiners an alternative to, or in conjunction with, dot3? (I’m guessing the former.)

Gib

The regular texture environment only supports two constant colors, period. If what you want is:

out.red = tex0 dot const0
out.green = tex0 dot const1
out.blue = tex0 dot const2

then you’re out of constants. The way to get around this is to feed a texture of known color into the environment instead of const2.

If you use register combiners on a GF3 and up, I believe there’s another nVIDIA extension which allows you to specify two constants per combiner, which removes this problem.

I’m gradually getting this into focus. The way the register combiners work looks pretty strange to me as a neophyte, but I’m sure there are good reasons for everything.

I do have NV_register_combiners2, which as you say allows two constant colors per combiner. The YUV->RGB transformation is more than a matrix multiply. I’m informed that
B = 1.164(Y-16) + 2.018(U-128)
G = 1.164(Y-16) + 0.813(V-128) - 0.391(U-128)
R = 1.164(Y-16) + 1.596(V-128)

i.e.
B = 1.164Y + 2.018U - 1.08175
G = 1.164Y + 0.813V - 0.391U + 0.28375
R = 1.164
Y + 1.596*V - 0.87075
(if I’ve done the arithmetic correctly).

Now I just have to figure out how to do this in as few combiners as possible.

Gib

I’ve done a lot of head scratching while poring over the register combiners docs, and I’ve convinced myself that the YUV-RGB conversion, as defined above, cannot be done by this method.

I hope someone will prove me wrong.

Gib

Gib,

You will need at least three general combiners to do this. I believe some earlier nVIDIA cards only have two general combiners. I believe that on a GeForce3 and up (except not on a GeForce 4MX) you’ll do fine.

It’s not as easy as 3 dot products, because you also have to use some tricks to pack results back into one rgb color. It costs additional dot products.

OK, I decided to refresh my skills in register combiners programming and here’s the code that do the stuff. It requires the YUV texture in reversed order (R=V, G=U, B=Y). It can be easily done by changing format from GL_RGB to GL_BGR_EXT in glTexImage call.

// texture in format: VUY (R=V, G=U, B=Y)

// spare0.rgb = x = V - 0.5
// spare1.rgb = y = U - 0.5
glCombinerInputNV(GL_COMBINER0_NV, GL_RGB, GL_VARIABLE_A_NV, GL_TEXTURE0_ARB,       GL_HALF_BIAS_NORMAL_NV,  GL_RGB);
glCombinerInputNV(GL_COMBINER0_NV, GL_RGB, GL_VARIABLE_B_NV, GL_CONSTANT_COLOR0_NV, GL_UNSIGNED_IDENTITY_NV, GL_RGB);
glCombinerInputNV(GL_COMBINER0_NV, GL_RGB, GL_VARIABLE_C_NV, GL_TEXTURE0_ARB,       GL_HALF_BIAS_NORMAL_NV,  GL_RGB);
glCombinerInputNV(GL_COMBINER0_NV, GL_RGB, GL_VARIABLE_D_NV, GL_CONSTANT_COLOR1_NV, GL_UNSIGNED_IDENTITY_NV, GL_RGB);
glCombinerOutputNV(GL_COMBINER0_NV, GL_RGB, GL_SPARE0_NV, GL_SPARE1_NV, GL_DISCARD_NV, GL_NONE, GL_NONE, GL_TRUE, GL_TRUE, GL_FALSE);

// spare0.a = z = 0.5 * 1.164*(Y - 0.0625) = 0.582*Y - 0.036375
glCombinerInputNV(GL_COMBINER0_NV, GL_ALPHA, GL_VARIABLE_A_NV, GL_TEXTURE0_ARB,       GL_UNSIGNED_IDENTITY_NV, GL_BLUE);
glCombinerInputNV(GL_COMBINER0_NV, GL_ALPHA, GL_VARIABLE_B_NV, GL_CONSTANT_COLOR0_NV, GL_UNSIGNED_IDENTITY_NV, GL_ALPHA);
glCombinerInputNV(GL_COMBINER0_NV, GL_ALPHA, GL_VARIABLE_C_NV, GL_CONSTANT_COLOR1_NV, GL_SIGNED_NEGATE_NV,     GL_ALPHA);
glCombinerInputNV(GL_COMBINER0_NV, GL_ALPHA, GL_VARIABLE_D_NV, GL_ZERO,               GL_UNSIGNED_INVERT_NV,   GL_ALPHA);
glCombinerOutputNV(GL_COMBINER0_NV, GL_ALPHA, GL_DISCARD_NV, GL_DISCARD_NV, GL_SPARE0_NV, GL_NONE, GL_NONE, GL_FALSE, GL_FALSE, GL_FALSE);

float stage0_color0[4] = { 1, 0, 0, 0.582f };
float stage0_color1[4] = { 0, 1, 0, 0.036375f };
glCombinerStageParameterfvNV(GL_COMBINER0_NV, GL_CONSTANT_COLOR0_NV, stage0_color0);
glCombinerStageParameterfvNV(GL_COMBINER0_NV, GL_CONSTANT_COLOR1_NV, stage0_color1);
// spare0.r = z + 0.7980*x = 0.5*R
// spare0.g = z - 0.4065*x
// spare0.b = z + 0.0000*x
glCombinerInputNV(GL_COMBINER1_NV, GL_RGB, GL_VARIABLE_A_NV, GL_SPARE0_NV,          GL_SIGNED_IDENTITY_NV,   GL_RGB);
glCombinerInputNV(GL_COMBINER1_NV, GL_RGB, GL_VARIABLE_B_NV, GL_CONSTANT_COLOR0_NV, GL_EXPAND_NORMAL_NV,     GL_RGB);
glCombinerInputNV(GL_COMBINER1_NV, GL_RGB, GL_VARIABLE_C_NV, GL_SPARE0_NV,          GL_SIGNED_IDENTITY_NV,   GL_ALPHA);
glCombinerInputNV(GL_COMBINER1_NV, GL_RGB, GL_VARIABLE_D_NV, GL_ZERO,               GL_UNSIGNED_INVERT_NV,   GL_RGB);
glCombinerOutputNV(GL_COMBINER1_NV, GL_RGB, GL_DISCARD_NV, GL_DISCARD_NV, GL_SPARE0_NV, GL_NONE, GL_NONE, GL_FALSE, GL_FALSE, GL_FALSE);

float stage1_color0[4] = { 0.5f*(1 + 0.798f), 0.5f*(1 - 0.4065f), 0.5f*(1 + 0), 0 };
glCombinerStageParameterfvNV(GL_COMBINER1_NV, GL_CONSTANT_COLOR0_NV, stage1_color0);

// spare0.r = 2*(z + 0.7980*x + 0.0000*y) = R
// spare0.g = 2*(z - 0.4065*x - 0.1955*y) = G
// spare0.b = 2*(z + 0.0000*x + 1.0000*y) = B
glCombinerInputNV(GL_COMBINER2_NV, GL_RGB, GL_VARIABLE_A_NV, GL_SPARE1_NV,          GL_SIGNED_IDENTITY_NV,   GL_RGB);
glCombinerInputNV(GL_COMBINER2_NV, GL_RGB, GL_VARIABLE_B_NV, GL_CONSTANT_COLOR0_NV, GL_EXPAND_NORMAL_NV,     GL_RGB);
glCombinerInputNV(GL_COMBINER2_NV, GL_RGB, GL_VARIABLE_C_NV, GL_SPARE0_NV,          GL_SIGNED_IDENTITY_NV,   GL_RGB);
glCombinerInputNV(GL_COMBINER2_NV, GL_RGB, GL_VARIABLE_D_NV, GL_ZERO,               GL_UNSIGNED_INVERT_NV,   GL_RGB);
glCombinerOutputNV(GL_COMBINER2_NV, GL_RGB, GL_DISCARD_NV, GL_DISCARD_NV, GL_SPARE0_NV, GL_SCALE_BY_TWO_NV, GL_NONE, GL_FALSE, GL_FALSE, GL_FALSE);

float stage2_color0[4] = { 0, 0.5f*(1 - 0.1955f), 1, 0 };
glCombinerStageParameterfvNV(GL_COMBINER2_NV, GL_CONSTANT_COLOR0_NV, stage2_color0);

// leave the final combiner in the default state

glCombinerParameteriNV(GL_NUM_GENERAL_COMBINERS_NV, 3);
glEnable(GL_PER_STAGE_CONSTANTS_NV);
glEnable(GL_REGISTER_COMBINERS_NV);

It uses 3 generic combiners and 4 constant colors so at least GeForce3 is needed. But…
If primary and secondary colors are not used, two of the constants can be placed there, thus eliminating the need for GL_NV_register_combiners2 extension. Also calculations from the third combiner can be moved to the final combiner so only two combiners are needed. This way everything can work even on GeForce1.
I leave implementation details as an exercise to the reader

Hope this helps

Kuba

PS. Oops, 1280x1024 recomended to view this monster

[This message has been edited by coop (edited 11-03-2002).]

Wow Kuba, that was beyond the call of duty!

I’ll need to study your code closely to grasp the full extent of your cunning. This will be very instructive.

I had the mistaken impression that all the combiners had to be set up as a linear chain, i.e. #0 -> #1 -> #2 …, but your example shows that, for example, #0 can be used repeatedly to load #1’s input registers, etc. That changes things a bit.

I am very grateful to you, and will be even more so when I’ve established that it works :slight_smile:

Meanwhile, Herr Draxinger on c.g.a.o has suggested a completely different method of doing the conversion, using a lookup into a 3D texture. Am I right in guessing that this would be faster?

Gib

gib,

It probably would not be faster, and probably would be of lesser quality.

Well, it could be of the same quality if you spent 256x256x256x4 == 64 megabytes on the look-up texture. Addressing your way around that texture is going to make you go fill limited on look-ups quite quickly.

Originally posted by gib:
I had the mistaken impression that all the combiners had to be set up as a linear chain, i.e. #0 -> #1 -> #2 …, but your example shows that, for example, #0 can be used repeatedly to load #1’s input registers, etc. That changes things a bit.

I don’t quite understand you here. Combiners indeed have to work in the linear chain. That is the results from #0 can be used in #1, #2 has access to both the results from #0 and #1, etc. Reverse is not true, for example #0 has no access to anything produced by the other combiners. All combiners have access to all textures, primary and secondary color, and to the constants (global or per combiner like in my code).
I hope I made this a little bit more clear. Maybe the confusion is because I used both rgb and alpha parts of the first combiner? Combiners 1 and 2 use only the rgb part.

Kuba

Kuba, as you surmise, I was misled by the use of both RGB and ALPHA portions of combiner #0. I had also failed to grasp the fact that each combiner gets a full set of registers. Having studied your code I now think I understand it. There are several features of the RCs that are not obvious on first reading. For example, I didn’t see how you accessed z (in alpha) in the rgb portion, then discovered that with portion = GL_RGB, componentUsage = GL_ALPHA puts the alpha value into r,g and b. I also admire your fancy footwork in getting around the issue of negative coefficients.

The code runs very fast, reducing framerate from 410 without the RCs to 370 with them, i.e. the cost is only about 0.25 ms per frame on my test case. This is excellent.

I still haven’t done a real test, because I haven’t yet sorted out the code to convert an RGB texture to hold the corresponding V, U and Y bytes (just to use for testing). I am a complete beginner in the field of video, and the whole subject of YUV turns out to be rather complex. Oh well, as my Mum used to say “You shouldn’t have joined if you can’t take a joke.”

regards
Gib

Well Kuba, it works! You knew that, of course.

After getting my test set up correctly (I have created a VUY texture that I display repeatedly) I find that I get 480 fps (displaying to about 512x512). This is pretty good. But here is the surprise: I have also set up the option of using SDL_DisplayYUVOverlay(), which does the conversion to RGB in software, and this gets 547 fps, i.e. about 0.25 ms/frame less.

I am using a P4 1.8, and I guess it just has the edge over the GF4 for this kind of code.

I’m happy with the outcome, and I’ve learned a lot, though it took me a while.

thanks
Gib

Gib,
You may get better performance by updating your texture via a call to glTexSubImage2D rather than glTexImage2D every frame. Just a thought.

Are you sure SDL converts in software? Indeed, a fast P4 IS going to out-math an older graphics card on “general math”. But SDL may be able to give you a native YUV overlay surface, or a YUV assisted blit.

The SDL function is definitely not using the video hardware (it may be using MMX). I’m converting the YUV data into an RGB texture, which I then map to a quad. That is, the surface that I display the YUV overlay on is the texture surface. Is there support for YUV overlay in OpenGL?

Gib

Gib, I’m glad my code works for you and you find it useful.

Speaking about the performance, it could be improved even further by squeezing everything down to 2+final combiners because 2 combiners work twice as fast as 3. But with 480fps I don’t think it’s necessary .

I don’t know nothing about SDL so I won’t speculate if it’s hardware accelerated or not. But I think that given sufficiently fast YUV->RGB conversion routine (MMX probably) a software converter can work as fast as any hardware accelerated one. It’s because the biggest bottleneck is probably still the process of uploading data to the hardware and not the conversion itself.

Kuba