Performance Tips?

Now that we’ve moved into an era of very powerful and very hast hardware, I have some questions about the older performance tips that have been tossed about in the past.

In particular, I am aware that state changes can cause performance problems in OpenGL implementations. However, it is not entirely clear exactly which state changes cause these problems, and how much performance is lost. Also, drivers have improved since the last nVidia performance faq, so some of these restrictions may be lifted.

BTW, I am aware that this is all very implementation-specific. I would like to know the appropriate answers for both nVidia and ATi hardware.

I would like to know whether each of these functions incur a significant hit in terms of performance (best case, worst case, and average, if possible):

  • Binding a texture object.

  • Uploading ARB_vertex_program attributes.

  • Binding an ARB_vertex_program.

  • Binding an ARB_fragment_program.

  • Uploading a texture that uses a card-native texture format (compressed, etc). The new texture is the same size as the old one (glTexSubImage is used).

  • Changing a vertex pointer.

  • Changing several vertex pointers (is this much worse than just one?).

write your own simple test app to measure it yourself failing that ask carmack

anyways

i would say ‘Uploading a texture’ would be the slowest on all hardware

>>* Changing a vertex pointer.<<

i never change vertex pointers

I think we could do with a public benchmark program for state changes and somewhere for people to post results. Anyone want to write one?

I want to know the relative load on the cpu/gpu, what performance is dependent on, what are the fixed costs and variable costs. State changes can return control to the app quickly but still be processing on the gpu. All these things impact on how we should write software yet we don’t have any collective information. We are just told they are slow and should be kept to a minimum.

Maybe I’ll write a quick benchmark program and post the source, then others can add to it.

You might be able to leverage the old isfast tool (home) as a starting point, rather than writing the whole thing from scratch. AFAIK it hasn’t been updated for a few years though.

Thanks Mike, I’d already written most of it by the time I saw your message though.

It only has one of the state changes you are interested in Korval. I haven’t used fragment or vertex programs yet so that’s why they aren’t included.
http://www.adrian.lark.btinternet.co.uk/GLBench.htm

Heres the exe, no source yet. I will release that once I’ve added some comments http://www.adrian.lark.btinternet.co.uk/GLBench.zip

Run the exe and when it’s finished it will create a text file with the results.

This is just an alpha version, I’m open to suggestions. I don’t want to add tnl and fillrate benchmarks just yet (if ever). This benchmark is more for functions that aren’t included in any existing benchmark apps.

Don’t post the results here but send them to me and I’ll add them to the page. My email address is in my profile.

[This message has been edited by Adrian (edited 02-08-2003).]

Nice initiative, Adrian!

These figures can be useful, but what is really interesting is to see what happens when you do combinations. Simply changing a single state over and over may not affect the behaviour of the pipeline as much as if you do combinations (since it may force various flushes etc). At least that’s what I believe - there are probably a few interesting scenarios that you can identify (e.g. glReadPixels in combination with vertex arrays vs. vertex arrays and glReadPixels by themselves etc).

I’m not sure what should be included in the test, since there are infinite combinations in which order to do things etc, but one could probably identify some common tasks that most programs do / want to do.

Some details:

GL_SCISSOR_TEST 915 ms? Or is it 915/1000 ms? Not very clear from the tables.

megabytes per second is usually written MB/s (Mb is, as FAIK, megabit).

I got a weird result for glReadPixels. Only 1MB/sec when I had 2x FSAA turned on. Everything else ran at a decent speed.

Thanks for the feedback Marcus. I think adding combinations though would make it complicated to interpret the results. I call glfinish at the end of each test so the results you see are the total CPU/GPU time not the time between starting the function and it returning control to the app. Maybe I should show both times.

I think the relative performance is most interesting. From those tables it’s clear which functions should be kept to a mimumum. I use glsciscor quite a lot so those figures were slightly disappointing.

I will change the Mb to MB. The times represent the total time in milliseconds, I will make this clearer in the next version.

Nutty, you meant copypixels I get 1Mb/sec to with FSAA on to. In fact with FSAA on I get a black then white screen when I run the app(as opposed to just the dos box). wierd. Do you get that?

marcus is right.

If you are doing something like this

start=gettime();
for(i=0; i<1000; i++)
{
glMultMatrix(…);
}
diff=gettime()-start;

then you’re measuring the time it takes for your CPU to process the call, since the driver will batch up calls, and then send it to the card when time comes to render something.

Same thing with many other GL calls.
You might want to render some spinning cube or something.

glReadPixels and a few other can be tested that way, so those should be OK.

PS: When i ran your program, my desktop became unresponsive until I closed it.

Adrian states that he uses glFinish to ensure the calls are processed (and not just queued). His code should look like this:

start=gettime();
for(i=0; i<1000; i++)
{
glMultMatrix(....);
}
[b]glFinish();[/b]
diff=gettime()-start;

Originally posted by V-man:
[b]you’re measuring the time it takes for your CPU to process the call, since the driver will batch up calls, and then send it to the card when time comes to render something.

So you’re saying glFinish isn’t sufficient then, ok.


PS: When i ran your program, my desktop became unresponsive until I closed it.

That might be the copypixels problem with FSAA on.

That’s right kehziah, that’s how I’m doing it.

Interesting that there appears to be no difference between image2D and subimage2D calls, at least on my gf4.
What are you doing at these points? Are you loading up different data each time? If not, then the driver may filter out redundant stuff…

Originally posted by kieranatwork:
Are you loading up different data each time? If not, then the driver may filter out redundant stuff…

Yes I thought it was weird to, here’s the code

glFinish();

QueryPerformanceCounter(&start);

for (i=0;i&lt;10;i++)
{
	if (Type==0)
	{
		glTexSubImage2D(GL_TEXTURE_2D,0,0,0,256,256,GL_BGRA_EXT,GL_UNSIGNED_BYTE,buf);
	}
	if (Type==1)
	{
		glTexImage2D(GL_TEXTURE_2D,0,GL_RGBA,256,256,0 ,GL_BGRA_EXT,GL_UNSIGNED_BYTE,buf);
	}

}

glFinish();

QueryPerformanceCounter(&end2);

As you can see I wasn’t changing the data but I just added
*buf=i; // Change one pixel
in the loop and the performance is still the same…

[This message has been edited by Adrian (edited 02-10-2003).]

Originally posted by kieranatwork:
Are you loading up different data each time? If not, then the driver may filter out redundant stuff…

I don’t see how the driver could detect changes in the pixel data. It has to assume that there has been a change, and upload it all from scratch.

Can it be that the driver is optimized to detect that the pixel format, texture size etc is all the same as the present texture in memory, and effectively convert the call to a SubImage call? (it should be possible)

You may try another test where you use a base image size, say 256x256, and then loop through 256x256, 128x128, 64x64, 32x32, 256x256, … Then you should see the difference between Image and SubImage (?). This test could be a separate test (in addition to the present fixed size test).

You could also add a separate tests for different fixed sizes. I have noticed “optimal” texture sizes, which probably depend strongly on cache sizes.

[This message has been edited by marcus256 (edited 02-11-2003).]

>I don’t see how the driver could detect changes in the pixel data.

It that particular case, it would not need to. All it would need would be to detect that the specified texture data isn’t used, so there is no need to actually upload to the graphics board, only to maintain a copy of the raw data in main memory (to honour the GL spec, and in case it might actually be used).