"frame rate" slowing down with time

sek · September 30, 2003, 3:20pm

Hello:

I am using two fragment programs to do a physical simulation on the GPU rather than software/CPU. I don’t need to see the frame until it is finished. I am using the GPU to try and go much faster than the algorithm runs on the CPU. I am on XP using a GeForce FX 5900 Ultra with the 45.23 driver. My 2D simulation uses three 512x512 textures and a viewport of the same size. My problem is that if I do 500 passes, it is very fast and my CPU usage does not max out. However, if I do more passes, say 1000, the execution time is not scaling linearly even though the code is just doing the same thing over and over again. At a higher number of passes, my CPU usage shoots up to 100% and stays there until it is done. Here is my rendering code:

void Process_CTSI()
{
int i;

if (CurrentStep == 0) start_time = clock();

for (i=0; i&lt; NumSteps; i++)
{
	glBindProgramNV(GL_FRAGMENT_PROGRAM_NV, fp_A_ID);
	glEnable(GL_FRAGMENT_PROGRAM_NV);
	glActiveTextureARB(GL_TEXTURE0_ARB);  
	glBindTexture(GL_TEXTURE_2D, TextureID_A);
	glActiveTextureARB(GL_TEXTURE1_ARB);  
	glBindTexture(GL_TEXTURE_2D, TextureID_B);
	glBegin(GL_TRIANGLE_STRIP);
		glTexCoord2f(0,1); glVertex2f(-1,1);
		glTexCoord2f(0,0); glVertex2f(-1,-1);
		glTexCoord2f(1,1); glVertex2f(1,1);
		glTexCoord2f(1,0); glVertex2f(1,-1);
	glEnd();
	glBindTexture(GL_TEXTURE_2D, TextureID_A);
	glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0, iWidth, iHeight);


	glBindProgramNV(GL_FRAGMENT_PROGRAM_NV, fp_B_ID);
	glActiveTextureARB(GL_TEXTURE0_ARB);  
	glBindTexture(GL_TEXTURE_2D, TextureID_B);
	glActiveTextureARB(GL_TEXTURE1_ARB);  
	glBindTexture(GL_TEXTURE_2D, TextureID_A);
	glActiveTextureARB(GL_TEXTURE2_ARB);  
	glBindTexture(GL_TEXTURE_2D, TextureID_C);
	glBegin(GL_TRIANGLE_STRIP);
		glTexCoord2f(0,1); glVertex2f(-1,1);
		glTexCoord2f(0,0); glVertex2f(-1,-1);
		glTexCoord2f(1,1); glVertex2f(1,1);
		glTexCoord2f(1,0); glVertex2f(1,-1);
	glEnd();
	glDisable(GL_FRAGMENT_PROGRAM_NV);

	
	glColor4f(Data[CurrentStep], 0.0, 0.0, 0.0);


	glEnable(GL_BLEND);
	glBlendFunc(GL_ONE, GL_ONE);
	glBegin(GL_LINES);
		glVertex2f(-1.0, -0.75);
		glVertex2f(1.0, -0.75);
	glEnd();
	glBindTexture(GL_TEXTURE_2D, TextureID_B);
	glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0, iWidth, iHeight);
	glDisable(GL_BLEND);

	CurrentStep++;
}

if (CurrentStep==NumSteps)
{	glutSwapBuffers();
	end_time = clock();
	simulation_time = (double) (end_time - start_time) / CLOCKS_PER_SEC;

	cout&lt;&lt;"Start Time = "&lt;&lt;start_time&lt;&lt;endl;
    cout&lt;&lt;"End Time = "&lt;&lt;end_time&lt;&lt;endl;
    cout&lt;&lt;"RTT simulation_time = "&lt;&lt;simulation_time&lt;&lt;endl;
	cout&lt;&lt;"NumSteps = "&lt;&lt;NumSteps&lt;&lt;endl;
	cout.flush();
}

}

Can anyone tell me why it should run so much slower when NumSteps = 1000 as compared to when NumSteps = 500 and why the CPU usage goes so high with the longer run time? Also, when I put the above rendering code into two seperate display lists and call them in the loop, with the glColor4f(Data[],…) call in between, it actually runs slower than without the display lists. Shouldn’t it run faster? Confused.

Thanks.

[This message has been edited by sek (edited 09-30-2003).]

Korval · September 30, 2003, 4:04pm

Just a guess, but the hardware and software queue’s (in the driver) may have been filled by your computations, so the CPU is forced to stall until those queue’s drain sufficiently.

Also, when I put the above rendering code into two seperate display lists and call them in the loop, with the glColor4f(Data,…) call in between, it actually runs slower than without the display lists. Shouldn’t it run faster?

Display lists don’t guarentee additional speed. In fact, you usually don’t get a speed up unless they contain only geometry. Adding non-geometry (state changing) can easily cause display lists to slow down. Given how few verts you’re sending anyway (even 2000x iterations is just 16,000 verts, which is nothing to a 5900Ultra), you can’t possibly be vertex transfer bound, so that isn’t a problem. I wouldn’t worry about trying to use display lists or even VBO’s.

sek · September 30, 2003, 4:37pm

Thanks Korval:

My CPU doesn’t stall though (with display lists or not). The problem is that the CPU seems to hold the GPU back - at least that is what I think is happening, because of the fact that when the CPU doesn’t max out (with lesser number of passes such as 500), the frame/iteration rate is much faster. Am I missing something?

Originally posted by Korval:
Just a guess, but the hardware and software queue’s (in the driver) may have been filled by your computations, so the CPU is forced to stall until those queue’s drain sufficiently.

[This message has been edited by sek (edited 09-30-2003).]

MichaelK · October 1, 2003, 3:45am

Maybe in one of Your fragment programs You are allocating memory with operator new() and dont free it with delete()? The heap grows and grows and CPU must spend more and more time to allocate next chunk of memory?

SirKnight · October 1, 2003, 4:50am

Originally posted by MichaelK:
Maybe in one of Your fragment programs You are allocating memory with operator new() and dont free it with delete()? The heap grows and grows and CPU must spend more and more time to allocate next chunk of memory?

???
You can’t allocate/deallocate memory in a fragment program.

-SirKnight

sek · October 1, 2003, 5:43am

Here is one of my NV_fragment_programs (the other is very similar):

!!FP1.0

DEFINE my_offset = {-0.001953125, 0.0};

ADD R10.xyzw, f[TEX0].xyxy, my_offset.xyyx;

TEX R0, f[TEX0], TEX1, 2D;
TEX R1.z, R10.xywz, TEX1, 2D;
TEX R1.y, R10.zwxy, TEX1, 2D;
TEX R3, f[TEX0], TEX2, 2D;
TEX R4.x, f[TEX0], TEX0, 2D;

ADD R1.xyzw, R0.xzyw, R1.xyzw;
ADD R0.x, -R1.z, R1.y;

MAD o[COLR].xyzw, R3.xyzw, R0.xyzw, R4.xyzw;

END

Originally posted by SirKnight:
[b]
???
You can’t allocate/deallocate memory in a fragment program.

-SirKnight[/b]

system · October 1, 2003, 6:44am

Like the man said, you dont (cant) allocate memory in a FP or even a VP.

The kind of behavior you are seeing happens quite often. There is probably a specific number of loops that can cause the CPU usage to shoot from 5% to 100%.

Also, I think that the cpu usage monitor of windows is imperfect. Sometimes it shows 100% usage but everything runs smooth and sometimes everything crawls.

sek · October 1, 2003, 7:11am

V-MAN:

Do you have a feeling for why my application runs at such a lower iterations per second when run 1000 times relative to 500 times - when the same code is being repeated?

Thanks.

Originally posted by V-man:
[b]Like the man said, you dont (cant) allocate memory in a FP or even a VP.

The kind of behavior you are seeing happens quite often. There is probably a specific number of loops that can cause the CPU usage to shoot from 5% to 100%.

Also, I think that the cpu usage monitor of windows is imperfect. Sometimes it shows 100% usage but everything runs smooth and sometimes everything crawls.

[/b]

[This message has been edited by sek (edited 10-01-2003).]

SirKnight · October 1, 2003, 8:15am

Originally posted by V-man:
[b]… Also, I think that the cpu usage monitor of windows is imperfect. Sometimes it shows 100% usage but everything runs smooth and sometimes everything crawls.

[/b]

Sometimes mine says 0% even when I have a few programs running, even though they are minimized. Maybe it’s actaully between 0 and 1, so it’s a fraction and it only displays integers. Dunno.

-SirKnight

system · October 1, 2003, 8:59am

>>>>Do you have a feeling for why my application runs at such a lower iterations per second when run 1000 times relative to 500 times - when the same code is being repeated?<<<<

I’m guessing the driver is at fault. I can’t detail what is happening so I would go back to what Korval said.

But you are running the code only once right?
How long dows it take to finish and WHEN exactly does CPU usage go to 100% (as the loop runs, when you swapbuffer?)

You can experiment by putting a flush after the 500th iteration.

sek · October 1, 2003, 9:34am

V-MAN:

But you are running the code only once right?
How long dows it take to finish and WHEN exactly does CPU usage go to 100% (as the loop runs, when you swapbuffer?)

Yes - I am running the code only once. If NumSteps = 500, it takes only 0.06 seconds and the CPU spikes up only to about 15% and back down. If NumSteps = 1000, it takes 2.7 seconds and if NumSteps = 2000, it takes 8.4 seconds. In the latter two cases, the CPU spikes up to 100% immediately and stays there until it is done.

You can experiment by putting a flush after the 500th iteration.

I thought that with glutSwapBuffers() and double buffering, a glFlush() was not necessary - anyway I put in a flush and no effect.

Thanks V-MAN,

Sean.

[This message has been edited by sek (edited 10-01-2003).]

SirKnight · October 1, 2003, 11:34am

Just a quick stab in the dark…have you tried glFinish() after the 500th iteration?

-SirKnight

sek · October 1, 2003, 12:01pm

SirKnight:

Yes. I just tried it again. The effect is that even at NumSteps = 500, the CPU goes to 100% and execution time is slowed down dramatically with all NumSteps. However, glFinish() does seem to make the execution time increase linearly with NumSteps - just a lot slower, even at the higher NumSteps.

Originally posted by SirKnight:
[b]Just a quick stab in the dark…have you tried glFinish() after the 500th iteration?

-SirKnight[/b]

[This message has been edited by sek (edited 10-01-2003).]