PDA

View Full Version : GPU CPU parallelism



tarantula
06-09-2003, 07:34 AM
How should the rendering be done so that the parallelism can be maximised?
Is this possible on old cards (TNT)?

I tried to find out the time taken by the swapBuffers call when I was drawing a lot of stuff. I got 0 or 1ms :s

drawstuff();
glFlush();
//should the code that's desired to be run in parallel inserted here?
swapBuffers();

Korval
06-09-2003, 08:15 AM
Well, a TNT doesn't have a T&L unit. The only kind of parallelism you can hope to get is based on fillrate. So, if you're drawing lots of small polys, you can't get much parallelism.

In any case, use VBO's. That should provide you with as much parallelism as you can get.

paulc
06-09-2003, 08:28 AM
The best way to get parallelism to to flush the pipeline just before you draw each frame. This means that all the drawing is being done by the card while your doing update calculations (physics, AI etc). The order is very simply:

while( !quit )
{
update( AI physics ... )
glFlush()
swapbuffers
draw( everything )
}

dorbie
06-09-2003, 09:24 AM
That is a really attrocious main loop structure.

Why would you flush AFTER you do your compute?

On top of this you call swapbuffers immediately after which has an implicit flush anyway?

Bad.

MacReiter
06-09-2003, 10:30 AM
Not my post, but thought I'd toss in my understanding/suggestion anyway:

You HAVE to do glFlush AFTER the calculations, or else you get no parallelism at all. The glFlush forces resynchronization between the GPU and CPU. So if you draw out your scene and then immediately glFlush before your calculations, you just stall the CPU until the GPU can catch up. Then the GPU sits there doing nothing while the CPU updates the AI. That's why the AI calculations occurred BEFORE the glFlush.

Actually, while I know that the GPU can do a lot of work without needing further CPU intervention, there must be some limit to it. If so, the other glXxxxx commands will be forced to stall along the way. I wonder if it wouldn't be better to interleave your processing in between large portions of your rendering code to distribute the load. Of course, the better way to do that would be to have your rendering code and your parallel code in separate threads so that if the rendering code had to stall waiting for the GPU, the parallel code could get work done. If you want frame synchronization, you can still use a mutex/semaphore/conditionvariable/event or whatever else is handy to keep the two threads synchronized on a per-frame basis.

An extremely vague version would look like this:




void RenderThread(void)
{
while (!done)
{
RenderWorld();
glFlush();
glSwapBuffers();
WaitForSingleObject(AiReadyEvent);
SignalEvent(RenderReadyEvent);
}
}

void AIThread(void)
{
while (!done)
{
UpdateAiSystems();
SignalEvent(AiReadyEvent);
WaitForEvent(RenderReadyEvent);
}
}


OK, there's probably all kinds of hideousness there. It would probably be better to have a third thread that received notifications as Render and AI became ready, and then released an event when both were ready. I was just trying to get the shortest possible version out there.

PLEASE NOTE : if you try to use the above code, note that the two threads perform their wait/signal in opposite orders. This is necessary to avoid deadlock. I do not know if the order I chose is "optimal", or if it even matters. This approach does not scale up nicely to more than 2 threads. There really should be some kind of "manager" thread, like I said...

Also, thread programming is not for the faint of heart. Debugging gets entertaining. I wouldn't recommend adding threading to an otherwise single-threaded application just for this parallelism. But if you're multithreaded anyway, what the heck.

Lastly, it is possible that thread context switching will be so slow that you won't be able to get anything useful done during the short stalls anyway. You'd just have to test and find out. Compare the render/calculate/flush single threaded performance to the multithreaded performance and see which is better.

As for SwapBuffers doing an implicit glFlush, I don't know. I can imagine a way in which it wouldn't need to, but using such a system would make time synchronization fairly difficult. I chose to ignore the issue in the preceding code.

Of course, I'm just a hobbyist OpenGL programmer, so any of you who do this "for real" can feel free to point out all of the things I've overlooked or misunderstood http://www.opengl.org/discussion_boards/ubb/smile.gif

Mac

tarantula
06-09-2003, 04:56 PM
Korval, I'll try to make a fillrate intensive app and see how much of parallelism I can get.

paulc, I don't understand why a glFlush() has to be issued before swapBuffers. Shouldn't it be issued before update(AI physics ... ) so that the card will start executing?

Also the way I understand, glFlush returns immediately so it wouldn't stall the cpu; swapbuffers , glFinish will stall the cpu. I was assuming flushing the pipeline would start the rendering and when the swapBuffers call is made the stalling will be very less if most of the stuff has been rendered.

Is "parallel time" the time taken by the swapBuffers call issued immediately after the drawing code?

Thanks for your replies, but I'm more confused now.


[This message has been edited by tarantula (edited 06-09-2003).]

tfpsly
06-09-2003, 11:05 PM
Originally posted by tarantula:
paulc, I don't understand why a glFlush() has to be issued before swapBuffers. Shouldn't it be issued before update(AI physics ... ) so that the card will start executing?

glFlush does not tell the gpu to start doing its job, but wait until it finishes it. So it's to be called after you do your game computations, never before.

BTW, there is no point in calling glFlush and just after swapbuffer, as the later will call glFlush! Only call swapbuffer, get rid of the glFlush in your loop :
while ( !finished )
{
Render
UpdateGame
Swap
}

bashbaug
06-09-2003, 11:20 PM
There is understandably some confusion on this as the GL spec is pretty vague in this area. Basically, all it guarantees is that glFlush will cause the commands you just queued to complete "sometime" and that glFinish won't return until all commands are complete.

So... what does this really mean? In my experience, calling glFlush sends any queued commands to the hardware but doesn't wait until the commands are finished before returning. glFinish also sends any queued commands to the hardware and DOES wait until the commands are finished before returning. So calling glFlush won't stall the CPU but glFinish probably will.

As others have mentioned, SwapBuffers does an implicit glFlush, NOT an implicit glFinish, so in most cases it won't stall the CPU. The few cases where it will stall the CPU are where the CPU has so many frames queued up that it makes sense to throttle back for interactivity reasons.

Hope this makes sense.

-- Ben

roffe
06-09-2003, 11:28 PM
Originally posted by tfpsly:
glFlush does not tell the gpu to start doing its job, but wait until it finishes it.
No, glFlush does not block.
Personally I like facts. From http://wwws.sun.com/software/graphics/OpenGL/manpages/glFlush.html

DESCRIPTION
Different GL implementations buffer commands in several dif-
ferent locations, including network buffers and the graphics
accelerator itself. glFlush empties all of these buffers,
causing all issued commands to be executed as quickly as
they are accepted by the actual rendering engine. Though
this execution may not be completed in any particular time
period, it does complete in finite time.

Because any GL program might be executed over a network, or
on an accelerator that buffers commands, all programs should
call glFlush whenever they count on having all of their pre-
viously issued commands completed. For example, call
glFlush before waiting for user input that depends on the
generated image.



NOTES
glFlush can return at any time. It does not wait until the
execution of all previously issued GL commands is complete.

Ysaneya
06-09-2003, 11:58 PM
*cry*

Why are there so many people believing that glFlush blocks? Seems to be in the top list of OpenGL myths..

Y.

tarantula
06-10-2003, 02:48 AM
Hmm.. now swapBuffers is not a blocking call? Doesn't swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.

Btw, *I* do understand what glFlush does. But I am not sure where I can get the parallelism from. Somebody please tell me how the parallelism can be achieved. jwatte? dorbie?

tfpsly
06-10-2003, 03:00 AM
Originally posted by roffe
No, glFlush does not block...

interesting. Thanks!


Originally posted by tarantula:
Hmm.. now swapBuffers is not a blocking call? Doesn't swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.
Of course it does!


Btw, *I* do understand what glFlush does. But I am not sure where I can get the parallelism from. Somebody please tell me how the parallelism can be achieved. jwatte? dorbie?

It comes from the fact that while the gpu is finishing its job, the cpu is free and it can be used for whatever you want it to do. Then whe you're finished, swap the buffers (stalling the cpu until the gpu is done) and both PUs will get synchronized.

roffe
06-10-2003, 03:24 AM
Originally posted by tarantula:
But I am not sure where I can get the parallelism from.

Achieving good parallelism is hard.You must benchmark your app extensively to find bottlenecks and move operations around, so you find a good equlibrium between cpu and gpu workload. If using only glFlush/glFinish for synchronization you are left with:
i) send work to gpu
ii) do variable amount of cpu work
iii) block, more cpu or more gpu work

By using extensions such as NV_FENCE you can poll the gpu for partial completion. Which lets you do:
i) send work to gpu
ii) some cpu work
iii) poll gpu
iv) more cpu work
v) poll,block,whatever

nystep
06-10-2003, 03:24 AM
NAME
glFinish - block until all GL execution is complete


That must be what we're looking for. Before reading the thread, i must say i beleived glFlush blocked too.

bashbaug
06-10-2003, 05:34 AM
Originally posted by tarantula:
Hmm.. now swapBuffers is not a blocking call? Doesn't swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.

You can queue up the swap just like you can queue up any other rendering call...

-- Ben

V-man
06-10-2003, 07:45 AM
"No, glFlush does not block..."

It blocks until commands are sent to the server, which is pretty quick on a PC.

But Im not sure this is a needed function. I think that on PC's as soon as a command is called it is executed or it is executed as soon as possible.

Take this dumb example :

glBegin(GL_TRIANGLES);
glVertex3f(0.0, 0.0, 0.0);
glVertex3f(1.0, 0.0, 0.0);
glVertex3f(1.0, 1.0, 0.0);
glVertex3f(2.0, 2.0, 0.0);
glVertex3f(5.0, 5.0, 0.0);
glVertex3f(3.0, 3.0, 0.0);
glEnd();

When the third glVertex is called, the triangled is rendered. When the sixth glVertex is called, the triangle is rendered.

This is something I observed in software mode, but it may not be true for hw.

For the case of glDrawRangeElements, I imagine the whole arrays must be uploaded before execution begins.

evanGLizr
06-10-2003, 09:28 AM
Originally posted by V-man:
[B]"No, glFlush does not block..."

It blocks until commands are sent to the server, which is pretty quick on a PC.

But Im not sure this is a needed function. I think that on PC's as soon as a command is called it is executed or it is executed as soon as possible.


Even on PC's that's not really the case.

Normally commands are put into a buffer, be it the DMA buffer itself (uncached memory, normally AGP write-combined) or a temporary buffer (cached memory) which will need to be copied to the real DMA buffer sometime.

In both cases you must have some granularity to initiate DMA transfers (initiating them on each command you put into the buffer is a nono).

So you DO need glFlush to tell the driver that now is the right time to initiate a DMA transfer, otherwise the DMA transfer won't begin until you've run out of space in the buffer or until you've reached some granularity hardcoded in the driver (or not so hard-coded if the driver has some load-balancing heuristics).

Regarding whether wglSwapbuffers blocks or not (i.e. calls internally glFinish or not), that depends on the OS (yes, win9x behaves differently to Win2k and both differently to WinNT), registry settings and some wacky things the driver can do to avoid glFinish being called from wglSwapbuffers internal code (tracing back the call stack).

[This message has been edited by evanGLizr (edited 06-10-2003).]

jwatte
06-10-2003, 11:40 AM
I wouldn't count on a lot of parallelism on a TNT2.

I WOULD count on a lot of parallelism on a GeForce2.

Note that the cards are likely to start rendering even before you call glFlush()/SwapBuffers(), unless the cards are tile renders like the i845. If you issue some geometry, and the card is idle, might as well go ahead and kick it off.

To get parallelism, you don't really need to do anything special, as long as you issue vertex array geometry with "common case" geometry states. Doing any kind of read/get, or excessively uploading textures and stuff, will probably cause blocking/stalls/less parallelism.

If you were to try to get parallelism out of a TNT2, you'd have to do something like rendering all your small geometry first (where small triangle fill might overlap with the transform of the next thing) and then draw your large geometry (walls, sky box, whatever) and call glFlush() before starting the calculation for the next frame.

Of course, if all you do is:

forever() {
calculate
draw
swapbuffers
}

Then that's pretty much what you're doing, anyway.

dorbie
06-10-2003, 04:44 PM
It's already been said but flush is non blocking and swapbuffers performs an implicit flush. When you know this a lot of it is self evident, although big FIFOs etc make it somewhat moot. It does still depend on where your bottlenecks are and that depends too on your hardware.

I seem to value latency more than most and would do some things in my loop you may not than make these issues more critical.

dorbie
06-10-2003, 05:07 PM
FYI on windows (and Linux) swapbuffers will block IF there's another swapbuffers in the queue. FIFOs are large and can infact store several frames in some instances hosing your latency, so the policy is to limit the FIFO to one frame. This varies with implementations. Issue so much as a glNormal after swap on an IRIX and you block. There's a good reason for this, but it's lost on people who don't even sync to vertical retrace.

Parallelism comes from things like FIFOs on the card that can store data and commands, DMAs by the card and display listed memory on the card and in your mapped agp memory. One of the advantages of a GPU, is that your CPU is only concerned with dispatch not T&L and even then not busy with it because you're smart about dispatch. Even without a GPU you could benefit from graphics parallelism while the card is busy with 'setup' and fragment processing. Even with the best card you have to be careful not to do anything that would block the CPU of course, even the smallest glReadpixels for example, but other things might do it and implementations and extensions can add their own quirks.

nystep
06-10-2003, 11:24 PM
Hmm,

I'm not much a very advanced coder in OpenGL and i might have few knowledge, but what's the use of blocking the CPU waiting for GPU to finish rendering whereas a simple swapbuffers call would be queued in a FIFO buffer and would enable the CPU to calculate next frame?

The question can be read as: isn't calling glFinish the same as wasting CPU time to wait for GPU?

regards,

zeckensack
06-11-2003, 02:54 AM
Originally posted by Korval:
Well, a TNT doesn't have a T&L unit. The only kind of parallelism you can hope to get is based on fillrate. So, if you're drawing lots of small polys, you can't get much parallelism.

In any case, use VBO's. That should provide you with as much parallelism as you can get.Hu?
As you said, a TNT doesn't have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.
So?

V-man
06-11-2003, 03:01 AM
"The question can be read as: isn't calling glFinish the same as wasting CPU time to wait for GPU?"

I dont think anyone said you should use glFinish. The talk is about using glFlush to initiate DMA transfer and get the GPU to render WHEN you want, and you can continue using the CPU as the GPU does it's job.

And I think it's better to use SwapBuffers instead of wglSwapBuffers. I had some slowness on an old hardware with wglSwapBuffers. But I dont want to debate which to use here.

MacReiter
06-11-2003, 03:52 AM
Originally posted by nystep:
The question can be read as: isn't calling glFinish the same as wasting CPU time to wait for GPU?

OK, having already stuck my foot in my mouth as one of the original "glFlush blocks" people (see above) (and I did look it up in the MSDN and realize my mistake afterwards, which I should have done before posting, but anyway...), I'm kinda nervous saying anything, but for what it's worth:

Yep, glFinish would "waste" CPU time waiting on the GPU. You would only use it if:
1. You had nothing better to do on the CPU anyway, and:
2. You needed to synchronize the CPU and GPU for some simulation correctness reason. (and apparently NV_FENCE is much nicer, although I know nothing about it), or:
3. You wanted to know how long it took the GPU to do something. If you just profiled the rendering code without putting a glFinish at the end, you would only be profiling how long it took the CPU to issue (or possibly just to buffer) the commands.

All of which sounds like "normal" programs wouldn't need glFinish.

dorbie
06-11-2003, 07:57 AM
Yep, glFinish is not everyone's cup of tea but I have advocated it in the past to cut latency and/or sync input for improved consistency. It led to a big discussion/disagreement with a driver guy here, but people who write drivers and try to go for best fps above all else don't share my priorities :-).

Your main loop will definitely change depending on your priorities, this becomes even more critical if you do anything to block on graphics, because nonblocking swap and a big FIFO hides a lot of these issues.

[This message has been edited by dorbie (edited 06-11-2003).]

Korval
06-11-2003, 08:22 AM
As you said, a TNT doesn't have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.

Which is why appropriate TNT drivers that implement VBO's will not store them in video or AGP memory, regardless of what switches you use on them. Which is, of course, the whole point of having hints.

As I said, if you use VBO's, you will get as much parallelism as you can get on the hardware you're using. That it happens to be very little on some hardware doesn't change that you're getting what you can get.

roffe
06-11-2003, 08:36 AM
Originally posted by MacReiter:
All of which sounds like "normal" programs wouldn't need glFinish.

Any program that needs to read something back from gpu buffers,based on previous work, have to use glFinish.Unless you benchmark so precise that you know for sure that the information will be available after n frames/cycles/other on all relevant hw. I'll let others decide how "normal" this behaviour is.

jwatte
06-11-2003, 09:54 AM
Finish() is not necessary before a ReadPixels(), because ReadPixels() implicitly finishes up to the point where you read.

If you're trying to read "the screen" rather than the GL framebuffer using some out-of-band API, you're probably in for trouble, as different implementations may use different mechanisms to show you the framebuffer (including video overlay!)

roffe
06-11-2003, 10:22 AM
Originally posted by jwatte:
Finish() is not necessary before a ReadPixels(), because ReadPixels() implicitly finishes up to the point where you read.
For normal(hmm,what is normal?) read back functionality I think this is true, but extensions such as NVIDIA's PDR bends these rules somewhat. Or relaxes them as the spec put it.Maybe there are other extensions that act similarly.

dorbie
06-11-2003, 10:25 AM
Yep, OpenGL is consistent on readback, there's no need to glFinish, blocking is already implied with something like a readback call since the readback must wait for rendering to complete before fetching pixels. It doesn't just implicitly flush it *guarantees* all relevant processing, including *fragment* processing is complete. I doubt there are any implementations that do anything smart beyond waiting on all pending fragment processing here. glFinish would be a bad thing to do immediately before readback since it would introduce the additional delay of the transport of the readpixels command to graphics that would otherwise happen during rendering. (not talking about extensions)

[This message has been edited by dorbie (edited 06-11-2003).]

cass
06-11-2003, 07:10 PM
Originally posted by zeckensack:
Hu?
As you said, a TNT doesn't have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.
So?

VBO can still do all of the software-T&L optimizations that the ill-defined Compiled Vertex Arrays were used for in the past.

cass
06-11-2003, 07:20 PM
Originally posted by roffe:
For normal(hmm,what is normal?) read back functionality I think this is true, but extensions such as NVIDIA's PDR bends these rules somewhat. Or relaxes them as the spec put it.Maybe there are other extensions that act similarly.

Yes, for PDR, you have to use fences to determine when a ReadPixels operation has completed.

The PBO (Pixel Buffer Object) extension - whose specification has unfortunately not yet commenced - will allow safe, asynchronous pixel transfers in a high performance, portable way.

Bother your local ARB representative if you'd like to see action on this extension. http://www.opengl.org/discussion_boards/ubb/smile.gif

Thanks -
Cass