problem about the parallelism of OpenGL

pango · April 6, 2004, 7:09pm

what works can be doing parallelism in OpenGL?the works include:
input(glTexSubImage2D(),…),
rendering(glBegin(),…,glEnd()),
output(glReadPixels(),…),
internal pixel copy(glCopyTexSubImage(),…),
I mean is it possible that two or more below works can be doing simultaneously?such as when doing rendering,input or output can be also doing?

Relic · April 7, 2004, 3:56am

Have a look at the documentation of wglMakeCurrent.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/opengl/ntopnglr_2u0k .asp

Yes, you can render with multiple threads simultaneously. You can update textures while rendering something else…
You need multiple contexts, and need to call wglShareLists for the asynchronous texture loading.
You need to be careful with multithreading in general. Don’t expect multithreaded rendering to be really faster, because you have only one hardware which executes the drawing commands, and the OS timeslicing will result in additional overhead.

pango · April 7, 2004, 6:29am

To Relic:
Does the RCs created in different threads can be shared?I create 2 rendering context in different threads(one in main thread,one in a new thread created by CreateThread()),but the call to “wglShareLists()” fail,why?

system · April 7, 2004, 6:30am

I think the question isn’t about multithreading and multithreading doesn’t mean async behavior for GL.
Example: A texture must be fully uploaded to be usable by secondary threads.

Anyway, standard GL doesn’t have any async behavior except in the case of NV_pdr and the occlusion testing (part of 1.5)
Another one called conditional_render will appear soon.

If you call glReadPixel, glDrawPixel, glTexImage and other functions, they won’t return until the job is done.

For rendering, it’s different (glDrawPixels included?), there you will most likely find paralellism. The driver buffers GL commands and sends batches to the GPU.

In the case of software drivers, it’s totally different. Some don’t buffer much and try to render as soon as possible.

Relic · April 7, 2004, 7:03am

To pango:
Not sure how you’re doing this, but normally you call wglShareLists once before you really start to use the RCs. I have seen problems when the RCs were bound. Doing wglShareLists in two different threads sounds dangerous.

zeckensack · April 7, 2004, 7:38am

The driver must serialize rendering commands from different threads anyway. Using the same rendering context from two different threads is a completely pointless endeavour, even if you can get it to work (which isn’t easy, mind you). A context can only be current to one thread at a time.

wglShareLists doesn’t share state AFAIK, it only shares objects (display lists, textures, buffer objects).

That being said, there’s sufficient parallelism you can get automatically, without doing anything special. If you don’t call glReadPixels or the like frequently, the graphics chip can do a lot of work in parallel to the CPU. Most, if not all OpenGL drivers just create enormous job queues for the chip to process. Only the creation of the queue is done on the host processor (which is comparatively light-weight), while the processing is done “later”, with little CPU involvement.

evanGLizr · April 7, 2004, 8:13am

Originally posted by V-man:
If you call glReadPixel, glDrawPixel, glTexImage and other functions, they won’t return until the job is done.
Just a note for accuracy sake, glDrawPixels and glTexImage do not need to be synchronous and can return before the work is done.
The driver just puts the data in the DMA buffer, fires the transaction and goes back to the application before the data has been transfered (the data cannot be just transferred directly from the user pointer, because the app could modify the buffer right after the function returns).
The serialization imposed by the sequential nature of the command buffer will ensure proper rendering.

As you mention, this is not possible for glReadPixels without changing the current API because - for this function - the app must have the data ready when the function returns.

pango · April 7, 2004, 9:07am

Thanks for everyone’s reply,now let me assume below 3 threads can be created:
-thread only for writing image data to texture;
-thread only for rendering using texture;
-thread only for reading the image back using glReadPixels();

All the 3 threads are working with different objects(I means when thread 1 is writing data to tex1,thread 2 rendering scene using tex2 in rc1,and thread3 reading data from rc2),so they can be doing in parallel?

I’m writing a video process engine using OGL,the engine write real video stream to a texture(using glTexSubImage()),do some rendering, and then read the image back.The engine must finish all the jobs within 0.04sc,so how to improve the throughput of OGL I/O become very important for me.Then if I could put the writing,rendeing,reading in different threads,how much performance improve can I get?

Korval · April 7, 2004, 10:00am

Just a note for accuracy sake, glDrawPixels and glTexImage do not need to be synchronous and can return before the work is done.
The driver just puts the data in the DMA buffer, fires the transaction and goes back to the application before the data has been transfered (the data cannot be just transferred directly from the user pointer, because the app could modify the buffer right after the function returns).
DrawPixels, because it pulls from cached system memory, cannot be DMA’d. It has to be copied into DMA memory, at the very least, for any async actions to work. Also, DrawPixels requires the generation of fragments, so fragment-programs still operation. It doesn’t just set pixels in the framebuffer. As such, a simple DMA isn’t even the right answer; it has to pass through the pixel pipe.

TexImage is even worse because it reallocates an object; this is a driver-side thing, not hardware-side. It, also, doesn’t operate on the framebuffer, so it is mostly a code-side operation. Of course, it has the same DMA problems as DrawPixels.

Thanks for everyone’s reply,now let me assume below 3 threads can be created:
-thread only for writing image data to texture;
-thread only for rendering using texture;
-thread only for reading the image back using glReadPixels();

These are 3 sequential operations. You can’t render with a texture until it has been written to. And you can’t read back a meaningful framebuffer until you have rendered something. Multithreading this is nonsense.

The engine must finish all the jobs within 0.04sc
You can’t guarentee something like this. At the very least, your OS may task-switch your process out. Also, you can’t know how long a glTexSubImage operation will take.

evanGLizr · April 7, 2004, 10:58am

Originally posted by Korval:
[quote]Just a note for accuracy sake, glDrawPixels and glTexImage do not need to be synchronous and can return before the work is done.
The driver just puts the data in the DMA buffer, fires the transaction and goes back to the application before the data has been transfered (the data cannot be just transferred directly from the user pointer, because the app could modify the buffer right after the function returns).
DrawPixels, because it pulls from cached system memory, cannot be DMA’d. It has to be copied into DMA memory, at the very least, for any async actions to work.
[/QUOTE]

In case you want to learn something, for a change, all I’ve said is true. In this case caching has nothing to do with the ability to be able to DMA, you can lock-down the cached memory and DMA from there, you just need a physical address and flush the cache (and in the case of PCI transactions you don’t even need to flush the cache, as they snoop the cache). Don’t confuse AGP with being able to DMA.

The reason why the data has to be copied to the DMA buffer rather than being DMA’d from the user buffer (which is exactly what I said), has nothing to do with the buffer being in cacheable memory, it’s because as soon as the function returns to the application, the app is free to modify the data, which would result in incorrect (modified) data being sent to the card.

[b]

Also, DrawPixels requires the generation of fragments, so fragment-programs still operation. It doesn’t just set pixels in the framebuffer. As such, a simple DMA isn’t even the right answer; it has to pass through the pixel pipe.

[/b]

Huh? And who says that DMA’d data doesn’t go through the pixel pipe? Or do you think that DMA’d graphic commands are just written to some piece of memory “without ever going through the pixel pipe”.

[b]
TexImage is even worse because it reallocates an object; this is a driver-side thing, not hardware-side. It, also, doesn’t operate on the framebuffer, so it is mostly a code-side operation. Of course, it has the same DMA problems as DrawPixels.

[/b]

Nonsense. As always, believe what you want, but keep your beliefs for yourself and stop discussing people who know the facts, don’t mislead people with your “ideas” or “imaginations”.

Korval · April 7, 2004, 2:45pm

Huh? And who says that DMA’d data doesn’t go through the pixel pipe? Or do you think that DMA’d graphic commands are just written to some piece of memory “without ever going through the pixel pipe”.
Actually, yes. If I were designing hardware, I would not bind my DMA engine to my fragment pipeline. There are any number of reasons for that, but the most important of which would be concurrency. I would like the driver to be able to DMA up a texture while I’m rendering with another.

And what about vertex DMA’s (DMA-ing a VBO into video memory, for example) that are clearly not pixels? Should they go through the pixel pipe too?

How it is you claim to “know the facts” is beyond me when you say something nonsensical like that.

Now, it may be possible to, in limited circumstances, bind the DMA to the fragment pipeline. But I wouldn’t want to build my hardware that way. And the fact that glDrawPixels is a slow operation, even granted the multiple copying, on all kinds of hardware probably tells you something about how hardware is designed.

Nonsense. As always, believe what you want
Like the undeniable fact that glTexImage causes allocation of textures, or the undeniable fact that this allocation is a driver-side process, with only a mild connection to any hardware or DMA process?

pango · April 7, 2004, 9:12pm

[/QUOTE]These are 3 sequential operations. You can’t render with a texture until it has been written to. And you can’t read back a meaningful framebuffer until you have rendered something. Multithreading this is nonsense.

Korval,you mistake what I said,I said the 3 threads are all doing with different objects,it means that when thread 1 is writing data to tex1,thread 2 is doing rendering using tex2 in rc1,and thread 3 is reading data from rc2,so the 3 threads doesn’t work on same object,so them can be working simutaneously?

Korval · April 7, 2004, 10:26pm

so them can be working simutaneously?
Maybe. There’s still only one piece of hardware. If that hardware can handle DMA’ing a texture, while rendering, while DMA’ing from a different rendering context (with it’s own frame buffer), then yes. But there’s now way to tell if this will work or not.

Give it a shot and see what you get. Worst case, it goes slightly slower (due to thread overhead) than the single threaded case. Actually, the absolute worst case is that driver developers hadn’t considered this as a possibility and it fails spectacularly.

pango · April 7, 2004, 10:42pm

why there is no one ask my question?I want to know which one of below operations can be doing at the same time:
-input(glTexSubImage2D(),…),
-rendering(glBegin(),…,glEnd()),
-output(glReadPixels(),…),
-internal pixel copy(glCopyTexSubImage(),…)

pango · April 8, 2004, 12:14am

Originally posted by Korval:
[b] [quote]so them can be working simutaneously?
Maybe. There’s still only one piece of hardware. If that hardware can handle DMA’ing a texture, while rendering, while DMA’ing from a different rendering context (with it’s own frame buffer), then yes. But there’s now way to tell if this will work or not.

Give it a shot and see what you get. Worst case, it goes slightly slower (due to thread overhead) than the single threaded case. Actually, the absolute worst case is that driver developers hadn’t considered this as a possibility and it fails spectacularly.[/b][/QUOTE]Korval,do you mean whether my idea can be worked out is depend on the driver?

system · April 8, 2004, 6:49am

…depend on the driver?
what do you think we are talking about here?

If you read the GL spec, you will notice it describes how those functions should behave.

It does not say what your driver should do and how it should utilize your hw.

The driver just puts the data in the DMA buffer, fires the transaction and goes back to the application before the data has been transfered
What’s a DMA buffer? Any memory area?

Yes, I do beleive that the driver should “take” your pixels
but I have heard that this could crash some systems

pix=allocate_mem(quantity);
glReadPixels(…);
deallocate_mem(pix);

and yes, this has happened to me and putting a glFinish fixed it (long time ago)

evanGLizr · April 8, 2004, 10:11am

Originally posted by V-man:

[quote]The driver just puts the data in the DMA buffer, fires the transaction and goes back to the application before the data has been transfered
What’s a DMA buffer? Any memory area?
[/QB][/QUOTE]In this case I was referring to the DMA buffer where the driver puts the graphic commands that drive the graphics card. “Uploading” a texture doesn’t need to be done accessing the video memory directly, the graphics card may have a special blit command which takes the data from the command stream and blits into texture memory. The same protocol can be used for DrawPixels, but in that case the data is processed through the pipeline upto the unit that deals with fragment shading.

In general a DMA buffer is any piece of memory where a secondary controller can fetch data asynchronously from the main controller (CPU).
The buffer can be mapped on recular cached memory, agp memory, uncached memory or write combined memory, you name it. The only difference is that on some of those you may need to take care of memory coherence issues (flushing caches), depending of the kind of DMA transaction.

[b]
Yes, I do beleive that the driver should “take” your pixels but I have heard that this could crash some systems

pix=allocate_mem(quantity);
glReadPixels(…);
deallocate_mem(pix);

and yes, this has happened to me and putting a glFinish fixed it (long time ago)[/b]
I’ve seen that kind of problem and I’ve recommended the same approach, to use a glFinish to test if it was a driver bug (or, more likely, the app trashing memory). But if that happens is an OpenGL implementation bug, and it has to be fixed. With the current spec the application should be able to do whatever sees fit with the data buffer after the function call returns.

pango · April 8, 2004, 8:35pm

everyone,below is a mail I received from a NVIDIA’s engineer:

I’m assuming your processing looks like this:

CPU sends Frame1 to Texture1
GPU renders using Texture1 to Buffer1
CPU reads Buffer1
continue for Frame2…

Notice that at 1 and 3, GPU is waiting (not doing anything) and at 2 CPU is waiting.
Ideally, we should do this:

CPU sends Frame1 to Texture1
GPU renders using Texture1 to Buffer1 / CPU sends Frame2 to Texture1
CPU reads Buffer1 / GPU renders using Texture1 to Buffer1

However at 2, CPU and GPU use the same Texture1, and at 3 same Buffer1.
This will corrupt the buffers. Using multiple textures / buffers, we can do:

CPU sends Frame1 to Texture1
GPU renders using Texture1 to Buffer1 / CPU sends Frame2 to Texture2
CPU reads Buffer1 / GPU renders using Texture2 to Buffer2

Do you see we are saving time, by starting to work on frame 2 before
frame 1 finishes? Of course your application might not be able to make use
of this if you have to process frames one at a time.

so from what he said,I believe we can write to and read from GPU while it is doing some rendering,but how to implement such parallel working?How much performance can I get?