Creating objects in another thread

I’m trying to use another thread for creating (and loading) OpenGL objects
The problem is that I have a successful D3D implementation, as it supports threaded object creation.

As for OpenGL , I have followed the guideline described here, and also using GLFW3 for managing contexts:

[ul]
[li]At Init, I Create the main context and set it after other contexts and threads are created[/li][li]At Init I create a context (new 1x1 hidden window in GLFW) for loader thread, in the main thread, Also I pass the main thread’s context to loader thread as shared[/li][li]At the start of loader thread execution, set it to current[/li][li]Create my objects in the loader thread[/li][li]After loading done, pass the object IDs and render them in main thread[/li][/ul]

But I’m currently having couple of problems, that I need to ask …

[ul]
[li]Besides Buffers and texture objects, I’m creating VAOs, I’ve read somewhere that VAO’s can’t be shared, as a result I don’t get valid geometry, is this true?[/li][li]Performance is terrible, when I create another context shared with the main one, I get terrible frame-rates, for example from 3-4ms per-frame to 20ms per-frame (using latest AMD gpu drivers), any tips? or is there anything wrong with this method of loading ?[/li][/ul]

Before going any further, let’s clear some facts:

  1. The web page you are referencing to is not very clear and a little bit deprecated. I’m surprised Alfonse didn’t correct it. :slight_smile:

  2. During GL context creation you should define sharing group.

  3. If you have multiple contexts, make each one current in a separate thread (and don’t “pass” them to other threads, as you mention in the post).

  4. You don’t need VAO to fill VBO with data (loading thread should be confined just to that). And, yes, VAOs are not shared among contexts.

  5. After all, having resource loading in a separate GL context (not thread) you probably won’t get any performance boost, since GL contexts are not really “parallel”. It is easier for implementation to have monolithic gigantic VBOs loaded at once in a separate thread, instead of splitting it into slices and loaded in the drawing thread, by thus far my experience is that the second approach is actually faster because don’t require synchronization. So, multi-threading - yes, but multi-context - maybe…

thx for the early reply
2. what do you mean by sharing group? I just send the main context to the “shared” parameter for proceeding contexts, and call wglShareList in windows, after all it’s glfw that does all the work

  1. maybe I didn’t explain properly, I just create contexts in the main thread, but set them to current in their own threads

  2. yes I know, but I’m creating the whole high level object (model for example) and it’s gl objects in another thread, thought I could get away with vao’s too, like what I’m doing with the d3d impl.

  3. I’m implementing background loading of game resources like models and textures. The performance problem I was talking about occurs only when I create additional context for another thread, even if I do nothing with it. and the slowdown is HUGE. could it be because I’m creating 2nd context with a dummy hidden window? (that’ the only way glfw let’s me create contexts)

That’s exactly I meant with the sharing group (passing the first context as the second parameter of the wglCreateContextAttribsARB for the others). But, do not use wglShareList! That’s the implicit meaning of my first statement. It is a remnant of the “ancient” time. Using wglCreateContextAttribsARB is sufficient.

That’s also correct.

This part I don’t understand. What do you call a “gl object”?

Didn’t you say you’re creating (all) contexts at the start? Or I just assumed that? :slight_smile:
You shouldn’t create contexts during application execution.
Second, you don’t need new windows to create other contexts!
Window is required for setting pixel format. It should be done only once for the drawing window. All contexts in the sharing group should have the same pixel format so there is nothing to do with the windows. I’ve never used glfw. Maybe you are right about that constraint, but check it again since it is not the constraint of the OpenGL itself.

This part I don’t understand. What do you call a “gl object”?

GL objects are like GL textures, VBOs, etc…

Didn’t you say you’re creating (all) contexts at the start? Or I just assumed that?
You shouldn’t create contexts during application execution.

Yes, I don’t create contexts during execution, not that crazy

Second, you don’t need new windows to create other contexts!

Unfortunately, I was limited to GLFW, and GLFW doesn’t let me create contexts individually, they have to be created with their window.
But today I forked it and followed the guideline described here to create my own custom GL contexts (no new window, just get the DC of the main window)

Managed to resolve the VAOs also by creating them in the main thread.
But the major problem still persists, which is the horrible performance when I’m trying to create VBOs using another context.
Here’s a shot, that I disabled the whole multi-threaded loading feature:

And here’s a second shot, with mult-threaded loading (thus another context) turned on:

Check out the ft (frame time) graph, as you can see, muti-threaded object creation doesn’t just raise spikes, It continues to slows down the performance drastically through all frames. But in D3D implementation and also no-background loading, it’s all good.

So I currently don’t know if it’s a drivers fault (someone experienced this with AMD drivers/GPU?) or something I did wrong with the GL api

There is no resource leaking, there is no resource recreation, because I have abstracted the API layers, all higher-level logic is same with D3D/GL

  • Worker (loader) context is created in main thread
  • Worker context is set to current in the loader thread
  • glFlush() is called after loading (VBO/texture creation) in the loader thread
  • No window is created for worker context, I just get the DC from main window and create it
  • worker context is not using any draw functions, just common glGen, glBind, glBufferData, …
  • worker context is using main context as shared parameter upon creation

It is hard to say what the problem is. Does the frame rendering time stays high even after the loading process or just during the loading?
How do you measure frame rendering time? Is it a GPU or a CPU time? How do you communicate between threads? You should use sync objects for context synchronization. They are part of the core since GL 3.2. That will not change performance, but should assure the object is finished before accessing from the other thread/context.

Does the frame rendering time stays high even after the loading process or just during the loading?

Yes, I also noticed that even if I load resource in the main thread (using the main context), I get the same slowdown, just because I’ve created another context that is shared with the current one!

Is it a GPU or a CPU time?

CPU time

I guess the problem is with the AMD drivers!
I upgraded from catalyst 13.9 to 13.12 today, and saw major improvement in frame-rates, currently with 13.12 the cpu frame-time graph is like this:
[ATTACH=CONFIG]575[/ATTACH]

It’s still not very stable like with no secondary context, but far better than 13.9 drivers

for comparison, this is Direct3D result of multi-threaded loading, look how stable the fps is:
[ATTACH=CONFIG]576[/ATTACH]

Then I hope the case is closed. :slight_smile:

Posted a similiar post on AMD developer’s forum, to see if they got an answer.
Overall I had many problems with AMD’s OpenGL drivers till now, they don’t seem to be polished at all.

personally i’ve had bad experience with shared contexts and came to conclusion that they are to be avoided when possible.
generally they are not used very often by the developers and so the driver writers tend to overlook them so that when using shared contexts you are more likely to stumble on driver bugs.
Also they have inherent performance penalty because the driver is forced to use some kind of mutexes to avoid messing between threads, which is not the case in non-shared mode.
It may be better if you re-organize your software as to avoid using shared contexts.

Yeah, the AMD guy said the same thing too.
It’s a bummer, but thanks

Now I’m creating resources in the main thread and fill them in loader, here is the procedure :

[ul]
[li]Loader Thread: Load the file and read the buffers, pass their data/params to a queue for main thread[/li][li]Main thread: Read the queue and create pending objects, for example VBO is created by: glGenBuffer(1, &buff); glBindBuffer(GL_ARRAY_BUFFER, buff); glBufferData(GL_ARRAY_BUFFER, size, NULL, GL_STATIC_DRAW);[/li][li]Main thread: map created buffers glMapBufferRange(GL_ARRAY_BUFFER, 0, size, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);, and pass the mapped pointer to loader thread[/li][li]Loader thread: memcpy buffers into mapped GL buffers[/li][li]Main thread: Unmap all buffers and the data is ready for render glUnmapBuffer(GL_ARRAY_BUFFER);[/li][/ul]

Now, without the extra contexts, the frame rate is stable.

But on loads I still get huge spikes in frame time just like the previous method.
I use mutexes and conditions to sync the stuff, but I don’t use any blocking in the main thread, all happens in loader thread where performance doesn’t really matter for me.
In OpenGL docs it recommends that don’t ever create objects in the update/render loop, that’s what I’m doing now. So anyone experimented with this ?
Is this a problem with creating objects ? Do I have to create all GL objects on init time and manage them in some way so there is no glGen/glData… ?!

I’m already using your method from months but I’m not sure this method is REALLY thread-safe (even if it seems to work fine).
Follow my thought:

[ul]
[li]Main Thread : glMapBuffer/glMapBufferRange… [/li][li]Uploading/Loading Thread : memcpy(…) [/li][li]Main Thread : glUnmapBuffer() [/li][/ul]
While CPU executing memcpy(…) in another thread, MainThread continue to run drawing graphic using a buffer WHILE other thread may change its data…I’m not really sure if it is good.

PS: Your “blocking situation” may dipends of Drivers: they may block MainThread (during rendering) because glMapBuffer refers to an address space of memory IN-USE by other thread during upload with memcpy…If this behavour will be confirmed, using this method is near the same as execute memcpy() in the MainThread… :frowning:

PS2: (ref: http://www.opengl.org/sdk/docs/man/xhtml/glMapBuffer.xml) “A mapped data store must be unmapped with glUnmapBuffer before its buffer object is used” so, in theory, Binding VBO/VAO without have unmapped the buffer could generate an error, and this case could became real at every new frame until other Thread has finished to execute memcpy()…

I’m already using your method from months but I’m not sure this method is REALLY thread-safe (even if it seems to work fine).
Follow my thought:

I’m using the method you said, check the previous post

But I still get severe spikes, I suspect they are from glGenBuffer/glBufferData funcs
And also, I’m mapping with GL_MAP_UNSYNCHRONIZED_BIT flag, which doesn’t suppose to do any blocking on the main thread.
And I think it’s thread safe to memcpy in another thread, the AMD driver developer recommended the same thing too.

Also CPU side during the mentioned GL calls are fast (By checking the delta time between the whole create/map operation), because obviously those commands are streamed to driver for later processing. It seems that the stall occurs on SwapBuffers or something. I don’t have the proper tools/knowledge to detect where exactly it stalls.

I was just saying that I was using the same your method, so I know something about that and I don’t answer you just with theory :wink:

GL_MAP_UNSYNCHRONIZED_BIT is not safe without using multiple buffers (use n.1 while uploading while n.2 is used for rendering, then switch n.1 with n.2 in next operation) or doubled size buffer (use first half during upload while second half during rendering, then switch the parts) for the same reason I wrote before: you are risking to change a buffer/memory while it is in use by GPU.
Using those two methods the upload-to-GPU procedure is really async without problems, otherwise the CPU, GPU and/or the Driver should block and executes a SYNC until you release mapped buffer with glUnmapBuffer(). It is irrilevant if the block occurs during SwapBuffer or MapBufferRange/memcpy, because indeed it is always a “synced” method for the GPU-side.

(sorry for my English but it isn’t my first language)

I was looking to find the most asynchronous and efficient way to upload data to a GPU buffer and tried all sorts of variants i could think of. I played with the various usage and map flags.
In the end the most efficient turned out to be plainly using glBufferSubData. Any attempt to use glMapBufferRange produced worse results.
The GL_MAP_UNSYNCHRONIZED_BIT, at least on nvidia, desn’t seem to do what is is supposed to do. That is, their driver apparently still does nasty synchronizations that cripple frame rate.

When i did the above “research” i was on nvidia so the result may not be applicable for ati. Now i have an ati too, but im too lazy to repeat the experiments again.

I’m not sure about your last sentence because GL_MAP_UNSYNCHRONIZED_BIT is designed to work within an environment that unmaps buffers after they’re used, but the case discussed here supposes to leave the buffer mapped “indefinitely” while MainThread (the Rendering one) needs to use that buffer.

GL_MAP_UNSYNCHRONIZED_BIT is a good flag when combined with Multiple PBO (“ping pong” method) or Doubled (in size) PBO as I wrote before, in this case you can execute glMapBufferRange()+memcpy()+glUnmapBuffer() in the same function without ruins GPU’s Rendering procedure (which uses other PBO or just “the other half of PBO”).

Are you sure a buffer can stay permanently mapped with GL_MAP_UNSYNCHRONIZED_BIT even while the GPU is using it?
I remember i was wondering about that too but then i read the specification and it sounded like that is impossible, so i gave up this idea.
Here is a citation of the relevant text in the 4.4 spec:
[b][i]
6.3.1 Unmapping Buffers

After the client has specified the contents of a mapped buffer range, and before the
data in that range are dereferenced by any GL commands, the mapping must be
relinquished by calling

boolean UnmapBuffer( enum target );

[/i][/b]it unconditionally says that the buffer must be unmapped. There is no exception for the GL_MAP_UNSYNCHRONIZED_BIT flag.
Have you really tried drawing from mapped buffer and got it working?

Even it it happens to work on your driver, it’s probably not a good idea to rely on that because it looks to go against the specification
and other drivers may not support it.

I think allowing the GPU to use GL_MAP_UNSYNCHRONIZED_BIT-mapped buffer would be very useful feature, well worth posting a suggestion about it :slight_smile:

I just read in the spec that there is a new flag in 4.4 to allow permanent mapping: GL_MAP_PERSISTENT_BIT.
The spec is not clear, but it sounds like there may be performance implications about this flag. For example there is also a function FlushMappedBufferRange, which is mentioned together with GL_MAP_PERSISTENT_BIT at one place.
I wonder why is this function necessary if not to fix some possible performance problems.
Anyhow, i don’t know about you, but i am currently limited to opengl version 3.3 because i need to support older hardware but this flag GL_MAP_PERSISTENT_BIT is only available starting with 4.4

Another consideration is that if it was really possible (for the nvidia driver) to let the CPU access the same memory from which the GPU is currently drawing, then there shouldn’t be any synchronizations
when you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT and then unmap before drawing from it. But according to the frame rate there are synchronizations for sure (as i mentioned earlier).

If it is not possible for the driver to do it then GL_MAP_PERSISTENT_BIT will be emulated by memcpy-ing on the CPU from the mapped buffer any time you issue a drawing operation that is supposed to read data from that buffer,
which can not be any better than plain glBufferSubData (but can be worse). The mapped buffer itself will be just plain system memory.