PDA

View Full Version : To Map() or not to Map()?



andras
07-02-2005, 02:52 PM
I know that this question have come up many times (mostly under different topics though), and I know that most people say to use BufferData() instead of Map(), but I haven't yet been convinced, and still cannot see why that approach could be better/faster.

Obviously, the main reason for Map() is to write dynamic data straight into host memory. If it was not dynamic, I would upload it only once as a static buffer, and then it wouldn't really matter if I used Map() or BufferData().

So, dynamic data means it's created on the fly. Using BufferData, I first have to create my data (that might be huge) in local memory, then call BufferData, that will block until it copies over everything to host memory. On the other had, Map() enables me to write straight into host memory, skipping the expensive copy altogether. Sure, if I try to Map() a buffer that is currently used by the GPU, then it blocks, whereas BufferData could just find some free space somewhere else, and start copying right away! But that's why one should doublebuffer the VBs (or can even have n buffers in a ring), this way, waiting for the GPU can be eliminated.

Creating this ring of buffers brings up another question. I've been told not to pre-allocate buffers, but use BufferData() with NULL instead, so the driver can allocates space without copying anything. This way I could have the best of both worlds: Map() gives me direct access, and BufferData() eliminates blocking without the need to manage a ring of buffers.

While this is all beautiful in theory, I've found that manually switching between preallocated buffers is much faster than asking BufferData every time for a new buffer. I guess this is similar in a way to calling "new" in C++ every time, instead of using preallocated buffers.

Please share your thoughts on this subject!

Roderic (Ingenu)
07-03-2005, 12:18 AM
Not sure what you mean by "pre-allocate", calling BufferData with NULL, do reserve memory somehow.

AFAIR, the procedure to follow to Map() w/o too many problems, is to first call BufferData( NULL ), then call Map( WRITE_ONLY ), so that the driver knows that you don't care about what's in the buffer and you'll overwrite everything.
It should behave as a BufferData call, that is it's not blocking and using another memory area should the buffer be 'busy'.

Just checked my docs, here's something you might want to read (nVidia):
http://www3.uji.es/~jromero/documentos/Using-VBOs.pdf

p10+ are about the different functions and their use.

glBufferData seems to be the only one to avoid syncing, the only alternative being BufferData( NULL ) + Map( WRITE_ONLY ) as I said earlier.
[That behavior should be Vendor agnostic, not sure though.]

-BufferData() is the vendor independant prefered way of updating a VBO.
-BufferSubData() is syncing with nVidia's GPUs, but not with ATI's GPUs.
-Map() is syncing no matter the Vendor.

knackered
07-03-2005, 05:04 AM
For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.
Also, it is *still* the case that you get a significant increase in draw speed if you compile the display lists using immediate mode.
I've given up asking why, but it's probably due to the fact that most people benchmark on old apps which use display lists, and vendors are only interested in looking good in benchmarks.

andras
07-03-2005, 06:27 AM
Not sure what you mean by "pre-allocate", calling BufferData with NULL, do reserve memory somehow.What I mean by pre-allocating, is to actually create multiple VBOs during initialization, and then just switch between them. It has the same effect as calling BufferData(NULL), but it's much faster.


AFAIR, the procedure to follow to Map() w/o too many problems, is to first call BufferData( NULL ), then call Map( WRITE_ONLY ), so that the driver knows that you don't care about what's in the buffer and you'll overwrite everything.
It should behave as a BufferData call, that is it's not blocking and using another memory area should the buffer be 'busy'.This is exactly what I said, but last time I checked it was a lot slower, than the switching I've just described. I'm guessing it's because of the complex memory management, the driver has to do every time, I ask for some space.

Jan
07-04-2005, 06:28 AM
Hm, interessting, that preallocated buffers are faster. I just tried adding glBufferData (NULL) to my particle-system and it doubled the framerate. I cannot imagine, that i can gain much by preallocating the buffers, since the memory management shouldn't be processing intensive, the real slowdown comes, if CPU and GPU have to synchronize, when you want to write to mapped memory.

Also, it should be quite a complex task to manage preallocated buffers in a way which doesn't waste memory. If i have two buffers and simply switch between them, then i use twice the memory. With glBufferData (NULL) i only use twice the memory as long as one buffer is still rendered from. So, if every dynamic stuff i do, uses twice the memory it actually needed, then i waste a lot of my precious VRAM.

Therefore i wouldn't do that and simply rely on the driver to do that management for me, which is in fact the complete purpose of VBO. If we do our own memorymanagement anyway, we could have sticked to VAR.

Jan.

jide
07-04-2005, 07:45 AM
Originally posted by knackered:
For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.
What do you mean here ? We shall display lists all static stuffs on nv and 3dlabs cards, but should keep them inside VBO under other cards ? Is that really a good thing to do that ?

andras
07-05-2005, 06:06 AM
Hm, interessting, that preallocated buffers are faster. I just tried adding glBufferData (NULL) to my particle-system and it doubled the framerate. I cannot imagine, that i can gain much by preallocating the buffers, since the memory management shouldn't be processing intensive, the real slowdown comes, if CPU and GPU have to synchronize, when you want to write to mapped memory.Yes, you are correct (in theory). But I've just re-did the test with latest 77.72 drivers on my GF6600, and while I get 200+ FPS with VB double buffering, I got ~120FPS when requesting new memory every time by calling BufferData(NULL)..


Also, it should be quite a complex task to manage preallocated buffers in a way which doesn't waste memory. If i have two buffers and simply switch between them, then i use twice the memory. With glBufferData (NULL) i only use twice the memory as long as one buffer is still rendered from. So, if every dynamic stuff i do, uses twice the memory it actually needed, then i waste a lot of my precious VRAM.Again, you are correct with regards to wasting memory, although you might not be able to tell the exact size, just an upper bound, even when you request fresh memory all the time.


Therefore i wouldn't do that and simply rely on the driver to do that management for me, which is in fact the complete purpose of VBO. If we do our own memorymanagement anyway, we could have sticked to VAR.Well, I would love to have a VB that works fast the way we are supposed to use it, but the hard fact remains that you need to manage stuff yourself, and VAR is way more powerful in this sense.

Trust me: I really wish I didn't have to deal with managing the buffers myself!! I would love to do it the right way, and trust the drivers to do a good job!!

Andras

Roderic (Ingenu)
07-05-2005, 07:02 AM
Tried BufferSubData/BufferData with none NULL pointer ?

andras
07-05-2005, 07:36 AM
Tried BufferSubData/BufferData with none NULL pointer ?Nope. BufferSubData stalls by definition, and both need a separate buffer in client memory, which could be pretty big (if that's not a waste of memory then what is??), and there's also a full stalling copy involved, I cannot see how this could be faster/more efficient..

Roderic (Ingenu)
07-05-2005, 11:17 PM
Originally posted by andras:

Tried BufferSubData/BufferData with none NULL pointer ?Nope. BufferSubData stalls by definition, and both need a separate buffer in client memory, which could be pretty big (if that's not a waste of memory then what is??), and there's also a full stalling copy involved, I cannot see how this could be faster/more efficient..Well you may use a 'cache' in RAM to avoid creating/deleting memory. You would then fill it up when needed and send it to the VBO using BufferData.
That way you wouldn't be "wasting" much RAM, and you would end up using the prefered IHV method of updating VBOs...
(That means that you replace the whole buffer of course, you can't just update a part of it.)

knackered
07-05-2005, 11:38 PM
Originally posted by jide:

Originally posted by knackered:
For static data on nvidia & 3dlabs cards use display lists - VBO gives nowhere near the performance of display lists on static data.
What do you mean here ? We shall display lists all static stuffs on nv and 3dlabs cards, but should keep them inside VBO under other cards ? Is that really a good thing to do that ?Hey, don't shoot the messenger.

macarter
07-07-2005, 11:35 AM
Originally posted by andras:
Please share your thoughts on this subject!Map() and copy is 15% faster than BufferSubData() on my benchmarks using Catalyst 5.1 X800XT.

Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.

andras
07-07-2005, 02:34 PM
Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.That's _exactly_ what I do! I believe that Map() is fundamentally more powerful than BufferData & friends, the only problem is I have to double buffer manually if I want to get the best performance out of it.

Again, could some driver guy look into why BufferData(NULL) seems to be much slower than just switching between preallocated buffer objects?? It would make life soo much easier, if we could just use that!! Thanks!

tamlin
07-07-2005, 05:33 PM
Just a stab in the dark here, but could it make a performance difference to disable the client state before calling BufferData(NULL)?

andras
07-08-2005, 07:07 AM
I don't know, will have to try that! But this brings me to another interesting question: So, I was re-reading this nVidia document (http://developer.nvidia.com/object/using_VBOs.html) on VBO usage, and there is a section called "Avoid calling glVertexPointer() more than once per VBO", where they say, that all the actual setup happens on glVertexPointer call! Now how exactly does this work in the shader era, when we have to bind attribute arrays to locations? For example I have lots of different shaders, and each shader have multiple attributes, and different attributes are stored in different VBOs. So for each attribute, I have to bind the corresponding VBO, and then call VertexAttribPointer(location..) to attach the buffer to a location. And I'll have to do this every time I change shaders, right? And of course every time I request new memory with glBufferData(NULL)! Or am I missing something? I have to admit that I feel a bit lost here. If someone could shed some light on how this works it would be really appreciated! Thanks!

zeckensack
07-08-2005, 08:08 AM
Originally posted by macarter:
I use a high priority thread to draw and a low priority thread to fill the VBO.Just some friendly advice, learned the hard way: you shouldn't use thread priorities at all on Win32.
Quick summary: a high priority thread cannot sleep. It will only yield the CPU when it waits on a waitable object, or when another thread of equally high priority is ready to run.

The result is
1)disastrous performance on single-core machines, if you rely on Sleep, SwitchToThread, Yield, whatever
2)nothing on multi-core or "HyperThreading" CPUs. You don't lose anything and neither do you gain anything by fiddling with priorities.

Just don't do it. Please.

andras
07-08-2005, 11:06 AM
Just some friendly advice, learned the hard way: you shouldn't use thread priorities at all on Win32.
Quick summary: a high priority thread cannot sleep. It will only yield the CPU when it waits on a waitable object, or when another thread of equally high priority is ready to run.
Hmm, I dunno, it works like a charm here.. Our main thread doesn't use a lot of CPU (we make the GPU sweat instead ;P), but it has to be super responsive! So actually, our idle thread uses 90% of the CPU, it's kinda funny.. :)

Dirk
07-08-2005, 01:24 PM
Originally posted by macarter:
Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.[/QB]Is that well-defined behaviour or just blind luck? Can we depend on that being the case? I would have imagined that AGP-mapped memory would be thread-specific, which would break this behaviour. Doesn't seem to be the case, but is that true for other OSes, too?

I've learned to be very careful about OpenGL and multiple threads, so before I make my design depend on it, I'd really like to have a serious answer, as I couldn't find anything about it in the spec.

Thanks

Dirk

macarter
07-11-2005, 05:34 AM
Originally posted by dirk:

Originally posted by macarter:
Map() provides a pointer that can be used to fill a VBO in a thread lacking a GL context. I use a high priority thread to draw and a low priority thread to fill the VBO. I see this as a great advantage that becomes even greater on a multicore CPU.Is that well-defined behaviour or just blind luck? Can we depend on that being the case? ...[/QB]Threads by definition share memory mappings.

zeckensack
07-11-2005, 06:38 PM
Originally posted by andras:
Hmm, I dunno, it works like a charm here..Okay, I'm going out on a limb here, but ...
a)you're working on a HyperThreaded P4 and
b)you'll get the exact same performance with default priorities anyway.

Our main thread doesn't use a lot of CPU (we make the GPU sweat instead ;P), but it has to be super responsive!And you're assuming assigning it a high priority will make your thread "super responsive"?
Well, yes, in some twisted way it will do that. A high priority thread will starve all other threads, it will basically run all the time unless it Waits on some object or its message queue.
But then again, if a thread waits on an object, it will be resumed immediately anyway, as long as it has the same priority as all other currently ready threads.

So there's your responsiveness. Giving a thread high priority will not increase its responsiveness. It will instead make all lower-priority threads unresponsive.

Please try your software at least once with HT disabled. I'm sure you'll see what I mean.

So actually, our idle thread uses 90% of the CPU, it's kinda funny.. :) Please don't tell me you've written your own idle thread :eek:

yooyo
07-12-2005, 04:45 AM
All GL stuff should stick in one thread. Maping buffers and passing poniters to another thread may not be safe because of driver changes.

Much better is to use async data transfer using PBO or VBO. When you desing app that have to do some streaming job on GPU, you must sync GPU and CPU to achive best performances.

It is OK to use thread, but one (render) thread should be dedicated to OpenGL and all other should serve data to this render thread.

On latest hw (P4 3.06GHZ/HT + PCI-X Nvidia 6600GT) I can play up to 6 full PAL (720x576) MPEG2 videos and CPU usage is ~90%. It is ~9.4MB per frame. One rendering thread and 6 DirectShow filtergrpahs with bunch of filters inside (and each filter runs in it's own thread)

yooyo

andras
07-12-2005, 05:13 AM
a)you're working on a HyperThreaded P4 andNope, I'm developing on an Athlon.. There is no HT...


So actually, our idle thread uses 90% of the CPU, it's kinda funny.. :) Please don't tell me you've written your own idle thread :eek: I meant our idle priority worker thread.. It is the thread that does all the computing, but it's really much more important to run the UI thread instead. The worker thread can wait until there's nothing else to do.
Anyway, I'll try without changing priorities and see what happens (as soon as I get my stuff compiling again :-P)

andras
07-12-2005, 05:22 AM
All GL stuff should stick in one thread. Maping buffers and passing poniters to another thread may not be safe because of driver changes.This is not true. Read the specs (http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_buffer_object.txt) :
"The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)."

tamlin
07-12-2005, 08:06 AM
yooyo,
That's just horrible. That's a horribly slow performance given your h/w. Is that for real, or did you miss an order of magnitude somewhere?

yooyo
07-12-2005, 09:05 AM
That's just horrible. That's a horribly slow performance given your h/w. Is that for real, or did you miss an order of magnitude somewhere?Horrible? Why? I say 6 MPEG2 streams (full PAL resolution). And each one is decoded on CPU side. Each stream take ~ 10-15% CPU time just for decoding.

Maybe you didn't understand me. Using PBO it is possible to stream ~1.8GB/sec. In my case 6 frames is ~9.4 MB/frame.

yooyo

tamlin
07-12-2005, 09:35 AM
yooyo,
I'm sorry. I didn't realize you saturated the CPU. My bad.

andras
07-13-2005, 05:40 AM
Hey, I thought I'd re-post this, as it seems like it got lost in the noise (either that, or nobody knows the answer :) ) Anyway, here goes again, hope I'll have more luck with it this time

Originally posted by andras:
So, I was re-reading this nVidia document (http://developer.nvidia.com/object/using_VBOs.html) on VBO usage, and there is a section called "Avoid calling glVertexPointer() more than once per VBO", where they say, that all the actual setup happens on glVertexPointer call! Now how exactly does this work in the shader era, when we have to bind attribute arrays to locations? For example I have lots of different shaders, and each shader have multiple attributes, and different attributes are stored in different VBOs. So for each attribute, I have to bind the corresponding VBO, and then call VertexAttribPointer(location..) to attach the buffer to a location. And I'll have to do this every time I change shaders, right? And of course every time I request new memory with glBufferData(NULL)! Or am I missing something? I have to admit that I feel a bit lost here. If someone could shed some light on how this works it would be really appreciated! Thanks!

yooyo
07-13-2005, 08:43 AM
@andras:

In one of your VBO you have stored vertex positions. When you bind that VBO, just call glVertexPointer once. And this is related to all other glXXXPointer functions too. So...
1. activate VBO
2. just setup pointers
3. setup vertex pointer as last
4. use glEnableClientState / glDisableClientState / glEnableVertexAttribArrayARB / glDisableVertexAttribArrayARB calls to enable or disable attributes (this is cheap)
5. draw your geometry using glDrawElements(Arrays) / glDrawRangeElements / glMultiDrawElements(Arrays).
6. in GLSL, use glBindAttribLocationARB before linking and unify your attributes locations according to your VBO config.

If im wrong, let somebody correct me.

yooyo

andras
07-13-2005, 11:22 AM
Well, this is what I'm doing with the exception that I never call glVertexPointer as I only use custom vertex attributes (yes, even for position).. But I guess this should be ok.

Dirk
07-14-2005, 04:25 PM
Originally posted by andras:
This is not true. Read the specs (http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_buffer_object.txt) :
"The expectation is that an application might map a buffer and start filling it in a different thread, but continue to render in its main thread (using a different buffer or no buffer at all)."Hey, that's a good find! Now we just need the driver writers to confirm that this is actually supported and I can start redesigning my stuff!

nVidia, ATI: Do your drivers support this behavior? On Linux, too? ;)

T101
07-14-2005, 10:54 PM
Maybe I misunderstand here (I haven't used VBOs yet), but a VBO is just filled by using ordinary assignments, right? Not by gl functions.

So if a buffer is just CPU-accessible, and provided that you don't use the buffer for rendering while you're still filling it (something that is up to your own program, typically prevented by means of a mutex), then there is absolutely no reason why this shouldn't work on all operating systems capable of multithreading.

After all, that's the difference between a thread and a process: a process has its own memory, open files etc. All threads of the same process share those resources. (And a GL context can only be active in one process)

Dirk
07-17-2005, 05:12 PM
Originally posted by T101:
Maybe I misunderstand here (I haven't used VBOs yet), but a VBO is just filled by using ordinary assignments, right? Not by gl functions.
Yup.


So if a buffer is just CPU-accessible, and provided that you don't use the buffer for rendering while you're still filling it (something that is up to your own program, typically prevented by means of a mutex), then there is absolutely no reason why this shouldn't work on all operating systems capable of multithreading.
That's not something I would bet the house on that I don't own. Given that VBOs can live in AGP or graphics card memory, I would really like to get some confirmation that these are consistent and accessible across all threads.



After all, that's the difference between a thread and a process: a process has its own memory, open files etc. All threads of the same process share those resources. (And a GL context can only be active in one process)I'm pretty sure that a GL context can only be active in one thread. I've written programs that use multiple threads, where each thread has a different OpenGL context, so it's certainly not one context per process.

Any of the driver guys want to comment? Even saying "We don't know yet, probably not" would be helpful ("We do know, and yes" would be more helpful, but hey, I take what I can get ;) ).

andras
07-17-2005, 05:26 PM
Life is dangerous, you have to take some risks! ;) Live on the edge!! Go ahead and use it! :D

Won
07-18-2005, 06:15 AM
But a mapped VBO is just a pointer, and hence can be shared across threads. You would need some way for the memory thread (which doesn't need its own rendering context) to communicate to the rendering thread that it is done (CPU-CPU synchronization is your own problem) so the render thread can call unmap.

This could potentially speed things up (or make things more interactive, at least) if you need to process/decompress data, assuming that switching thread contexts isn't too slow. I have not confirmed this, but I heard that NV contexts implicitly flush on thread switches.

-W

Dirk
07-18-2005, 06:44 PM
Originally posted by andras:
Life is dangerous, you have to take some risks! ;) Live on the edge!! Go ahead and use it! :D With the current state of drivers I'm always on the edge and take all the risks I can handle. ;)

I'm currently redesigning the Geometry part of my Open Source scenegraph ( OpenSG (http://www.opensg.org) ), and if that doesn't work reliably for all my users my life will be pretty miserable. I'd really like to be sure before committing to it... :)

Dirk
07-18-2005, 06:49 PM
Originally posted by Won:
But a mapped VBO is just a pointer, and hence can be shared across threads.Given that the pointer can point to AGP or graphics card memory I'm not so sure about this.


You would need some way for the memory thread (which doesn't need its own rendering context) to communicate to the rendering thread that it is done (CPU-CPU synchronization is your own problem) so the render thread can call unmap.Yeah, we have the CPU part pretty well and flexibly covered, it's the graphics that I'm working on.


This could potentially speed things up (or make things more interactive, at least) if you need to process/decompress data, assuming that switching thread contexts isn't too slow. I have not confirmed this, but I heard that NV contexts implicitly flush on thread switches.Hm, what about true multi-processor (or multi-core) systems? There is no thread switch here, multiple threads are physically running at the same time. So there is no way to flush anything.

andras
07-25-2005, 06:00 AM
Our main thread doesn't use a lot of CPU (we make the GPU sweat instead ;P), but it has to be super responsive!And you're assuming assigning it a high priority will make your thread "super responsive"?
Well, yes, in some twisted way it will do that. A high priority thread will starve all other threads, it will basically run all the time unless it Waits on some object or its message queue.
But then again, if a thread waits on an object, it will be resumed immediately anyway, as long as it has the same priority as all other currently ready threads.

So there's your responsiveness. Giving a thread high priority will not increase its responsiveness. It will instead make all lower-priority threads unresponsive.Sorry, it took me a long time to test, but finally I could try setting the priority back to normal and this causes the framerate to become much more uneven, there are sudden spikes, which makes the overall feeling pretty bad. If I set this thread to idle, there is no noticeable performace loss, but everything becomes a lot smoother. So do you have any idea why is that then, or how could I make it smooth with normal priority?? :confused: