PDA

View Full Version : Lock API



Alfonse Reinheart
09-27-2009, 03:36 PM
I went over some of this suggestion in another thread, but I wanted to clarify some of the points and create a new discussion rather than derail that thread.

Current Performance Problems in OpenGL

It is inferred from NVIDIA's work on the bindless graphics API that OpenGL has a number of basic inefficiencies in its vertex specification pipeline that create a large number of client-side memory accesses for each draw call.

The purpose of this proposal is to solve these problems without resorting to low-level hackery as in the bindless graphics extensions.

Origin of the Problem

Not being an NVIDIA driver developer, I can only speculate as to the ultimate source of the client memory accessing. Thus, this analysis may well be wrong, thus leading to a wrong conclusion.

The absolute most optimal case for any rendering command is this: add one or more command tokens to the graphics FIFO (whether the actual GPU FIFO or an internal marshalling FIFO). This is the bare minimum of work necessary to actually provoke rendering.

The first question is this: what is in these tokens?

The implementation must communicate the state information of the currently bound VAO. Which vertex attributes are enabled/disabled, what buffer objects+offsets+stride they each use, etc. Basically, the VAO state block.

However, in GPU lingo, some of that state block contains different data. Specifically as it relates to buffer objects. All the GPU cares about is getting a pointer, whether to video memory, "AGP" memory or whatever it can access.

The VAO stores a buffer object name, not a GPU address. This is important for two reasons. One, buffer object storage can be moved around by the implementation at will. Two, buffer object storage can be ''reallocated'' by the application. If you have a VAO that uses buffer object name "40", and you call "glBufferData" on it, the VAO must use the new storage from that moment onward.

#2 is a really annoying problem. Because buffer objects can be reallocated by the user, a VAO cannot contain GPU pointers even if the implementation wasn't free to move them around.

This means that, in order to generate the previously-mentioned tokens, the implementation must perform the following:

1: Convert the buffer object name into a pointer to an internal object.

2: Query that object for the GPU address.

3: If there is no GPU address yet... Here be dragons!

The unknown portion of step 3 is also a big issue. Obviously implementations must deal with this eventuality, but exactly how they go about it is beyond informed speculation. Whatever the process is, one thing is certain: it will involve more client-side memory access.

Here is the thing: if an implementation could know, be absolutely certain that the GPU address of all of a VAO's buffer objects would not change, then the implementation could optimize things. The VAO's state block could be boiled down into a small block of prebuilt tokens that would be copied directly into the FIFO. Now even in this case, you still need to:

1: Convert the VAO name into a pointer (generally expected to be done when the VAO is bound).

2: Copy the FIFO data into the command stream.

The second part requires some client-memory access. But it's the absolute bare minimum (without going to full "let the shader read from arbitrary memory" stuff).

How to do This

The bottlenecks of client-side memory access have been identified. So how do we solve this?

We provide the ability to lock VAOs.

When a VAO is locked, this relieves the OpenGL implementation from certain responsibilities. First, a locked VAO is immutable; the implementation no longer has to concern itself with changing things at the user's whim. A locked VAO that is deleted will continue to exist until it is unlocked.

Second, all buffer objects attached to that VAO at the time of locking are themselves locked. Any attempt to call glBufferData or any other function that gives the implementation the right to change the buffer object's storage will fail so long as that buffer object is attached to a locked VAO. Multiple VAOs can lock multiple buffer objects.

Implicitly locking buffer objects also has the effect of providing a strong hint to the implementation. Unlike the bindless graphics ability to make buffer objects resident, it does not force the implementation to fix the object in memory. But it does strongly suggest to the implementation that this buffer object will be in frequent use, and that it should take whatever measures it needs to in order to keep rendering with this data as fast as possible.

To help separate locked VAOs from unlocked ones, the locking function should return a "pointer" (64-bit integer). It is illegal to bind a locked VAO at all; instead, you must bind the pointer with a special bind call (that automatically disables the standard bind point).

Comparison to Bindless

This suggestion cannot achieve 100% of the performance advantage of the full bindless API (that is, just giving vertex shaders a few pointers and having them work). However, it be able to remove enough issues that it can achieve performance parity with GL_NV_vertex_buffer_unified_memory.

Speaking of which, GL_NV_vertex_buffer_unified_memory tackles this issue in a different way. It uses the bindless shader_load API to allow you to bind bare pointers rather than buffer objects. This in turn relies on making buffer objects resident, which gives them a guaranteed GPU address.

This is an interesting idea, but it relies on a lot of manual management. You have to make specific buffer objects resident, and you have to remember yourself what the reason was behind this residency. It also requires the concept of a "GPU address" and so forth.

This example is much more OpenGL-like. It keeps the low-level details hidden while allowing the implementation to make optimizations where appropriate. It is much safer as well; there are a number of pitfalls with GL_NV_vertex_buffer_unified_memory (like rendering when you made a buffer non-resident, etc) that this API can easily catch.

It is a targeted solution to a specific problem.

mfort
09-28-2009, 02:05 AM
Good, we are moving somewhere.
Let's discuss what would happen if the locks are implicit.

Fictional(?) driver:
When new VAO is created it locks all the buffers and bakes all the information it needs. Including GPU addresses. So there is no complicated verification/resolution during rendering. By locking I mean to flag the buffer that is used inside VAO and store a back-pointer to that VAO.

Questions:
Q: What if someone deletes buffer used in VAO?
A: No worries, the buffers use reference counters. The buffer cease to exist when the last user is destroyed.

Q: What if someone modifies buffer used in VAO?
A1: Not allowed. Error is reported. (too strict IMHO)
A2: Driver makes copy-on-write. The value inside VAO is not changed. New VAO must be created to use the new data. (the opposite to the current OpenGL) (too strict IMHO)
A3: VAO is updated, including all the baked GPU addresses. This requires to have back pointers from buffer objects to all VAOs using it.

Q: What if the buffer is not filled with data at the time of VOA is created?
A1: Error is reported
A2: buffer is not used until data are available, see A3 above

Q: Do we need to change OpenGL API/spec?
A: No, all is done under the hood inside driver,

Q: What is the benefit?
A: It allows the "create once, use many times" usage pattern.

This way the driver moves the CPU load from usage to creation. This would work when one buffer is bound to limited number of VAOs. Otherwise the buffer update would be too costly.

Alfonse Reinheart
09-28-2009, 02:42 AM
Let's discuss what would happen if the locks are implicit.

You can't have implicit locks. Not without fundamentally (and non-backwards-compatibly) changing how the expected behavior works.

Remember: the driver is free to move buffer object data around as it sees fit. The driver has no way of knowing whether a particular VAO is "currently" in use or will be used in the future. The best it can do is take a few guesses based on usage patterns, but that is very complicated compared to the user providing real information.

One of the purposes of locking a VAO is to tell the driver when it is not a good idea to do that with this object. That communication is important and more direct than any usage pattern guessing.

Further, if implementations could do this implicitly, then they already would and bindless graphics wouldn't be much of a speed increase. Clearly there are things in the OpenGL specification that make this optimization difficult if not impossible without spec changes.


Q: Do we need to change OpenGL API/spec?
A: No, all is done under the hood inside driver,

The spec would definitely have to change. Most of the answers to the questions you ask require spec changes.

mfort
09-28-2009, 03:20 AM
Remember: the driver is free to move buffer object data around as it sees fit.

Yes, let it be this way. But once the buffer moves the driver updates all the VAOs that use this buffer. (list of back pointers helps here)



The driver has no way of knowing whether a particular VAO is "currently" in use or will be used in the future.

C'mon. Driver knows if VAO is in use. What can happen if in use? Let it be the same as PBO. If in use, then:
a) wait (e.g. when BufferSubData is called)
b) or make a shadow copy (copy on write)



bindless graphics wouldn't be much of a speed increase

there cannot be much faster path other then NVIDIA bindless API. I am not surprised it is fast. I'd would be interested of the speed up compared to VAO.
BTW. Do not forget the display lists are the fastest way to render static geometry on NV HW.






The spec would definitely have to change. Most of the answers to the questions you ask require spec changes.

Name it.

I really think there are only two ways.
a) making very small changes to current API and optimized drivers. Look at NV 180.x, they optimized drivers a lot. So it clearly shows that there is still some room.
b) making more low level stuff. Maybe not that low level as NVIDIA presented. The most problems with their approach is the need to change the shaders.

mfort
09-28-2009, 04:48 AM
To help separate locked VAOs from unlocked ones, the locking function should return a "pointer" (64-bit integer). It is illegal to bind a locked VAO at all; instead, you must bind the pointer with a special bind call (that automatically disables the standard bind point).


Then it could be called glMapMemoryGPU(...) and glUnmapMemoryGPU(...) + glBindMemoryGPU(...)

Alfonse Reinheart
09-28-2009, 11:30 AM
But once the buffer moves the driver updates all the VAOs that use this buffer. (list of back pointers helps here)

That's insane. Programs should be reasonably free to have tens of thousands of these. To have buffer object modification code iterate through a list of objects to update is ridiculous.

And it doesn't actually work.

Here is the pseudo-code that a driver has to do when you render with a VAO currently:



foreach attrib in attributeList
{
bufferObject = GetBufferObject(attrib.bufferObjectName);
if(!bufferObject.IsBufferInGPU())
//HERE BE DRAGONS!
}

//Copy tokens into FIFO.


That loop is the source of the problems. Removing that loop is paramount to gaining performance equivalent to bindless.

Don't forget that the driver can remove a buffer object from video memory entirely if it needs to. At that point it doesn't have a GPU address, so the backpointers don't help. It still has to get the buffer object and check to see if it is uploaded. And if not, it has to upload it.


C'mon. Driver knows if VAO is in use.

I don't mean being rendered with; I mean something that you intend to use in this current frame.

Because if you don't plan to use that VAO in this frame, the driver needs to be able to page out the buffer objects that the VAO uses. That gives it more freedom to move unimportant things around. Locking the VAO is a strong indication that you're going to use it, so the API should make it current.


there cannot be much faster path other then NVIDIA bindless API. I am not surprised it is fast. I'd would be interested of the speed up compared to VAO.

My goal is to give the driver the information it needs to get the most optimal performance with a reasonable abstraction.


Name it.


A1: Not allowed. Error is reported. (too strict IMHO)
A2: Driver makes copy-on-write. The value inside VAO is not changed. New VAO must be created to use the new data. (the opposite to the current OpenGL) (too strict IMHO)

Both of these are against the current behavior.


a) making very small changes to current API and optimized drivers. Look at NV 180.x, they optimized drivers a lot. So it clearly shows that there is still some room.
b) making more low level stuff. Maybe not that low level as NVIDIA presented. The most problems with their approach is the need to change the shaders.

Or do what I suggested. It isn't low-level at all. It maintains the abstraction while providing drivers the opportunity to optimize things.


Then it could be called glMapMemoryGPU(...) and glUnmapMemoryGPU(...) + glBindMemoryGPU(...)

No. "Mapping" is an operation you do to cause some GPU-local memory to become CPU-accessible. This is nothing like mapping.

mfort
09-28-2009, 11:49 AM
step by step



That's insane. Programs should be reasonably free to have tens of thousands of these. To have buffer object modification code iterate through a list of objects to update is ridiculous.

Yes, why not (to have 1000s). It is not free operation to modify a buffer that is inside VAO.
How many VAOs are using one particular buffer?
How many times (per frame) you update such buffer?




Both of these are against the current behavior.

Yes, thats why I put A3 that does it.




No. "Mapping" is an operation you do to cause some GPU-local memory to become CPU-accessible. This is nothing like mapping.

You are mapping it into GPU space. (see suffix GPU). That "mapped" memory would not be accessible by CPU at all.

What NV bindless API is doing IS actually "mapping". First they force it to be in GPU memory (flag it to not to move) by MakeBufferResident and then they get the GPU address.

Alfonse Reinheart
09-28-2009, 12:00 PM
How many VAOs are using one particular buffer?

I imagine quite a few. It is often the case that large bits of scenery all are stored in the same buffer object.


You are mapping it into GPU space. (see suffix GPU).

No, you're not. The driver is allowed to fix the buffer in GPU space, but that isn't required. All that is required is that the API prevent you the user from doing things that might cause the buffer to be moved (reallocating storage, etc). This gives the driver the freedom to fix the buffer in GPU space. But that behavior is not required.

See, the crucial difference between MakeBufferResident and locking is that MakeBufferResident is something that forces particular behavior on the driver. Locking doesn't force this behavior; it simply strongly suggests it. That's what makes locking higher level than making a buffer "resident".

mfort
09-28-2009, 12:08 PM
btw. I have found explanation why they did not use MapMemoryGPU:



6) What does MakeBufferResidentNV do? Why not just have a
MapBufferGPUNV?

RESOLVED: Reserving virtual address space only requires knowing the
size of the data store, so an explicit MapBufferGPU call isn't
necessary. If all GPUs supported demand paging, a GPU address might
be sufficient, but without that assumption MakeBufferResidentNV serves
as a hint to the driver that it needs to page lock memory, download
the buffer contents into GPU-accessible memory, or other similar
preparation. MapBufferGPU would also imply that a different address
may be returned each time it is mapped, which could be cumbersome
for the application to handle.

So things are bit more complicated then I thought.

kRogue
09-30-2009, 03:42 PM
Giggles. Why not just put the cards on the table and make everyone use nVidia's bindless graphics? Just kidding. The main part that I find scary in this suggestion is that it is quite complicated to explain and to use, much more complicated than nVidia's bindless graphics. That and it appears that nVidia's bindless graphics does it a little better too.... just too bad it sort of makes assumptions about how the GPU does its buffer management magic... what will happen if AMD makes their own bindless graphics extension? Will we all die from extension-itus? Probably not, I found it quite easy to make a little abstraction that let me use bindless graphics if it was available, the only sticky part being that it has to track the current format of a vertex attribute, not exactly rocket science. If AMD makes their own, chances are since nVidia already made one, then AMD will make their's also easy to port to, which usually implies easy to make an abstraction layer that maps to any of the 3 paths (nVidia, traditional no bindless, and hope to be made AMD bindless).

Alfonse Reinheart
09-30-2009, 04:19 PM
The main part that I find scary in this suggestion is that it is quite complicated to explain and to use, much more complicated than nVidia's bindless graphics.

Is it?



//Create VAO with attached VBOs.

GLlockedobj pVAO = glLockObject(GL_VERTEX_ARRAY, vaoObjName);

glBindLockedObject(GL_VERTEX_ARRAY, pVAO);

//Do rendering.


Really now, was that hard?

kRogue
10-01-2009, 12:35 AM
This is why I have issue with it:

1. The VAO interface is kind of awkward in my eyes, worse the locking does not handle the following very common usage pattern:

Animated keyframe vertex data with non-animated texture co-ordinates.

If the texture co-ordinates were animated as well, one could use the base vertex index added to GL 3.2 (or the appropriate extension), but since that data is not, then the offset into the buffer for the texture data is same across frames, but not for the animated data, thus to use VAO's one must then have a VAO for each keyframe interpolation pair that one uses.

Secondly, there is the side affect of your proposal: locking a VAO implicitly locks the associated buffer objects. This is going to cause bugs because then one needs to be absolutely 100% sure the underlying buffer objects don't change. Worse, what about when one needs to use transform feedback too? A reasonable usage patter for transform feedback is to do a very expensive skinning, feed the values into a buffer which in turn is fed into a simpler shader where some skinned object is drawn many times.

However, what is worth noting, a common usage pattern is that the buffer data size is the same, but the values change. Here the bindless graphics API of nVidia deals with this well: use glBufferSubData to change values (for the hardcore, multi-threaded situation, stream to a different buffer and use the copy buffer API in EXT_direct_state_access). In fact glMapBuffer is legal and so is modifying the values in the buffer object, the only requirement is that where the buffer object is located does not change, i.e. don't call glBufferData.

Perhaps a better middle ground would be that when buffer objects are "locked" it is not their actual data, rather just "where" they are, which is essentially what glMakeBufferReisidentNV does, that and inform the driver one will be using it.

Additionally, the bindless graphics API allows one to completely skip the integer to object conversion, which is a big deal with respect to cache misses, now your proposal of a new kind of object GLlockedobj, which you imagine basically being some kind of pointer handles this, but through an extra layer, where as the bindless API does not add extra layers. With that in mind, something that is easy to understand and use is for object from GL to not be indexed by GLuint, this is absolutely insane in my eyes, much better to just do something like:

typedef struct
{
void *opaque
} GLbufferObject;

typedef struct
{
void *opaque
} GLTextureObject;

typedef struct
{
void *opaque
} GLWhateverObject;

and for the associated calls rather than taking that GLuint to use the above. Also it gives the developer a little help in that now the GL objects are strongly typed. With the above the cache miss goes mostly away (since the driver can have whatever they want on the other side of the pointer).

Alfonse Reinheart
10-01-2009, 01:00 AM
Animated keyframe vertex data with non-animated texture co-ordinates.

People still do vertex animation in this day and age?


thus to use VAO's one must then have a VAO for each keyframe interpolation pair that one uses.

I don't see what the problem with that is. Is there something wrong with having lots of locked VAOs that I am not aware of? VAOs are not large objects, and they have no GPU state. There is no reason you couldn't have hundreds of thousands of them if you needed.


This is going to cause bugs because then one needs to be absolutely 100% sure the underlying buffer objects don't change.

I don't understand what you mean here.


A reasonable usage patter for transform feedback is to do a very expensive skinning, feed the values into a buffer which in turn is fed into a simpler shader where some skinned object is drawn many times.

You seem to misunderstand; maybe I didn't explain it well enough.


Second, all buffer objects attached to that VAO at the time of locking are themselves locked. Any attempt to call glBufferData or any other function that gives the implementation the right to change the buffer object's storage will fail so long as that buffer object is attached to a locked VAO. Multiple VAOs can lock multiple buffer objects.

This paragraph only forbids the use of functions that change the buffer object's type or storage. Basically, glBufferData and glMapBufferRange with GL_INVALIDATE_BIT (or the invalidate part is ignored). All other functions work as advertised. So glBufferSubData, as well as doing a glReadPixels into such a buffer, or doing transform feedback into such a buffer, all of those work just fine.

It is the inability to use glBufferData and mapping with INVALIDATE that gives implementations the freedom to make buffers "resident".


Additionally, the bindless graphics API allows one to completely skip the integer to object conversion, which is a big deal with respect to cache misses, now your proposal of a new kind of object GLlockedobj, which you imagine basically being some kind of pointer handles this, but through an extra layer, where as the bindless API does not add extra layers.

Bindless graphics and VAO usage are orthogonal. Because the bindless state is VAO state, you can capture this state in VAOs. This is the expected common usage, as it allows the driver to store optimal data.

And what "extra layer" does the locked object have?


With that in mind, something that is easy to understand and use is for object from GL to not be indexed by GLuint, this is absolutely insane in my eyes, much better to just do something like:

That requires all of the work of the EXT_direct_state_access plus changing a lot of function. That's a lot of new functions, as well as deprecating a lot of old ones.

Longs Peak tried to do this. That effort failed. The ARB is clearly not capable of making changes that are that far-reaching. The Lock-API is a good compromise between the nothing we have now and having pointer-based objects.

kRogue
10-01-2009, 04:20 AM
This paragraph only forbids the use of functions that change the buffer object's type or storage. Basically, glBufferData and glMapBufferRange with GL_INVALIDATE_BIT (or the invalidate part is ignored). All other functions work as advertised. So glBufferSubData, as well as doing a glReadPixels into such a buffer, or doing transform feedback into such a buffer, all of those work just fine.


my bad, I definitely did not read it, I think I zeroed in on the immutable part, worse I read it as "data of buffer obejct cannot change".

With that in mind, all my objections are pants... though I have to admit that having so many VAO floating around to vary what frame one is using seems odd..

For what it is worth keyframe animated thingies are perfectly fine for small objects that are drawn lots of times, but not so many times as instancing demands.

zeoverlord
10-01-2009, 01:26 PM
We provide the ability to lock VAOs.
I think your making the problem out way worse than it really is, making VBOs lockable, yea i can go for that, but it's not really a huge problem as i don't allow my app to go around messing with my buffers, there is also no real performance problem as both VAO and bindless rendering are approximately equally fast(compared to doing it manually), in fact i have had no problems whatsoever with any of my apps that use VAO.
the only way of improving on this would be to add this.
glDrawVertexArray(GL_TRIANGLES, 0, 143, vao);
Which would solve some issues regarding leaving VAOs open so the app can mess with it.


BTW, bindless graphics is currently an experimental extension, and i like the concept of it, but i rather first have unified buffers than a way to micromanage them.

Alfonse Reinheart
10-01-2009, 01:51 PM
there is also no real performance problem as both VAO and bindless rendering are approximately equally fast

You have actually stressful benchmarks that show that bindless graphics provides no performance benefit over VAOs? I'd love to see those metrics.

Jan
10-01-2009, 03:02 PM
I have no experience with bindless rendering, but as far as i have read, some applications have gained significant speed-ups from it, due to less cache-misses. Now i don't get the whole discussion here, but i am a bit surprised about the statement that

"VAO and bindless rendering are approximately equally fast"

I did use VAOs and have not gained ANYTHING from them a few months ago. And from some other threads that seemed to be the general consensus. Did that change recently?

Jan.

Ilian Dinev
10-01-2009, 03:33 PM
I only remember benching VAOs at 190.57 on GF8600GT, and VAOs were 5% slower than the VBO way. Provided, .57 was a hotfix on VAOs iirc, it's understandable. 190.62 reintroduce a bug from previous versions (gl_ClipDistance[]) , which makes me wonder which includes the more up-to-date GL code.
Anyway, I'll make it a note to bench again these days on several different drivers and on RadeonHDs.

zeoverlord
10-01-2009, 06:58 PM
You have actually stressful benchmarks that show that bindless graphics provides no performance benefit over VAOs? I'd love to see those metrics.
Nope only anecdotal in this case + some logic.

But what i do know is that in normal cases bindless rendering could not gain any significant speedup(as buffer switching is already pretty fast), so even if it takes half the time as a VAO it's still not a big difference on the whole frame.
Thus it's only in those cases where your doing an unusually high amount of buffer switches such a speedup will stack up to be significant.

And jan, well i don't know how drivers treat VAOs today, but i can't see why it wouldn't be faster, in fact logically if it just replays commands it would be as fast as normal, so i don't see why this can't be optimized.
And if nothing else it sure helped to clean up my rendering thread.

Edit: i ran a mini benchmark and it seems that vao is as fast as without it, though im using 190.15 so i have to upgrade and see what happens when i add bindless into the mix.
But i think my theory still stands.

Alfonse Reinheart
10-01-2009, 07:13 PM
But what i do know is that in normal cases bindless rendering could not gain any significant speedup(as buffer switching is already pretty fast), so even if it takes half the time as a VAO it's still not a big difference on the whole frame.

NVIDIA seems to think there is something to be gained out of it. And since they made the hardware, wrote the drivers, designed the extensions, and provided benchmarks of them, I feel it's more reasonable to accept what they say about it, rather than what one might think.

It seems clear from the design of the bindless graphics API that they feel that VAO's alone are not sufficient to provide these kinds of optimizations. I posted as part of the original post my ideas on what it is that NVIDIA was trying to avoid.

elFarto
10-02-2009, 02:47 AM
Seems you could just add a usage hint to glBufferData, GL_IMMUTABLE (replacing GL_STATIC_DRAW and friends). When specifying this, the GL is not required to support calling glBufferData or GL_INVALIDATE_BIT on that buffer.

It's not as flexible as your idea, but much simpler.

Come to think about it 'not required to' may not be strong enough, 'will not' might be better.

Regards
elFarto

Groovounet
10-02-2009, 05:18 AM
Nice talks guys! I didn't had time to read this thread before:

It follows well those threads:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=256729
And other older topics!

I still think that VAO were a mistake, API sugar that doesn't provide anything and as everyone notice, it doesn't provide anything...

I notice something on the thread I would like to clearify: drivers and GPUs change the content of the buffer of the image, twiddle or even compress the data, buffers and images for hundred of good reasons that always end up to: Memory bandwise if golden!

1: Any kind of VAO lock API should allow the drivers and the GPUs to use these fancy features that 'optimize' the data.
2: The API need to feet with at least some use of the programmers (unlike VAOs!!!), like your example with skinning but instancing is an other one.

Groovounet
10-02-2009, 05:32 AM
Ok ok my underlying idea: Standard lossless image and buffer compression formats.

Alfonse Reinheart
10-02-2009, 11:27 AM
Seems you could just add a usage hint to glBufferData, GL_IMMUTABLE (replacing GL_STATIC_DRAW and friends). When specifying this, the GL is not required to support calling glBufferData or GL_INVALIDATE_BIT on that buffer.

That's not a good idea. For several reasons.

You want to be able to turn it on and off as you need to. If you know a particular object isn't going to be used for a while, but you want to keep it around (rather than having to rebuild its buffers), you can unlock its VAO. This gives the GL the freedom to move the buffer object out of video memory if it needs to.

There is also another problem. The usage hints are still important when a buffer object is locked. You can still map the buffer, and you should still reasonably be able to stream data into locked buffers.


I still think that VAO were a mistake, API sugar that doesn't provide anything and as everyone notice, it doesn't provide anything...

Why? And in what way does it "not provide anything?" The only problem with VAOs giving performance improvements are problems with buffer objects. That they can be created/destroyed/respecified, so when you render with them, the code must fetch the buffer object and get its GPU address. That's not a problem of VAOs specifically; that would still happen if you were doing all the binding yourself.


I notice something on the thread I would like to clearify: drivers and GPUs change the content of the buffer of the image, twiddle or even compress the data, buffers and images for hundred of good reasons that always end up to: Memory bandwise if golden!

Drivers are allowed to do this with images, but not buffer objects. They cannot compress buffer object attribute data in any way.


Any kind of VAO lock API should allow the drivers and the GPUs to use these fancy features that 'optimize' the data.

That should be left to some other API that allows the driver to accept a number of rendering commands to create a special drawable object. That API might look something like this:



glBeginRender(VAO, GL_OWN_BUFFER_OBJECTS)
glDrawElements(*);
glDrawElements(*);
GLobject theObj = glEndRender();

glRender(theObj);


The GL_OWN_BUFFER_OBJECTS means that the GL driver is free to modify the buffer objects attached to the VAO. However, it also means that these buffer objects can no longer be accessed by the user; they behave as if the user deleted them. The VAO is also deleted.

There might also be a GL_LOCK_BUFFER_OBJECTS flag that forces the buffer objects to become locked while this render object exists. This would then behave much like a locked VAO, but with implicit rendering commands.

The fact that it collects rendering calls rather than just a VAO is important. That allows the driver the freedom to cull out buffer object data that happens to be not used in those rendering commands.

Groovounet
10-02-2009, 05:34 PM
I notice something on the thread I would like to clearify: drivers and GPUs change the content of the buffer of the image, twiddle or even compress the data, buffers and images for hundred of good reasons that always end up to: Memory bandwise if golden!

Drivers are allowed to do this with images, but not buffer objects. They cannot compress buffer object attribute data in any way.


Really? Do you have any reference for this?
I would say that even if it's not allowed, it's been done before and it's going to become more and more present in future GPUs.
A well know and basic example, was the int to short int conversion of index buffer on nVidia chips / drivers when it was possible. It's been efficiently tested in the past!

Typically, this glMakeBufferResident function is could be a place were the untwiddling / decompression could be request. It will hide some memory latency and present a buffer that can actually be read. This is were glMapBuffer is an issue, the whole buffer must be ready to use (untwiddle / uncompressed) before the function return, glMapBufferRange is better here.

Buffer compression and twiddling is great for GPUs it's easy to reach 50% of memory bandwise saving so that even is it's not allow yet by OpenGL, this API should relax this constraint when the developer ask it ... and there is many situation where this is possible! (Apply a wavelet on a mesh a display the histogram, the results are stunning, some data could be compressed up to 1bit per memory bust using 1/64)

nVidia announced 7x with bindless graphics ... I don't really believe it is just the result of a new API. There is something going on behind this.

Groovounet
10-02-2009, 05:57 PM
I still think that VAO were a mistake, API sugar that doesn't provide anything and as everyone notice, it doesn't provide anything...

Why? And in what way does it "not provide anything?" The only problem with VAOs giving performance improvements are problems with buffer objects. That they can be created/destroyed/respecified, so when you render with them, the code must fetch the buffer object and get its GPU address. That's not a problem of VAOs specifically; that would still happen if you were doing all the binding yourself.


Because most of the time all it does is for each mesh it makes you write more code than because for 0% of efficient gaim. Sometime (quite often actually) it results in a explosion of object count.

It just behave like a function call wapper. It doesn't allow to lock the buffer access of even the memory address because of reallocations or just the way the memory controller work to keep the memory access efficient. This is just a API wrapper. Some sugar.

Second case where it shall do something: the attributes descriptions (offset, type, stride, etc). Too bad you can't change the buffer in the VAO and assume that you are going to use the same buffer format. (Typically, if you have 10 different mesh for 10 different animated characters, each single buffer maybe different and for each buffer a single VAO). You can't assume that when you change you VAO, the buffers attributes are the same, so could check for each attribute but it is just easier and faster to just bind everything again. That's where the bindless graphics API is so good, you bind once your format and display you 10 characters!

Jan
10-02-2009, 07:47 PM
"nVidia announced 7x with bindless graphics"

Yep, the algorithm they use is called "marketing".

Usually it fails big time.

Alfonse Reinheart
10-02-2009, 07:52 PM
A well know and basic example, was the int to short int conversion of index buffer on nVidia chips / drivers when it was possible. It's been efficiently tested in the past!

When you upload an image format, the driver has the right to corrupt your data. It does not guarantee in all cases that the exact colors you specify are what you get out.

With buffer objects, they do make that guarantee. Which is why nVidia only does this conversion when it can. That is, when it will not affect the absolute value of the data.

It is considered bad form for drivers to do this. That is because the driver must do special processing on the first render with this element buffer. This helps contribute to NVIDIA's love for "first render" hitches.


Because most of the time all it does is for each mesh it makes you write more code than because for 0% of efficient gaim.

The purpose of the beginning section of my post was to propose an explanation for the "0% of efficient gaim". Do you have an alternative explanation? Do you have reason to believe that the lock API would not solve the problem?


Sometime (quite often actually) it results in a explosion of object count.

I don't understand how object count is an issue. These are small structs; they don't take up much room. And they're client-side data, so it's not like you're using up precious GPU memory.


"nVidia announced 7x with bindless graphics"

Yep, the algorithm they use is called "marketing".

Usually it fails big time.

I get the point you're making. But NVIDIA would be foolish to put numbers out that are so easily proven wrong. They can be exaggerated. But they wouldn't bother with bindless graphics unless there was some significant speedup.

A reasonable question to ask is this: does bindless graphics mean that NVIDIA will make no effort to use VAOs to improve performance?

Eosie
10-02-2009, 08:33 PM
It seems to me that this locking API is awkward and won't yield any performance benefits. Locking can already be done in a driver automatically, assuming everything is locked by default. When a buffer is reallocated or a vertex array state is changed, the appropriate VAO can be marked "dirty" (unlocked) and can be locked again once it is used to render something. That means VAOs which aren't changed stay locked forever. The proposed explicit locking seems to be just another hint with some sugar here and there.

First, we need to resolve some design issues in VAOs and that is to decouple a vertex format and vertex data. It's one of the things bindless graphics came up with and certainly had impact to its success, among other things.

Alfonse Reinheart
10-02-2009, 11:14 PM
Locking can already be done in a driver automatically, assuming everything is locked by default. When a buffer is reallocated or a vertex array state is changed, the appropriate VAO can be marked "dirty" (unlocked) and can be locked again once it is used to render something.

Yes, a driver could do this. But they don't. It takes too much effort and requires a lot of back-pointers from buffer objects to VAOs. There may even be internal reasons why it can't be done.

The purpose of the lock API is to give the implementation the freedom it needs to do this easily, by taking freedom away from the user.


First, we need to resolve some design issues in VAOs and that is to decouple a vertex format and vertex data. It's one of the things bindless graphics came up with and certainly had impact to its success, among other things.

Bindless graphics doesn't uncouple vertex formats from vertex data. All it does is allow you to use pointer values rather than buffer object names.

Groovounet
10-03-2009, 11:03 AM
When you upload an image format, the driver has the right to corrupt your data. It does not guarantee in all cases that the exact colors you specify are what you get out.

With buffer objects, they do make that guarantee. Which is why nVidia only does this conversion when it can. That is, when it will not affect the absolute value of the data.


What?!! When I was speaking of images and buffers twiddling and compression I didn't said anything about lossy: I said lossless! I would be really surprised if there is anything lossy on buffers and images on GPUs those days.



Bindless graphics doesn't uncouple vertex formats from vertex data. All it does is allow you to use pointer values rather than buffer object names.

Come on! Have a look on both extension you will see it does!

I think bindless graphics got it right for both issues I was talking before and I do believe in this 7x things. Probably a very specific case with a lot of draw call with same format and static buffers!

VAO for buffer and vertex format is like texture for image and filter. As the current state of the OpenGL spec, I want this feature deprecated.

Alfonse Reinheart
10-03-2009, 12:47 PM
Come on! Have a look on both extension you will see it does!

No, it doesn't. In the standard case, VAOs store buffer object names and an offset. In the bindless case, VAOs store buffer object addresses and an offset.

All bindless does is remove the buffer object's name itself from the equation. Which means that, when rendering with bindless, the rendering system no longer has to test the buffer objects (to see if they exist, get GPU addresses, etc).

You still have to either store the bindings in a VAO or keep calling "glBufferAddressRangeNV" and "glVertexAttribFormatNV" to build your attribute data. This is directly equivalent to "glBindBuffer" and "glVertexAttribPointer".

You really should read the original post. I spend a great deal of time explaining where I think NVIDIA gets their speedup from with bindless, and how locking mimics this almost exactly. If you have a problem with my reasoning there, please explain what it is.


VAO for buffer and vertex format is like texture for image and filter.

That analogy does not work. Buffer objects already store the vertex data. So there is already separation between the raw data and how that data gets used. Because of that, you can have many, many VAOs that all use the same buffer objects. Textures can't do that.

The part of VAOs that you seem to be having a problem with are the storage of buffer object name+offset, or in bindless, buffer object GPU address+offset. But the only "problem" this creates is making lots of VAOs. And I don't understand why this constitutes a problem.

Eosie
10-03-2009, 02:39 PM
Locking can already be done in a driver automatically, assuming everything is locked by default. When a buffer is reallocated or a vertex array state is changed, the appropriate VAO can be marked "dirty" (unlocked) and can be locked again once it is used to render something.

Yes, a driver could do this. But they don't. It takes too much effort and requires a lot of back-pointers from buffer objects to VAOs. There may even be internal reasons why it can't be done.

If they are unable to optimize it, they never will no matter what API you throw at them. I have already given you the idea how to make it efficient, and I disagree with the fact that some locking "hint" will make drivers faster than ever.


The purpose of the lock API is to give the implementation the freedom it needs to do this easily, by taking freedom away from the user.


First, we need to resolve some design issues in VAOs and that is to decouple a vertex format and vertex data. It's one of the things bindless graphics came up with and certainly had impact to its success, among other things.

Bindless graphics doesn't uncouple vertex formats from vertex data. All it does is allow you to use pointer values rather than buffer object names.

The reason bindless graphics is here in the given form is that NVIDIA might have admitted that even though GPU pointers may have provided some improvements, changing the vertex format is still costly, so this is the reason it's separate from the rest. To make the best of bindless graphics, using just GPU pointers will not make your applications dance. If you had taken a look at the vertex_buffer_unified_memory spec, you would know that the only example that's there sets the vertex format once and renders many times. EDIT: Storing the address+offset in VAOs is done in the same way regardless of availability of bindless graphics. Due to the aforementioned reasons, VAOs in conjunction with bindless graphics might actually hold you back.

Decoupling buffer bindings and vertex formats might not just improve performance the same way bindless graphics does, but what's more important, it would improve usability. Notice that D3D has it too.

And about this GPU pointers stuff, I'd like to see more direct way of handling buffers in OpenGL. Explicit allocations of device memory and page-locked system memory, memcpy between RAM and VRAM performed by a user, not a driver, which implies having GPU pointers by design. This is quite common in CUDA and would come in handy in OpenGL too. Just dreaming. ;)

Alfonse Reinheart
10-03-2009, 05:18 PM
Apparently, I didn't read the extension thoroughly enough. I was under the impression that it basically replaced the buffer object binding with a pointer.

I'm not sure presently what this would mean for a cross-platform API for improving vertex specification performance.

Alfonse Reinheart
10-03-2009, 07:45 PM
After thinking about it for a while, this opens up possibilities. And serious problems.

Now, everything I said in my original post may still be valid. That is, when I explained what I thought was the reasoning behind bindless graphics for rendering, that may still be correct. Indeed, I imagine it's a significant cache issue one way or another. The lock API as it currently stands may get, say, 80% of the performance of bindless.

However, there is also this potential problem. That, for whatever reason, vertex format changes in hardware take more performance than changing the buffers used by that rendering.

The examples in the bindless graphics spec suggest this is the case. But consider this.

The justification for bindless graphics was as a cache issue, not an issue with vertex formats being attached to the GPU addresses for them. Specifically, this was the CPU's cache. How exactly does the vertex format affect the CPU's cache?

It may be the case that there's simply more data. That FIFO chunk I mentioned, if you're using the same vertex format, would be smaller than if you changed vertex formats. Vertex format information takes up room that's clearly larger than the GPU addresses that are the the source of those attributes.

Cache lines these days are 64-bytes. That's big enough for 16 32-bit values (the buffer addresses, if every one of the 16 attributes comes from a different buffer). So in the worst-possible case, you're guaranteed that the format+address data will be larger than one cache line.

Really, I think the only way to know is to test it. To write an application that completely flushes the CPU's cache. Then have it do some rendering stuff. One way with the "common" form of bindless (one vertex format, lots of pointer changes). Then with constant format changes, once per render operation. And see what is fastest. The mesh data itself isn't at issue; indeed, it's better to just render a single triangle from 200,000 buffer objects. And of course, cull all fragments.

Unfortunately, my knowledge of cache architecture on x86 chips is insufficient to do something that actually flushes the cache fully.

Also, this won't answer the other important question: is this an NVIDIA-only issue, or is this something that ATI implementations could use some help on too?

elFarto
10-04-2009, 04:36 AM
If vertex specification is a performance problem, couldn't you fix it by moving the specification into the vertex shader? I.e have a shader like:


in struct {
vec3 pos;
vec3 normal;
vec2 tex;
} vertex;

Then in the draw call you only have to pass in a buffer/pointer. In fact the current bindless extensions would allow you todo this, by making vertex a pointer, binding a buffer to it and using gl_VertexID to do the lookup.

Regards
elFarto

Alfonse Reinheart
10-04-2009, 11:04 AM
If vertex specification is a performance problem, couldn't you fix it by moving the specification into the vertex shader?

Well, that already exists. The vertex shader inputs.

Vertex specification is about providing appropriate buffer objects and interpretations of those buffer objects (ie: the format), so that the pre-vertex shader code can know where "vec3 pos" actually comes from.

elFarto
10-04-2009, 11:41 AM
Vertex specification is about providing appropriate buffer objects and interpretations of those buffer objects (ie: the format), so that the pre-vertex shader code can know where "vec3 pos" actually comes from.
What do you mean by 'pre-vertex shader code'?

Regards
elFarto

Alfonse Reinheart
10-04-2009, 03:39 PM
What do you mean by 'pre-vertex shader code'?

There is hardware in GPUs that fetch attributes from memory. That convert the particular format (normalized byte, unnormalized signed short, float, etc) into the expected attribute data for the vertex shader. This hardware understand the format of each attribute and knows what GLSL variable to store it in.

It isn't necessarily programmable, so "code" was the wrong word.

elFarto
10-05-2009, 02:12 AM
It isn't necessarily programmable, so "code" was the wrong word.
Ah, Ok, I understand now. Then I suggest we steal DirectX 10's CreateInputLayout, IASetInputLayout and IASetVertexBuffers API. This might actually help the driver, since calling SetVertexBuffers hints to the driver you're about to use them (why else would you be calling it?).

Thinking about it a bit more, we're actually working around a design problem with buffer objects. If the problem is, we can supply completely new data for a buffer effectively changing it's address, then we should stop that from happening.
Modify the API so that you an only size the buffer once (essentially giving you a malloc/free API).

This removes the need to lock the buffer (you can't change it, therefore you don't need to lock it).
If you make the new API return an int64, it can directly return the GPU address, giving you all the benefits of the bindless extensions.
You'll still need MakeBuffer[Non]Resident. The driver can't guess when you'll need what buffers, you need to tell it.

The only issue I can't figure out is how to make it easy/fast for the driver to swap buffers in/out of GPU memory by just using the GPU address. Any ideas?

Regards
elFarto

Alfonse Reinheart
10-05-2009, 11:30 AM
If you make the new API return an int64, it can directly return the GPU address, giving you all the benefits of the bindless extensions.

Except that we don't want that low-level access to things. We want to maintain the abstraction, not break it.


You'll still need MakeBuffer[Non]Resident. The driver can't guess when you'll need what buffers, you need to tell it.

But if you're going to do that, you're just locking the buffer with a different API call. You may as well just use the current behavior until you lock/resident the buffer.

The design of my API is to make the transition from what we have currently smooth, since the ARB isn't interested in large rewrites of behavior. Changing the default behavior of buffer objects would likely break a lot of code.

elFarto
10-06-2009, 05:06 AM
Your Lock API looks fine, it just feels like we're working around another issue.

It certainly seems that VAOs need to be modified to not contain the reference to the buffer, but instead an index id. Then a set of buffers can be bound separately (this mirrors the DirectX 10 API).

OpenCL's buffer management looks better. You can't force it to reallocate the buffer, it's one size for the entirety of it's existence. It also seems like OpenCL has a better image API than OpenGL. If we could get an extension to access the fixed functionality of the GPU (rasteriser, interopolators, etc..) it would make a much better API (since it matches the hardware pretty closely).

Again, I really don't think there's anything wrong with your idea, I just think we're making the hole we're in even larger.

Regards
elFarto

Alfonse Reinheart
10-06-2009, 11:25 AM
It certainly seems that VAOs need to be modified to not contain the reference to the buffer, but instead an index id.

What is an index id?

Also, in order for this division to matter, we would need some evidence that part of the bottleneck is the setting of the vertex format. NV_vertex_buffer_unified_memory includes the GPU address as part of the VAO state. So it would require profiling the extension to see how much longer rendering takes when using VAOs vs. changing the vertex format as needed.

elFarto
10-06-2009, 11:52 AM
What is an index id?
Go read this (http://msdn.microsoft.com/en-us/library/ee418762%28VS.85%29.aspx), and specifically the D3D10_INPUT_ELEMENT_DESC page linked from it. The index id I'm referring to is called an "InputSlot".

When you've setup the format of your slots, you call IASetVertexBuffers to attach a set of buffers to them. Kind of like the way uniform blocks work in OpenGL.

Regards
elFarto

Alfonse Reinheart
10-06-2009, 12:17 PM
I don't see the need for such a remapping layer. Just because D3D10 does it that way doesn't mean OpenGL should.

Groovounet
10-07-2009, 03:57 AM
I think you are a bit close minded on your ideas on this topic Alfonse. The thread is losing interest for my point of view.

elFarto didn't link Direct3D because "D3D10 does it that way doesn't mean OpenGL should". He refers to this to give an example of some kind of 'slot' for buffers so that we could bind a VAO and freely change the buffers.

Bind one VAO, bind N*M array buffer and draw N times at least.

Alfonse Reinheart
10-07-2009, 11:20 AM
Bind one VAO, bind N*M array buffer and draw N times at least.

To suggest such a thing, one needs evidence that it is beneficial. The lock API has clear benefits, based on what I pointed out in the first post. As I suggested, there needs to be some benchmarking done to see if that would be worthwhile.

It should also be noted that NVIDIA had the chance to create this separation with bindless graphics. But they did not. They specifically made the GPU address part of VAO state.