PDA

View Full Version : Official Bindless Graphics feedback thread



barthold
04-27-2009, 09:48 AM
NVIDIA just released a new and faster way of rendering with OpenGL, called Bindless Graphics. Bindless Graphics are changes to OpenGL that can enable close to an order of magnitude improvement in the CPU-limitedness of graphics applications. Recent improvements in programmability have focused on additional flexibility in shaders (expanding formats to include more float and integer types, better branching support, etc.) and enabling new features (geometry programs, transform feedback, etc), and while some of these allow offloading parts of certain workloads to the GPU, they don't directly attack the issues that dominate CPU time.

Bindless Graphics is enabled through two OpenGL extensions. NV_shader_buffer_load (http://developer.download.nvidia.com/opengl/specs/GL_NV_shader_buffer_load.txt) and NV_vertex_buffer_unified_memory (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt)

You can read the Bindless graphics tutorial (http://developer.nvidia.com/object/bindless_graphics.html) for more detail.

Let us know how it works for you!

Barthold
(with my NVIDIA hat on)

Ido_Ilan
04-27-2009, 11:25 AM
Just a comment: A lot of updates recently from NVIDIA, nice to see all the updates and browse developer page (like few years back).

Reading the spec right now, seems nice. Presentation is well made. Currently our application is more CPU + State changes bound than GPU raw power limited, but I aim (hopefully) to test these new extension next week and see If it make a difference.

Ido

MalcolmB
04-27-2009, 11:35 AM
How does this extension interact with Map/Unmap usage patterns? I use map/unmap so I can update the VBO data in another thread. Will it still work?

Brolingstanz
04-27-2009, 12:13 PM
Is 8800 support planned? (All beta drivers I can find are for 9200+ only.)

bertgp
04-27-2009, 02:20 PM
Are these extensions GL 3 only? In the spec, there are examples using GLSL version 1.20, but the header states that the spec is written against OpenGL 3.0.

I was hoping that NV_shader_buffer_load would have been an extension which allows saving compiled shaders to disk and reloading them later!

I haven't tried it yet, but it's nice to see that binding blocks of uniforms will now be as fast as changing a pointer.

CatAtWork
04-27-2009, 03:42 PM
Is 8800 support planned? (All beta drivers I can find are for 9200+ only.)

From the NVIDIA release: "Bindless Graphics is available starting with NVIDIA Release 185 drivers for hardware G80 and up."

Brolingstanz
04-27-2009, 03:55 PM
That's what happens when you don't eat your veggies. :o
Sorry bout that.

bobvodka
04-27-2009, 04:10 PM
Blindless Graphics aka (like) one of the good bits from LP.

Nice one NV, good to see that some of LP has survived in some form.

Just a shame that AMD wont get in on the act (I say this as an AMD GPU user, so no bias against them); because of that it would be intresting to see how much use this would get in the 'real' world.

Dark Photon
04-27-2009, 05:54 PM
Very nice! Great job, NVidia!

Cyril
04-28-2009, 12:56 AM
That's really nice !
But one thing I wonder is, since we are now able to get direct GPU global memory addresses, how does this interact with OpenGL SLI mode ? Is automatic scaling with multiple GPU still working ?

Groovounet
04-28-2009, 01:29 AM
What about VAO with this? Oo

I spent post and post to debate with Korval a while ago about "VAO don't allow fast buffer" switch so I'm glade to see such feature but I'm a little be confused VAO and this now. It even deprecate the new uniform buffer ... Oo

I definitely need more information about all this, I don't actually know where to go now.

knackered
04-28-2009, 02:00 PM
well it's only G80 cards, so you'll have two very different code paths.
This stuff has actually got me excited about GL again. Thanks nvidia.

ector
04-28-2009, 02:58 PM
Wow, that's daring stuff. Basically cuts open big holes through the thick mud that the current OpenGL API is :)

I like it. If only it was universal.

Mars_999
04-28-2009, 04:19 PM
All I can say is about time! As usual Nvidia leading the pack for OpenGL. This is why I buy Nvidia only due to their support for OpenGL vs. ATI. Just wish ATI would pick up the slack and get in gear... Here's to wishing!!!

jeffb
04-28-2009, 09:43 PM
> How does this extension interact with Map/Unmap usage patterns? I use map/unmap so I can update the VBO data in another thread. Will it still work?

Yes, MapBuffer (and MapBufferRange) will still work, but please read issues 7-9 of NV_shader_buffer_load for more info.

> Is automatic scaling with multiple GPU still working ?

Yes, SLI can still work.

> What about VAO with this?

The new vertex attrib state is contained in the VAO (see the "New State" section), so these can be used in conjunction with VAO. However, it's not clear that VAO will provide additional benefit since switching vertex attrib addresses should be cheap (Groovounet, some of your posts about VAO have been very insightful).

> Are these extensions GL 3 only?

They do not require a GL3 context, and should work with all the ARB_compatibility features.

Cyril
04-29-2009, 12:38 AM
> Is automatic scaling with multiple GPU still working ?
Yes, SLI can still work.


Thanks for your answer.
Do you know if it's currently working with NVIDIA R185 drivers ? If so, does it work in heterogeneous SLI configurations, with GPU not having the same amount of memory for instance ? Any idea of how it is implemented ? Does each GPU clone the same address space ?

skynet
04-29-2009, 03:23 PM
Ok, I had a bit time to read/think through the specs and I think, I might have grasped it :-)

First of all, I think its a step in the right direction. It offers new features, even D3D10 has not. Issuing draw commands will get faster, leaving more CPU time for the appliaction itself. Although the GPU will not render faster per-se, there are now more possibilities for complex datastructures used in the shaders.

I have a few points to criticise, though. First of all the naming of the extensions is... just awkward.
Why not use "NV_shader_memory_access" (its about shaders directly accessing memory, right?) and "NV_buffer_direct_memory" (I don't get the "unified" part...)

I think it might be good to not provide the non-Named Functions. We all know that bind-to-edit will die out some day, so why provide these obsolete semantics for brand new functionality?! Just provide the Named-Function semantics (but please, without the "Named" prefix).
Also, the specs refer to some functions of EXT_direct_state access without mentioning dependencies on it.

Now we get pointers and pointer arithmetic in shaders. That allows to create really complex data structures, like lists, trees, hashmaps - the specs even talk about whole scenegraphs. But how are we supposed to debug such monsters?

Is the driver using the "sizeiptr length" parameter of BufferAddressRangeNV to determine that a certain region of memory is in use by drawing commands? If not, do we have to use fences for that (welcome back, NV_vertex_array_range)?

It seems, that once a VBO is made resident, it never can become non-resident (unless the application tells it to be). How can GL make that guarantee? What is the use of a non-resident buffer then? Does a resident buffer have any performance penalties?

I like the separation of vertex attribute format, enable and pointer. VAO did not offer this and therefore was useless (for me). You could it one step further and provide "vertex format" objects which encapsulate the format and enables. I don't know if that would be a great benefit, though (switching a stream setup with one command).

I'd like to suggest to move all the buffer-object related functions (MakeBufferResidentNV, IsBufferResidentNV, MakeBufferNonResidentNV) from NV_shader_buffer_load into NV_vertex_buffer_unified_memory. They feel more natural there. Additionally, IsBufferResidentNV may be replaced or accompanied by a new parameter to glGetBufferParameteriv().

Last but not least, IMHO the use of these new extensions probably has a fair impact on the design of a renderer. It cannot be used optional easily. Therefore, I will probably hesitate to actually make use of these features, if they are not available on ATI as well.


my 2c :-)

jeffb
04-29-2009, 07:57 PM
> the "sizeiptr length" parameter

D3D10 has some guarantees that accessing beyond the end of a buffer will not crash, this is providing something similar. If that's not useful to you, you can use INT_MAX and ignore it.

> ... once a VBO is made resident...

Issue 6 of shader_buffer_load discusses the purpose of residency. It shouldn't adversely affect performance most of the time.

> probably has a fair impact on the design of a renderer.

The presentation on the developer site has some examples of how to port. The vertex buffer extension should be easy to maintain both codepaths, although I can appreciate that maintaining multiple versions of shaders has some cost to developers.

Jan
04-30-2009, 02:19 AM
So, will there be new texture formats to render pointers into FBOs ? Could be useful for deferred renderers to "render materials" for later lookup.

I guess i could also store all materials in a big array and use an integer from an FBO for lookup, should be flexible enough.

I find the extension very intriguing, probably the most interesting and powerful idea since the introduction of shaders. However, debugging is indeed a problem. With standard GL/GLSL a broken shader usually crashes a program. With this i fear the blue-screen might become much more common. And debugging standard shaders can be a nightmare today already.

Jan.

rexguo
04-30-2009, 04:10 AM
So it's basically adding 'GPU address pointers'.

1. Handles now become direct GPU pointers on client side.
2. Shading language now has C-like pointer syntax.

It's a very nice feature but I really would have
preferred that the wording be more direct (i.e.
new feature: pointers to GPU memory!) so it is
easier to read, understand and use! :)

bertgp
04-30-2009, 07:35 AM
I have some concerns about the driver's memory management when using MakeBufferResidentNV. My understanding is that this function locks a buffer at its present VRAM location in order to provide a constant GPU address. If that is the case, this would seem to exacerbate the memory fragmentation of VRAM. If some randomly sized buffers are locked at some random positions in memory, the driver won't be able to defrag the memory and later buffer allocations will eventually fail. How will the GL cope with this?

Regarding structures made of pointers, maybe it would be a good idea to add refcounts to resident GPU addresses. As it stands now, the application has to do this on its own in order to use a resident GPU address in multiple structures. It is really only safe to delete a resident buffer shared across multiple structures via its GPU address when its usage count drops to 0. Otherwise, it generates dangling pointers to deleted memory. Essentially, it seems to me that using GPU addresses shared across structures is an easy way to introduce hard to debug crashes.

bertgp
04-30-2009, 07:37 AM
By the way, I think its a great idea to ask for developer feedback like this. Now it remains to be seen if anything coming of this thread will actually makes its way in the spec/driver :)

barthold
04-30-2009, 12:09 PM
We released 185.81 beta drivers today, including for notebook. Please use those drivers if you are going to experiment with bindless graphics.

Barthold

cass
04-30-2009, 02:26 PM
bertgp, the gpu address is virtual. The virtual address can be constant without making the physical address constant.

This looks like a really good set of extensions.

I particularly like that the vertex buffer code should be easily hidden behind an existing application VBO abstraction.

Brolingstanz
04-30-2009, 06:43 PM
Yeah, this is freaky cool.

Big thanks for the beta drivers!

Keith Z. Leonard
05-01-2009, 12:01 AM
This is definitely promising. It seems that it would be pretty easy to change my VBO code to do this. I assume the big wins here are a direct connection to GPU address, thus eliminating a bind id lookup, and the notion that you cannot edit, so you can rely on these functions being lightweight. I was wondering why there were no GL extensions to do something like this for textures, as drawing from a texture should take less work than setting up to update a texture on the driver side, no?

bertgp
05-01-2009, 07:41 AM
bertgp, the gpu address is virtual. The virtual address can be constant without making the physical address constant.

With virtual addresses, there can still be problems. Let's say there are N memory pages and each one has one resident VBO locked in it. In that case, any allocation greater than the size of two memory pages would fail since it would not be possible to provide a continuous range of virtual memory which satisfies it.

Before resident buffers, the GL could shuffle data around as much as it wanted. With this extension, there are more restrictions to what it can do.

cass
05-01-2009, 08:54 AM
Being virtual doesn't mean completely unconstrained, I agree, but the virtual address space of modern GPUs is in the terabyte range.

Agreed that giving the virtual address does add restrictions that didn't exist with handles alone, but the handle-to-address cache misses were not an insignificant cost.

This seems like a perfectly reasonable way to go about eliminating that cost.

bertgp
05-01-2009, 10:08 AM
Being virtual doesn't mean completely unconstrained, I agree, but the virtual address space of modern GPUs is in the terabyte range.

Yeah you're right. I had completely forgotten that the addresses are 64 bits. We have some leeway before each page of the 64 bit range is taken!

And by the way, I wasn't saying that this extension is useless, just pointing out a possible problem and wondering how Nvidia intended to handle the issue.

skynet
05-01-2009, 10:29 AM
We released 185.81 beta drivers today, including for notebook. Please use those drivers if you are going to experiment with bindless graphics.

Is GL3.1 enabled in these?

Brolingstanz
05-01-2009, 01:27 PM
Nope, but don't let that deter you ;-)

Fitz
05-02-2009, 08:33 PM
I've been playing around with this extension, but one thing I was somewhat curious about...

In the spec example for "GL_NV_vertex_buffer_unified_memory", it lists this function:

// get the address of this buffer and make it resident.
GetBufferParameteri64vNV(target,BUFFER_GPU_ADDRESS ,&gpuAddrs[i]);

But I cannot find any such function as having been added in any of these extensions, although there is a similar function called GetBufferParameterui64vNV from the "GL_NV_shader_buffer_load" extension.


I have been using this second one since it seems to more or less work... well for a few minutes, and then the driver randomly crashes. I'm creating lots and lots of these buffers, probably at least 20k, perhaps there is a limit or something? For some reason when I use occlusion culling I crash much faster, with OC in use the program lasts about 10 seconds, with it off, more like a minute or so.

jeffb
05-02-2009, 08:49 PM
Yes, it's supposed to be GetBufferParameterui64vNV (with a "u"), that's a typo.

Can you provide a repro app for the crash?

Fitz
05-02-2009, 09:01 PM
jeffb I sure can, do you need source or just a .exe? What should I send it to? Hopefully it is just something stupid I did..

jeffb
05-02-2009, 09:26 PM
Fitz, I sent you a private message with more info

Dark Photon
05-04-2009, 05:44 AM
We released 185.81 beta drivers today, including for notebook. Please use those drivers if you are going to experiment with bindless graphics.

I'm probably missing something obvious, but where?

developer.nvidia.com: ver 180.51 (Apr 1) Linux
nvdeveloper.nvidia.com: ver 182.65 (Apr 29) Win

<u>EDIT</u>:
Never mind. Found it:
http://www.nvidia.com/Download/betadrivers.aspx aka
http://www.nvidia.com/Download/Find.aspx?lang=en-us

Just a heads-up: developer.nvidia.com->Drivers->Download Latest Drivers doesn't get you the latest drivers. Google NVidia beta drivers to get the above links.

David Weller
05-04-2009, 08:27 AM
I'm probably missing something obvious, but where?

developer.nvidia.com: ver 180.51 (Apr 1) Linux
nvdeveloper.nvidia.com: ver 182.65 (Apr 29) Win

<u>EDIT</u>:
Never mind. Found it:
http://www.nvidia.com/Download/betadrivers.aspx aka
http://www.nvidia.com/Download/Find.aspx?lang=en-us

Just a heads-up: developer.nvidia.com->Drivers->Download Latest Drivers doesn't get you the latest drivers. Google NVidia beta drivers to get the above links.

Wow. I own the NVIDIA DevZone site and didn't know about this annoyance. I'm going to change the "Drivers" link on the left side to just go to http://www.nvidia.com/Download/Find.aspx from now on. Hopefully that will help!

wizard
05-05-2009, 01:28 AM
Now we're seeing some real changes ;)

Anyway, this extension seems really cool. Might be some time until I get the time to start working on it for the GL renderer but I almost can't wait for the moment :)

We now have to push ATI/AMD to implement it also. I actually have real hopes that it will, since this is a major step forward in GPU programmability.

Dark Photon
05-05-2009, 04:47 AM
I'm going to change the "Drivers" link on the left side to just go to http://www.nvidia.com/Download/Find.aspx from now on. Hopefully that will help!
Thanks, David

bobvodka
05-08-2009, 06:23 AM
We now have to push ATI/AMD to implement it also. I actually have real hopes that it will, since this is a major step forward in GPU programmability.

I don't hold out much hope; they don't have as many resources when it comes to developer stuff as NV and certainly not as many much when it comes to driver dev.

The state of their dev. pages alone should be a good clue to this...

Auto
05-11-2009, 02:58 AM
Hey this looks really very interesting - hats off to NVidia.

A couple of questions:

With Shader Buffer Load the addresses are virtual, and mapped to the GPU address space at Init() - presumably it doesn't matter from a functionality perspective whether the data is in Host or V-Ram here right? Is the idea to use the standard GL buffer API to upload the data to V-Ram prior to the draw call for speed (presumably multiple buffers of dependent fetches that might criss-cross Host<->V-Ram may not have the fastest access patterns there).

Secondly when are you likely to be adding support for this within Cg?

Finally - I'm thinking of this firstly in terms of iterating through a light parameter buffer via a buffer of indirected pointers to affecting lights here - so a general but related question: are there any profiles that support dynamic (from constant rather than compile-time) loop iteration counters yet?

Thanks.

Jackis
05-12-2009, 09:47 AM
Do 185.85 WHQL support these extensions? Or we should use beta for now?

[EDIT] Sorry, stupid question. Just installed these new drivers and saw, that these extensions are supported. Thanks for suc on-the-fly WHQL drivers!

wizard
05-12-2009, 02:21 PM
Secondly when are you likely to be adding support for this within Cg?

Finally - I'm thinking of this firstly in terms of iterating through a light parameter buffer via a buffer of indirected pointers to affecting lights here - so a general but related question: are there any profiles that support dynamic (from constant rather than compile-time) loop iteration counters yet?


About the loop iteration counters: You mean in Cg or in GLSL? GLSL works fine if you mean through uniforms.

Auto
05-13-2009, 06:01 AM
Yep I was referring to Cg there, but thanks - I didn't know that. I'm a bit locked in to CgFX/GL with my code so I don't use GLSL.

So am I correct in thinking it's totally uniform based and not a recompilation job prior to shader upload in the driver?

If so that's pretty interesting, though surely only some hardware (presumably post G80-class) supports that right?

wizard
05-13-2009, 06:54 AM
So am I correct in thinking it's totally uniform based and not a recompilation job prior to shader upload in the driver?

If so that's pretty interesting, though surely only some hardware (presumably post G80-class) supports that right?

G80 inclusive.

Auto
05-13-2009, 09:38 AM
Right - got you, that's what I meant, should've been clearer.

Thanks.

tamlin
05-14-2009, 08:37 AM
Kickass!

This is bringing back the old nvidia I liked - taking the lead in experimentation and actual innovation. If this is a sign of a "new" (or simply reborn) spirit - for all that is dear, don't let it be a single drive-by bullseye!

Some input on the vertex buffer spec:

If I'm to use all (remaining) space in the buffer anyway (as in the example), could perhaps -1 as "size" (last) argument to BufferAddressRangeNV work (using the size from the previous BufferData call)? I just found the code to manually having to adjust buffer size ("vboSizes[i]-4") to be... inelegant. Possibly also error-prone. Comments?

Would it perhaps make more sense to rename GetBufferParameterui64vNV to simply GetBufferParameterAddr, and have it expect a holder of size void*? That way it could satisfy requirements for both 32- and 64-bit platforms, without wasting the upper 32 bits for 32-bit platforms.

Are the *FormatNV functions just working names (not wanting to interleave the working code path too much with experimental stuff)? I'm not thinking of the "NV" moniker, I'm thinking "Hmmm, haven't I already seen this, even if in another incarnation, in plain VBO?".

Are there any scenarios where one could actually want to modify stride for an interleaved attribute in a single array? In the example "20" (4*sizeof(float)+4*sizeof(UBYTE)) is used in both the ColorFormatNV call, and the immediately following VertexFormatNV. Could this be simplified in some way? F.ex. a few client-side-only functions with a stride-setting call first, then a few calls to set the different buffer's offset? (just brainstorming here)

Anyway, while I just noticed this announcement and haven't had time to play with it; bloody good work!

jeffb
05-15-2009, 04:22 AM
tamlin,

> could perhaps -1 as "size" (last) argument to BufferAddressRangeNV work

A GLsizeiptr is signed, but you could use INT_MAX.

> GetBufferParameterAddr, and have it expect a holder of size void*

This is a GPU address, not a CPU address, and may be 64bit even if the CPU address space is only 32bits.

Sorry, I didn't follow the question about the *Format functions.

skynet
05-15-2009, 05:01 AM
Are there any scenarios where one could actually want to modify stride for an interleaved attribute in a single array? In the example "20" (4*sizeof(float)+4*sizeof(UBYTE)) is used in both the ColorFormatNV call, and the immediately following VertexFormatNV. Could this be simplified in some way? F.ex. a few client-side-only functions with a stride-setting call first, then a few calls to set the different buffer's offset? (just brainstorming here)


There are useful scenarios where you can mix interleaved and non-interleaved atributes in one VBO or have a mixture of formats spread across separate VBOs. I would not restrict this to either interleaved or linear, EXCEPT for real performance advantages (someone of the HW guys can comment on this?).
I'd suggest vertex format objects instead (or put the format specification into a display list). This would allow the driver to detect "all attributes have same stride" or "all atributes are linear" etc. and then do special handling of these cases. But this only pays off, _if_ the driver could take any advantage of this knowledge. But as I'm not a HW guy, I don't know...

Overmind
05-15-2009, 05:42 AM
I've finally had time to go through these specs, and it really looks cool.

I just have a question:

From the examples of NV_shader_buffer_load:
in vec4 **ptr;

glVertexAttribI2iEXT(8, (unsigned int)pointerBufferAddr, (unsigned int)(pointerBufferAddr>>32));

It seems like there is some implicit packing/unpacking going on.

Why not introduce a function like this:
glVertexAttribui64NV(8, pointerBufferAddr); (plus corresponding type specifiers for glVertexAttribFormat)
You did it with glUniformui64NV, so why not for attributes/varyings? It would make the code a lot more explicit and less confusing.

tamlin
05-19-2009, 10:08 AM
jeff,

I don't get these:

- "A GLsizeiptr is signed". Uh, yeah, that's why I suggested -1 instead of ~0 (a negative size makes no sense, why I thought -1 would fit perfectly).

- "This is a GPU address, not a CPU address". While true, the driver would still have to verify it on every use by the application (else it'd open a whole factory of worms and system crashes). Right? Would it then be a too large overhead to not only verify, but also internally perform the "32-bit user address space -> 64-bit PCI address space" translation (on 32-bit processes/operating systems)? Also, as you can't (normally) address anything outside "your" address space, how would a 64-bit space help a 32-bit app?

tamlin
05-19-2009, 10:20 AM
There are useful scenarios where you can mix interleaved and non-interleaved atributes in one VBO
Point taken. Format objects would likely be a better long-term solution for what I had in mind.

Jan
05-19-2009, 01:35 PM
" "This is a GPU address, not a CPU address". While true, the driver would still have to verify it on every use by the application (else it'd open a whole factory of worms and system crashes)."

As far as i can see, this extension DOES open up a huge can of worms. I think crashing you favorite OS will become easy again.

Jan.

Ilian Dinev
05-19-2009, 01:47 PM
Worst ever that can happen is a GPU reset, imho. A gpu reset is as 'damaging' and slow as changing the screen resolution.

mikeyba
06-02-2009, 12:24 PM
I've started to implement bindless graphics for my app, but I cannot find the appropriate header files or other support for the new function calls. I'm using the latest OpenGL SDK (10.52), and the latest drivers (185.85).

Do I need to go back to a beta driver (that presumably includes headers, etc)?

Thanks!
-mike

Brolingstanz
06-02-2009, 01:30 PM
You'll need to code up and grab your own extension procs for the time being.

Quick and dirty loader for tinkering...


//
// Declare stuff (just paste and comma delimit from spec)
//

#define GLDECL(ret, name, ...) \
typedef ret (APIENTRYP PFN##name##PROC)(__VA_ARGS__); \
PFN##name##PROC name = NULL;

// ------------------------------------------------------------------------------------------------
// NV_shader_buffer_load
// -----------------------------------------------------------------------------------------------

// Buffer operations
GLDECL(void, glMakeBufferResidentNV, GLenum target, GLenum access);
GLDECL(void, glNamedMakeBufferResidentNV, GLuint buffer, GLenum access); // Not in beta
GLDECL(void, glMakeBufferNonResidentNV, GLenum target);
GLDECL(void, glNamedMakeBufferNonResidentNV, GLuint buffer); // Not in beta
GLDECL(GLboolean, glIsBufferResidentNV, GLenum target);
GLDECL(GLboolean, glIsNamedBufferResidentNV, GLuint buffer);
GLDECL(void, glGetBufferParameterui64vNV, GLenum target, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetNamedBufferParameterui64vNV, GLuint buffer, GLenum pname, GLuint64EXT *params);
// New Get flavor
GLDECL(void, glGetIntegerui64vNV, GLenum value, GLuint64EXT *result);
// (Named) program uniform get/set
GLDECL(void, glUniformui64NV, GLint location, GLuint64EXT value);
GLDECL(void, glUniformui64vNV, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glGetUniformui64vNV, GLuint program, GLint location, GLuint64EXT *params);
GLDECL(void, glProgramUniformui64NV, GLuint program, GLint location, GLuint64EXT value);
GLDECL(void, glProgramUniformui64vNV, GLuint program, GLint location, GLsizei count, GLuint64EXT *value);

enum NV_shader_buffer_load
{
// Accepted by the <pname> parameter of GetBufferParameterui64vNV, GetNamedBufferParameterui64vNV:
GL_BUFFER_GPU_ADDRESS_NV = 0x8F1D,

// Returned by the <type> parameter of GetActiveUniform:
GL_GPU_ADDRESS_NV = 0x8F34,

// Accepted by the <value> parameter of GetIntegerui64vNV:
GL_MAX_SHADER_BUFFER_ADDRESS_NV = 0x8F35,
};

// ------------------------------------------------------------------------------------------------
// NV_vertex_buffer_unified_memory
// ------------------------------------------------------------------------------------------------

GLDECL(void, glBufferAddressRangeNV, GLenum pname, GLuint index, GLuint64EXT address, GLsizeiptr length);
GLDECL(void, glVertexFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glNormalFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glIndexFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glTexCoordFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glEdgeFlagFormatNV, GLsizei stride);
GLDECL(void, glSecondaryColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glFogCoordFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glVertexAttribFormatNV, GLuint index, GLint size, GLenum type, GLboolean normalized, GLsizei stride);
GLDECL(void, glVertexAttribIFormatNV, GLuint index, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glGetIntegerui64i_vNV, GLenum value, GLuint index, GLuint64EXT result[]);

enum NV_vertex_buffer_unified_memory
{
// Accepted by the <cap> parameter of DisableClientState,
// EnableClientState, IsEnabled:
GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV = 0x8F1E,
GL_ELEMENT_ARRAY_UNIFIED_NV = 0x8F1F,
// Accepted by the <pname> parameter of BufferAddressRangeNV
// and the <value> parameter of GetIntegerui64i_vNV:
GL_VERTEX_ATTRIB_ARRAY_ADDRESS_NV = 0x8F20,
GL_TEXTURE_COORD_ARRAY_ADDRESS_NV = 0x8F25,
// Accepted by the <pname> parameter of BufferAddressRangeNV
// and the <value> parameter of GetIntegerui64vNV:
GL_VERTEX_ARRAY_ADDRESS_NV = 0x8F21,
GL_NORMAL_ARRAY_ADDRESS_NV = 0x8F22,
GL_COLOR_ARRAY_ADDRESS_NV = 0x8F23,
GL_INDEX_ARRAY_ADDRESS_NV = 0x8F24,
GL_EDGE_FLAG_ARRAY_ADDRESS_NV = 0x8F26,
GL_SECONDARY_COLOR_ARRAY_ADDRESS_NV = 0x8F27,
GL_FOG_COORD_ARRAY_ADDRESS_NV = 0x8F28,
GL_ELEMENT_ARRAY_ADDRESS_NV = 0x8F29,
// Accepted by the <target> parameter of GetIntegeri_vNV:
GL_VERTEX_ATTRIB_ARRAY_LENGTH_NV = 0x8F2A,
GL_TEXTURE_COORD_ARRAY_LENGTH_NV = 0x8F2F,
// Accepted by the <value> parameter of GetIntegerv:
GL_VERTEX_ARRAY_LENGTH_NV = 0x8F2B,
GL_NORMAL_ARRAY_LENGTH_NV = 0x8F2C,
GL_COLOR_ARRAY_LENGTH_NV = 0x8F2D,
GL_INDEX_ARRAY_LENGTH_NV = 0x8F2E,
GL_EDGE_FLAG_ARRAY_LENGTH_NV = 0x8F30,
GL_SECONDARY_COLOR_ARRAY_LENGTH_NV = 0x8F31,
GL_FOG_COORD_ARRAY_LENGTH_NV = 0x8F32,
GL_ELEMENT_ARRAY_LENGTH_NV = 0x8F33,
};

//
// Grab procs...
//

#undef GLDECL
#define GLDECL(ret, name, ...) \
name = (PFN##name##PROC)wglGetProcAddress(#name); \
if (name == 0) cerr << "Missing extension: " << #name << endl;



// Add these to your init function
GLDECL(void, glMakeBufferResidentNV, GLenum target, GLenum access);
GLDECL(void, glMakeBufferNonResidentNV, GLenum target);
GLDECL(GLboolean, glIsBufferResidentNV, GLenum target);
GLDECL(void, glNamedMakeBufferResidentNV, GLuint buffer, GLenum access);
GLDECL(void, glNamedMakeBufferNonResidentNV, GLuint buffer);
GLDECL(GLboolean, glIsNamedBufferResidentNV, GLuint buffer);
GLDECL(void, glGetBufferParameterui64vNV, GLenum target, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetNamedBufferParameterui64vNV, GLuint buffer, GLenum pname, GLuint64EXT *params);
GLDECL(void, glGetIntegerui64vNV, GLenum value, GLuint64EXT *result);
GLDECL(void, glUniformui64NV, GLint location, GLuint64EXT value);
GLDECL(void, glUniformui64vNV, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glGetUniformui64vNV, GLuint program, GLint location, GLuint64EXT *params);
GLDECL(void, glProgramUniformui64NV, GLuint program, GLint location, GLuint64EXT value);
GLDECL(void, glProgramUniformui64vNV, GLuint program, GLint location, GLsizei count, GLuint64EXT *value);
GLDECL(void, glBufferAddressRangeNV, GLenum pname, GLuint index, GLuint64EXT address, GLsizeiptr length);
GLDECL(void, glVertexFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glNormalFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glIndexFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glTexCoordFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glEdgeFlagFormatNV, GLsizei stride);
GLDECL(void, glSecondaryColorFormatNV, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glFogCoordFormatNV, GLenum type, GLsizei stride);
GLDECL(void, glVertexAttribFormatNV, GLuint index, GLint size, GLenum type, GLboolean normalized, GLsizei stride);
GLDECL(void, glVertexAttribIFormatNV, GLuint index, GLint size, GLenum type, GLsizei stride);
GLDECL(void, glGetIntegerui64i_vNV, GLenum value, GLuint index, GLuint64EXT result[]);

mikeyba
06-03-2009, 06:59 AM
Thanks a lot! I'm having some compile issues with it right now, but I need to put more time into it.

-mike

mikeyba
06-03-2009, 02:42 PM
Thanks again for the code snips, Brolingstanz.

I have it working now. Yes, you can crash your card pretty easily (when you're doing things wrong, of course), and yes, the card resets pretty much instantly (at least my GTX 280 does).

I thought that I had gotten a 50% speed increase, but when I updated my old VBO code to match the test case's simplifications, I got the same performance in the end. My bottlenecks may be elsewhere.

Thanks!
-Mike

Jackis
06-08-2009, 08:31 AM
Just a question.
Does making buffer resident mean it's now sort of GPU-located? I mean, doesn't it result, that we can't exceed VRAM with resident VBOs?
And second reason why I'm asking that - is there any possibility now to make VBO, which is totally in VRAM? Yes, up to mine responsibility, and so on and so on... But can we avoid having driver-side copy of VBO content?

makaroni
06-16-2009, 01:44 AM
Congratulations to the NVIDIA team. Bindless rendering really improved our rendering speed, especially on systems with a slow CPU and a fast 3d graphics card. However so far we only use the glBufferAddressRangeNV functionality to speed up the VBO submission to the graphics card.

What we do not understand yet is the new way of submitting uniform variables to a shader program. Can bindless rendering be used in the following scenario:

We have one shader program and we submit transform matrices and other float or vec uniform's to this shader each time we render a triangle strip. Will bindless rendering be able to speed up this case, e.g. can we replace glUniform calls with bindless calls?

The bindless tutorial from NVIDIA is confusing in this respect. I am aware of the sample code using

loc = GetAttribLocation(pgm, “mat”);
VertexAttribI2iEXT(loc, buf1Addr, buf1Addr>>32);

but I have no clue on how to use this in our case.

Can somebody help us with this issue?

JoeSSU
06-16-2009, 12:58 PM
So about the Cg Support? Is there any? I saw a post on this earlier but it didn't make sense.

LangFox
06-20-2009, 01:05 AM
// enable vertex address use
EnableClientState(VERTEX_ATTRIB_ARRAY_UNIFIED_NV);


EnableClientState has been deprecated by OpenGL 3.0 and removed from OpenGL 3.1.

If I use VertexAttribFormatNV and VertexAttribIFormatNV instead of VertexFormatNV, I think this code isn't necessary. Right?

Overmind
06-20-2009, 05:50 AM
From the examples of NV_shader_buffer_load:
in vec4 **ptr;

glVertexAttribI2iEXT(8, (unsigned int)pointerBufferAddr, (unsigned int)(pointerBufferAddr>>32));

It seems like there is some implicit packing/unpacking going on.

Why not introduce a function like this:
glVertexAttribui64NV(8, pointerBufferAddr); (plus corresponding type specifiers for glVertexAttribFormat)
You did it with glUniformui64NV, so why not for attributes/varyings? It would make the code a lot more explicit and less confusing.

Is anyone from nVidia still reading this topic?

I would really like to know the answer to my question...

LangFox
06-21-2009, 02:14 AM
And I'm wondering if we could put all vertex attributes in an uniform buffer, instead of assigning them by VertexAttrib*?



uniform mat4 g_mat4_modelViewProjection;

struct VERTEX
{
vec2 texcoord;
vec4 position;
};
uniform VERTEX *vertex;

out vec2 s_vec2_texcoord;

void main()
{
s_vec2_texcoord = vertex[gl_VertexID].texcoord;
gl_Position = g_mat4_modelViewProjection * vertex[gl_VertexID].position;
}

Eosie
06-21-2009, 09:39 AM
1) What kind of memory may a pointer to UBO point to? Is it constant memory, or global memory, or both?

2) If it's global memory, how can I be one-hundred percent sure that memory reads are coalesced? (this term is from CUDA)

laanwj
06-25-2009, 06:20 AM
About bindless rendering: I am unable to get any of the speed-ups mentioned.

I've made a simple program that does thousands of draw calls per frame, using normal VBO and bindless extension, and I am unable to get a speed-up using this extension. I've tried:


- render loop
- change material A
- render submesh A 1000 times
- change material B
- render submesh B 1000 times
...

and


- render loop
- render 1000 times:
- change material A
- render submesh A
- change material B
- render submesh B
...

I've tried to render a few different meshes or only one a zillion times. I've tried with large meshes and small, simple meshes...

In all of the cases there was no performance difference at all... seems the performance gain stated by NVidia is overrated enormously. Or I'm doing something wrong. Or something is wrong with my hardware (8600GT). Does anyone have a simple GL example program that shows an actual, impressive speed-up?

Of course, this API allows doing things (somewhat) more conveniently, and using complex data structures in shaders, but the promised speed-up with simple draw calls is nowhere to be found :(

Ilian Dinev
06-25-2009, 06:48 AM
With "a few different meshes or only one" you effectively stay in L1. And the whole thing is about so many VBOs, that lookups went out of L2.

zed
06-26-2009, 02:44 AM
but the promised speed-up with simple draw calls is nowhere to be found
Identify your current bottleneck
I assume this doesnt help fillrate at all (perhaps slightly)
thus try rendering to a small window
eg 320x240

laanwj
06-26-2009, 02:49 AM
Thanks; I'll try with many different meshes, just fill up GPU memory a bit :) I have some other theories though:

- My graphics card is too slow. CPU has no problem keeping it occupied, even with slow draw calls. Maybe I should try on another card.

- I only looked at the rendering time (FPS). I did not look at CPU usage % or profiled the draw calls. Maybe I should do that instead of look at rendering performance.

The thing is, I'd like to use this extension in an existing rendering engine (Ogre3D), but I first want to see it actually gain something before I bother with the details.

laanwj
06-26-2009, 02:57 AM
On another note, I really like this new interface. It finally makes it possible to work with plain pointers on the GPU, a la CUDA.

Speaking about CUDA, it would be great if you could share a memory space with a CUDA program and just swap pointers between GL and back :) It would make interoperability super cheap. Or is this already possible?

Ferdi Smit
06-27-2009, 05:07 AM
Speaking about CUDA, it would be great if you could share a memory space with a CUDA program and just swap pointers between GL and back :) It would make interoperability super cheap. Or is this already possible?

I doubt it. GL_NV_shader_buffer_load says:

18) Is the address space per-context, per-share-group, or global?

RESOLVED: It is per-share-group. Using addresses from one share group in another share group will cause undefined results.

So it the pointer could be shared across shared contexts (but in that case you could already share OpenGL buffer IDs with similar same effect); however, as far as I know, CUDA does not actually share a context with OpenGL (meaning they can use a different virtual address space, resulting in pointers not pointing to the same things (?) ).

I also hope(d) this was possible. For some reason the OpenGL interoperability in CUDA is very slow (Has this been fixed/addressed? I haven't check since CUDA 2.0, I believe). Judging from the performance, it appeared these buffers are actually copied to a different address space; it would be nice to be able to directly share global device memory between CUDA/OpenCL and OpenGL through pointers, without any copying.

jeffb
06-27-2009, 02:25 PM
>EnableClientState(VERTEX_ATTRIB_ARRAY_UNIFIED_NV);
>EnableClientState has been deprecated by OpenGL 3.0 and removed from OpenGL 3.1.
>If I use VertexAttribFormatNV and VertexAttribIFormatNV instead of VertexFormatNV, I think this code isn't necessary. Right?

I think there's some confusion here. VERTEX_ATTRIB_ARRAY_UNIFIED_NV is the "on switch" for using gpu addresses for any vertex attributes, so it's still necessary. VertexAttribFormatNV sets the same format state as VertexAttribPointer, it doesn't do anything else implicitly.

Regarding deprecation, this sounds like a missing interaction. It would make sense for Enable to accept the new tokens, similar to how primitive restart was incorporated into GL3.1. But I don't think it works that way today.

> And I'm wondering if we could put all vertex attributes in an uniform buffer, instead of assigning them by VertexAttrib*?

Based on your code snippet, I think you're asking if you can put the pointer to an interleaved vertex buffer in a uniform. Yes, that should work.

elFarto
07-30-2009, 09:31 AM
> And I'm wondering if we could put all vertex attributes in an uniform buffer, instead of assigning them by VertexAttrib*?

Based on your code snippet, I think you're asking if you can put the pointer to an interleaved vertex buffer in a uniform. Yes, that should work.
The bindless API does look quite nice, but it seems you've missed a trick here. There's no need for the client to call all these methods to setup the layout of the buffer. LangFox touches on this point.

The simplest process is:

1. make a buffer and upload your vertex information
2. get a pointer to this buffer and update the vertex program (see LangFox's program above)
3. Draw n vertices and let the vertex program deal with retrieving the data.

All I can see is missing is a DrawVertices(type, count) method and a DrawInstancedVertices(type, count, instances) method.

It seems you've only considered how would direct access to GPU memory help vertex arrays, but when you have direct access, you don't need vertex arrays.

Regards
elFarto

Dark Photon
02-18-2010, 05:54 AM
To help ensure that bindless never dies and hopefully gets merged into the core in some form... ;)

Just wanted to follow up here and mention that merely by using NV_vertex_buffer_unified_memory (http://developer.nvidia.com/object/bindless_graphics.html), I've seen as much as 40% reductions in draw time (about half of that is through wiring the vertex attribs and index arrays directly to GPU memory rather than fetching through bound VBO handles, and the other half is getting rid of the now-needless VBO buffer binds). This is on slow CPUs and fast ones with twice the cache, and this is on real-world database content, not contrived.

Others have seen ~50% draw time reductions (2X speedup) (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=271436#Post2714 36).

So in case it's not already "in the cards", please do add bindless->ARB/core to the list and/or bump it up in priority. It's free performance for the vendors, more complex content we can put in front of users (good for the us and the GPU vendors), and increased demand for the latest graphics cards. Not to mention it increases the appeal of developing in OpenGL.