PDA

View Full Version : Bindless Stuff



Jan
03-21-2010, 06:36 AM
So i tried this magical NV_vertex_buffer_unified_memory because i want my app to run 7.5x faster.

All i can produce is hard crashes, though.

So, i reduced it to the case of binding only the index-array via bindless-stuff and leave the vertex-data array as is, for now. Still, i can't get it to run.

This is the code, to setup the data:

int iIndexDataSize = iIndices * sizeof (short);
glGenBuffers (1, &IndexBufferID);
glBindBuffer (GL_ELEMENT_ARRAY_BUFFER, IndexBufferID);
glBufferData (GL_ELEMENT_ARRAY_BUFFER, iIndexDataSize, (void*) &pIndices[0], GL_STATIC_DRAW);
glGetBufferParameterui64vNV (GL_ELEMENT_ARRAY_BUFFER, GL_BUFFER_GPU_ADDRESS_NV, &ptrIndexArray);
glMakeBufferResidentNV (GL_ELEMENT_ARRAY_BUFFER, GL_READ_ONLY);

And this is the code to bind the buffer at runtime:

if (UseBindless) {
glEnableClientState (GL_ELEMENT_ARRAY_UNIFIED_NV);
glBufferAddressRangeNV (GL_ELEMENT_ARRAY_ADDRESS_NV, 0, ptrIndexArray, iIndexDataSize);
}
else
glBindBuffer (GL_ELEMENT_ARRAY_BUFFER, IndexBufferID);

If i understand the "documentation" (which is very shallow IMO) correctly, that should be it, and nothing else is necessary.

However, it just crashes all the time. Any ideas?

Thanks,
Jan.

Alfonse Reinheart
03-21-2010, 11:38 AM
If i understand the "documentation" (which is very shallow IMO) correctly, that should be it, and nothing else is necessary.

You might want to use these functions:



void VertexFormatNV(int size, enum type, sizei stride);
void NormalFormatNV(enum type, sizei stride);
void ColorFormatNV(int size, enum type, sizei stride);
void IndexFormatNV(enum type, sizei stride);
void TexCoordFormatNV(int size, enum type, sizei stride);
void EdgeFlagFormatNV(sizei stride);
void SecondaryColorFormatNV(int size, enum type, sizei stride);
void FogCoordFormatNV(enum type, sizei stride);
void VertexAttribFormatNV(uint index, int size, enum type,
boolean normalized, sizei stride);
void VertexAttribIFormatNV(uint index, int size, enum type,
sizei stride);

Jan
03-21-2010, 12:55 PM
Yes, for vertex-data! But i am currently only trying it out with index-data. And since even that does not work, i'd like some more information.

Jan.

Alfonse Reinheart
03-21-2010, 01:25 PM
From the spec:


It is not possible to mix vertex or attrib arrays sourced from GPU
addresses with either vertex or attrib arrays in client memory or
specified with classic VBO bindings.


If they prohibit mixing and matching VBOs with bindless for attributes, there's a good chance that they're not too keen on mixing and matching VBO attributes with bindless elements. Though it'd be nice if they actually said something about it.

Jan
03-21-2010, 02:10 PM
Hmmm, but it talks about old-style vertex-arrays (GL_COLOR_ARRAY, etc.) and new-style (generic vertex-attribs) and stuff in client memory (RAM) and other stuff in GPU memory. As i interpret it, it does not say anything about index-arrays and vertex-arrays.

Both my index and vertex data is in a VBO. And since there are two states to enable (glEnableClientState (GL_ELEMENT_ARRAY_UNIFIED_NV) and glEnableClientState (GL_ATTRIB_ARRAY_UNIFIED_NV)) i would guess you can mix&match at least that.

I will try using bindless stuff for both, tomorrow, but from how i understand the spec, only binding index or vertex arrays with it should be possible.

Jan.

Dark Photon
03-21-2010, 04:06 PM
So i tried this magical NV_vertex_buffer_unified_memory because i want my app to run 7.5x faster.
Yeah :p , on contrived cases.

2X is all I've gotten on real use cases, which ain't shabby at all!


So, i reduced it to the case of binding only the index-array via bindless-stuff and leave the vertex-data array as is, for now.
I personally have never tried mixing bindless and "classic" VBOs in the same batch (to use language from the extension), though I agree the presence of separate enables would otherwise imply that you should be able to do this. At the same time, the extension basically states that they don't want to support mixing "bindless" with "classic" VBOs (see Question #4) -- the goal of the extension being reducing batch overhead, and that would just increase it. But this permutation isn't spelled out exactly.

(terminology: "Bindless VBOs" being VBOs set up by GPU address, and "classic VBOs" being VBOs set up by buffer handle).


Still, i can't get it to run....Any ideas?
Nothing immediately obvious. It may be your mixing of "bindless" and "classic" VBOs within a batch, which the extension spec speaks against.

Other ideas: I can tell you when I was first playing with this the #1 cause of crashes by far was leaving the bindless enables enabled:


glEnableClientState ( GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV );
glEnableClientState ( GL_ELEMENT_ARRAY_UNIFIED_NV );
accidentally when I needed to render something with "classic" VBOs (I didn't convert everything over at once).

If you've got this in a short test program, feel free to post and we'll figure it out. Also, if I have time this evening, I'll whip up a short working batch bindless test pgm and post it -- something you can morph into your usage and figure out what's up with the crash.

Also, as with all "crash" issues, have you run a CPU memory debugger (e.g. valgrind, Purify, etc.) to see if it's something obvious?

Dark Photon
03-21-2010, 06:07 PM
Ok, here's a simple little GLUT test program to illustrate how to draw a "bindless batch" (that is, draw it with NV_vertex_buffer_unified_memory (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) from here (http://developer.nvidia.com/object/bindless_graphics.html)). To build without bindless, comment out #define WITH_BINDLESS.

To keep it simple, I used the fixed-function pipe and legacy vertex attributes.

Anyway, here's the code.



//------------------------------------------------------------------------------
// Bindless batches example:
//
// NOTE: To keep this as stupid-simple as possible, this demo pgm
// doesn't use shaders, and therefore doesn't use generic vertex
// attributes (since to do so would require assuming NVidia vertex
// attribute aliasing. Of course, bindless trivially supports both.
//
//------------------------------------------------------------------------------

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define GL_GLEXT_PROTOTYPES
#include <GL/gl.h>
#include <GL/glut.h>

//------------------------------------------------------------------------
// Uncomment the following to build with NV bindless batches support
// (i.e. NV_vertex_buffer_unified_memory)
//------------------------------------------------------------------------
#define WITH_BINDLESS


GLuint Vbo_handle[3]; // Positions, Colors, Index list

#ifdef WITH_BINDLESS
GLuint64EXT Vbo_addr[3]; // Positions, Colors, Index list
GLuint Vbo_size[3]; // Positions, Colors, Index list
#endif

//-----------------------------------------------------------------------

void checkGLError( const char hdr[] )
{
int err = glGetError();
if( err )
{
fprintf(stderr, "ERROR %s: %s\n", hdr, gluErrorString(err));
exit(1);
}
}

//-----------------------------------------------------------------------

void init()
{
static const GLfloat pos [] = { -1, -1,
-1, 1,
1, 1,
1, -1 };
static const GLfloat color[] = { 1,0,0,1,
1,0,0,1,
1,0,0,1,
1,0,0,1 };
static const GLushort index[] = { 0, 1, 2, 3 };

// Create and fill VBOs
glGenBuffers( 3, Vbo_handle );

GLenum gl_target = GL_ARRAY_BUFFER;

// Positions...
glBindBuffer( gl_target, Vbo_handle[0] );
glBufferData( gl_target, sizeof(pos) , pos , GL_STATIC_DRAW );

// Colors...
glBindBuffer( gl_target, Vbo_handle[1] );
glBufferData( gl_target, sizeof(color), color, GL_STATIC_DRAW );

// Index array...
gl_target = GL_ELEMENT_ARRAY_BUFFER;

glBindBuffer( gl_target, Vbo_handle[2] );
glBufferData( gl_target, sizeof(index), index, GL_STATIC_DRAW );

#ifdef WITH_BINDLESS
// Make them resident and query GPU addresses
for ( int i = 0; i < 3; i++ )
{
glBindBuffer( gl_target, Vbo_handle[i] );
glGetBufferParameterui64vNV( gl_target, GL_BUFFER_GPU_ADDRESS_NV,
&amp;Vbo_addr[i] );
glMakeBufferResidentNV ( gl_target, GL_READ_ONLY );
}

Vbo_size[0] = sizeof(pos);
Vbo_size[1] = sizeof(color);
Vbo_size[2] = sizeof(index);

// Make draw calls use the GPU address VAO state, not the handle VAO state
glEnableClientState( GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV );
glEnableClientState( GL_ELEMENT_ARRAY_UNIFIED_NV );
#endif
}

//-----------------------------------------------------------------------

void reshape(int width, int height)
{
glViewport(0, 0, width, height);
}

//-----------------------------------------------------------------------

void display()
{
static float angle = 0.0;

// Clear screen
int err=0;
glClear( GL_COLOR_BUFFER_BIT );

// Load up PROJECTION and MODELVIEW
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glOrtho(-2,2,-2,2,-2,2);

glMatrixMode(GL_MODELVIEW);
glLoadIdentity();

glRotatef(angle, 0,0,1);
angle += 0.01;

// Draw a quad
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_COLOR_ARRAY );

#ifdef WITH_BINDLESS

// "Bindless VBOs case
glVertexFormatNV ( 2, GL_FLOAT, 2*sizeof(GLfloat) );
glBufferAddressRangeNV( GL_VERTEX_ARRAY_ADDRESS_NV, 0,
Vbo_addr[0], Vbo_size[0]);

glColorFormatNV ( 4, GL_FLOAT, 4*sizeof(GLfloat) );
glBufferAddressRangeNV( GL_COLOR_ARRAY_ADDRESS_NV , 0,
Vbo_addr[1], Vbo_size[1]);

glBufferAddressRangeNV( GL_ELEMENT_ARRAY_ADDRESS_NV, 0,
Vbo_addr[2], Vbo_size[2] );
#else

// "Classic" VBOs case
glBindBuffer ( GL_ARRAY_BUFFER, Vbo_handle[0] );
glVertexPointer ( 2, GL_FLOAT, 2*sizeof(GLfloat), 0 );

glBindBuffer ( GL_ARRAY_BUFFER, Vbo_handle[1] );
glColorPointer ( 4, GL_FLOAT, 4*sizeof(GLfloat), 0 );

glBindBuffer ( GL_ELEMENT_ARRAY_BUFFER, Vbo_handle[2] );
#endif

glDrawElements ( GL_QUADS, 4, GL_UNSIGNED_SHORT, 0 );

// Swap
glutSwapBuffers();
glutPostRedisplay();
checkGLError( "End of display()" );
}

//-----------------------------------------------------------------------

void keyboard(unsigned char key, int x, int y)
{
switch (key) {
case 27:
exit(0);
break;
}
}

int main (int argc, char** argv)
{
glutInit(&amp;argc, argv);
glutInitDisplayMode( GLUT_RGBA | GLUT_DOUBLE );
glutCreateWindow(argv[0]);

glutKeyboardFunc(keyboard);
glutDisplayFunc(display);
glutReshapeFunc(reshape);

glutReshapeWindow(400,400);

printf("GL_RENDERER = %s\n",glGetString(GL_RENDERER));

glClearColor(0,0,0,0);

init();

glutMainLoop();
return 0;
}


If instead you wanted to use generic vertex attributes, in display() you'd just change the attribute setup to this:


glVertexAttribFormatNV( attrib_id, width, type, norm_fixed_pt, stride );
glBufferAddressRangeNV( GL_VERTEX_ATTRIB_ARRAY_ADDRESS_NV, attrib_id, gpu_addr, gpu_size );

And of course use the generic enable (e.g. glEnableVertexAttribArray) instead of the legacy attribute enables.

Dark Photon
03-21-2010, 06:21 PM
So, i reduced it to the case of binding only the index-array via bindless-stuff and leave the vertex-data array as is, for now.
I personally have never tried mixing bindless and "classic" VBOs in the same batch...
Out of curiousity, I just hacked the above little test program to do just that (bindless index VBO and "classic" attribute array VBOs), and it worked just fine.

Tested on NVidia 195.36.07.03 (Linux), their new GL 3.3 drivers (equiv: 197.15 Windows):
* NVidia 3.3 drivers (http://developer.nvidia.com/object/opengl_driver.html)

Aleksandar
03-22-2010, 04:31 AM
The specification is pretty straight forward. There are no ambiguities out there. Before start complaining on extension try to find error in your code. Maybe the tutorial (http://developer.download.nvidia.com/opengl/tutorials/bindless_graphics.pdf) would help you.

I have implemented bindles with both fixed functionality and shaders, and it works perfectly.

I'll try to summarize the most important things:

1. Set Bindless vertex formats
(glVertexFormatNV, glNormalFormatNV, glTexCoordFormatNV, etc for fixed functionality or glVertexAttribFormatNV for generic attributes)

2. Enable adequate client states (GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV - for vertex arrays, and
GL_ELEMENT_ARRAY_UNIFIED_NV - for index arrays)

3. Create VBOs, grab the pointers and make them resident
- glBufferData
- glGetBufferParameterui64vNV
- glMakeBufferResidentNV

4. Enable vertex attributes and set address range for the pointers
- glEnableVertexAttribArray
- glBufferAddressRangeNV(GL_VERTEX_ATTRIB_ARRAY_ADDR ESS_NV,...) - for attributes and
- glBufferAddressRangeNV(GL_ELEMENT_ARRAY_ADDRESS_NV , ...) - for index

5. Finally, draw your VBOs!!!

It isn't so complicated, is it? ;)

Jan
03-22-2010, 05:32 AM
@Dark Photon: Wow, you put a lot of work into that!

And you were absolutely right, i did forget that another small piece of code also creates its own VBO and renders from it, and of course i did not think about disabling bindless there.

So, now i got it working. Since i am CPU bound i was expecting a speed up of at least 7.5x and nothing less. But so far i can't tell whether i got any speed up, at all, it doesn't look like it.

Anyway, yes, it's not that complicated to integrate, but i think it would be great if the GL would reset bindless to disabled, when a glBindBuffer call is done. I mean, i had to extend libraries that don't even use bindless themselves, to disable it, otherwise their perfectly fine VBO code suddenly crashes. I know, it's not in the spirit of OpenGL to reset states behind your back, but if i call glBindBuffer (GL_ELEMENT_...., bla blub), the driver KNOWS that i want THAT and NOT bindless. Also, why have a enable/disable, at all?? If i call glBufferAddressRangeNV than i give the driver the address directly, and if i call glBindBuffer/glVertexAttribPointer then i do it indirectly, where is the need to have an enable/disable state for that?

Maybe before making something like that EXT/ARB/core, it could be a bit more streamlined to make code that is bindless aware and such that is unaware of it work together without problems.

Thanks for all your help,
Jan.

Dark Photon
03-22-2010, 06:21 AM
So, now i got it working. ... Anyway, yes, it's not that complicated to integrate
Glad to hear it.


...i had to extend libraries that don't even use bindless themselves, to disable it, otherwise their perfectly fine VBO code suddenly crashes.
I too had to force bindless enables off before calling some "old, ugly VBO code" (which was mixing VBOs and client arrays -- don't ask...) that I didn't want to modify (no need to extend it though; just reset bindless enables off before I called it). But this is just like any other GL state -- certain called code has its assumptions on the existing GL state. Modelview, projection, enabled vertex attribs, bound shader, etc.


I know, it's not in the spirit of OpenGL to reset states behind your back, but if i call glBindBuffer (GL_ELEMENT_...., bla blub), the driver KNOWS that i want THAT and NOT bindless.
While I don't like the idea of glBindBuffer auto-disabling flags (maybe you bound it to map/subload it, not render with it?), I definitely see the root of your point: we have draw behavior selected by background GL state, not explicit function arguments. glDraw* are all overloaded to do client arrays, classic VBOs, and bindless VBOs, and you don't tell it which it's doing. That's dictated by the presence of bound buffers or not (and now a bindless enable).

It's the same issue as for client arrays vs. classic VBOs selection, where which is done depends on whether a "global" buffer is bound to a bind point or not.

The DSA approach would suggest an additional draw call argument to select which "mode" the draw call will operate in (client arrays, classic VBOs, bindless VBOs), but this is more obtrusive and less palettable for an extension to do (particularly given the client arrays/VBO precedent).


Also, why have a enable/disable, at all?? If i call glBufferAddressRangeNV than i give the driver the address directly, and if i call glBindBuffer/glVertexAttribPointer then i do it indirectly, where is the need to have an enable/disable state for that?
I see your point -- have the "GPUaddr vs. buffer handle" selector set automatically based on the last "set" call you made for that vertex attribute: gl*Pointer implies buffer handle, glBufferAddressRangeNV implies GPU address.

Only three issues I can think of:
this interface allows mixing classic and bindless VBOs, which they wanted to avoid, for the index array, the only thing you have to go on (minus the bindless enables) is whether you last called glBufferAddressRangeNV( GL_ELEMENT_ARRAY_ADDRESS_NV ... ) or glBindBuffer ( GL_ELEMENT_ARRAY_BUFFER ...). Maybe you bound the buffer for reasons other than drawing with it... if you're mixing bindless with classic VBOs (in separate batches of course) and doing lazy state/state tracking, you couldn't lazy state across a switch between bindless and non-bindless (e.g. lazy bind to element array buffer). You'd have to invalidate to force the implicit selector inside GL your way. This is a very minor con though, and only relevant for transitional code.Considering the above concerns, I think the extension did a reasonably good job of choosing a simple, intuitive interface that can coexist with existing draw behavior.


Maybe before making something like that EXT/ARB/core, it could be a bit more streamlined to make code that is bindless aware and such that is unaware of it work together without problems.
Definitely worth another think to see if there's an even easier way.

Dan Bartlett
03-23-2010, 11:03 AM
I thought I'd have a quick go at testing the speed of bindless compared to other methods.
I set up the benchmark so it looped through N objects, binding needed resources for each object + then drawing.
I used various number of randomly generated vertices per object (3/21/210/2100/21000/210000) and tested Bindless, VAO, VBO with Vertex Array, Vertex Array + Immediate mode.
It used vertex position (4 floats) + color (4 unsigned bytes) interleaved in one buffer object (except for VertexArray + Immediate mode which don't use buffer object), with no indexing.

It was on a Nvidia GeForce 9500 card, and results show Frames/Second using each technique.

Here are my results when drawing fairly large triangles:

Objects: Vertices: Bindless/ VAO/ VBO+VertexArray/ VertexArray / Immediate
-
1 : 3 : 3100 / 3100 / 3100 / 3100 / 3100
10 : 3 : 3050 / 3050 / 3050/ 3050 / 3050
100 : 3 : 2830 / 2790 / 2820 / 2830 / 2830
1000 : 3 : 2268 / 740 / 1030 / 1575 / 2120
10000 : 3 : 492 / 102 / 154 / 244 / 392
100000 : 3 : 56 / 10 / 16 / 26 / 46
-
1: 21: 3070 / 3070 / 3070 / 3070 / 3070
10: 21: 2900 / 2900/ 2900/ 2900 / 2900
100: 21: 2300 / 2275 / 2290 / 2300 / 2340
1000: 21: 1144 / 735 / 1008 / 925 / 673
10000: 21: 237 / 102 / 153 / 140 / 89
100000: 21: 34.6 / 10.6 / 16.4 / 14.5 / 9.1
-
1: 210: 2890 / 2900/ 2900/ 2900/ 2900
10: 210: 2275 / 2275 / 2275 / 2280 / 2330
100: 210: 1062 / 1056 / 1060 / 1111 / 684
1000: 210: 248 / 248 / 248 / 210 / 92
10000: 210: 35 / 35 / 35 / 21 / 9.5
-
1: 2100: 2296 / 2300 / 2300 / 2300 / 2350
10: 2100: 1120 / 1120 / 1120 / 1175 / 708
100: 2100: 246 / 246 / 246 / 228 / 95
1000: 2100: 35 / 35 / 35 / 24 / 9.5
-
1: 21000: 1120 / 1120 / 1120 / 1175 / 708
10: 21000: 246 / 246 / 246 / 229 / 95
100: 21000: 35 / 35 / 35 / 24 / 9.7
-
1: 210000: 247 / 247 / 247 / 229 / 95
10: 210000: 35 / 35 / 35 / 24 / 9.4
-

Here are my results when drawing tiny triangles:

1: 3: 3095 / 3095 / 3095 / 3095 / 3095
10: 3: 3090 / 3090 / 3090 / 3090 / 3090
100: 3: 3080 / 3030 / 3070 / 3080 / 3080
1000: 3: 2460 / 730 / 1003 / 1535 / 2131
10000: 3: 486 / 102 / 153 / 230 / 388
100000: 3: 56 / 10.6 / 16.4 / 25 / 46
-
1: 21: 3095 / 3095 / 3095 / 3095 / 3095
10: 21: 3080 / 3080 / 3080 / 3080 / 3080
100: 21: 3030 / 2990 / 3020 / 3000 / 2935
1000: 21: 2556 / 734 / 1015 / 894 / 665
10000: 21: 490 / 102 / 153 / 135 / 89
100000: 21: 56 / 10.6 / 16.4 / 14.3 / 9.1
-
1: 210: 3090 / 3090 / 3090 / 3090 / 3090
10: 210: 3040 / 3040 / 3040 / 3040 / 2970
100: 210: 2740 / 2710 / 2735 / 1162 / 685
1000: 210: 1397 / 728 / 1007 / 207 / 92
10000: 210: 224 / 101 / 152 / 22 / 8.6
100000: 210: 23.9 / 10.6 / 16.5 / 2.3 / 1.0
-
1: 2100: 3035 / 3035 / 3035 / 3035 / 2885
10: 2100: 2750 / 2745 / 2745 / 1204 / 693
100: 2100: 1408 / 1398 / 1404 / 221 / 94
1000: 2100: 240 / 240 / 240 / 24 / 9.5
10000: 2100: 25.7 / 25.7 / 25.7 / 2.4 / 1.0
-
1: 21000: 2160 / 2160 / 2160 / 1211 / 694
10: 21000: 554 / 554 / 554 / 222 / 94
1000: 21000: 65 / 65 / 65 / 24.4 / 9.8
-
1: 210000: 558 / 558 / 558 / 222 /95
10: 210000: 65 / 65 / 65 / 24.5 / 9.8


The highest speedup I got in any of the tests, was about 3.5x speedup vs VBOs for lots of little objects (1 or 7 triangles per object).
I found interesting that bindless outperforms immediate mode for 1 triangle per object, whereas buffer object struggles in that case.
The performance of VAOs is a bit disappointing, with performance falling off faster than just using VBOs with lots of objects, although maybe the tests didn't really test VAOs enough in situations where they would perform better.

Chris Lux
03-23-2010, 12:11 PM
i did not expect VAO to perform that bad in some cases. which driver version did you use and could you do the same test on a newer card?

DmitryM
03-23-2010, 12:17 PM
I don't get it. Who cares about objects with less than 100 vertices? Drawing many of them without any instancing method involved is not efficient from the start.

As far as I can see, VAO performs good in a real-case scenario.

Alfonse Reinheart
03-23-2010, 12:44 PM
It was on a Nvidia GeForce 9500 card, and results show Frames/Second using each technique.

I have to question your testing techniques (and your unwillingness to properly format your results). FPS is not a useful measure; milliseconds is.

Furthermore, the problems with vertex transfer center around cache issues. Your tests don't seem to be doing anything except rendering the same object over and over. In order to have valid tests, you must:

1: Simulate running a game loop. So you need to flush the CPU cache.

2: Render multiple objects per "frame". Different buffers, different vertex formats, etc.

Dan Bartlett
03-23-2010, 02:53 PM
It was using the 197.15 drivers from http://developer.nvidia.com/object/opengl_driver.html, I don't have access to a newer card at the moment.

VAOs were in many cases the same speed as just using VBOs, possibly when limited by something else, but I'd have thought that there would be situations where VAOs could provide a speed-up, will need to try to find them at some point (try more attributes, different vertex formats, etc).

It was rendering very similar objects (all the same size + vertex formats - perhaps bias against VAOs + towards bindless), but not the same object (every object had different buffer objects / vertex array pointers - almost worst case, although they were always accessed in order created). It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia's test-case, and where/when bindless started to have an effect.

The test results are only really a comparison relative to other results. Since it was a quick test, I just placed the code into a "OnRender" event of a "Direct OpenGL" object in a GLScene scene-graph, and rendered each "object" (erm, set of N colored, randomly generated triangles) in a for-loop straight after each other with nothing else happening - no transformations/materials or anything else applied between objects. With none of the tested code running, framerate was ~3100. Using the time taken instead of FPS would be correct for calculating speedup, but will only make a small percentage difference to the result in the 10s-100s of frames/second range.

Dan Bartlett
03-23-2010, 04:27 PM
Bindless + VAO combined seem to be faster than VBOs, but not as fast as bindless by itself, will need to include this in tests too.

Alfonse Reinheart
03-23-2010, 04:54 PM
It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia's test-case, and where/when bindless started to have an effect.

And what about the cache issues? Did you flush the cache between renders?

Also, how many objects do you render? Do you cull the triangles, so that you aren't accidentally testing rasterization time?

My point is that your benchmark isn't as specific about testing vertex batch overhead as it needs to be.

Jan
03-23-2010, 05:10 PM
I agree with Alfonse, the 7x claim is meant for applications that are heavily CPU bound due to cache-misses. As long as you do not construct a test, that will generate a high overhead by cache-misses, bindless will not be able to speed it up that dramatically.

But still, nice to see someone put a little bit more effort into investigating the differences.

Jan.

Dan Bartlett
03-23-2010, 08:13 PM
My (Pascal) code for rendering using VAOs looks like this:


for I := 0 to Iterations - 1 do
begin
for j := 0 to NumberOfObjects - 1 do
begin
glBindVertexArray(vaoID[j]);
glDrawArrays(GL_TRIANGLES, 0, IndexDataSize[j]);
end;
end;
glBindVertexArray(0);


If I render the same "object" 10000 times per frame, instead of 10000 objects once per frame, also culling the front + back faces which gives fairly close results to previous result (since the triangles rendered were tiny), my results are like this:

Objects |Vertices |Iterations |Bindless |Bindless VAO |VAO |VBO |Vertex Array |Immediate
-
10000 |3 |1 |482 |182 |101 |152 |230 |385
10000 |21 |1 |482 |182 |101 |152 |134 |89
10000 |210 |1 |224 |185 |101 |152 |22 |9.6
-
1 |3 |10000 |512 |419 |234 |229 |246 |443
1 |21 |10000 |515 |420 |235 |227 |158 |98
1 |210 |10000 |224 |224 |224 |224 |33 |10.6

The "bindless VAO" code is exactly the same as VAO code, but at the build stage, bindless instructions are used instead of normal ones.



glGenVertexArrays(NumberOfObjects, @vaoBindlessID[0]);
for I := 0 to NumberOfObjects - 1 do
begin
glBindVertexArray(vaoBindlessID[i]);
glEnableClientState(GL_COLOR_ARRAY);
glEnableClientState(GL_VERTEX_ARRAY);

glEnableClientState(GL_VERTEX_ATTRIB_ARRAY_UNIFIED _NV);
//glEnableClientState(GL_ELEMENT_ARRAY_UNIFIED_NV);
glColorFormatNV(4, GL_UNSIGNED_BYTE, sizeof(TVertex));
glVertexFormatNV(4, GL_FLOAT, sizeof(TVertex));
glBufferAddressRangeNV(GL_VERTEX_ARRAY_ADDRESS_NV, 0, ptrVertices[i], verticesDataSize[i]);
glBufferAddressRangeNV(GL_COLOR_ARRAY_ADDRESS_NV, 0, ptrVertices[i]+16, verticesDataSize[i]-16);

CheckOpenGLError;
end;
glBindVertexArray(0);

vs


glGenVertexArrays(NumberOfObjects, @vaoID[0]);
for I := 0 to NumberOfObjects - 1 do
begin
glBindVertexArray(vaoID[i]);
glEnableClientState(GL_COLOR_ARRAY);
glEnableClientState(GL_VERTEX_ARRAY);

glBindBuffer(GL_ARRAY_BUFFER, vertexBufferID[i]);
glColorPointer(4, GL_UNSIGNED_BYTE, sizeof(TVertex), Pointer(16));
glVertexPointer(4, GL_FLOAT, sizeof(TVertex), Pointer(0));

CheckOpenGLError;
end;
glBindVertexArray(0);


When you say flushing the (CPU) cache, you mean some code like this? or is there a better way?



var
CacheClearer: Array[0..1024*1024-1] of byte;
i: integer;
begin
for i := 0 to Length(CacheClearer) - 1 do
CacheClearer[i] := CacheClearer[i]+1;
end;


With this after after every frame, I get results like this:

Objects |Vertices |Iterations |Bindless |Bindless VAO |VAO |VBO |Vertex Array |Immediate
-
10000 |3 |1 |279 |144 |88 |124 |180 |244
10000 |21 |1 |279 |146 |88 |124 |115 |77
10000 |210 |1 |224 |146 |89 |125 |21.8 |9.4

These are not comparable to earlier results, since now the base framerate is 550 FPS when none of the tested code is run.
This makes the rest of the frame no longer neglible in the speedup calculation (eg. 124 v 279 FPS = (1/124-1/550)/(1/279-1/550) = 3.5x speedup in that part of the app, but a 2.25x speedup overall.

Alfonse Reinheart
03-23-2010, 09:04 PM
I get results like this:

Please use code tags to make tables like that more legible.

Dark Photon
03-24-2010, 05:31 AM
It was on a Nvidia GeForce 9500 card...
Here are my results when drawing fairly large triangles
Thanks for the tests, but this test hardware and method makes me suspect that you are very likely going to be GPU limited much of the time (large triangles = lots of fill, and this is a slow card).

Where you are going to see the most benefit from bindless is where you're "not" waiting on your GPU to get the work done. You're waiting on your CPU to pump the batches. That is, in cases where your GPU is fairly fast, and your CPU/CPU memory is relatively slow, such that you just can't keep the GPU fed..

Also as Alfonse pointed out, for the maximum benefit, you need to be rendering a lot of "different" batches from different buffers. This maximizes your chance for cache misses, which is where bindless shines. Also, don't render super large triangles. To maximize the bindless benefit, the goal is not to be GPU limited here.


VAOs were in many cases the same speed as just using VBOs, possibly when limited by something else...
Which strongly suggests your test program is not CPU/batch submit limited for those cases, which is where you're going to get the max speed-up from bindless.


It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia's test-case, and where/when bindless started to have an effect.
To maximize bindless benefit, you want a fast GPU and a relatively slow CPU/CPU mem (e.g. slow memory clock, smaller memory caches, etc.) and batches that aren't super-huge (more CPU batch setup overhead). The benefit is going to be different for different hardware, but it shouldn't ever net you a loss.

To make that test setup even uglier, running other threads on other cores sharing the same CPU caches which push data out of the cache, causing more cache misses. But just running enough different batches through one thread should do that too.


Bindless + VAO combined seem to be faster than VBOs, but not as fast as bindless by itself
That is my experience too. Don't stack VAOs on top of bindless -- you lose perf. Bindless gives you everything VAOs give you and more.

This may be due to the expense of having a bazillion little VAOs floating around in the driver that are otherwise each causing cache misses when accessed. Dunno. But bindless apparently avoids this overhead by letting you store nearly all of the VAO state on your side in the data structures you store your batches in, which are already in the cache at that point anyway while you're submitting draw calls.


FPS is not a useful measure; milliseconds is.
Right (emphasis mine):

* Performance (Humus) (http://www.humus.name/index.php?ID=279)
* The evils of fps (http://www.realtimerendering.com/blog/the-evils-of-fps/)

Besides including irrelevent "cruft", FPS is the inverse of time, and thus varies non-linearly with time (which is one reason it's fairly useless). For instance, the performance difference between 80 and 90 fps is actually ''greater than'' (i.e. more impressive than) the performance difference between 125 fps and 150 fps. Why? Well, invert to seconds/frame and see. And if you have to invert to make sense out of this nonsense anyway, why use FPS at all? Just use milliseconds (ms).

Dark Photon
03-24-2010, 05:56 AM
I don't get it. Who cares about objects with less than 100 vertices? Drawing many of them without any instancing method involved is not efficient from the start.
Thing is, sometimes you want to draw 1000 little boxes, or 5000 little balls, all photocopies of each other (or slight munges). In those cases, instancing shines. (...if you don't care about culling efficiency.)

But sometimes you really do want lots of varied content, and instancing is like hammering in a screw. It's not the right solution.

You want cheaper batches. And that's what bindless gives you.

It also avoids some of the contortions you end up doing to efficiently cull instances. Instances can really kill your perf through loss of frustum-culling granularity if you're not careful. Faster batches means you can tolerate smaller instance groups, which means better frustum-culling from the get go.

Alfonse Reinheart
03-24-2010, 10:35 AM
Instances can really kill your perf through loss of frustum-culling granularity if you're not careful.

If by "careful" you mean "I frustum cull my instances", then yes.

Instancing has nothing to do with frustum culling, unless you're only thinking static instancing. In which case, you should say that.

Aleksandar
03-24-2010, 12:45 PM
I'm glad that bindless has finally achieved such attention (after one year of existence). :)

Well, I don't like generic tests because they show nothing. If someone reports a 2x speed boost in a real application, than it is for respect. Bundless can achieve that if there are thousands of VBOs even on fast CPUs with enough cache.

Before going deeper into analysis, it would be useful to clarify some facts about the test.

First, is there a glFinish() call at the end of the drawing method. If there is no such call than the results are not valid. I have a lot of experiences with NVIDIA drivers on Windows, and my early tests (few years ago) were not valid because of that.

Second, what method (function) is used to measure the time? On Windows I suggest using PerformanceCounter. (Don't laugh at me, I know for the bugs on some motherboards, but they are past, and even if you still have such one, measuring small intervals excludes the error).

Third, it can be useful to find the bottleneck of the application to justify the frame-rate. Currently I'm investigating debuggers/profilers for OpenGL in order to purify my code. Those tools can really be useful. (By the way I'm a little bit disappointed by Nexus. :( Or maybe I expected too much...)

And for the units of measured values, my opinion is that the pseudo-frame-rate is much better for the most of readers, than ms. the pseudo-frame-rate is just the inverted value of the number of seconds something lasts, but the measuring interval is terminated before SwapBuffers or similar frame terminator and any screen synchronization routine should be eliminated.

knackered
03-24-2010, 01:14 PM
Alfonse, what is your problem? You appear to be attacking someone who is trying to help and, unless I missed a paypal debit, he's doing it for free. Measure your language or bugger off. I'm finding this useful. To everyone else, thank you for your efforts. I shall continue reading until I have something to contribute.

Dan Bartlett
03-24-2010, 05:31 PM
First, is there a glFinish() call at the end of the drawing method. If there is no such call than the results are not valid. I have a lot of experiences with NVIDIA drivers on Windows, and my early tests (few years ago) were not valid because of that.
No, but performance is measured over many frames, it shouldn't be needed should it? (Well, maybe 1 glFinish() after the final frame, but it always keeps a running total of framerate, so it would hit performance doing glFinish() after every frame)


Second, what method (function) is used to measure the time? On Windows I suggest using PerformanceCounter. (Don't laugh at me, I know for the bugs on some motherboards, but they are past, and even if you still have such one, measuring small intervals excludes the error).
It's using the built in GLScene performance monitoring code, the relevant parts look like this:



// stripped down render loop
if FFrameCount = 0 then
QueryPerformanceCounter(FFirstPerfCounter);
# render scene #
if not (roNoSwapBuffers in ContextOptions) then
RenderingContext.SwapBuffers;
Inc(FFrameCount);
QueryPerformanceCounter(perfCounter);
Dec(perfCounter, FFirstPerfCounter);

if perfCounter > 0 then
FFramesPerSecond := (FFrameCount * vCounterFrequency) / perfCounter;



TGLSceneBuffer = class(TGLUpdateAbleObject)
...
public
{: Current FramesPerSecond rendering speed.<p>
You must keep the renderer busy to get accurate figures from this
property.<br>
This is an average value, to reset the counter, call
ResetPerfomanceMonitor. }
property FramesPerSecond: Single read FFramesPerSecond;
{: Resets the perfomance monitor and begin a new statistics set.<p>
See FramesPerSecond. }
procedure ResetPerformanceMonitor;
end;



procedure TGLSceneBuffer.ResetPerformanceMonitor;
begin
FFramesPerSecond := 0;
FFrameCount := 0;
FFirstPerfCounter := 0;
end;


It's basically just counting the number of frames rendered after you reset the performance monitor, and dividing by the total time taken between just before the first frame was rendered after a performance monitor reset + just after the last frame was rendered.

Typically you'd just query "FramesPerSecond" every couple of seconds + reset the performance monitor straight after with "ResetPerformanceMonitor()" to get a framerate displayed that is responsive.



FPS := GLSceneViewer1.FramesPerSecond;
GLSceneViewer1.ResetPerformanceMonitor();

Groovounet
03-24-2010, 07:41 PM
Shouldn't you use timer query?
http://www.opengl.org/registry/specs/ARB/timer_query.txt

Great job Dan anyway, it's complicated to have this sort of thing sorted but it's good to have some numbers.

(I am still wondering how VAO end up in OpenGL 3. Who could ever see any good in this feature design this way? Oo)

Alfonse Reinheart
03-24-2010, 08:23 PM
I am still wondering how VAO end up in OpenGL 3. Who could ever see any good in this feature design this way?

Apple has had VAOs around for years. It seemed to work for them.

Groovounet
03-25-2010, 05:20 AM
Define "work"? :p

Alfonse Reinheart
03-25-2010, 10:32 AM
Define "work"?

It does what they wanted it to.

Aleksandar
03-25-2010, 12:15 PM
It's basically just counting the number of frames rendered after you reset the performance monitor, and dividing by the total time taken between just before the first frame was rendered after a performance monitor reset + just after the last frame was rendered.

Typically you'd just query "FramesPerSecond" every couple of seconds + reset the performance monitor straight after with "ResetPerformanceMonitor()" to get a framerate displayed that is responsive.
Well, measuring the time and dividing it with a number of frames drawn in that period is not a very accurate method to discover how much time is actually spent in the drawing itself. In the best case, I don't want a screen synchronization to take its part, and in real apps there can be a peace of code executing between two consecutive drawings. That's why many on this forum insist on ms and not on fps. So, the idea is to measure time taken for each frame, not a time-span interval across many frames.


// E.g.
QueryPerformanceCounter(&amp;q1);
DrawScene();
glFinish();
QueryPerformanceCounter(&amp;q2);
time = CalcTime(q1,q2);
SwapBuffers();
//Etc...

ScottManDeath
03-25-2010, 01:19 PM
You might want to supplement the CPU time by GPU ticks, as retrieved by the EXT/ARB_time_query extension.

Dan Bartlett
03-25-2010, 03:46 PM
So you reckon this would provide more useful results?


glFinish();
glBeginQuery(GL_TIME_ELAPSED, timerQuery);
DrawScene();
glFinish;
glEndQuery(GL_TIME_ELAPSED);
glGetQueryObjectiv(timerQuery, GL_QUERY_RESULT, @timeElapsed);
// Calc average time elapsed

I assume it also requires an initial glFinish() before starting the timer query?
I'm not convinced running in a completely clean pipeline would give the most realistic results, but it will help eliminate cases where the fastest stats are limited by something else, and the lower stats are limited by what you are measuring.

Aleksandar
03-26-2010, 03:05 AM
The first glFinish is not necessary, especially if there is only one thread drawing (and the previous glFinish committed drawing). But if you like, you can include even that. :)

Groovounet
03-26-2010, 03:37 AM
The purpose of timer query is to AVOID glFinish.

Bu I'm not sure it's a key point for your test.

Aleksandar
03-26-2010, 04:49 AM
It is crucial! Because we want to know exact moment when the driver finishes the drawing, not when it accepts the command.

Groovounet
03-26-2010, 04:54 AM
That's what the timer query does...

The timer query, sit in the command queue. it takes the start and end time from when it is processed in the thread that process the command queue so that you never have to stall to get an accurate timing. Well, unless you call glGetQuery too soon, but it can have a Frame + 1 latency with no trouble.

Aleksandar
03-26-2010, 05:07 AM
Oh, you are talking about the new GL3.3 extension - GL_ARB_timer_query. Sorry, I didn't understand!
Well I haven't tried it yet. And I cannot relay on it, because most of the cards/drivers do not implement GL 3.3.

Although it would be interesting to compare results of GL_ARB_timer_query with "the old method". ;)
Thanks for the suggestion!

Groovounet
03-26-2010, 08:03 AM
I should have quote the extension. It didn't went to my mind that Windows call this timer query too! Fun :p

Timer query is pretty old actually and supported through
GL_EXT_timer_query extension on nVidia hardware back to GeForce 6.

Aleksandar
03-28-2010, 06:03 AM
It works perfectly! :)

But I have a minor objection on glGetQueryObjectiv(timeQuery, GL_QUERY_RESULT_AVAILABLE, &amp;available);

It shouldn't raise an exception (glError()->GL_INVALID_OPERATION) if it is called before first time-query. In order to simply code, and avoid additional checking (as you have already said, we query time for the previous frame if we want to skip waiting), it will be more convenient if glGetQueryObjectiv returns 0 or even -1 in the case of error.

Groovounet
03-28-2010, 06:48 AM
Do you like ping pong? :D

What you can do is actually to create 2 query timers.
One for the current frame timing, one for the previous frame get query at the end of the current frame. Actually you still have a test this way.

An error is probably required because of:
- The precedent of the query object (which is a great concept but a mess)
- The query object is probably allocated a glBeginQuery so you glGetQuery on nothing except a reserved name.

Aleksandar
03-28-2010, 07:53 AM
Yes, it is nice. ;)

But I don't need two timers. One is sufficient. Even one requires a separate critical section to avoid read/write conflict (because app can query frame rate at will).

It is also very nice that errors are silently ignored by OpenGL. :D
So I don't have to worry about it (unless it causes some delay in the pipeline).

Dan Bartlett
03-28-2010, 07:57 AM
One thing about timer query is that (from spec):

The timer is started or stopped when the effects from all previous commands on the GL client and server state and the framebuffer have been fully realized.
and from issue (8) "How predictable/repeatable are the results returned by the timer query?" of http://www.opengl.org/registry/specs/ARB/timer_query.txt

Note that modern GPUs are generally highly pipelined, and may be
processing different primitives in different pipeline stages
simultaneously. In this extension, the timers start and stop when the
BeginQuery/EndQuery commands reach the bottom of the rendering pipeline.
What that means is that by the time the timer starts, the GL driver on
the CPU may have started work on GL commands issued after BeginQuery,
and the higher pipeline stages (e.g., vertex transformation) may have
started as well.


The most noticable effect of this I could see was that in certain situations when drawing a lot of stuff, the timer query results would appear to be much faster on a large window than on a small window, since it completes quite a lot of the commands issued on the current frame between BeginQuery + EndQuery before the previous frame (+clear) finishes + timer query starts.
I guess that timing the same code at different places in the scene could have quite different results depending on previous commands issued.
Putting a glFinish before starting the query gave me accurate results for both large + small windows (at the expense of FPS, for benchmarking only), no glFinish needed at the end.

Aleksandar
03-28-2010, 08:26 AM
I have to admit that I don't understand what you mean with large/small window stuff. Of course that the rendering is affected by previous commands. Further more, on the laptops driver shuts down the speed if burden is not enough for full power (for example, 8600M has three levels of power consumption and clock rates).

For the purpose of testing, you don't need to use query timer. It is aimed for real-time LOD management. glFinish is enough for testing. But as I've already mentioned, you should call glFinish at the end of your drawing, or you can get even a 3x higher frame-rate than it is real one (I have tried that on my application last night, and from about 115fps (measured by timer query) a got more than 300fps if there is no glFinish/glFlush commands at the end).

Piers Daniell
03-29-2010, 12:04 PM
For the purposes of measuring how long the DrawScene() takes in the GPU, neither glFinish() calls should be necessary. The elapsed time query is run on the GL server (in the GPU) inline with all other GL commands. So everything before the glBeginQuery() is complete when the timer starts and everything before the glEndQuery() is completed before the timer stops. Of course, you'll need some kind of synchronization with the CPU (app thread) to pick up the result.

Aleksandar
03-29-2010, 12:22 PM
I completely agree with you. glFinish is necessary only if ARB/EXT_timer_query extension is not used (and that was the context in which I mentioned glFinish).

But, generally, GPU power management makes it very hard to determine at which core frequency the measure is taken. Further more, it is not guaranteed that the frequency is not changed during the particular measuring. It would be useful if GPU frequencies can be locked for the purpose of fair measuring or boosting the speed in some particular cases. ;)

Dan Bartlett
03-29-2010, 12:29 PM
Everything before the glBeginQuery() will be complete, but if you don't have a glFinish() before glBeginQuery(), then some of the commands issued after the glBeginQuery() may also be completed before the timer starts if the driver can complete them before the previous queued commands have completed.

eg. If you had:

task1
task2
glBeginQuery
task3
task4
task5
glEndQuery


Then if task1 + task2 take a long time and task3 + task4 can be completed while task1 + task2 are still running, then the timer query will only be measuring the time taken for task5 to complete, since task3 + task4 finished before task1 + task2, and the timer query only starts when task1 + task2 are complete.

Alfonse Reinheart
03-29-2010, 12:42 PM
Then if task1 + task2 take a long time and task3 + task4 can be completed while task1 + task2 are still running

GPUs are required to do things in order. Thus, task3 is guaranteed to complete after task2.

Dan Bartlett
03-29-2010, 12:59 PM
GPUs are required to do things in order. Thus, task3 is guaranteed to complete after task2.
Ok, it might not complete task3 before task2 is complete, but it might have completed a good chunk of the processing needed for task3 + task4 before task2 is complete.

This is mentioned in issue(8) of: http://www.opengl.org/registry/specs/ARB/timer_query.txt

In certain situations, this can have a large impact on measured time by timer query. In one case, I was getting 50% shorter time intervals measured with a large window compared to a small window.

charliejay
03-30-2010, 05:24 AM
Then if task1 + task2 take a long time and task3 + task4 can be completed while task1 + task2 are still running

GPUs are required to do things in order. Thus, task3 is guaranteed to complete after task2.

Is it more accurate to say that GPUs are required to do things in order if necessary? So task 3 could complete before task 2 and taks 1 provided it didn't depend on the state of any of the resources used by them?

Alfonse Reinheart
03-30-2010, 10:42 AM
Is it more accurate to say that GPUs are required to do things in order if necessary?

Not in any terms that the user is allowed to see. The GPU cannot report completion of task 3 before tasks 1 and 2 complete. So even if it does reorder things, you are forcibly insulated from the effects of that.

Dark Photon
07-20-2010, 02:02 PM
In the spirit of Rob's sharing, I'll throw this out there.

It's interesting to apply Rob's "streaming VBO" technique to static geometry too, and then take advantage of temporal coherence. That is, if you've uploaded a batch before and you haven't orphaned its buffer yet, then... yep, you guessed it. You don't need to upload it again. Just launch the batch from the old location, again.

In the ideal case (static/near-static scene, static/near-static viewpoint), you end up with perf that's pretty darn close to NVidia display lists (or bindless preloaded VBOs). Now that is sweet! :p Worst case, it's about client arrays perf, which isn't shabby. Afterall, you gotta get the data to the GPU at least once (though if GL4, you allegedly could use ARB_copy_buffer/bg thread to accelerate that).

You can think of all kinds of ways to improve upon this to maximize perf (maximize cache "hits").