Bindless Stuff

So i tried this magical NV_vertex_buffer_unified_memory because i want my app to run 7.5x faster.

All i can produce is hard crashes, though.

So, i reduced it to the case of binding only the index-array via bindless-stuff and leave the vertex-data array as is, for now. Still, i can’t get it to run.

This is the code, to setup the data:

int iIndexDataSize = iIndices * sizeof (short);
glGenBuffers (1, &IndexBufferID);
glBindBuffer (GL_ELEMENT_ARRAY_BUFFER, IndexBufferID);
glBufferData (GL_ELEMENT_ARRAY_BUFFER, iIndexDataSize, (void*) &pIndices[0], GL_STATIC_DRAW);
glGetBufferParameterui64vNV (GL_ELEMENT_ARRAY_BUFFER, GL_BUFFER_GPU_ADDRESS_NV, &ptrIndexArray);
glMakeBufferResidentNV (GL_ELEMENT_ARRAY_BUFFER, GL_READ_ONLY);

And this is the code to bind the buffer at runtime:

if (UseBindless) {
glEnableClientState (GL_ELEMENT_ARRAY_UNIFIED_NV);
glBufferAddressRangeNV (GL_ELEMENT_ARRAY_ADDRESS_NV, 0, ptrIndexArray, iIndexDataSize);
}
else
glBindBuffer (GL_ELEMENT_ARRAY_BUFFER, IndexBufferID);

If i understand the “documentation” (which is very shallow IMO) correctly, that should be it, and nothing else is necessary.

However, it just crashes all the time. Any ideas?

Thanks,
Jan.

If i understand the “documentation” (which is very shallow IMO) correctly, that should be it, and nothing else is necessary.

You might want to use these functions:


    void VertexFormatNV(int size, enum type, sizei stride);
    void NormalFormatNV(enum type, sizei stride);
    void ColorFormatNV(int size, enum type, sizei stride);
    void IndexFormatNV(enum type, sizei stride);
    void TexCoordFormatNV(int size, enum type, sizei stride);
    void EdgeFlagFormatNV(sizei stride);
    void SecondaryColorFormatNV(int size, enum type, sizei stride);
    void FogCoordFormatNV(enum type, sizei stride);
    void VertexAttribFormatNV(uint index, int size, enum type,
                              boolean normalized, sizei stride);
    void VertexAttribIFormatNV(uint index, int size, enum type,
                               sizei stride);

Yes, for vertex-data! But i am currently only trying it out with index-data. And since even that does not work, i’d like some more information.

Jan.

From the spec:

If they prohibit mixing and matching VBOs with bindless for attributes, there’s a good chance that they’re not too keen on mixing and matching VBO attributes with bindless elements. Though it’d be nice if they actually said something about it.

Hmmm, but it talks about old-style vertex-arrays (GL_COLOR_ARRAY, etc.) and new-style (generic vertex-attribs) and stuff in client memory (RAM) and other stuff in GPU memory. As i interpret it, it does not say anything about index-arrays and vertex-arrays.

Both my index and vertex data is in a VBO. And since there are two states to enable (glEnableClientState (GL_ELEMENT_ARRAY_UNIFIED_NV) and glEnableClientState (GL_ATTRIB_ARRAY_UNIFIED_NV)) i would guess you can mix&match at least that.

I will try using bindless stuff for both, tomorrow, but from how i understand the spec, only binding index or vertex arrays with it should be possible.

Jan.

Yeah :stuck_out_tongue: , on contrived cases.

2X is all I’ve gotten on real use cases, which ain’t shabby at all!

So, i reduced it to the case of binding only the index-array via bindless-stuff and leave the vertex-data array as is, for now.

I personally have never tried mixing bindless and “classic” VBOs in the same batch (to use language from the extension), though I agree the presence of separate enables would otherwise imply that you should be able to do this. At the same time, the extension basically states that they don’t want to support mixing “bindless” with “classic” VBOs (see Question #4) – the goal of the extension being reducing batch overhead, and that would just increase it. But this permutation isn’t spelled out exactly.

(terminology: “Bindless VBOs” being VBOs set up by GPU address, and “classic VBOs” being VBOs set up by buffer handle).

Still, i can’t get it to run…Any ideas?

Nothing immediately obvious. It may be your mixing of “bindless” and “classic” VBOs within a batch, which the extension spec speaks against.

Other ideas: I can tell you when I was first playing with this the #1 cause of crashes by far was leaving the bindless enables enabled:

    glEnableClientState ( GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV );
    glEnableClientState ( GL_ELEMENT_ARRAY_UNIFIED_NV       );

accidentally when I needed to render something with “classic” VBOs (I didn’t convert everything over at once).

If you’ve got this in a short test program, feel free to post and we’ll figure it out. Also, if I have time this evening, I’ll whip up a short working batch bindless test pgm and post it – something you can morph into your usage and figure out what’s up with the crash.

Also, as with all “crash” issues, have you run a CPU memory debugger (e.g. valgrind, Purify, etc.) to see if it’s something obvious?

Ok, here’s a simple little GLUT test program to illustrate how to draw a “bindless batch” (that is, draw it with NV_vertex_buffer_unified_memory from here). To build without bindless, comment out #define WITH_BINDLESS.

To keep it simple, I used the fixed-function pipe and legacy vertex attributes.

Anyway, here’s the code.


//------------------------------------------------------------------------------
//  Bindless batches example:
//
//    NOTE: To keep this as stupid-simple as possible, this demo pgm
//    doesn't use shaders, and therefore doesn't use generic vertex
//    attributes (since to do so would require assuming NVidia vertex
//    attribute aliasing.  Of course, bindless trivially supports both.
//
//------------------------------------------------------------------------------

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define GL_GLEXT_PROTOTYPES
#include <GL/gl.h>
#include <GL/glut.h>

//------------------------------------------------------------------------
// Uncomment the following to build with NV bindless batches support
//   (i.e. NV_vertex_buffer_unified_memory)
//------------------------------------------------------------------------
#define WITH_BINDLESS


GLuint Vbo_handle[3];          // Positions, Colors, Index list

#ifdef WITH_BINDLESS
GLuint64EXT Vbo_addr[3];       // Positions, Colors, Index list
GLuint      Vbo_size[3];       // Positions, Colors, Index list
#endif

//-----------------------------------------------------------------------

void checkGLError( const char hdr[] )
{
  int err = glGetError();
  if( err )
  {
    fprintf(stderr, "ERROR %s: %s
", hdr, gluErrorString(err));
    exit(1);
  }
}

//-----------------------------------------------------------------------

void init()
{
  static const GLfloat pos  [] = { -1, -1,
                                   -1,  1,
                                    1,  1,
                                    1, -1 };
  static const GLfloat color[]  = { 1,0,0,1,
                                    1,0,0,1,
                                    1,0,0,1,
                                    1,0,0,1 };
  static const GLushort index[] = { 0, 1, 2, 3 };

  // Create and fill VBOs
  glGenBuffers( 3, Vbo_handle );

  GLenum gl_target = GL_ARRAY_BUFFER;

  // Positions...
  glBindBuffer( gl_target, Vbo_handle[0] );
  glBufferData( gl_target, sizeof(pos)  , pos  , GL_STATIC_DRAW );

  // Colors...
  glBindBuffer( gl_target, Vbo_handle[1] );
  glBufferData( gl_target, sizeof(color), color, GL_STATIC_DRAW );

  // Index array...
  gl_target = GL_ELEMENT_ARRAY_BUFFER;

  glBindBuffer( gl_target, Vbo_handle[2] );
  glBufferData( gl_target, sizeof(index), index, GL_STATIC_DRAW );

#ifdef WITH_BINDLESS
  // Make them resident and query GPU addresses
  for ( int i = 0; i < 3; i++ )
  {
    glBindBuffer( gl_target, Vbo_handle[i] );
    glGetBufferParameterui64vNV( gl_target, GL_BUFFER_GPU_ADDRESS_NV, 
                                 &Vbo_addr[i] );
    glMakeBufferResidentNV     ( gl_target, GL_READ_ONLY );
  }

  Vbo_size[0] = sizeof(pos);
  Vbo_size[1] = sizeof(color);
  Vbo_size[2] = sizeof(index);

  // Make draw calls use the GPU address VAO state, not the handle VAO state
  glEnableClientState( GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV );
  glEnableClientState( GL_ELEMENT_ARRAY_UNIFIED_NV );
#endif
}

//-----------------------------------------------------------------------

void reshape(int width, int height)
{
  glViewport(0, 0, width, height);
}

//-----------------------------------------------------------------------

void display()
{
  static float angle = 0.0;

  // Clear screen
  int err=0;
  glClear( GL_COLOR_BUFFER_BIT );

  // Load up PROJECTION and MODELVIEW
  glMatrixMode(GL_PROJECTION);
  glLoadIdentity();
  glOrtho(-2,2,-2,2,-2,2);

  glMatrixMode(GL_MODELVIEW);
  glLoadIdentity();

  glRotatef(angle, 0,0,1);
  angle += 0.01;

  // Draw a quad
  glEnableClientState( GL_VERTEX_ARRAY );
  glEnableClientState( GL_COLOR_ARRAY );

#ifdef WITH_BINDLESS

  // "Bindless VBOs case
  glVertexFormatNV      ( 2, GL_FLOAT, 2*sizeof(GLfloat) );
  glBufferAddressRangeNV( GL_VERTEX_ARRAY_ADDRESS_NV, 0, 
                          Vbo_addr[0], Vbo_size[0]);

  glColorFormatNV       ( 4, GL_FLOAT, 4*sizeof(GLfloat) );
  glBufferAddressRangeNV( GL_COLOR_ARRAY_ADDRESS_NV , 0, 
                          Vbo_addr[1], Vbo_size[1]);

  glBufferAddressRangeNV( GL_ELEMENT_ARRAY_ADDRESS_NV, 0, 
                          Vbo_addr[2], Vbo_size[2] );
#else

  // "Classic" VBOs case
  glBindBuffer       ( GL_ARRAY_BUFFER, Vbo_handle[0] );
  glVertexPointer    ( 2, GL_FLOAT, 2*sizeof(GLfloat), 0 );
  
  glBindBuffer       ( GL_ARRAY_BUFFER, Vbo_handle[1] );
  glColorPointer     ( 4, GL_FLOAT, 4*sizeof(GLfloat), 0 );

  glBindBuffer       ( GL_ELEMENT_ARRAY_BUFFER, Vbo_handle[2] );
#endif

  glDrawElements     ( GL_QUADS, 4, GL_UNSIGNED_SHORT, 0 );

  // Swap
  glutSwapBuffers();
  glutPostRedisplay();
  checkGLError( "End of display()" );
}

//-----------------------------------------------------------------------

void keyboard(unsigned char key, int x, int y)
{
   switch (key) {
      case 27:
         exit(0);
         break;
   }
}

int main (int argc, char** argv)
{
  glutInit(&argc, argv);
  glutInitDisplayMode( GLUT_RGBA | GLUT_DOUBLE );
  glutCreateWindow(argv[0]);

  glutKeyboardFunc(keyboard);
  glutDisplayFunc(display);
  glutReshapeFunc(reshape);

  glutReshapeWindow(400,400);

  printf("GL_RENDERER = %s
",glGetString(GL_RENDERER));

  glClearColor(0,0,0,0);

  init();

  glutMainLoop();
  return 0;
}

If instead you wanted to use generic vertex attributes, in display() you’d just change the attribute setup to this:

glVertexAttribFormatNV( attrib_id, width, type, norm_fixed_pt, stride );
glBufferAddressRangeNV( GL_VERTEX_ATTRIB_ARRAY_ADDRESS_NV, attrib_id, gpu_addr, gpu_size );

And of course use the generic enable (e.g. glEnableVertexAttribArray) instead of the legacy attribute enables.

Out of curiousity, I just hacked the above little test program to do just that (bindless index VBO and “classic” attribute array VBOs), and it worked just fine.

Tested on NVidia 195.36.07.03 (Linux), their new GL 3.3 drivers (equiv: 197.15 Windows):

The specification is pretty straight forward. There are no ambiguities out there. Before start complaining on extension try to find error in your code. Maybe the tutorial (http://developer.download.nvidia.com/opengl/tutorials/bindless_graphics.pdf) would help you.

I have implemented bindles with both fixed functionality and shaders, and it works perfectly.

I’ll try to summarize the most important things:

1. Set Bindless vertex formats
(glVertexFormatNV, glNormalFormatNV, glTexCoordFormatNV, etc for fixed functionality or glVertexAttribFormatNV for generic attributes)

2. Enable adequate client states (GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV - for vertex arrays, and
GL_ELEMENT_ARRAY_UNIFIED_NV - for index arrays)

3. Create VBOs, grab the pointers and make them resident
[b]- glBufferData

  • glGetBufferParameterui64vNV
  • glMakeBufferResidentNV[/b]

4. Enable vertex attributes and set address range for the pointers
[b]- glEnableVertexAttribArray

  • glBufferAddressRangeNV(GL_VERTEX_ATTRIB_ARRAY_ADDRESS_NV,…) - for attributes and
  • glBufferAddressRangeNV(GL_ELEMENT_ARRAY_ADDRESS_NV, …) - for index[/b]

5. Finally, draw your VBOs!!!

It isn’t so complicated, is it? :wink:

@Dark Photon: Wow, you put a lot of work into that!

And you were absolutely right, i did forget that another small piece of code also creates its own VBO and renders from it, and of course i did not think about disabling bindless there.

So, now i got it working. Since i am CPU bound i was expecting a speed up of at least 7.5x and nothing less. But so far i can’t tell whether i got any speed up, at all, it doesn’t look like it.

Anyway, yes, it’s not that complicated to integrate, but i think it would be great if the GL would reset bindless to disabled, when a glBindBuffer call is done. I mean, i had to extend libraries that don’t even use bindless themselves, to disable it, otherwise their perfectly fine VBO code suddenly crashes. I know, it’s not in the spirit of OpenGL to reset states behind your back, but if i call glBindBuffer (GL_ELEMENT_…, bla blub), the driver KNOWS that i want THAT and NOT bindless. Also, why have a enable/disable, at all?? If i call glBufferAddressRangeNV than i give the driver the address directly, and if i call glBindBuffer/glVertexAttribPointer then i do it indirectly, where is the need to have an enable/disable state for that?

Maybe before making something like that EXT/ARB/core, it could be a bit more streamlined to make code that is bindless aware and such that is unaware of it work together without problems.

Thanks for all your help,
Jan.

Glad to hear it.

…i had to extend libraries that don’t even use bindless themselves, to disable it, otherwise their perfectly fine VBO code suddenly crashes.

I too had to force bindless enables off before calling some “old, ugly VBO code” (which was mixing VBOs and client arrays – don’t ask…) that I didn’t want to modify (no need to extend it though; just reset bindless enables off before I called it). But this is just like any other GL state – certain called code has its assumptions on the existing GL state. Modelview, projection, enabled vertex attribs, bound shader, etc.

I know, it’s not in the spirit of OpenGL to reset states behind your back, but if i call glBindBuffer (GL_ELEMENT_…, bla blub), the driver KNOWS that i want THAT and NOT bindless.

While I don’t like the idea of glBindBuffer auto-disabling flags (maybe you bound it to map/subload it, not render with it?), I definitely see the root of your point: we have draw behavior selected by background GL state, not explicit function arguments. glDraw* are all overloaded to do client arrays, classic VBOs, and bindless VBOs, and you don’t tell it which it’s doing. That’s dictated by the presence of bound buffers or not (and now a bindless enable).

It’s the same issue as for client arrays vs. classic VBOs selection, where which is done depends on whether a “global” buffer is bound to a bind point or not.

The DSA approach would suggest an additional draw call argument to select which “mode” the draw call will operate in (client arrays, classic VBOs, bindless VBOs), but this is more obtrusive and less palettable for an extension to do (particularly given the client arrays/VBO precedent).

Also, why have a enable/disable, at all?? If i call glBufferAddressRangeNV than i give the driver the address directly, and if i call glBindBuffer/glVertexAttribPointer then i do it indirectly, where is the need to have an enable/disable state for that?

I see your point – have the “GPUaddr vs. buffer handle” selector set automatically based on the last “set” call you made for that vertex attribute: gl*Pointer implies buffer handle, glBufferAddressRangeNV implies GPU address.

Only three issues I can think of:

  1. [li]this interface allows mixing classic and bindless VBOs, which they wanted to avoid,[]for the index array, the only thing you have to go on (minus the bindless enables) is whether you last called glBufferAddressRangeNV( GL_ELEMENT_ARRAY_ADDRESS_NV … ) or glBindBuffer ( GL_ELEMENT_ARRAY_BUFFER …). Maybe you bound the buffer for reasons other than drawing with it…[]if you’re mixing bindless with classic VBOs (in separate batches of course) and doing lazy state/state tracking, you couldn’t lazy state across a switch between bindless and non-bindless (e.g. lazy bind to element array buffer). You’d have to invalidate to force the implicit selector inside GL your way. This is a very minor con though, and only relevant for transitional code.

Maybe before making something like that EXT/ARB/core, it could be a bit more streamlined to make code that is bindless aware and such that is unaware of it work together without problems.

Definitely worth another think to see if there’s an even easier way.

I thought I’d have a quick go at testing the speed of bindless compared to other methods.
I set up the benchmark so it looped through N objects, binding needed resources for each object + then drawing.
I used various number of randomly generated vertices per object (3/21/210/2100/21000/210000) and tested Bindless, VAO, VBO with Vertex Array, Vertex Array + Immediate mode.
It used vertex position (4 floats) + color (4 unsigned bytes) interleaved in one buffer object (except for VertexArray + Immediate mode which don’t use buffer object), with no indexing.

It was on a Nvidia GeForce 9500 card, and results show Frames/Second using each technique.

Here are my results when drawing fairly large triangles:

Objects: Vertices: Bindless/ VAO/ VBO+VertexArray/ VertexArray / Immediate

1 : 3 : 3100 / 3100 / 3100 / 3100 / 3100
10 : 3 : 3050 / 3050 / 3050/ 3050 / 3050
100 : 3 : 2830 / 2790 / 2820 / 2830 / 2830
1000 : 3 : 2268 / 740 / 1030 / 1575 / 2120
10000 : 3 : 492 / 102 / 154 / 244 / 392
100000 : 3 : 56 / 10 / 16 / 26 / 46

1: 21: 3070 / 3070 / 3070 / 3070 / 3070
10: 21: 2900 / 2900/ 2900/ 2900 / 2900
100: 21: 2300 / 2275 / 2290 / 2300 / 2340
1000: 21: 1144 / 735 / 1008 / 925 / 673
10000: 21: 237 / 102 / 153 / 140 / 89
100000: 21: 34.6 / 10.6 / 16.4 / 14.5 / 9.1

1: 210: 2890 / 2900/ 2900/ 2900/ 2900
10: 210: 2275 / 2275 / 2275 / 2280 / 2330
100: 210: 1062 / 1056 / 1060 / 1111 / 684
1000: 210: 248 / 248 / 248 / 210 / 92
10000: 210: 35 / 35 / 35 / 21 / 9.5

1: 2100: 2296 / 2300 / 2300 / 2300 / 2350
10: 2100: 1120 / 1120 / 1120 / 1175 / 708
100: 2100: 246 / 246 / 246 / 228 / 95
1000: 2100: 35 / 35 / 35 / 24 / 9.5

1: 21000: 1120 / 1120 / 1120 / 1175 / 708
10: 21000: 246 / 246 / 246 / 229 / 95
100: 21000: 35 / 35 / 35 / 24 / 9.7

1: 210000: 247 / 247 / 247 / 229 / 95
10: 210000: 35 / 35 / 35 / 24 / 9.4

Here are my results when drawing tiny triangles:

1: 3: 3095 / 3095 / 3095 / 3095 / 3095
10: 3: 3090 / 3090 / 3090 / 3090 / 3090
100: 3: 3080 / 3030 / 3070 / 3080 / 3080
1000: 3: 2460 / 730 / 1003 / 1535 / 2131
10000: 3: 486 / 102 / 153 / 230 / 388
100000: 3: 56 / 10.6 / 16.4 / 25 / 46

1: 21: 3095 / 3095 / 3095 / 3095 / 3095
10: 21: 3080 / 3080 / 3080 / 3080 / 3080
100: 21: 3030 / 2990 / 3020 / 3000 / 2935
1000: 21: 2556 / 734 / 1015 / 894 / 665
10000: 21: 490 / 102 / 153 / 135 / 89
100000: 21: 56 / 10.6 / 16.4 / 14.3 / 9.1

1: 210: 3090 / 3090 / 3090 / 3090 / 3090
10: 210: 3040 / 3040 / 3040 / 3040 / 2970
100: 210: 2740 / 2710 / 2735 / 1162 / 685
1000: 210: 1397 / 728 / 1007 / 207 / 92
10000: 210: 224 / 101 / 152 / 22 / 8.6
100000: 210: 23.9 / 10.6 / 16.5 / 2.3 / 1.0

1: 2100: 3035 / 3035 / 3035 / 3035 / 2885
10: 2100: 2750 / 2745 / 2745 / 1204 / 693
100: 2100: 1408 / 1398 / 1404 / 221 / 94
1000: 2100: 240 / 240 / 240 / 24 / 9.5
10000: 2100: 25.7 / 25.7 / 25.7 / 2.4 / 1.0

1: 21000: 2160 / 2160 / 2160 / 1211 / 694
10: 21000: 554 / 554 / 554 / 222 / 94
1000: 21000: 65 / 65 / 65 / 24.4 / 9.8

1: 210000: 558 / 558 / 558 / 222 /95
10: 210000: 65 / 65 / 65 / 24.5 / 9.8

The highest speedup I got in any of the tests, was about 3.5x speedup vs VBOs for lots of little objects (1 or 7 triangles per object).
I found interesting that bindless outperforms immediate mode for 1 triangle per object, whereas buffer object struggles in that case.
The performance of VAOs is a bit disappointing, with performance falling off faster than just using VBOs with lots of objects, although maybe the tests didn’t really test VAOs enough in situations where they would perform better.

i did not expect VAO to perform that bad in some cases. which driver version did you use and could you do the same test on a newer card?

I don’t get it. Who cares about objects with less than 100 vertices? Drawing many of them without any instancing method involved is not efficient from the start.

As far as I can see, VAO performs good in a real-case scenario.

It was on a Nvidia GeForce 9500 card, and results show Frames/Second using each technique.

I have to question your testing techniques (and your unwillingness to properly format your results). FPS is not a useful measure; milliseconds is.

Furthermore, the problems with vertex transfer center around cache issues. Your tests don’t seem to be doing anything except rendering the same object over and over. In order to have valid tests, you must:

1: Simulate running a game loop. So you need to flush the CPU cache.

2: Render multiple objects per “frame”. Different buffers, different vertex formats, etc.

It was using the 197.15 drivers from http://developer.nvidia.com/object/opengl_driver.html, I don’t have access to a newer card at the moment.

VAOs were in many cases the same speed as just using VBOs, possibly when limited by something else, but I’d have thought that there would be situations where VAOs could provide a speed-up, will need to try to find them at some point (try more attributes, different vertex formats, etc).

It was rendering very similar objects (all the same size + vertex formats - perhaps bias against VAOs + towards bindless), but not the same object (every object had different buffer objects / vertex array pointers - almost worst case, although they were always accessed in order created). It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia’s test-case, and where/when bindless started to have an effect.

The test results are only really a comparison relative to other results. Since it was a quick test, I just placed the code into a “OnRender” event of a “Direct OpenGL” object in a GLScene scene-graph, and rendered each “object” (erm, set of N colored, randomly generated triangles) in a for-loop straight after each other with nothing else happening - no transformations/materials or anything else applied between objects. With none of the tested code running, framerate was ~3100. Using the time taken instead of FPS would be correct for calculating speedup, but will only make a small percentage difference to the result in the 10s-100s of frames/second range.

Bindless + VAO combined seem to be faster than VBOs, but not as fast as bindless by itself, will need to include this in tests too.

It was mostly to see if I could reach anywhere near the 7x speedup that was achieved in NVidia’s test-case, and where/when bindless started to have an effect.

And what about the cache issues? Did you flush the cache between renders?

Also, how many objects do you render? Do you cull the triangles, so that you aren’t accidentally testing rasterization time?

My point is that your benchmark isn’t as specific about testing vertex batch overhead as it needs to be.

I agree with Alfonse, the 7x claim is meant for applications that are heavily CPU bound due to cache-misses. As long as you do not construct a test, that will generate a high overhead by cache-misses, bindless will not be able to speed it up that dramatically.

But still, nice to see someone put a little bit more effort into investigating the differences.

Jan.

My (Pascal) code for rendering using VAOs looks like this:


  for I := 0 to Iterations - 1 do
  begin
    for j := 0 to NumberOfObjects - 1 do
    begin
      glBindVertexArray(vaoID[j]);
      glDrawArrays(GL_TRIANGLES, 0, IndexDataSize[j]);
    end;
  end;
  glBindVertexArray(0);

If I render the same “object” 10000 times per frame, instead of 10000 objects once per frame, also culling the front + back faces which gives fairly close results to previous result (since the triangles rendered were tiny), my results are like this:

Objects |Vertices |Iterations |Bindless |Bindless VAO |VAO |VBO |Vertex Array |Immediate

10000 |3 |1 |482 |182 |101 |152 |230 |385
10000 |21 |1 |482 |182 |101 |152 |134 |89
10000 |210 |1 |224 |185 |101 |152 |22 |9.6

1 |3 |10000 |512 |419 |234 |229 |246 |443
1 |21 |10000 |515 |420 |235 |227 |158 |98
1 |210 |10000 |224 |224 |224 |224 |33 |10.6

The “bindless VAO” code is exactly the same as VAO code, but at the build stage, bindless instructions are used instead of normal ones.


glGenVertexArrays(NumberOfObjects, @vaoBindlessID[0]);
for I := 0 to NumberOfObjects - 1 do
begin
  glBindVertexArray(vaoBindlessID[i]);
  glEnableClientState(GL_COLOR_ARRAY);
  glEnableClientState(GL_VERTEX_ARRAY);

  glEnableClientState(GL_VERTEX_ATTRIB_ARRAY_UNIFIED_NV);
  //glEnableClientState(GL_ELEMENT_ARRAY_UNIFIED_NV);
  glColorFormatNV(4, GL_UNSIGNED_BYTE, sizeof(TVertex));
  glVertexFormatNV(4, GL_FLOAT, sizeof(TVertex));
  glBufferAddressRangeNV(GL_VERTEX_ARRAY_ADDRESS_NV, 0, ptrVertices[i], verticesDataSize[i]);
  glBufferAddressRangeNV(GL_COLOR_ARRAY_ADDRESS_NV, 0, ptrVertices[i]+16, verticesDataSize[i]-16);

  CheckOpenGLError;
end;
glBindVertexArray(0);

vs


glGenVertexArrays(NumberOfObjects, @vaoID[0]);
for I := 0 to NumberOfObjects - 1 do
begin
  glBindVertexArray(vaoID[i]);
  glEnableClientState(GL_COLOR_ARRAY);
  glEnableClientState(GL_VERTEX_ARRAY);

  glBindBuffer(GL_ARRAY_BUFFER, vertexBufferID[i]);
  glColorPointer(4, GL_UNSIGNED_BYTE, sizeof(TVertex), Pointer(16));
  glVertexPointer(4, GL_FLOAT, sizeof(TVertex), Pointer(0));

  CheckOpenGLError;
end;
glBindVertexArray(0);

When you say flushing the (CPU) cache, you mean some code like this? or is there a better way?


var
  CacheClearer: Array[0..1024*1024-1] of byte;
  i: integer;
begin
  for i := 0 to Length(CacheClearer) - 1 do
    CacheClearer[i] := CacheClearer[i]+1;
end;

With this after after every frame, I get results like this:

Objects |Vertices |Iterations |Bindless |Bindless VAO |VAO |VBO |Vertex Array |Immediate

10000 |3 |1 |279 |144 |88 |124 |180 |244
10000 |21 |1 |279 |146 |88 |124 |115 |77
10000 |210 |1 |224 |146 |89 |125 |21.8 |9.4

These are not comparable to earlier results, since now the base framerate is 550 FPS when none of the tested code is run.
This makes the rest of the frame no longer neglible in the speedup calculation (eg. 124 v 279 FPS = (1/124-1/550)/(1/279-1/550) = 3.5x speedup in that part of the app, but a 2.25x speedup overall.