streaming VBO and VAO

Say there are 2 or more VAO, always referring to the same VBO and one binds from one of the VAOs to some other of the VAOs. Is the performance hit less than binding a different VBO (that is, does one still benefit from the streaming approach)?

background:

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=273141#Post273141

Say one places a number of batches into one “caching”/streaming VBO (a VBO that can be orphaned). AFAIK VAOs would make almost no sense to use in this case, as the referenced VBO (by the vertex attribute pointers) stays the same and the vertex attribute pointers usually change after orphaning. But perhaps one could still use VAOs to store the enabled vertex arrays (glEnableVertexAttribArray()/glDisableVertexAttribArray()) state, if the performance hit when binding the VAOs was less than calling the above GL functions.

EDIT:
What if the vertex attribute pointers were updated only if the streaming VBO was orphaned (that is, only when they are potentially invalidated)?

EDIT2:
The approach works, but I’m not so sure about perf. Any hints?

EDIT:
What if the vertex attribute pointers were updated only if the streaming VBO was orphaned (that is, only when they are potentially invalidated)?

You can update the pointers (e.g. set VAOs) only when the buffer is orphaned. To read from the data from the buffer, set the ‘first vertex’ argument of your draw calls by taking the current offset of the buffer you’re streaming data into account.

Writing:


static GLuint streamOffset = 0;

// stream content to buffer
glBindBuffer(GL_ARRAY_BUFFER, buffers[BUFFER_VERTEX_STREAM]);

// orphan if buffer full
if(streamOffset + next_power_of_two(STREAM_DATA_SIZE) > STREAM_BUFFER_SIZE)
{
	glBufferData(GL_ARRAY_BUFFER, STREAM_BUFFER_SIZE, NULL, GL_STREAM_DRAW);   // orphan buffer
	streamOffset = 0;  // reset offset
	glBindVertexArray(vertexArrays[VERTEX_ARRAY_STREAM]); // reset vao pointers
		glBindBuffer(GL_ARRAY_BUFFER, buffers[BUFFER_VERTEX_STREAM]);
		glVertexAttribPointer(0, 3, GL_FLOAT, 0, sizeof(Vertex), BUFFER_OFFSET(0));
		glVertexAttribPointer(1, 3, GL_FLOAT, 0, sizeof(Vertex), BUFFER_OFFSET(3*sizeof(float)));
		glVertexAttribPointer(2, 2, GL_FLOAT, 0, sizeof(Vertex), BUFFER_OFFSET(6*sizeof(float)));
	glBindVertexArray(0);
}
	
// get data ptr
Vertex* vertices = (Vertex*)(glMapBufferRange( GL_ARRAY_BUFFER, 
                                               streamOffset, 
                                               next_power_of_two(STREAM_DATA_SIZE), 
                                               GL_MAP_WRITE_BIT|GL_MAP_UNSYNCHRONIZED_BIT ));

// stream data
stream_data(vertices);

// unmap buffer
glUnmapBuffer(GL_ARRAY_BUFFER);
glBindBuffer(GL_ARRAY_BUFFER, 0);

Reading:

// draw
glBindVertexArray(vertexArrays[VERTEX_ARRAY_STREAM]);
glDrawArrays(GL_TRIANGLES, streamOffset/sizeof(Vertex), STREAM_DATA_SIZE/sizeof(Vertex));
streamOffset += next_power_of_two(STREAM_DATA_SIZE); // increment offset

Do you have any bench results comparing the VAO approach with the non-VAO (manual state-setting) approach? I didn’t see any bench results when googling, only assertions that using VAOs “should” improve perf and I don’t see any measurable differences with the project I’m currently working on.

I don’t think using VAOs has gained performance wise in the past, but I haven’t tested it recently, so the “fast path” might have changed.
If you’re changing quite a few attributes, then using VAOs should theoretically be better than manually doing it, since drivers could cache whether the VAO is valid, whereas every time you call glVertexAttribPointer it will need to do a number of checks.

AFAIK orphaning the vertex array object doesn’t invalidate the attribute pointers, they still point to the same offset in the buffer object as they did before.

If all your models are sharing a shared vertex structure, then I would use one VAO with all the attributes pointers pointing to the start of the streaming buffer object, and use the same VAO the whole time, just adjusting the enabled/disabled attributes without switching the VAO.

If you were to have models with different vertex layout, then switching between different VAOs each pointing to the start of the buffer object may be a good idea, making sure that if the vertex sizes are different, that you increment the current vertex offset to be a multiple of the new vertex size.

Something like this (written in notepad, untested etc.):


// -------------------- set up -------------------
// set up VBO + IBO
glBindBuffer(GL_ARRAY_BUFFER, streamingVertices);
glBufferData(GL_ARRAY_BUFFER, streamingVerticesSize, NULL, GL_STREAM_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);

glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, streamingIndices);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, streamingIndicesSize, NULL, GL_STREAM_DRAW);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);

// set up VAO
glBindVertexArray(streamingVAO);
glBindBuffer(GL_ARRAY_BUFFER, streamingVertices);
// position
for (int i=0; i<vao.num_attributes; i++)
  glVertexAttribPointer(vao.attribute[i].index, vao.attribute[i].num_elements, vao.attribute[i].data, vao.vertexsize, BUFFER_OFFSET(vao.attribute[i].offset));
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, streamingIndices);
glBindVertexArray(0);


// -------------------- draw time -------------------
// on begin streaming draw
glBindVertexArray(streamingVAO);
glBindBuffer(GL_ARRAY_BUFFER, streamingVertices);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, streamingIndices);

int current_vertex = 0;
int current_index = 0;
int last_vao = -1;
int vertex_size;
int index_size;

for (int i=0; i< num_dynamic_meshes; i++)
{
  if (last_vao != dynamic_mesh[i].vao)
  {
    current_vertex = increment_vertex_to_valid_position(current_vertex, last_vao, dynamic_mesh[i].vao);
    current_index = increment_index_to_valid_position(current_index, last_vao, dynamic_mesh[i].vao);
    glBindVertexArray(dynamic_mesh[i].vao);
    last_vao = dynamic_mesh[i].vao;
    vao = Get_VAO(dynamic_mesh[i].vao);
    vertex_size = vao->vertex_size; // assuming nice size, eg. 64
    index_size = vao->index_size;
    index_type = convert_to_gl_index_type(vao->index_type);
  }

  int num_vertices = dynamic_mesh[i].num_vertices_to_be_generated();
  int num_indices = dynamic_mesh[i].num_indices_to_be_generated();

  if ((current_vertex + num_vertices)*vertex_size > streamingVerticesSize)
  {
    // orphan vertex buffer
    glBufferData(GL_ARRAY_BUFFER, streamingVerticesSize, NULL);
    current_vertex = 0;
  }

  if ((current_index + num_indices)*index_size > streamingIndicesSize)
  {
    // orphan index buffer
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, streamingIndicesSize, NULL);
    current_index = 0;
  }

  int vertex_offset = current_vertex * vertex_size; 
  int index_offset = current_index * vertex_size;

  int vertex_data_length = num_vertices * vertex_size;
  int index_data_length = num_indices * index_size;

  if (dynamic_mesh[i].wants_mapped_pointers)
  {
    vertex_t *vertices = (vertex_t*)glMapBufferRange(GL_ARRAY_BUFFER, vertex_offset, vertex_data_length, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT | MAP_UNSYNCHRONIZED_BIT);
    dynamic_mesh[i].fill_vertices(vertices);
    glUnMapBuffer(GL_ARRAY_BUFFER);

    if (dynamic_mesh[i].has_indices)
    {
      index_t *indices = (index_t*)glMapBufferRange(GL_ELEMENT_ARRAY_BUFFER, index_offset, index_data_length, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_RANGE_BIT | MAP_UNSYNCHRONIZED_BIT);
      dynamic_mesh[i].fill_indices(indices);
      glUnMapBuffer(GL_ELEMENT_ARRAY_BUFFER);
    }
  }
  else
  {
    dynamic_mesh[i].generate_vertices();
    glBufferSubData(GL_ARRAY_BUFFER, vertex_offset, vertex_data_length, dynamic_mesh[i].vertex_data());
    if (dynamic_mesh[i].has_indices)
    {
      dynamic_mesh[i].generate_indices();
      glBufferSubData(GL_ELEMENT_ARRAY_BUFFER, index_offset, index_data_length, dynamic_mesh[i].index_data());
    }
  }

  dynamic_mesh[i].set_enabled_attributes();
  if (dynamic_mesh[i].has_indices)
    glDrawRangeElementsBaseVertex(dynamic_mesh[i].mode, current_vertex, current_vertex + num_vertices, num_indices, index_type, current_index, current_vertex);
  else
    glDrawArrays(dynamic_mesh[i].mode, current_vertex, num_vertices);
  
  current_vertex += num_vertices;
  current_index += num_indices;
}

I’ve never observed drastic performance gaps between the two approaches (I assume that by non-VAO you mean the 'bind a single VAO and ‘forget’ about it ?). I just stick to some guidelines/information I find here and there (such as this one).

_blitz: yes, I mean the “mutable VAO”.

It seems to me both of you are using your streaming VBO(s) for “true” streaming, i.e. content generated/changed on the CPU and then uploaded to the GPU every frame. That is, your VBO data is GL_STREAM_DRAW.

But Dark Photon suggested using a single large GL_DYNAMIC_DRAW VBO (for all vertex attribute/… dynamic and static data of the app) with data that may or may not be changed every frame, but orphaned once nothing more can be stuffed at the tail of the VBO (for example, because the view is moved/teleported abruptly to some odd corner of the world whose data is not yet in the VBO). If dynamic data changes, but stays the same size, one does not need to append at the tail of the VBO, but simply update a portion of the VBO.

The advantages of the single VBO approach is that no VBO binds are needed, but with VAO, no explicit VBO binds (in the rendering loop) are needed as well, so I was wondering if the VAO approach was somehow better than the single VBO approach and if VAO was perf-friendly with the single VBO approach. Also, as noted, one can still change some data in the single VBO and then stream it to the GPU every frame, while leaving some unchanged.

I think using different VAOs and updating vertex attribute pointers when orphaning or manual state switching are the only options in the single, large VBO approach.

I’m pretty sure you can’t safely stream this kind of data asynchronously: how do you garantee that you’re not writing to a region which is being read by the GPU ?

I’m pretty sure you can’t safely stream this kind of data asynchronously: how do you garantee that you’re not writing to a region which is being read by the GPU ?

Sure you can. I even wrote a class that does it. You just have to know what you’re doing.

The key thing is to know how far you’ve written into the buffer, so that if you run out of room, you can invalidate it. That way, you can never be writing to a place that’s being read.

Your class implements the algorithm I described above; I agree, it runs smoothly asynchronously.
I just realized I misunderstood ugluk’s last post, I thought the DYNAMIC_DRAW buffer approach was about updating sub regions of a large, unique buffer holding the scene’s data.

I just realized I misunderstood ugluk’s last post, I thought the DYNAMIC_DRAW buffer approach was about updating sub regions of a large, unique buffer holding the scene’s data.

It is about this in addition to holding static data. In a way the VBO is a caching store. Dark Photon’s refinement of an earlier Rob Barris’ idea.

I guess an example is in order:


template <std::size_t max_num_resources>
inline bool GLVBOHandler<max_num_resources>::update_resource(
  vboresourceid_type const resource_id)
{
  if (resources_[resource_id].orphan_id == orphan_id)
  {
    glBufferSubData(GL_ARRAY_BUFFER,
      GLintptr(resources_[resource_id].vbo_ptr),
      resources_[resource_id].client_size,
      resources_[resource_id].client_ptr);

    return false;
  }
  else
  {
    return load_resource(resource_id);
  }
}

Here I check if the data has been orphaned, if so I append, possibly orphaning all data already present in the VBO (if I run out of space), otherwise I just do the glBufferSubData() call.

This works very well for me for dynamic data (and I mean it). How would orphaning work otherwise if the orphaned data was already referenced in a draw call? Or maybe this is not legitimate to do and the approach works only if the entire buffer is orphaned?

But I tested this approach on windows, linux, iOS, FreeBSD, … on NVIDIA and Radeon, works everywhere.

Alfonse: I don’t append at all, if the buffer’s subregion does not change size, only content.