Enhanced glBufferSubData

CatDog · August 1, 2008, 10:18am

I’d like to have an enhanced version of BufferSubData(). Something like this:

void glBufferSubDataCATDOG(
  GLenum    target,  
  GLintptr  offset, 
  GLint     rcount,
  GLint     rsize,
  GLsizei   stride, 
  const GLvoid *data );

target
Specifies the target buffer object.

offset
Specifies the offset into the buffer object’s data store where data replacement will begin, measured in bytes.

rcount
Specifies the number of records to be replaced.

rsize
Specifies the size in bytes of one data record.

stride
Specifies the byte offset between consecutive records.

data
Specifies a pointer to the new data that will be copied into the data store.

CatDog

Rob_Barris · August 1, 2008, 3:22pm

do you have some idea of a typical workload that you would use with this?

i.e. which ranges of bytes might you be wanting to replace ?

CatDog · August 2, 2008, 4:19am

Sure. Most of the time, if not always, VBOs are sequences of data records. E.g. an interleaved vertex array:

Position0 - Normal0 - Color0 - Position1 - Normal1 - Color1 - …

What if you want to change the colors (and only the colors) dynamically?

Don’t use interleaved arrays. This is almost always slower when rendering. And it requires a completely different VBO layout.
Use BufferSubData. This means the driver has to upload the dirtied region as a block, including all unchanged data (Positions and Normals in this case).

My proposal let’s the driver optimize this task. Obviously, this becomes more important, when the number of static attributes increases.

Well, the ultra-enhancement would be as follows:

void glBufferSubDataCATDOG_ULTRA(
  GLenum    target,  
  GLintptr  offset, 
  GLint     rcount,
  GLint     rsize,
  GLsizei   srcstride, 
  GLsizei   dststride, 
  const GLvoid *data );

srcstride
Specifies the byte offset between consecutive records within the source buffer.

dststride
Specifies the byte offset between consecutive records within the VBO.

With this, the application could maintain a tightly packed color array in RAM and upload dirtied portions of it to the interleaved vertexbuffer with a single call. I would find this very very useful.

CatDog

Brolingstanz · August 2, 2008, 10:59am

“Don’t use interleaved arrays.”

This does have certain advantages.

Don’t know what happens with GL3 and beyond but DX10 allows any sequential subset of the vertex layout (object) to be used with a vertex shader. Requires fewer objects this way, really just 1 if you prefer to orchestrate things that way. Plus it makes the individual streams optional, which can be more flexible for varying inputs.

I’ve always been a fan of interleaved layouts too, but the latest advances are changing things for me a bit.

Zengar · August 2, 2008, 3:44pm

… or use buffer mappings and upload only relevant parts yourself.

CatDog · August 2, 2008, 7:24pm

Your suggestions are welcome!

Zengar, I don’t understand that. Of course I want to upload only relevant parts, that was the reason for my request. What do you mean with buffer mappings?

Modus, as long as I measure a performance drop of nearly 20% for my kind of data, non-interleaved arrays are no option for me. I’m currently using BufferSubData for the dirtied regions of my arrays, and this turned out to be the fastest method. I think my proposal could add some flexibility and gives the driver the chance for optimization.

After all, I don’t want to argue against the non-interleaved method in general. It’s just that dynamic data transport with interleaved arrays could be made better.

CatDog

Rob_Barris · August 2, 2008, 8:20pm

OK, you want a scatter / strided bufferSubData call. I do see one possible problem though, which is that the basic unit of memory storage tends to be a cache line (the size of that line may vary depending on CPU or GPU).

So if you have this contiguous array of new color values that you want to deliver into the VBO, say you are using vertices that are somewhere in between 32 and 48 bytes in size, something has to happen to take each four byte color and deliver it to the right spot. That work is either going to be done by the CPU or GPU.

If the CPU is doing it… I doubt it would be any faster than just mapping the buffer and writing those fields yourself with a simple loop. In fact the loop that the API or driver might use would probably look the same. You’re going to wind up touching all those cache lines and paying that memory bandwidth price even if you only want to change 4 bytes on each one.

If the GPU is doing it… well the GPU is not doing it unless the CPU told it how to do it. If the data is coming from your memory then somehow it has to made visible to the GPU. The CPU could copy it somewhere for GPU pickup, but after that I don’t know what the GPU can do with it, if there is any such thing as a DMA engine that can do a scatter like that.

OpenGL can allow you to have everything in two streams, say all of the vertex attributes except for one in one area (interleaved) and the more dynamic attribute in another contiguous area or even another VBO where it’s easier to modify en masse. Splitting const data from varying data seems to me, to likely be the path of least resistance based on hunches about hardware and memory organization.

CatDog · August 3, 2008, 3:54am

Rob, I’m sure your are right with these technical concerns. But I don’t know what the CPU or GPU are doing. I can just speculate about it. As long as the API doesn’t provide the possibility for “scattered DMA” (I like that term), nobody will use it. All I can do is to say that I would use it, if I could.

I invested (far too much) time in empirical studies, asking what is the best way of handling my data. And as I said: mixing streams from different VBOs, one static one dynamic, works fine, but it is slower. Using BufferSubData on the arrays resulted in absolutely no CPU load. Obviously this is a straight operation, as you said. But mixing VBOs required the driver to do… whatever. One of my cores jumped to 100% when doing this! So the driver is reorganizing stuff in the background, maybe right before the DMA. I did not find a way to avoid this CPU load, except using one interleaved array.

My conclusion: current hardware* likes interleaved arrays. So I’m using them, but working with interleaved arrays could be made more flexible. Hence, I wrote that suggestion.

If scattered DMA doesn’t make sense from a hardware point of view, and never will, then forget about it. The old BufferSubData works out fine then.

*) I must admit that my current hardware is not so current anymore. GeForce 7…

CatDog

Rob_Barris · August 3, 2008, 9:24pm

On the path with two VBO’s, I wouldn’t mix the types, I might suggest doing both static or both dynamic, and use BufferData to fully replace the contents of one of them.

yooyo · August 7, 2008, 2:14pm

Why dont you split yor vertex data in two separate arrays… static and dynmic. Use two VBO’s, one for static and one for dynamic. Upload static once and change dynamic as you need. I used that in my old code when I do character skinning on CPU. Pos, norm, tangent and binormal are dynamic… all other attributes are static.

Seth_Hoffert · August 7, 2008, 2:20pm

Never mind, I misread the post!

CatDog · August 7, 2008, 3:11pm

I did that. But rendering performance dropped by 20% (see above). Or, to tell it the other way round: one interleaved array was faster - and I always prefer the fastest method.

I did not try what Rob suggested: tagging both static or dynamic. Honestly, what kind of internal knowledge do you need to come up with that idea? Maybe I will try this again, but for the moment it’s ok. Anyway, “scattered DMA” would be a nice feature!

CatDog