Uniform Buffer Objects, dynamic sized arrays and lights

Hello, I’m new to this forum as a poster. I tried making the title as descriptive as possible so here goes:

I’m new to UBOs and I’ve seen many different implementations and I don’t think none of them serves my purpose and/or I can’t understand UBOs completely. In my application, I have a number of lights that isn’t fixed. I’d like to feed them to the GPU so I can use them in the shader code. Bonus: I’d prefer to have lights stored in the GPU much like the VBOs, where I just need to bind the buffer’s ID (GLuint) to use it. The trick is that even if this is possible, you’d have to bind several UBOs, one for each light and bind a certain index to them so that it maps to the array on the shader code. Is there any way to do this?

Right now, I’m looking at having my fragment shader like this:


struct Light {
   vec3 position;
   float padding;
}
layout (std140) uniform Lights {
   Light light[];
}

and then iterating through the lights. This dynamic allocation of arrays is possible in 4.3 as far as I know. However, I have problems in the C++ code:

Initialization:


glGenBuffers(1,&m_lightsUBO); // this generates UBO, OK
glBindBuffer(GL_UNIFORM_BUFFER, m_lightsUBO); // this binds it, OK
glBufferData(GL_UNIFORM_BUFFER, /*size*/, /*data*/, GL_DYNAMIC_DRAW); // this allocates space for the UBO. 
glBindBufferRange(GL_UNIFORM_BUFFER, /*buffer index*/, m_lightsUBO, 0, /*size*/); // this binds UBO to Buffer Index

Though I don’t need to specify the data at this point, I do need to specify a size… but I don’t know how many lights I will need. This is a run-time value but if I move this code to a per-frame function I’ll be creating a new UBO every frame and thus defeating the purpose. Also, I’m not sure what this Buffer Index should be.
After the initialization, I can pretty much do this:

Per Frame:


glBindBuffer(GL_UNIFORM_BUFFER,m_lightsUBO);
glBufferSubData(GL_UNIFORM_BUFFER,0,data.size(),data.data());
glBindBuffer(GL_UNIFORM_BUFFER,0);

Which means I’m actually sending data to the GPU every frame, instead of just referencing a Buffer Object.

I’d like to know if what I’m looking for is possible and if so, how to do it. If not, then I’d like to know the closest possible alternatives to what I want.

Thank you in advance!

Uniform blocks must have an explicit size specific in the shader. Shader storage blocks however do not; they can be unbounded in size, with the size taken dynamically based on the range of the buffer object bound to the SSBO.

SSBOs are only available in GL 4.3 hardware.

Uniform blocks are probably sufficient for your needs. You can set a maximum hard limit on the number of lights, then simply pass fewer than that as a uniform variable:


#define MAX_NUM_TOTAL_LIGHTS 100
struct Light {
  vec3 position;
  float padding;
}
layout (std140) uniform Lights {
  Light light[MAX_NUM_TOTAL_LIGHTS];
  int numLights;
}

So your uniform buffer would always have space for 100 lights, but you would also pass a variable that says how many to use.

In any case, you will always have some kind of maximum size in your buffer object, no matter whether you use UBOs, SSBOs, buffer textures, or some other mechanism. You don’t want to be constantly allocating buffers of arbitrary size. You allocate the maximum size, then fill it with whatever you need.

I have 4.3 hardware, I’m even considering using beta 4.4 drivers.

My example was simple, but I will actually require lights to have about 20N in size and I will also need an array with Materials which can also be 20N in size. The reason for having several materials is that I’ll have the material index per pixel on a second shader pass that will do the light evaluations on a quad with a render-targeted texture. I will also be stress testing so I didn’t really want to have a maximum size for the array. But even if I use your Uniform Block suggestion, do I keep a reference (GLuint) for the Buffer Objects so I can just bind them before drawing without passing the data to the shaders unless a light is modified? What I want is to have an array of structs in the shader which I can bound dynamically per frame. For example, binding the C++ struct lights[i].ubo to the shader variable lights[i] on a per-frame basis after having generated every lights[i].ubo on a initialisation stage by buffering the data on lights[i].position, lights[i].cutoff_angle and so on.

EDIT: Oh I think I’m starting to understand what you can actually do with it. Could I, for example, allocate space for 100 lights, buffer the data into the shader, save the UBO id and then later when one of the lights gets updated I could buffer the data of that light into the shader without buffering all of the lights? I don’t want to keep pushing things from the CPU to the GPU if the GPU could already have stored data like the lights… I would prefer to change for example light 33 and light 66 without sending the other lights over to the shader. So what I basically want is an array of light structs in the shader that may have specifically indexed lights being updated on demand.

Should this be moved to OpenGL coding: Advanced or Beginners since it isn’t just GLSL and there seem to be more people actively helping in that section?

Maybe I’m reading you wrong, but: glBufferSubData updates a part of the buffer-object. So all you have to do is declare a uniform-block in the shader (with a sufficient-once-for-all number of lights) , create an equally-sized buffer in gl, bind it to the uniform block and - whenever you Change a light-setting - update the part that corresponds to it.
I don’t exactly know what you mean by 20N lights though. If the N is meant Alfonse’s 100 then maybe a uniform buffer is too large as the overall-size of uniforms is limited. Then you’d have to use textures or Images to store and get-into-the-shader your light-Settings.

N is the size of a float. In the OpenGL description of layout std140 it is said that data is stored in blocks of 4N if there is a vec3 or vec4, 2N if there is a vec2 and N if it’s just floats.

Each of my lights has 20N, thats 5 * vec4. So if there’s 100 lights, that’s 500 vec4s or 2000 floats. Same for materials so I could be passing 4000 floats per frame to the GPU and I really wanted to use the BUS as little as possible if I already have the data there.

For materials, I really just need to collect all of my materials and send them once, unless a new material appears at run-time which is highly unlikely. Then when collecting all the objects, I pair them with a material ID so even if the objects have to be updated a lot (due to changes in one of the scenegraph matrices) the materials won’t change so all I really need is the material ID to be passed to the shader on a per-primitive basis. For the lights, might be as you are saying: I allocate 2000 floats in the GPU’s memory and I feel it the first time, then whenever a light is updated I send the part of it that needs updating. So if light 33 gets updated I do a glBufferSubData(GL_UNIFORM_BUFFER,3320sizeof(GLfloat),20*sizeof(GLfloat),data.data()); and that only sends 20 floats through the BUS to the GPU, while the other lights are already stored in the GPU’s memory, right?

That’s the theory. I’ve to say I’m unsure about a limit of uniforms that reside in a buffer. 2000 floats would - if I remember right - have been too much for my old Laptop lying around (but that one is quite a few years old). There is a limit for the upper number of uniform-floats etc., but - as I said - I’m unsure about if those do apply for uniform-blocks at all. But I’m really no expert on the newer features of gl as I’m mainly working with Features present in 2.1. I’ve never used nor concerned with ShaderStorage-blocks etc. Maybe they’re a more adequate way for your needs: If ShaderStorage-blocks are - as Alfonse stated - not explicitely sized in the shader it is highly probable that no such limits on the number of uniform floats apply to them.

For materials, I really just need to collect all of my materials and send them once, unless a new material appears at run-time which is highly unlikely.

I won’t worry so much about how much data gets send. If a buffer or texture Needs to be grown by a fex Bytes once every 20 Frames that’s nothing. Just do partial updates. If a buffer Needs to be grown that is one allocation, one buffer copy + the update with new data. Sounds like 2 Million clock-cycles at most…

2000 floats would - if I remember right - have been too much for my old Laptop lying around (but that one is quite a few years old).

If it supported UBOs at all, then it is required to allow individual uniform blocks to contain at least 16KB of data. So 2000 floats is merely half of the minimum capacity; AMD supports 64KB buffers even on my old HD 3300.

I won’t worry so much about how much data gets send. If a buffer or texture Needs to be grown by a fex Bytes once every 20 Frames that’s nothing. Just do partial updates. If a buffer Needs to be grown that is one allocation, one buffer copy + the update with new data. Sounds like 2 Million clock-cycles at most…

Which explains why the ARB just released an extension who’s primary purpose is to make growing the size of a buffer object after initial creation impossible.

No, the ARB have made it abundantly clear that expanding the size of a buffer (or texture) in-situ is not a good idea.

Simply trying to grow it was my first idea before I was rasping my formulation :wink:
Creating a new buffer, copying the old and filling it up should be no problem if done once or twice a second - but I can just speak from my not alltoo hardware-demanding perspective.

And: you caught me. If I remember right UBOs were one thing my old lappi could not handle so I decided against using them - at least until I do not feel the need to test my stuff on other hardware regularly. That thing gave up compiling shaders containing for-loops with an animosous “Shader to supported by HW:”<Nothing>

[QUOTE=Alfonse Reinheart;1253108]If it supported UBOs at all, then it is required to allow individual uniform blocks to contain at least 16KB of data. So 2000 floats is merely half of the minimum capacity; AMD supports 64KB buffers even on my old HD 3300.

Which explains why the ARB just released an extension who’s primary purpose is to make growing the size of a buffer object after initial creation impossible.

No, the ARB have made it abundantly clear that expanding the size of a buffer (or texture) in-situ is not a good idea.[/QUOTE]

So what do you think of what I said? Allocating a UBO of 100 * 5 * 4 floats, binding the UBO ID to a binding point and block index so that you just need to bind it to use it. Then when an individual light is updated is it possible to update that specific location in the UBO? Using BufferSubData like I said?

That is something you can take for granted.

What can I take for granted?

UPDATE ON THE TOPIC: I have tried SSBOs which basically let me have a pointer to an memory location that is mapped to the SSBO data, so I can easily change the stuff I want individually if I must. But now I have a dilemma:

Should I use SSBOs or UBOs? From what I’ve seen, I can allocate space for the SSBO per frame depending on whether the number of active lights has changed. But isn’t this worse than just allocating a MAX value using a UBO and only filling a part of it? Also, I’ve read that writing to a SSBO is slower than writing to a UBO, so shouldn’t I just allocate a really big UBO and fill it with the active lights on a per-frame basis?

From what I can tell, only sending data for the lights that have been updated might prove to be harder than I initially thought since I’d have to find a way to deal with deleted lights. So I think I’ll just stick with sending all the active lights.

What do you guys think?

What can I take for granted?

That what you said will work.

From what I’ve seen, I can allocate space for the SSBO per frame depending on whether the number of active lights has changed. But isn’t this worse than just allocating a MAX value using a UBO and only filling a part of it?

Yes, but you can do that with SSBOs too. Did you not read the part about where I pointed out that, not 4 days ago, the ARB released an OpenGL feature who’s primary purpose is to make it impossible to reallocate space for a buffer object?

Also, I’ve read that writing to a SSBO is slower than writing to a UBO

There is no such thing as an SSBO. Or a UBO. Or a VBO.

They are just [i]buffer objects[/i]: unformatted linear arrays of memory stored and managed by OpenGL. You can use a buffer for shader storage purposes, then turn around and use it for UBO. You can do transform feedback into a buffer, then upload that data to a texture via PBO. You can use a buffer with a buffer texture, write with image load/store to it, then use it as vertex data with glVertexAttribPointer.

All buffer objects provide the same functionality.

What may be slower is reading from it in your shader. UBOs will (in all likelyhood) be copied into the constant local storage of your shaders, so reading will be quite fast. SSBO’s are basically just a nice form of Image Load/Store via buffer textures, so they’re treated like global memory accesses.

[QUOTE=Alfonse Reinheart;1253150]That what you said will work.

Yes, but you can do that with SSBOs too. Did you not read the part about where I pointed out that, not 4 days ago, the ARB released an OpenGL feature who’s primary purpose is to make it impossible to reallocate space for a buffer object?

There is no such thing as an SSBO. Or a UBO. Or a VBO.

They are just [i]buffer objects[/i]: unformatted linear arrays of memory stored and managed by OpenGL. You can use a buffer for shader storage purposes, then turn around and use it for UBO. You can do transform feedback into a buffer, then upload that data to a texture via PBO. You can use a buffer with a buffer texture, write with image load/store to it, then use it as vertex data with glVertexAttribPointer.

All buffer objects provide the same functionality.

What may be slower is reading from it in your shader. UBOs will (in all likelyhood) be copied into the constant local storage of your shaders, so reading will be quite fast. SSBO’s are basically just a nice form of Image Load/Store via buffer textures, so they’re treated like global memory accesses.[/QUOTE]

So UBO’s should be faster to read from in my shader? Since, as you said, it is impossible to reallocate space for a Buffer Object, I will have to allocate for example 100 lights whether I use UBOs or SSBOs. If this is true, then I should use UBOs, shouldn’t I?

Also, in my code, I’m using glBufferData with NULL data, per frame, just before I use glMapBufferRange. My SSBO initialization is basically generating it and binding as a GL_S_S_B to a predefined binding point. My questions are:

  1. If I have defined MAX_LIGHTS = 100 and I only need to upload 30 lights, what’s the differences between using glBufferData with size = current_number_of_lights and size = MAX_LIGHTS? You said it doesn’t reallocate space, so what does it really do? If I use glBufferData with size = 30 and the next frame use it with size = 31, what happens?

  2. Should I really be using glBufferData each frame? Or should I use it once in initialization and then just make sure glBufferSubData doesn’t upload something bigger than the space I’ve allocated?

  3. Should I really be using glMapBufferRange or should I instead be using glBufferSubData? I figured glMapBufferRange either copies the data to host memory and then copies changes back to the GPU, or it returns a pointer directly to GPU memory which is would be risky. So it seems glBufferSubData is just better since it just copies the specific data you want to the GPU at that point.

EDIT: (EDIT2 responds to this) This is strange, I tried using the .length method to check the light array length and it always returns 1. I’ve decided to see what would happen if I took glBufferData (with NULL pointer) and nothing happened, it didnt seem to be doing anything at all. I am basically generating a buffer, binding it and using glBufferSubData to upload the data with 3 lights… this results in 3 lights being correctly evaluated in the shaders accessing light[0], light[1] and light[2] but light.length returns 1 so I’m basically accessing a memory position out of the array. I don’t really understand what’s going on…

EDIT2: After some testing it seems that light.length compiles but isn’t supposed to even be used as it has nothing to do with array length. The correct way would be light.length() but there is a known bug where you need to wrap it with uint(uintBitsToFloat(light.length())) to get the unsigned int out of it. So now that I get the correct length of the array I managed to test some things:

If I use glBufferData(100,NULL) during initialization and then I use glBufferSubData(3,data) the array length is 100. If I only use glBufferSubData(3,data) it doesn’t allocate anything as expected. If I use glBufferData(3,data) per frame and then change to glBufferData(4,data) the array’s length also goes from 3 to 4, meaning it allocated a bigger space. But then how have they made it impossible to reallocate? And is it better to use glBufferData(3,data) per frame or use glBufferData(100,NULL) when initializing and glBufferSubData(3,data) per frame?

OK, let’s just cut to the chase. Go read this and implement one of those streaming strategies.

Since, as you said, it is impossible to reallocate space for a Buffer Object

I didn’t say it was impossible. I said that recent functionality allows you to make it impossible. And since that functionality exists to make using them faster, that’s a strong hint that you shouldn’t be doing it in the first place.

I will have to allocate for example 100 lights whether I use UBOs or SSBOs.

The ability to resize the storage for a buffer object has nothing to do with how you use it.

Uniform blocks must be of a specific size. Therefore, whatever buffer object you use for them must be at least that size. It could be bigger, but it can’t be smaller.

You said it doesn’t reallocate space

Where? I said that ARB_buffer_storage/GL 4.4 allows you to allocate buffers that cannot be reallocated. And that means that it was a mistake for OpenGL to let you reallocate them to begin with. So you should never do it.

I figured glMapBufferRange either copies the data to host memory and then copies changes back to the GPU, or it returns a pointer directly to GPU memory which is would be risky. So it seems glBufferSubData is just better since it just copies the specific data you want to the GPU at that point.

No it doesn’t. It copies the specific data to the GPU eventually.

Consider this. If you map the buffer, generate your light data every frame into that pointer, and unmap it, the worst-case scenario is that the driver will have to DMA-copy the data from the mapped pointer into the buffer object. It will do that at a time of its choosing, but sometime before you do anything that reads from that data. The best-case scenario is that you’re writing directly to the buffer object’s storage. This is much more likely if you use GL_INVALIDATE_BIT to invalidate the buffer (since you’re overwritting all of its contents).

If you use BufferSubData, you must generate your data into an array of your own, and you give that to BufferSubData. Worst-case, BufferSubData must then copy that array into temporary memory, and later DMA-copy that into the buffer. The reason why is quite simple. If the buffer is currently in use (is going to be read by GL commands that you have already issued that haven’t executed yet), then it can’t simply overwrite that data. The OpenGL memory model doesn’t allow later commands to affect earlier ones. So the implementation must delay the actual DMA-copy into the buffer storage until that storage is no longer in use. And since BufferSubData cannot assume that the pointer it was given will still be around after BufferSubData returns, it must copy that data into temporary memory and DMA from that into the buffer later.

So worst-case with BufferSubData is that there are two temporary buffers. You had to generate your lighting data into one temporary buffer, and OpenGL had to copy it into another temporary buffer.

Best case with BufferSubData is that it is able to do the DMA immediately. But that almost never happens. Why? Because DMA’s aren’t instantaneous. They’re an asynchronous operation. Also, DMA’s typically can’t happen directly from client memory. So most implementations of BufferSubData are still going to have to copy the buffer into some temporary, DMA-able memory, and then DMA it up to the GPU.

With mapped pointers, odds are very good that, if the pointer you get isn’t actually the buffer, it’s at least memory that’s DMA-ready. So the worst-case scenario for mapping is equal to the best case scenario for BufferSubData.

So yes, if performance is a concern (and at this point, it shouldn’t be. Stop prematurely optimizing stuff), mapping will only ever be equally as bad as BufferSubData, and can be a good deal faster.

[QUOTE=Alfonse Reinheart;1253153]OK, let’s just cut to the chase. Go read this and implement one of those streaming strategies.

I didn’t say it was impossible. I said that recent functionality allows you to make it impossible. And since that functionality exists to make using them faster, that’s a strong hint that you shouldn’t be doing it in the first place.

The ability to resize the storage for a buffer object has nothing to do with how you use it.

Uniform blocks must be of a specific size. Therefore, whatever buffer object you use for them must be at least that size. It could be bigger, but it can’t be smaller.

Where? I said that ARB_buffer_storage/GL 4.4 allows you to allocate buffers that cannot be reallocated. And that means that it was a mistake for OpenGL to let you reallocate them to begin with. So you should never do it.

No it doesn’t. It copies the specific data to the GPU eventually.

Consider this. If you map the buffer, generate your light data every frame into that pointer, and unmap it, the worst-case scenario is that the driver will have to DMA-copy the data from the mapped pointer into the buffer object. It will do that at a time of its choosing, but sometime before you do anything that reads from that data. The best-case scenario is that you’re writing directly to the buffer object’s storage. This is much more likely if you use GL_INVALIDATE_BIT to invalidate the buffer (since you’re overwritting all of its contents).

If you use BufferSubData, you must generate your data into an array of your own, and you give that to BufferSubData. Worst-case, BufferSubData must then copy that array into temporary memory, and later DMA-copy that into the buffer. The reason why is quite simple. If the buffer is currently in use (is going to be read by GL commands that you have already issued that haven’t executed yet), then it can’t simply overwrite that data. The OpenGL memory model doesn’t allow later commands to affect earlier ones. So the implementation must delay the actual DMA-copy into the buffer storage until that storage is no longer in use. And since BufferSubData cannot assume that the pointer it was given will still be around after BufferSubData returns, it must copy that data into temporary memory and DMA from that into the buffer later.

So worst-case with BufferSubData is that there are two temporary buffers. You had to generate your lighting data into one temporary buffer, and OpenGL had to copy it into another temporary buffer.

Best case with BufferSubData is that it is able to do the DMA immediately. But that almost never happens. Why? Because DMA’s aren’t instantaneous. They’re an asynchronous operation. Also, DMA’s typically can’t happen directly from client memory. So most implementations of BufferSubData are still going to have to copy the buffer into some temporary, DMA-able memory, and then DMA it up to the GPU.

With mapped pointers, odds are very good that, if the pointer you get isn’t actually the buffer, it’s at least memory that’s DMA-ready. So the worst-case scenario for mapping is equal to the best case scenario for BufferSubData.

So yes, if performance is a concern (and at this point, it shouldn’t be. Stop prematurely optimizing stuff), mapping will only ever be equally as bad as BufferSubData, and can be a good deal faster.[/QUOTE]

From what you’d telling me, mapped pointers are better if I’m rewriting the storage. So if I only allocate once, I should allocate 100 lights during initialization phase using glBufferData with a null pointer. Then, every frame, I should use a mapped pointer to overwrite the data from lights 0 to current_number_of_lights.

What about using glBufferData with a null pointer every frame just before using the mapped pointer like it is said on the Streaming techniques link you put? Will that be reallocating? (I’m under the impression that glBufferData always reallocates) Or will that be more efficient since it is used to tell the driver that you don’t really care about the previous piece of memory? I might be confusing buffer allocation with uniform block allocation, am I? After reading that link you gave me it seems that using glBufferData with the same size as the initial allocation and with a null pointer will basically be faster since I will be filling a new buffer or the old buffer (if not being used).

Also, should I use glMapBufferRange with GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT ? From that link you gave me, using GL_MAP_INVALIDATE_RANGE_BIT will be an optimization since I’m only writing and not reading. Also, using GL_MAP_UNSYNCHRONIZED_BIT would work since I’m only generating data into it before I actually render. Am I right?

And why should I stop prematurely optimizing? I must admit I’m a perfectionist but isn’t optimization good?

Sorry, I know it’s a lot of questions but this isn’t just for optimization, optimization is just my own way of understanding things thoroughly and I don’t want to be someone who just comes here and asks for people to fix stuff, I want to understand so I can teach others as well. In any case, you’ve already helped A LOT with my understanding of this and I thank you for that.

What part of “OK, let’s just cut to the chase. Go read this and implement one of those streaming strategies,” did you not understand?

And why should I stop prematurely optimizing? I must admit I’m a perfectionist but isn’t optimization good?

No, it isn’t. Optimization is a waste of time unless what you’re optimizing is actually responsible for the poor performance of your application. There’s the general 80/20 rule: 80% of your application’s performance is governed by 20% of your code. Until your application is actually somewhat remotely like working, you can’t know what 20% is making it slow. And if you don’t know what’s making it slower, you can’t know what to spend time optimizing. So often times, you’ll waste time optimizing something completely irrelevant.

Like say, how to efficiently stream a whole 16KB of data per frame to the GPU.

In the time it has taken us to have this discussion, you could have implemented any one of the general strategies you’ve suggested and moved on to something else. You can come back to this when a profiler tells you that it’s making your application slower.

[QUOTE=Alfonse Reinheart;1253155]What part of “OK, let’s just cut to the chase. Go read this and implement one of those streaming strategies,” did you not understand?

No, it isn’t. Optimization is a waste of time unless what you’re optimizing is actually responsible for the poor performance of your application. There’s the general 80/20 rule: 80% of your application’s performance is governed by 20% of your code. Until your application is actually somewhat remotely like working, you can’t know what 20% is making it slow. And if you don’t know what’s making it slower, you can’t know what to spend time optimizing. So often times, you’ll waste time optimizing something completely irrelevant.

Like say, how to efficiently stream a whole 16KB of data per frame to the GPU.

In the time it has taken us to have this discussion, you could have implemented any one of the general strategies you’ve suggested and moved on to something else. You can come back to this when a profiler tells you that it’s making your application slower.[/QUOTE]

Yes but I’m not delivering a product, I’m doing research so my purpose is to understand everything I can. It may take me a while to implement this due to all these discussions of optimization, but once I’ve done this once, I will understand it well so the next time I’ll have to implement something I will know exactly how to do it properly. Experienced are always doing prematurely optimization without even noticing. When you are developing something, you probably make design choices that are optimized or as optimized as you can come up with from the top of your head. That’s what I’m trying to achieve here. Sure, I am developing something, but my whole purpose of actually asking you guys is to make sure I fully understand it so that the next time I do it I will already know how to optimize from the top of my head as I implement it. I don’t think there is anything wrong with prematurely optimizing something if you already have the experience to do it right there at that moment without wasting time. Personally, I can’t do something that is new to me without trying to fully comprehend it. It comes as incomplete learning for me if I just copy code from a website to use on my application. Don’t get me wrong, I’m not saying any other way of doing it is wrong, but that’s just how I am.

I read the link you sent me. I only tried to confirm my interpretation of it applied to what I’m doing here, since you seem to be very experienced and educated on this. I wasn’t trying to be lazy and making you explain what’s in that link for me, I simply read it and still had some questions about how to apply it to what I’m doing. Most of those questions are just simply YES or NO questions. Sorry if I should have known the answers just by reading the link you gave me but as it may not seem to you, English is not my first nor second language. I am not a native speaker and sometimes I have trouble understanding things clearly through text and need some human confirmation of my own interpretation.

I understand if you don’t want to help me fully understand rather than just use what’s on the link. So if you’re not helping any more, know that you were already helpful, not just with the other posts but also with that link, since it made me understand things better even if not fully.

Here are some hints:

  • Initially use BufferData will NULL pointer once to specify data storage size, do this only once.
  • In your case you probably should use STREAM_DRAW usage.
  • Use fixed size array(s) with maximum number of lights size.
  • Pass number of lights actually used (since length would only tell the max size) to shaders using a uniform.
  • When you update the buffer, you can map it all with explicit flushing, and manually flush only the first N lights which are in use.
  • Use the invalidate bit. Invalidating the whole buffer is probably best.
  • There is no need for BufferData(NULL) - that is just an older way to say invalidate.
  • Using unsynchronized bit may be unsafe. When you update the data with the CPU, GPU may still be using the older data for previous frame. However, it is still worth experimenting with it. I found that it gives more performance and rendering errors were not an issue in my case.

In general, I would only use BufferData with NULL data and always use MapBufferRange to specify buffer contents, and never use BufferSubData. However, older OpenGL and unextended OpenGL ES versions before 3.0 do not have MapBufferRange. To support those, you could create an abstraction for buffer with mappufferrange and flush operations; These can be implemented using Buffer(Sub)Data calls if they not available in GL.

Then understand that “premature optimization is the root of all evil”. The 80/20 rule, or sometimes even more strictly constrained the 90/10 rule, is an established fact whose applicability that has been observed by generations of developers on a multitude of systems. If you don’t have substantial profiling data and know exactly which parts of the application are a limiting factor, you cannot really optimize anything.

Experienced are always doing prematurely optimization without even noticing.

For instance?

you probably make design choices that are optimized

You cannot make optimized design choices. A design is an abstract perspective on how components of your application work and interact. You can implement that design in some way possible in the language of your choice. How optimal the resulting code will be, depends not only on patterns and idioms you follow that apply to your programming language, but in large part on the compiler, the platform (OS) and the CPU architecture. Obviously, for some designs you know that there is no way of naively implementing them and getting good performance in the end. I argue, however, that the code could in itself can still be fairly optimal but performance may be constrained by other factors that have nothing to do with the quality of your code - like I/O performance (hard disks, network and so on), or a crappy operating system, etc. etc. In general the design alone cannot speak to the optimality of the resulting code and the performance of your application.

As I said, there are rules and idioms one should obey, like avoiding unnecessary copies of large sets of data and so on (and in this instance, deciding that a function take its arguments by reference is actually a design choice that will probably lead to faster code), but all in all that is not optimization: not doing so is actually premature pessimization - unless you have a good reason not to follow the general rule. So is the above mentioned choice of an obviously poor design.

as optimized as you can come up with from the top of your head

What you come up with from the top of your head is seldom optimal. If you want it really fast, you’re gonna have to profile and check the data - always.

It comes as incomplete learning for me if I just copy code from a website to use on my application.

Where did you get a recommendation to do so?

I don’t know your background, but I assume you’re either a student or rather fresh post-grad, and I don’t know if everyone here will agree with me, but please don’t let senseless perfectionism take over. You’re not gonna get anywhere if you try to tweak every single function and every expression in your code. Ivory tower thinking isn’t well applicable in the real world.

I understand if you don’t want to help me fully understand rather than just use what’s on the link. So if you’re not helping any more, know that you were already helpful, not just with the other posts but also with that link, since it made me understand things better even if not fully.

That is so not the point. The point is: You were already given sufficient help to tackle your problem at hand, at least on a basic level. If you have specific questions, no one on this forum will deny you their help until you understand what to do. In regards to performance, however, specific means providing actual data and pieces of code responsible for that data. If it’s crappy, we’ll tell you. If it’s OK and you just can’t do better on your current hardware, we’ll tell you. If you don’t seem to get what you’re doing at all, we’ll tell you. Personally, I think we got a very nice and helpful community here - you could do much, much worse.

Also, how who can anyone actually say that they fully understands everything they do? How, pray tell? Do you know exactly how your hardware works? Do you know exactly what code your GLSL compiler generates for your current GPU? I could go on … but I suspect the answer is “no!”. Being able to fully understanding everything you do when developing software is an illusion. Period.

As a general rule: First make it correct (which implies that it works in general) - then make it fast. This is exactly what Alfonse already told you above:

First make it correct, then make it fast.