Avoiding state switch when changing blend function

Following scenario: I am rendering a lot of particles, some to be blended additively, some via alpha. Right now I have to change the blend function every time the particle type changes. My idea was to handle all that in a shader and use gl_Color.a to decide whether to additively (a == 0.0) or alpha blend (a != 0.0).

So I created a shader that would get the current scene, a particle and the scene’s depth buffer as textures and render them back to the current scene. The somewhat weird discovery I made is that I can use a FBO’s color buffer both as render target and use it as color source in a shader. I wonder whether this is allowed by the standard, but my current OpenGL driver doesn’t seem to have anything against it.

The idea is that the shader would accumulate the blended particles in the frame buffer as it worked through the particle list and used the desired blend function for each particle.

Here’s my fragment shader (the shader also does soft blending, but that’s secondary):


uniform sampler2D particleTex, sceneTex, depthTex;
uniform float dMax;
uniform vec2 windowScale;

// ZNEAR 1.0
// ZFAR 5000.0
// A 5001.0 //(ZNEAR + ZFAR)
// B 4999.0 //(ZNEAR - ZFAR)
// C 10000.0 //(2.0 * ZNEAR * ZFAR)
// D (NDC (Z) * B)
#define NDC(z) (2.0 * z - 1.0)
#define ZEYE(z) (10000.0 / (5001.0 - NDC (z) * 4999.0)) //(C / (A + D))

void main (void) {
vec2 sceneCoord = gl_FragCoord.xy * windowScale;
float dz = clamp (ZEYE (gl_FragCoord.z) - ZEYE (texture2D (depthTex, sceneCoord).r), 0.0, dMax);
dz = (dMax - dz) / dMax;
vec4 sceneColor = texture2D (sceneTex, sceneCoord);
vec4 particleColor = texture2D (particleTex, gl_TexCoord [0].xy) * gl_Color;
if (gl_Color.a == 0.0) //additive
   gl_FragColor = vec4 (sceneColor.rgb + particleColor.rgb * dz, 1.0);
else //alpha
   gl_FragColor = vec4 (mix (sceneColor.rgb, particleColor.rgb, particleColor.a * dz), 1.0);
}

The particle data (vertices, texture coordinates, color) are passed in client arrays.

What the shader does is to properly blend scene and particle fragment colors and alpha and assign these to the current fragment color. The blend function used is the replacement one (GL_ONE, GL_ZERO) to have the resulting color completely replace the frame buffer contents at the current fragment position.

I had hoped that assigning a color to gl_FragColor would update the render target, and when rendering the next particle in line, I’d be reading from the updated scene. Unfortunately this doesn’t seem to be the case.

This works, but not fully: Apparently the frame buffer is not updated immediately after having rendered one particle, and reading from it again for the next particle would already return the scene modified by the previous particle. Instead, the current scene + particle completely replace the previous state. So I can indeed blend a particle with the frame buffer and write that back to the frame buffer, but it doesn’t accumulate.

Is there a cure for this, or another method to improve batching for particles using the same texture, but different blend functions?

Reading and writing from the same buffer is totally in the undefined land :slight_smile:
You should avoid it, especially as it does not work for your case.

Maybe I am missing something, but going from 2 state changes (add then alpha) to 1 is not a huge gain right ? You don’t change state for each particle, but render all the add, then all the blend ?

I supposed it’s undefined, but as it works I tried to exploit it just for the heck of it. I am not insisting on this method to reduce state changes though - if somebody has more insight into this subject and can give me some vital hints, I’d be just as happy.

I cannot just render all additively and then all alpha blended particles, because usually they overlap and need to be drawn in the right Z order. So I have to switch states everytime the blend function changes, and for 30.000 - 60.000 particles that really hurts, because there is some buffer management overhead involved (usually it’s a few dozen additively blended and then a few 100 alpha blended particles, e.g. for missile smoke trails).

I am already having code that checks whether particles with different render state requirements overlap or not, and only flush buffers and change states when necessary, but here it’s exactly the case that they do overlap because they belong to the same smoke trails.

When rendering all those particles with alpha blending, renderer speed at least doubles for huge numbers of particles.

I think you can solve this another way.

Set the blending function factors to GL_ONE and GL_SRC_ALPHA.

Now the amount that the previous framebuffer contents is reduced comes from your particle alpha and the color blended in is ignored.

Thus to emit a ‘non-blended’ particle, you would do this:

// zap out alpha of dst
gl_FragColor.a = 0.0;

And to ‘blend’ one you do this:

gl_FragColor.rgb *= gl_FragColor.a;
gl_FragColor.a = 1.0 - gl_FragColor.a;

In other words, you use the fragment’s alpha to control the dst factor and pre-multiplication makes the src factor obsolete. No need to read from the bound FBO. :slight_smile:

Thus to emit a ‘non-blended’ particle, you would do this:

He’s not talking about the difference between non-blended particles and blended ones; he’s talking about the difference between alpha-blended ones and additive ones. That is: src + dst = final.

That being said, what you have done here actually would work. For additive blending, you leave the RGB alone and set the alpha to 1.0. For alpha blending, you do the pre-multiplication as stated. The problem is that you still need some kind of state change or something that tells the shader which ones are additive and which are alpha blended.

Ah ok, that makes sense. Thanks!

Blend function is controlled via gl_Color.a (as already stated above).

Reading from and writing to the same buffer is a REALLY BAD idea.

You can’t read and write to the same buffer with guaranteed results over multiple particles. There are no assurances as to the pipelined order of the read vs. the write accross fragments but further, even if hardware supported this it could defeat pipelining and hose your performance because it would stall on a fragment prefetch while your shader completed the write. Worse than this you also have different caches at work for each purpose and the way they operate is profoundly different and task specific.

So making this work is actually undesirable from a hardware performance point of view. That said it may not matter most of the time.

You say it works but you may not even be touching the corner cases where it matters and those vary by platform. So a lot of small densely packed particles that are pipelined together is going to show implementation specific problems.

The way to do this right is ultimately to have a programmable blend shader or a reserved blend section in the fragment shader code, or compilers smart enough to recognize it as such and move it to a programmable blend pipeline stage.

I know all that. I was just playing around with it a bit because it seemed to work.

Is there a way to avoid a state change when needing to change a texture (besides putting several textures in a bigger one and using texture coordinates to address them)?

I wish the client array stuff would allow to pass textures and blend modes as well …

besides putting several textures in a bigger one and using texture coordinates to address them

Is there something wrong with that? That’s typically how particle systems work. It’s been done for well on a decade now.

Texture arrays. Same space.

I wish the client array stuff would allow to pass textures and blend modes as well …

texcoord.z becomes your “texture index”.

You can rener scene to texture, duplicate, and use a copy in shader.

Dark Photon,

I looked texture arrays up and it says that this wouldn’t not be supported by the fixed function pipeline. NVidia proposes extensions to use them with glEnable and glDisable, but is that standardized in anyway, or (still) vendor specific?


I am not just offering one type of particles, but 5 or 6 (smoke, snow, rain, fire, air bubbles). The image sizes can differ, the amount of animation frames can differ, and people can mod these.

Ofc all frames of a given particle images are placed in one texture, but mixing different particles would be another story.

So it’s not straight forward to put them all into one texture. Ofc I could do it programmatically, but I’d prefer not having to do it.

Some of the images are already rather big, and the game runs on pretty old hardware too, so I should probably avoid creating 4 MB textures containing all particle frames of the entire game.


Are you referring to the question about avoiding state changes due to blend function changes? That has been solved already.

NVidia proposes extensions to use them with glEnable and glDisable, but is that standardized in anyway, or (still) vendor specific?

I don’t think it’s even NVIDIA-only. The EXT_texture_array extension suggests that it could be possible in a future extension, but they don’t say that they will support it. Only that they could provide an extension if there were a need for it.

The deprecation list in the 3.0 and above specs suggest that TEXTURE_*D_ARRAY was allowed for glEnable/glDisable, but that only applies to GL 3.0 compatibility and above (when array textures were brought into core OpenGL).

Ofc all frames of a given particle images are placed in one texture, but mixing different particles would be another story.

No it isn’t. Packing multiple different particle types on a single image is no different from packing each particle on each image. Games do this sort of thing all the time.

If you absolutely must have them be on different 2D textures, you could still use texture arrays. Just put each particle’s images in a different array page. You can even load them independently into each page of the array texture.

Lastly, there’s one simple fact: you’re not getting around this. If you want to have fewer texture state changes, these are your options. No more, no less. So you could have a performance vs. modability tradeoff, but that’s up to you.

I know that NVidia proposing something doesn’t necessarily mean it’s NVidia specific, but it’s reason enough for me to ask. The question for me is how to bind a texture array, particularly if I don’t want shaders to be involved in rendering. If I was to use a shader I could reduce texture state switches by simply binding let’s say two textures to GL_TEXTURE0 and GL_TEXTURE1 and have the shader decide which one to use via the z (s) coordinates in a texture coordinate array passed to OpenGL - I wouldn’t even need to use texture arrays, right?

I said I could do it programmatically but don’t want to. It is a different story because I’d have to code it and maintain it for future extensions of the particle system. Adding another texture to a texture array seems much simpler and more elegant.

The question for me is how to bind a texture array, particularly if I don’t want shaders to be involved in rendering.

Only the specs can answer that. The EXT_texture_array spec says you can’t. The GL 3.0 spec suggests that you might be able to, but you’ll have to look it up yourself. And even then, it would only apply to OpenGL versions 3.0 or better; there’s no ARB_texture_array extension.

If I was to use a shader I could reduce texture state switches by simply binding let’s say two textures to GL_TEXTURE0 and GL_TEXTURE1 and have the shader decide which one to use via the z (s) coordinates in a texture coordinate array passed to OpenGL - I wouldn’t even need to use texture arrays, right?

Having an if-statement of the kind you suggest (especially considering that you’ve talked about 6 different kinds of textures, not just two) in your fragment shader is simply not conducive to performance. Many compilers will implement it by accessing all of the textures and picking the right result to write to the framebuffer. Others will actually do some conditional branching logic in your fragment shader, which is not exactly known for being performance-friendly.

Remember: the whole point of your doing this is performance. So any solution must necessarily be faster. And I don’t think this will be. Furthermore, if you had access to shaders, you could just do it the right way with texture arrays. So there’d be no need for this hackery.

I said I could do it programmatically but don’t want to. It is a different story because I’d have to code it and maintain it for future extensions of the particle system.

Then you need to decide what’s more important to you. Your options are:

1: Use shaders and texture arrays.

2: Use pure texture atlases.

3: Live with your current performance.

Performance isn’t free; there’s always a coding and maintenance cost. So how much effort do you want to put forth to get the performance?

I don’t consider my proposition hackery. It may be less performant on certain hardware, but that doesn’t make it hackery. It will pretty likely still be more performant than to flush the current particle buffer and setup a new one.

As I have stated before, I will not start to build particle “mega” textures programmatically. A performance increase must be achieved in an elegant fashion or not at all.

The most interesting path to go here is to use texture
arrays, but I will try both paths for the heck of it.

Btw, after looking into it a bit I found that there are two particle types competing most, and doing something here should prove most beneficial for performance. Then there are some that need special blending (multiplay) and cannot be handled by some “universal” particle render code anyway.

Thanks for your input so far.

On my current development hardware (Core i7 920 @ 3.6 GHz, 6 GB RAM 1800 MHz, Geforce GTX 470) I couldn’t find a speed difference between binding each texture separately to a TMU and using a texture array.

This issue is not how cheap/expensive the individual bind is, but how many more tris you can push without an intervening state change.

On my current development hardware (Core i7 920 @ 3.6 GHz, 6 GB RAM 1800 MHz, Geforce GTX 470) I couldn’t find a speed difference between binding each texture separately to a TMU and using a texture array.

You could do it that way. You could implement it as multiple textures in conditionals. And then you could test it on every GPU you’re interested in. Making sure that the driver gives reasonable performance and so forth. Every time you add more to your renderer, you need to test again to make sure that this is giving you faster performance. Again, on every GPU. Every driver revision, you need to retest to make sure the compiler isn’t doing something unpleasant.

Or you could simply do it the right way. The way that will either match or exceed the multiple-textures-with-a-condition method in all cases. The way that everyone else does it, and therefore the way that IHVs will be optimizing.

The choice is yours.