PDA

View Full Version : Texture Switching



glfreak
01-10-2011, 09:07 AM
This can be very useful in different situations, the ability of switching texture (sampler state i.e) while rendering primitives.

For instance, you are rendering a triangle strip or a list of triangles and, say 10 triangles, and in the middle at triangle 5 you want to start using a different texture.

I see this only happening if we are able to specify fragment shader uniform parameters during primitive rendering.

glEnable(GL_PER_PIRMITIVE_FRAGMENT_CHANGE)

glPrimitveFragmentUniform(primitveIndex, uniformVar, value)

glDrawArrays(....)

Primitive index is a zero based index denotes that primitive (triangle) rendered in the drawing call.

So at the specified primitive index, the fragment shader will be fed a different uniform value as specified by glPrimitveFragmentUniform.

Groovounet
01-10-2011, 09:17 AM
Welcome back indirect mode ! ...

gl_PrimitiveID + Texture 2D array would do and efficiently.

It's not possible to dynamically index sampler array in GLSL according to the OpenGL 4.1 spec. However it does work on nVidia but render garbage on AMD. Sampler array can be indexed by uniform values is really required but this won't make sense to do it in a per primitive base.

glfreak
01-10-2011, 09:58 AM
Yeah but gl_PrimitiveID is not available without activating geometry shader.

Groovounet
01-10-2011, 10:20 AM
Then use geometry shader, it's meant for it.

Alfonse Reinheart
01-10-2011, 10:39 AM
So at the specified primitive index, the fragment shader will be fed a different uniform value as specified by glPrimitveFragmentUniform.

Uniforms are so named because they are uniform; they do not change over the course of a primitive. Uniforms that do change would not be "uniform" and would therefore need to be something else.

Furthermore, if such a thing could be implemented with few if any performance issues, uniforms wouldn't need to be uniform. And if the performance impact is no different from simply issuing multiple draw commands with glUniform/etc calls inbetween, what's the point?

mhagain
01-10-2011, 11:21 AM
I don't think this feature is a good idea. One - IMO - serious weakness of OpenGL in the past is that it has abstracted the hardware a little too much, with the end result being a tendency to see suboptimal formats used in a lot of code, and the dreaded fall back to software emulation being something you need to watch out for. (Of course an advantage of this approach is that you don't need to sweat over details of the hardware, but I think the downside outweighs the upside here.)

A feature like this is going in the wrong direction of abstracting the hardware even more, whereas what we really need is less.

kRogue
01-10-2011, 11:29 AM
If you need to deal with textures not the same size, build a texture altas of the image data. For textures the same size: GL_TEXTURE_2D_ARRAY is exactly that, a third texture co-ordinate for which "layer".

aqnuep
01-10-2011, 11:29 AM
I don't see point in this suggestion either. We have texture arrays and actually that on its own solves the problem (in most situations) of breaking batches because of texture switches.

The simplest way is simply passing in the texture layer as part of the texture coordinates in the vertex array. Of course, this wastes some memory but you can add here using gl_PrimitiveID to source the texture layer numbers from a texture buffer in the geometry shader. This is much more flexible than your proposal.

glfreak
01-10-2011, 11:37 AM
I see the your point. Texture array seems sufficient.

Eosie
01-10-2011, 05:29 PM
It's not possible to dynamically index sampler array in GLSL according to the OpenGL 4.1 spec. However it does work on nVidia
Why the hell does it work on nVidia? Isn't it a spec violation?

Groovounet
01-10-2011, 05:48 PM
It's not possible to dynamically index sampler array in GLSL according to the OpenGL 4.1 spec. However it does work on nVidia
Why the hell does it work on nVidia? Isn't it a spec violation?

I don't think it is a violation.
If I remember the spec says, as often, that the result is undefined so whatever is valid. A warning from the GLSL compiler could be great but it just work...

aqnuep
01-11-2011, 02:02 AM
Yes, it is. NVIDIA tends to violate the spec in order to provide additional functionality. I don't quite agree the way how they expose these additional functionalities.

They should rather create some extension for these kind of stuff (e.g. NV_shader_indexed_sampler_array or whatever) but they should not allow it just like that. This makes the life of the developers much more difficult as they cannot make sure their shaders are really cross-platform/cross-vendor. However, we cannot do too much about it.

Groovounet
01-11-2011, 04:30 AM
Yes, it is. NVIDIA tends to violate the spec in order to provide additional functionality. I don't quite agree the way how they expose these additional functionalities.

They should rather create some extension for these kind of stuff (e.g. NV_shader_indexed_sampler_array or whatever) but they should not allow it just like that. This makes the life of the developers much more difficult as they cannot make sure their shaders are really cross-platform/cross-vendor. However, we cannot do too much about it.

Ermmm... The OpenGL community doesn't deserve allegation, so I think that if you want to assert something, you should check it!

"Samplers aggregated into arrays within a shader (using square brackets [ ]) can only be indexed with a dynamically uniform integral expression, otherwise results are undefined"

nVidia is free to make it works according to the spec. I am not saying on others cases, it always true but the specification have a lot of lose end like that.

aqnuep
01-11-2011, 06:19 AM
Yes, you are right. I didn't remember what the spec exactly says about it.

Of course, if OpenGL says that results are undefined, but does not state it is an error then NVIDIA is free to give defined results.

Sorry, I was prematurely judging the driver behavior. Anyway, I still stick to my statement that NVIDIA tends to allow such things in GLSL that are not supported (I remember that earlier it allowed the datatypes like e.g. float3 that are from Cg, but are not valid in GLSL).

Groovounet
01-11-2011, 07:18 AM
Sorry, I was prematurely judging the driver behavior. Anyway, I still stick to my statement that NVIDIA tends to allow such things in GLSL that are not supported (I remember that earlier it allowed the datatypes like e.g. float3 that are from Cg, but are not valid in GLSL).

I am not saying the contrary, and there are valid examples in nVidia drivers so no need to blame them when it's not true especially when the specification are actually to blame for implementation variations.

mhagain
01-11-2011, 11:11 AM
...when the specification are actually to blame for implementation variations.
That's a good point, the spec does need to nail things down more solidly in a lot of cases. There are already too many implementation-dependent or undefined behaviours in there and adding room for more doesn't help things at all.

Eosie
01-11-2011, 09:55 PM
the specification are actually to blame for implementation variations.
Except that the specifications and the drivers are made by more or less the same people. You may even think of it as some kind of a specification backdoor that allows implementers to silently add their own behavior, wait until people start using it (people don't usually read specifications, so they have no way to know it's an undefined behavior) and swear at the competitors' implementations that it doesn't work.

Instead, the responsible standardization process of the ARB should be:
- if possible, forbidden any undefined behavior in specification version X (e.g. by throwing a compiler error)
- consider making some of the forbidden behaviors allowed and well-defined in version X+1

I guess this process is already used, but apparently only sometimes.

Alfonse Reinheart
01-11-2011, 11:00 PM
Undefined behavior is usually undefined because it is too difficult/performance consuming to catch at runtime.

Take the prohibition of reading and writing to the same image at the same time. There's no real way to test this, because a shader could read from any mipmap layer of the texture, not just the one(s) bound to the FBO. The simple answer of checking texture object names doesn't work, because it's possible to bind different images of the same texture for writing and reading. As long as you ensure that you don't read and write from the same image, you're fine.

Also:


It's not possible to dynamically index sampler array in GLSL according to the OpenGL 4.1 spec. However it does work on nVidia

Not true. Entirely.

In the GLSL 3.3 spec, section 4.1.7 states that, "Samplers aggregated into arrays within a shader (using square brackets [ ]) can only be indexed with integral constant expressions." However, in the GLSL 4.0 spec, this corresponding section states, "Samplers aggregated into arrays within a shader (using square brackets [ ]) can only be indexed with a dynamically uniform integral expression, otherwise results are undefined."

In case you're wondering, a "dynamically uniform" expression is an expression such that all invocations of the shader, with the same uniform values, will result in the same value. This means that the index can now depend on uniforms and constants, rather than just constants.

So rather than a compile-time constant, it is a glDraw*-time constant. Which is better.

Groovounet
01-12-2011, 04:33 AM
I guess my vocabulary haven't be accurate enough on that on, it skipped my mind a second that they use "dynamically" for sampler array indexing with uniforms... huummm.

On nVidia you can index using any integer value from any source and it works as far as my experiment went. At least a warning could be nice because indexing with a uniform or constant could be checked at compile time.

Alfonse Reinheart
01-12-2011, 09:49 AM
At least a warning could be nice because indexing with a uniform or constant could be checked at compile time.

Could it? Consider this:



uniform int iLoopLen;

uniform sampler2d texArray[5];

void main()
{
vec4 iAccum = vec4(0.0);

for(int iLoop = 0; iLoop < iLoopLen; iLoop++
{
iAccum += texture(texArray[iLoop], <someTextureCoord>);
}
}


Each of these access to texArray perfectly qualifies as "dynamically uniform." But it doesn't directly involve uniforms. That's why the spec defines "dynamically uniform" instead of saying "uses a uniform or constant."

And that's why the compiler can't just do a simple test to see if it works.

bootstrap
01-12-2011, 01:05 PM
I see the your point. Texture array seems sufficient.

Warning: I haven't tried this yet. I already reserve a 4-bit field of an integer in each vertex for which texture in the texture array. This lets me render many objects within each batch with indexed vertex (element array).

However, this typically doesn't work within any contiguous smooth surface of a single object, because a given vertex is shared between multiple triangles on a smooth surface.

However, I have not yet tried the new "restart" capability. With that, it should be possible to duplicate vertices within smooth surfaces where you want the texture to switch. Of course the "duplicate" vertex has a different value in that 4-bit "texture array index" field, but that's not so terrible.

Therefore, with "texture arrays" and "restart" I believe you can achieve the result you want... and quite efficiently too.

Alfonse Reinheart
01-12-2011, 02:26 PM
However, this typically doesn't work within any contiguous smooth surface of a single object, because a given vertex is shared between multiple triangles on a smooth surface.

The 4-bit field is a vertex attribute, yes (Though how you create a 4-bit attribute is beyond me)? It's effectively part of the texture coordinate. So you simply do what you do if the same position uses different texture coordinates, or different normals: you duplicate the position in a new vertex.

There's no need for "restart" (I assume you're talking about primitive restart); this has been done since vertex arrays of any kind were first introduced.

bootstrap
01-18-2011, 10:11 PM
However, this typically doesn't work within any contiguous smooth surface of a single object, because a given vertex is shared between multiple triangles on a smooth surface.

The 4-bit field is a vertex attribute, yes (Though how you create a 4-bit attribute is beyond me)? It's effectively part of the texture coordinate. So you simply do what you do if the same position uses different texture coordinates, or different normals: you duplicate the position in a new vertex.

There's no need for "restart" (I assume you're talking about primitive restart); this has been done since vertex arrays of any kind were first introduced.

One of my vertex attributes is a 32-bit integer. It is simple matter to execute shift (>> and <<) and mask (&amp;) operators to extract 1, 2, 3, 4, 5, 6... and larger bit fields to specify:

- one of many transformation matrices
- one of many textures in a texture array
- whether to apply the texture or not
- whether to normal-map or not
- whether to emit light, compute lighting, or otherwise
- and so forth

This is what I do, so I can submit mammoth batches in a single call of glDrawElements().

PS: Just curious. What's so strange about a "4-bit field in an integer attribute"? There's no need to define any "4-bit attributes"! :-) In fact, that would be terribly inefficient, which is why I pack many individual bits and 4-bit fields into a single 32-bit integer attribute.

-----

You are entirely correct about adding duplicate vertices to accomplish these kinds of things.

However, this doesn't work if the desired process needs to be done dynamically AKA "on demand" (in response to something that happened in the application, like adding a gunshot hole or other damage).

Also, I tend not to think about adding duplicate vertices because the focus of my engine is "procedurally generated content". Which means, in practice, that objects tend to be composed of standard "fundamental shapes" and derivative shapes based upon the fundamental shapes. Since these shapes are standardized (created by standard routines), I usually don't even think in terms of specialized "jiggered" approaches like extra [duplicate] vertices to achieve results like this. Of course the other reason not to think that way is that it is not a general approach, in that it doesn't work for dynamic, "on demand" situations like I mentioned above.

However, if the nature of the application is such that he'll never run into these dynamic/on-demand situations, then making specialized extra/duplicate vertices is perfectly good.

Alfonse Reinheart
01-19-2011, 12:42 AM
This is what I do, so I can submit mammoth batches in a single call of glDrawElements().

While simultaneously having your shader be exceedingly large and branchy. I'm not a fan of monolithic shaders myself; I'm not convinced of the performance of this technique.


However, this doesn't work if the desired process needs to be done dynamically AKA "on demand" (in response to something that happened in the application, like adding a gunshot hole or other damage).

Then you need to decide whether constantly updating a buffer object is going to get you the same performance that just rendering "normally" will, without the uber-shader, mega-batch approach. I don't know what the penalty is for state changes, but I doubt it's more severe than PCIe bus transfers.

Melekor
02-21-2011, 12:25 AM
This feature would be extremely useful to me, in fact this increased # of batches due to texture switching is the last real bottleneck in my renderer.

It was suggested that this is not needed because we have texture arrays, or could just use an atlas. That's true for a lot of applications but not mine.

Texture Atlases won't work since I don't know until draw time which textures are needed together in the same atlas. I could generate an atlas every frame, but that would be slower than just doing the switching.

Texture Array does not work for the same reason, and additionally my textures are all different sizes.

OP's suggestion though, would let me collapse almost everything into 1 draw call - that's perfection!

Alfonse Reinheart
02-21-2011, 02:17 AM
This feature would be extremely useful to me, in fact this increased # of batches due to texture switching is the last real bottleneck in my renderer.

So how do you know that this is a bottleneck? Namely, how do you know that you aren't simply bumping into the fastest your hardware will go?


OP's suggestion though, would let me collapse almost everything into 1 draw call - that's perfection!

I think people have taken this "minimize batch count" thing a bit too far to be honest. Actual high-end games, products that cost millions of dollars to product who's success partially depends on getting as much performance as possible from hardware, do not take some of the steps that people often talk about.

Taking one draw call to render is not "perfection". It's simply taking one draw call to render. Everything has costs, and nothing is free.

Texture atlases have costs. You have to make textures bigger, mipmaps are more difficult to make workable, you may waste texture space, etc.

Texture arrays have costs. The entire texture must be small enough to fit into GPU memory all at once. So you can't have a working set of textures that are in GPU memory; it's all or nothing for any particular array.

I don't know what you have done that removes all state changes except texture changes. But I highly doubt that it was "free". Your particular applications might be able to live within its limitations, but that doesn't make it free.

The OP's suggestion might allow you the user to have only one draw call. But that doesn't mean any of the state change overhead has vanished. What the OP suggests is not possible on current hardware, and unless there is a pretty fundamental change in how textures are implemented, it will not be available on hardware in the near future.

Changing textures requires state changes. Either you are going to ask OpenGL to change that state, or OpenGL is going to change that state internally. But the state change, and all of the associated performance issues therein, will still be there.

kRogue
02-21-2011, 06:40 AM
Under our horribly over powered desktops, you can easily do over 1000 draw calls per frames and the CPU won't even break a sweat. Seriously, putting everything to just one draw call is just plain silly. The place where you need to look is "how many draw calls are you doing?" If you are under 1000 on a desktop, that is not likely your bottle neck (unless every draw call is accompanied by a texture and/or GLSL program change). Lastly, not having everything on one draw call allows one to cull large chunks of non-visible without forcing it down the GPU (ahh.. but how fine to cull, such joys!).

Melekor
02-21-2011, 10:55 AM
That's just it actually: in the worst case scenes we are easily doing 5000+ draw calls/frame, each one being just one quad, and each requiring a texture switch.

Imagine a particle system with 1000s of particles, where every particle has a different texture, that's basically the use case.

We also have a software rasterizer and when running at a small resolution like 640x480, it's actually faster than OpenGL in this worst case... that's kind of sad imo :(

Alfonse Reinheart
02-21-2011, 11:20 AM
Imagine a particle system with 1000s of particles, where every particle has a different texture, that's basically the use case.

And there's no way you can build appropriate texture atlases for these particular cases? This is a common technique used by many high-performance rendering applications. You don't need to know precisely which textures will be used. You know the set of images that could possibly be used, and that is generally enough. Bundle all the particle-system textures together, and you're fine.

Which brings to mind another question: how exactly would you render particle systems in the same draw call as, for example, terrain that might have diffuse maps, a bump map, and possibly one or two other textures? Not to mention using a much more complicated shader.


We also have a software rasterizer and when running at a small resolution like 640x480, it's actually faster than OpenGL in this worst case... that's kind of sad imo

That's not sad at all; it's expected. GPUs have, and always will have, some form of overhead for their use. That's why it is important to draw something suitably significant that allows the basic rendering performance gain to exceed the overhead.

These days, if everything you're drawing is just single-textured, multiply texture by color, Quake-1-era stuff, you're really wasting your GPU.

Melekor
02-21-2011, 11:41 AM
And there's no way you can build appropriate texture atlases for these particular cases? This is a common technique used by many high-performance rendering applications. You don't need to know precisely which textures will be used. You know the set of images that could possibly be used, and that is generally enough. Bundle all the particle-system textures together, and you're fine.

This doesn't work in our case because the set of "all possible" textures can be far too large to fit into one texture, or potentially, into VRAM. (We use a LRU cache system instead of loading everything at startup.)


Which brings to mind another question: how exactly would you render particle systems in the same draw call as, for example, terrain that might have diffuse maps, a bump map, and possibly one or two other textures? Not to mention using a much more complicated shader.

Well, you wouldn't. When I said one draw call I was just talking about the particle-like systems. The rest of the stuff we need to draw fits very well into the OpenGL paradigm and there are no performance issues.

Alfonse Reinheart
02-21-2011, 12:08 PM
This doesn't work in our case because the set of "all possible" textures can be far too large to fit into one texture, or potentially, into VRAM. (We use a LRU cache system instead of loading everything at startup.)

Then break it up into several smaller atlases. Each texture can represent particles for releated effects. Impose limits on your artists if you have to. You'll still have texture changes, but not as many.

Melekor
02-21-2011, 01:31 PM
Yeah, I think that's probably the best that can be done.

The point though, is this a limitation of the hardware, or of the API? I don't see why it shouldn't be possible to (efficiently) use a different texture for each polygon.

Alfonse Reinheart
02-21-2011, 01:40 PM
I don't see why it shouldn't be possible to (efficiently) use a different texture for each polygon.

Because texture accessing is built into the hardware. It's not just passing a pointer to the shader and having it fetch values from memory. There is dedicated texturing hardware associated with each cluster of shading processors. This texturing hardware needs to know specific information, not just about the texture (pointers to memory, etc), but about how to access it (sampler state, format, etc).

For any texture you use, this information must be passed to the texture unit hardware before it can access that texture. That's part of what happens when you bind a new texture and render with it. In a texture object (and sampler object), there is a block of GPU setup commands that gets put into the GPU's command buffer when you render with new textures.

Plus there are API issues. GLuint texture names have no relationship to the actual texture data on the GPU side. That translation is done on the CPU when you call glBindTexture. So having the GPU read, for example, "5" from a buffer object would be meaningless; it wouldn't know what to do with that value.

Coupled with that is whether or not texture "5" is in GPU memory currently or not. This is something that the CPU normally takes care of when you render with new textures. Again, it is part of the setup for new textures.

In short: not gonna happen.

kRogue
02-21-2011, 01:40 PM
Imagine a particle system with 1000s of particles, where every particle has a different texture, that's basically the use case.


There are my thoughts:
Does each particle have a unique texture? What are the texture resolutions? How many pixels are "most" the particles taking up on the screen? Do these textures have a fast to compute procedural nature?


Another thought. If there are say 1024 particles on the screen, running at a resolution of 1024x1024, then each particle on average takes up 1024 pixels, which comes to that the particle is 32x32 pixels in size (on average).

So. Putting the textures used by the particles into a texture atlas is definitely the way to go. Moreover, "calculating what textures to use each frame", does not sound likely. The more likely case is that from one frame to the next most of the particles are using the same texture as last frame. A texture atlas will give you flexibility. You build the atlas, as you need texture "room". Once the atlas gets full, you take a gander at what images are needed and not. I freely confess that having what image is applied to each particle changing from frame to frame does not make since. I can imagine a system where you "allocate particles each frame", and the allocation is done frame by frame per particle or per group. In that case, you've got some refactoring to do in order to take advantage of the fact that for most particles, from one frame to the next, the image data is the same.

Since you are talking for particles, then likely the images are pretty small, so the waste of an atlas slack is not such a big deal, the most obvious thing to do is make all the images the same exact size and a power of 2 at that. Then the texture atlas beans is heck-a-easy.


Keep in mind at the end of the day, you've (only, snickers) have 1GB of VRAM typically to fit geometry, texture and framebuffers.

Melekor
02-21-2011, 05:14 PM
@Alfonse:

I defer to your expertise on the hardware part. But it still seems like this scenario could be a lot more efficient if it was done in the driver instead of making 1000s of bind and drawArrays calls. Especially if it could be guaranteed that all the textures are the same format (but not same size).

@kRogue:

Interesting. That actually gives me an idea. I could allocate a large texture array, say 1024x1024x64, and put the LRU inside that. Under certain (realistic, I think) assumptions about the rate of things entering/leaving the LRU, this may be more efficient in both the typical and worst case scenes.

Just one problem, how would it work with mipmaps? I need mipmaps on all the textures, and if everything is in an atlas I can no longer use glGenerateMipmap. (not directly anyways)

Dark Photon
02-21-2011, 05:51 PM
Interesting. That actually gives me an idea. I could allocate a large texture array, say 1024x1024x64, and put the LRU inside that. ... Just one problem, how would it work with mipmaps? I need mipmaps on all the textures, and if everything is in an atlas I can no longer use glGenerateMipmap. (not directly anyways)
No prob. Slices of texture arrays can have MIPmaps.


Texture Atlases won't work since I don't know until draw time which textures are needed together in the same atlas. I could generate an atlas every frame, but that would be slower than just doing the switching.
You can put a lot of slices in your atlases. And then either dynamically rewrite your slice index (texcoord.z) for your batch verts, or use a helper texture to translate your virtual texcoord.z to the actual texture array slice index (think virtual memory).

Depending on your program's constraints, applying this may be trivial or hard. For instance, which texture formats do you need to support? Which resolutions? --> Yields number of texture array permutations. Then you need to figure out the max num slices in each array (i.e. largest working set for each fmt+res). Trivial, hard, or in between ... depends on your app.

kRogue
02-22-2011, 12:53 AM
You could also calculate the mipmaps yourself rather than glGenerateMipmaps (which is usually just a simple box filter).

Assuming you have that the image data in the atlas are powers of 2, then you can set the mipmap data of the images yourself. This makes a great deal of sense if the images are loaded disk (i.e. the image data loaded includes the mipmaps). If you are generating the image data procedurally at run time, then generating the mipmap data yourself is quite likely to be much faster than generating the data anyways. Take a look at nvidia.developer.com for a texture atlas white paper. Essentially it says "make the image data powers of 2 or make sure all image data is at powers of 2 boundaries.

If the power of 2 restriction is too great, then you can mess with GL_TEXTURE_MAX_LOD to specify the highest mipmap that GL can use (for example if the image data is aligned to power of 2 or multiple of 2^k (which ever is smaller), setting GL_TEXTURE_MAX_LOD to k will work too.