Are low level shaders better?

Do you find that low level shaders perform better than GLSL? In some cases, they can give 2 times higher FPS but I haven’t done much testing. I would like to hear from you.

Also, I suspect that fixed function vs shader that does the same thing, the fixed function might give better performance since it’s tuned up in the drivers.

About the first question - that truelly depends on many-many cases.

As for me, I’m not used to deal with GLSL, I’m using Cg for off-line compilation and ARB_fragment_program/ARB_vertex_program with nVidia OPTION string to load exaclty that shader, which is most appropriate for currently running platform. All the optimizations may be separately done, if it is really the bottleneck.

But I don’t think GLSL can compile better, then me, writing the same low-level shader, just because of my experience with it. Sometimes you know there is an instruction, which is not counted in compilation process (XPD, DPH, SSG and so on)

One of the things that I’ve found, which is specific to NVidia, is that I can’t noticably beat out the GLSL compiler. I can tie with it, but that’s all I’ve managed to do. Here’s what I’m doing so you can get a better picture of how the GPU is being used.

Usually, I have 8 textures bound for normal light rendering. This includes 4 textures from the material (bump, normal, diffuse, specular), 1 shadow buffer, 1 light mask, 1 projected texture, and 1 normalization cubemap. Most lighting models are fairly simple, but my most complex one compiles to about 80 instructions. Most of my tests were with the more complex ones, and my test was simply an assembly fragment program using ARB_precision_hint_fastest and a GLSL shader using the half types exposed by the NVidia GLSL compiler. After some pretty exhaustive tests, I found that I couldn’t beat out the GLSL compiler. The best I did was tie it. However, I will say that if I dumped out the assembly, I would sometimes get more instructions from the GLSL shader. However, more instructions doesn’t translate directly into worse performance. My guess is that the output was simply tailored to NVidia’s best performing instruction usage.

Also, I have yet to do any performance tests on ATI hardware, but they seem very confident in their GLSL compiler. That said, I’ll probably see similar results on ATI cards.

All this said, I now believe GLSL is the better option. The reason is that the assembly shaders I’ve written were hand optimized for the card that was in my machine at the time (NVidia). I would bet that an ATI card would run those shaders slower than if they were hand tuned for ATI. However, GLSL takes care of this and compiles down to the best shader for the given architecture (at least in theory, as well as in practice from what I’ve seen thus far). That said, I’m a firm believer in GLSL now.

One thing to note, I will generally prototype shaders in assembly. This lets me see where my clock cycles are going and lets me define how a shader should work in a way that is friendly for the target hardware. Then, I simply do a simple port to GLSL, dump out the assembly, and then compare with my prototype. This lets me catch performance issues that wouldn’t have been obvious otherwise. For instance, a matrix-vector multiply either translates into 4 dp4’s or a transpose (several MOVs) and then 4 dp4’s. Seeing these things in assembly is a great way to catch these problems (I’ve solved things like this by transposing the matrix on the CPU, then reversing the ordering of the parameters passed to the mul() function in the GLSL shader).

Kevin B

Originally posted by V-man:
Also, I suspect that fixed function vs shader that does the same thing, the fixed function might give better performance since it’s tuned up in the drivers.
Fixed function is often the same thing as shaders, only it’s hand tuned for the specific card, like it is when you compile it with GLSL, only slightly better.

So today there is no real difference between fixed function, assembly shaders and GLSL if they do the same thing.

Generally speaking, fixed function, assembly, and GLSL all use the same underlying hardware on most modern GPU implementations today.

What’s different is the software path that generates the underlying machine-specific microcode.

Typically fixed function is very well optimized. Assembly and GLSL tend to be in the same ball park today as their levels of abstraction are not all that different. Especially if you’re writing “equivalent” shaders in both (that is, relatively simple).

Over time, as shaders and shading language features get more complex, I expect the cost of compilation and linking to become the bigger factor.

The language you choose should probably be more about portability and content creation tool integration. Rely on offline compilation to boil it down to efficient code.

Cass,

do you mean, that even assembly fragment program is “tuned-up” after loading? I’m not speaking about precision_hint_fastest, I mean “tuning” in more complex way, like instructions replacing and so on.

Yes, even the assembly profiles do register allocation and various other optimizations to avoid hazards, maximize parallelism, and generally improve perf.

GPU shader microarchitectures are too different (today) and perf matters too much to assume that any portable shader description can be translated into executable code as-is with no optimizations.

I think offline tools and an opaque binary shader loading interface like OpenGL ES has is the closest you’ll get to knowing exactly what the compiled shader microcode really looks like.

If GLSL and assembly shaders gave identical and ideal performance, I would be ok with it, but losing even a bit due to GLSL sucks because there is no room to spare.
I like portability but I don’t like the idea of some extra MOV

About the first question - that truelly depends on many-many cases.
I know. I think it could be a loss from 0% to PLENTY

With NV you might be able to just use Cg and make use of their NV extensions but ATI is another story.

Originally posted by cass:
Assembly and GLSL tend to be in the same ball park today as their levels of abstraction are not all that different.
This is true, and frequently ignored. ARBfp doesn’t match the underlying hardware any better than GLSL. The difference is that ARBfp is a simpler interface, thus had shorter time to market.

Plus, there’s just no beating high level shaders for prototyping. I wouldn’t go back to the ARB*p stuff if you paid me. But if in the end you determine that there’s a significant difference in performance, you could always revert to another form before you ship. I imagine higher level optimizations are going to make a far bigger difference, in the days that follow.

Originally posted by Humus:
[quote]Originally posted by cass:
Assembly and GLSL tend to be in the same ball park today as their levels of abstraction are not all that different.
This is true, and frequently ignored. ARBfp doesn’t match the underlying hardware any better than GLSL. The difference is that ARBfp is a simpler interface, thus had shorter time to market.
[/QUOTE]Note, this is why I advocate simpler interfaces for programmable hardware. Improves time to market, and lets the software layer address the language aesthetics and tools integration issues.

This investing in a software layer above the driver isn’t necessarily an easy transition for OpenGL to make, but it’s a worthwhile one, I think.

Ok men,

Here is one example
http://ee.1asphost.com/vmelkon/files/glass_delphi3d_3.zi

you may need glew32.dll, glut32.dll

R9700, Cat 6.8, assembly, 400FPS
R9700, Cat 6.8, GLSL, 200FPS

and I recently changed the GLSL part to use glVertexAttrib and now it’s even worst
R9700, Cat 6.8, GLSL, 40FPS

Can it get any worst?

Glass:
ARBfp: 11 ALU, 4 TEX
GLSL: 10 ALU, 4 TEX

Depthmap:
ARBfp: 3 ALU
GLSL: 3 ALU

GLSL comes out as the winner. If there’s a performance issue here it lies elsewhere. Also worth noting is that writing good code will always be more important than language. You can cut 3 instructions from both the ARBfp and GLSL code of Glass by using an interpolator instead of gl_FragCoord.xy * InvTex0Dimensions.

For the ASM version
glass:
Program native instructions = 15
11 ALU and 4 TEX

depthmap:
Program native instructions = 3

So we are in agreement. How do you know about GLSL?

I know the code sucks because it uses immediate mode and I have glGetError everywhere, but that shouldn’t be a problem.
I even changed from using a mat3 to mat4 in the glass VS

I know the code sucks because it uses immediate mode and I have glGetError everywhere, but that shouldn’t be a problem.
And exactly why not?

I’m not an IHV, so I can’t be sure, but I wouldn’t be surprised if little-to-no effort was expended to make immediate mode and glslang shaders work well together. Take the 10 minutes to change over to VBOs just to make sure.

Originally posted by V-man:
How do you know about GLSL?
Using an internal tool.

Originally posted by V-man:
I even changed from using a mat3 to mat4 in the glass VS
I see no point in doing that. In the worst case that could mean another instruction for the last line. In your case though this seems to be optimzed away since you’re just adding on a zero in the end. I tried changing it back to mat3 and changing some vectors that really are scalars back to scalars. It didn’t make any difference in this shader though, 34 instructions in both cases.

I see no point in doing that. In the worst case that could mean another instruction for the last line. In your case though this seems to be optimzed away since you’re just adding on a zero in the end. I tried changing it back to mat3 and changing some vectors that really are scalars back to scalars. It didn’t make any difference in this shader though, 34 instructions in both cases.
When I changed from mat3 to mat4, I was thinking in terms of what happens in the driver.
If the driver has to upload vec4, then it has to expand the matrix to a 4x4 anyway.
If it can optimize that out, then it’s good because it has to also look at the FS.
I would like to change to VBO and who knows when get my complicated renderer working in my real app. I suspect that generic vertex attrib (GVA) sucks in some way.

Tracing Doom 3 shows that they do use GVA and the non-GVA in parallel.
-Not sure why they use glBlendFunc(GL_ONE,GL_ZERO);
Isn’t that like disabling blending?
-glColor4fv
-glDrawElements but not glDrawRangeElements
glDrawElements(…,. UNSIGNED_INT,…); only
-glColor3f(1.000000,1.000000,1.000000); ?
-Consecutive calls to
lActiveTextureARB(GL_TEXTURE1);
glActiveTextureARB(GL_TEXTURE2);
glActiveTextureARB(GL_TEXTURE3);
glActiveTextureARB(GL_TEXTURE4);
glActiveTextureARB(GL_TEXTURE5);

It won’t expand to 4x4, but possibly to 4x3, depending on how you see it. It will of course use three different constants, but the last component may be used for some other scalar. I don’t really see any benefit to use 4x4 when you can use 3x3 on the driver side either.

As for generic vertex attribs, that should not be a problem. For immediate mode I’m not sure how optimized that is, but for vertex arrays it should be fast, unless you’re using a format that’s not natively supported (like 3 * GL_UNSIGNED_BYTE).

Well, more than a year has passed by since this conversation took place. What are the general opinions nowadays about this topic?

ASM vs GLSL performance.

Try it in the shader analyser:
http://ati.amd.com/developer/gpusa/index.html
(Nvidia has a similar tool i think)

Normally, GLSL will be compiled to the same microcode like an equivalent ARB_FP. If not, you can usually optimize your GLSL code until it does.

I decided to switch to GLSL…