Carmack .plan: OpenGL2.0

Hyena #1: “Carmack.”
Hyena #2 (shivers): “brr Ooh, do it again!”
Hyena #1: “Carmack.”
Hyena #2 (shivers): “brr

Since someone will eventually post it I’ll mention Carmack updated his plan with impressions of the P10 and a word about OpenGL 2.0.

You can read it here: http://www.shacknews.com/finger/?fid=johnc@idsoftware.com

Here’s a snippet regarding OpenGL 2.0:

I am now committed to supporting an OpenGL 2.0 renderer for Doom through all the spec evolutions. If anything, I have been somewhat remiss in not pushing the issues as hard as I could with all the vendors. Now really is the critical time to start nailing things down, and the decisions may stay with
us for ten years.

Enjoy,
– Jeff

[This message has been edited by Thaellin (edited 06-28-2002).]

I like this quote:

“but drivers must have the right and responsibility to multipass
arbitrarily complex inputs to hardware with smaller limits. Get over it.”

This would, of course, be great for developers. However, I know the hardware/driver people will fight it

– Zeno

I honestly see it as infeasible, or at least much worse in performance.

The only way I can imagine implementing it in the general case is to use a SW renderer. (And that’s not invariant with HW rendering!)

And eventually, for a big enough program, the API has to refuse to load it. The real issue is whether this will happen in a defined way or an undefined way. I’ve always been of the opinion that program loading should be fully deterministic: you can decide in advance, on paper, given the extension spec and the implementation limits, whether a given program will load or fail to load.

NV_vertex_program has a fully deterministic program loading model. ARB_vertex_program has a 99.999% deterministic model (and the one thing that got the exemption from that, it wouldn’t have made sense to do any other way, and it’s still fully deteministic in practice).

If programs’ failure to load is undefined, and there are no resource limits, you can end up with unintended scenarios such as the following. What if we decided that a GF3 implementation should only load the programs that fit in one GF3 rendering pass? It’d make our lives easy. The spec wouldn’t explicitly prohibit what we were doing. Okay, so you might complain. But replace GF3 with GFx. Say we make our limits and capabilities high enough that we think you won’t ever notice in practice, but we still refuse to do multipass. Is that OK? We could be all dishonest and return a GL_OUT_OF_MEMORY. (But is that really dishonest? Isn’t a register file on chip that only has N register slots a “memory” that you can be “out of” too?)

The invariance issue is especially interesting. It can be brushed off for vertex programs, but not on the fragment level. Wouldn’t you really want to have a guarantee that if you change only the currently bound fragment program, the X/Y/Z of all rendered fragments must not change? Consider a GF3 implementation of “driver multipass”. Not only is multipass impossible there in a lot of cases, but you can easily write shaders that can’t be run in hardware no matter what. But then even very simple apps could get Z-fighting artifacts…

Consider any app that draws its Z first (which can definitely use hardware), and then does shading in subsequent passes (this is a typical model for stencil shadow apps). If that shading ever requires a software fallback, your scene will be covered with Z-fighting artifacts. And since you could write lots of programs that GF3 could never even hope to emulate, even in ridiculous numbers of passes, many nontrivial shaders would appear to render incorrectly.

But then what is the whole point of this no-limitations transparent-multipass stuff? It’s that old hardware should be able to run new shaders – perhaps slowly, but at least correctly. Unfortunately, invariance can break that promise.

  • Matt

If that shading ever requires a software fallback, your scene will be covered with Z-fighting artifacts

Why?

Matt, isn’t doing multiple draw to a texture solve the multipass problem?

This is a technique that works well on current hardware for complex effects(and is demonstrated by nvidia in a ppt presentation).

The driver could break down the compiled code into sections that the card can do in single pass and draw them to texture and collapse the result.

This is probably a simplistic view, but couldn’t a scheme like that work.

Originally posted by knackered:
Why?

I’m assuming it’s because the graphics card does math differently than a CPU would, especially on a fragment level where math is only 8 bits clamped. Or in the depth buffer where things are 16 or 24 bit…rounding may be an issue. If you had to emulate all that on a CPU, I’d think it’d be prohibitively slow.

– Zeno

Software rasterizers use different precision math and different Z interpolation techniques than hardware rasterizers. The two are near impossible to reconcile, especially if every single hardware generation does it just slightly differently… and just as for Z interpolators, so also for clippers, triangle setup, backface culling, subpixel snapping, and rasterization rules for edges.

The easiest example of a shader that cannot be broken down into multiple passes on a GF3 is a specular exponent done in real floating-point math. Often, bumpmapping uses the combiners to take exponents; here I’m talking about the equivalent to C’s pow(). It can be as simple as:

out = pow(in, 20.37126f);

Or you can do specular exponents where the shininess actually comes from a texture:

out = pow(N dot H, tex0) [N and H are probably textures too, from normalization cubemaps; or you can do normalize math in FP for exact results]

etc. If you don’t have this math operation per-pixel (at bare minimum, you need per-pixel exp() and log() to emulate pow()), it’s essentially impossible to emulate it. Sure, you can use a dependent texture lookup to do a power table. But that’s not what the user is asking for here. They’re asking for this math operation done in floating-point.

This is also a good example also of tradeoffs between speed and quality. A 1/2/3D texture is simply a function of 1/2/3 inputs. The driver can’t make decisions about when a function can be approximated by a texture; the app can. In practice, you don’t necessarily want to run the same shader on every piece of hardware. In this case, you really want to have a graceful fallback to something a little more suited to the HW, rather than a complete collapse into SW rendering.

  • Matt

Originally posted by mcraighead:

The driver can’t make decisions about when a function can be approximated by a texture; the app can.

Why not give this power? I know I would not mind.

I understand that there might be a better way to do it on lower end hardware, but I much rather have my code working on large range of cards without too many special cases. The results will look a bit different on different card, big deal!

Well, I guess this will be a hot topic of dicussions on the future of shaders.

[This message has been edited by Gorg (edited 06-28-2002).]

Errr, no, this would work pretty badly. Textures are lookup tables and therefore can be made with widely varying degrees of accuracy. The last thing you want us doing is to be able to cut that 256x256 lookup table down to 64x64 so we can win a benchmark. The looked-up value is also a source of precision woes – the question of how many bits are needed.

The driver is not in the position to know how the app wants to make its image quality vs. performance tradeoffs.

For that matter, if the driver can replace a function with a texture, why not just go the next step and say that the driver can replace a function with a simpler function? Who’s to say that “x + 0.001f” can’t just be simplified to “x”? Or why not move a fragment computation up to the vertex level for efficiency against the app’s will?

The driver’s job is to do what the app says, no more and no less.

  • Matt

You have good points, but I for one believe it is not the role of the APP to decide of the quality vs speed. It’s the user role.

So if card uses lookup tables to support some functions, have a driver option to set the size.

And about moving the fragment operations to vertex level, if the user wants it, why not!

Making things look crappier without any external demand does not change anything for the app, it will just annoy the user.

[This message has been edited by Gorg (edited 06-28-2002).]

If the user is deciding, then it needs to be the app’s responsibility still to decide how to cut down on the scene/shader complexity. The driver simply has far, far too little information to make any policy decisions for the app. In general, policy decisions for rendering must be made using higher-level information, such as scene graph information or knowledge about the particular demands of the application.

The driver’s role is to do precisely what it is told to do, but as quickly as possible.

The app is the only one who can decide what to do. This is simply the nature of immediate-mode APIs.

Of course, building textures to replace functions is a grossly special-case optimization (the functions usually must have domain [0,1] and be continuous), it can be very expensive (building and downloading textures!), and it may not accelerate final rendering (what if the shader is texture-fetch-limited rather than math-limited?) anyhow.

If you want an API where someone else makes rendering policy decisions, and you simply say things like “I want bumpmapping”, you need a higher-level rendering API than OpenGL. OpenGL cannot solve that problem by itself.

  • Matt

Does John Carmack understand all this, matt?
Can you think of his reasoning for trying to shift the responsibility to the driver?

I am not sure I am fully understanding you Matt.

The app still specifies the shader. What happens to make it work is not really its business.

I understand that all the texture work needed is expensive, but people will still code for lowest denominator and make sure the performance is fine on it.

Can you think of his reasoning for trying to shift the responsibility to the driver?

Easy enough. He wants to make his life easier. If you give the driver the responsibility of multipassing, then he doesn’t have to write specialized code for different types of hardware. It’s not a question of whether he things that it belongs there. He just wants to stop writing hardware-specific code.

Just a couple random comments here.

First, the whole multipass issue is getting brought down into implementation details on HW that really isn’t powerful enough to support it. Every time the API is making a significant step forward, it is going to leave some HW in a spot of either not supporting it well or at all. GL2 is obviously designed with its primary target as future HW. There will be a transition period, somewhat like Doom is already doing where an app will need to code old-style in addition to GL2 style.

Next, the SW fallback thing really isn’t anything new or any different than where we are today. Most of the changes you would want to make on different passes do NOT guarantee invariance anyway. Any reliance on invariance is generally relying on knowing that the target HW has more support than required.

Next, SW fallbacks are nothing new. One of the goals stated in the GL2 white papers was to specify an API that HW can grow into. Much as OpenGL originally was. (Hardly any HW could fully do OpenGL in HW back in 1992) Just a couple years ago consumer HW was working with how to avoid SW fallbacks such as some blending modes not being HW accellerated. Right now we have the API chasing the HW, this makes it hard to come up with a consistent clean path of evolution as each time the API is being extended, the focus ends up on current limitations rather than the correct/robust/orthogonal thing to do.

Next, when it comes to guaranteeing multipass invariance, nothing will be better at doing this than the driver. When API multipass would be invoked the driver would know that all passes must run in HW or all must fall to SW. If the app is doing this, it could screw it up. (Like using one fixed function pass, and others with ARB_vertex_program without position invariant)

Finally, virtuallization of resources is a very reasonable direction. It has been the subject of many recent research papers. (one is by Stanford and called the F-Buffer IIRC) The virtualization allows the underlying HW to choose the most efficient method, rather than the app compiling in a set bunch of methods for each know piece/class of HW known of at development time. (Coding in a normalize is better than creating a shader with a normalizer cube map, as the HW may have an rsq instruction that doesn’t eat bandwidth, and it wouldn’t suffer the cube-map edge filtering issues)

Just MHO.

-Evan

BTW, anybody going to SIGGRAPH and wanting to learn more about the GL2 shader languages etc, there is a course scheduled. It is not an official ARB thing, but it is a where the proposals are as of this date thing for people to get ready.

I’ve never spoken to John; I’ve sent him a few short emails at times, and never gotten replies.

Gorg, you are taking a very limiting view of what a “shader” is. It is absolutely essential for the app to know and specify how a given series of computations will be implemented. For example, whether they happen at vertex or fragment level, or whether they occur at IEEE float precision or some sort of fixed-point precision. Whether they are done with a texture lookup or with real math operations. Whether you implement specular exponents using lookup table, repeated squaring, or a power function. Apps most definitely need to specify which technique they desire. A “shader” is not just a series of desired math operations: “specular bumpmapping with constant specular exponent” or something. At the API level, a shader needs to consist of a sequence of operations that compute a result. The driver does not have the semantic information of why this particular computation is being done. Remember that graphics hardware can be used for non-graphics computation, for one; someone would be rather angry if a driver dropped in a texture for a power function in a scientific app.

In the example of a “normalize”, sure, several approaches should exist. You could implement it using DP3/RSQ/MUL, or you could implement it as a cubemap texture. But that cubemap texture could exist in widely varying resolutions and precisions, either with nearest or linear or fancier filtering. But I think the onus falls clearly on the application to not just say “normalize” and expect the vector to come out normalized, but to clearly specify whether it needs a full (e.g.) IEEE float normalize, or it can live with a texture lookup.

I expect that – in practice – artists will be writing different shaders for different generations of hardware. That’s not going away, no matter what. On each particular piece of hardware, not only might you pick a different shader, but you might compile that shader to different low-level operations and a different low-level API.

Evan, I think you have it backwards on who can do multipass more easily. The vast majority of multipass scenarios are handled in the real world by doing the first pass with (say) DepthFunc(LESS), DepthMask(TRUE), and later passes as DepthFunc(EQUAL), DepthMask(FALSE). This is a very nice way to be able to implement multipass. But this method is completely out of bounds for a driver implementing multipass, because splitting a DepthFunc(LESS), DepthMask(TRUE) pass into multiple passes, the later ones with DepthFunc(EQUAL), DepthMask(FALSE) does not produce the same results! In particular, if several fragments have the same X,Y,Z triplet, you get double blending. Again, the app is the one that has the semantic information about whether this sort of thing will work or not.

I am very skeptical of the practicality of F-buffer approaches.

If GL2 were merely “forward-looking”, I’d be all for that. But I think in practice it is (as currently proposed) a very worrisome direction for OpenGL to be heading. The proposals generally do not reflect my vision of where graphics should be 5 years from now. In fact, they fail to reflect it to such an extent that I think I would seriously consider finding a different line of employment [i.e. other than working on OpenGL drivers].

Again, I think it is completely wrong to be talking about how people are going to stop writing for piece of hardware X or Y. You may be able to write a piece of software that works on any given piece of hardware, but this completely fails to solve the problem of graceful degradation of image quality between different hardware platforms. It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware! Instead, I see future applications and engines playing out a careful balancing act between image quality and framerates. Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes. Indeed, the hope would be to maintain a constant framerate, insofar as that is possible.

Some people seem to think that these problems are going to get solved by the API. I disagree. I think they will be solved via a very careful interplay between the driver and a higher-level game engine/scene graph module, and also with some effort on the part of app developers and artists. Scalability takes work on everyone’s part. The important thing is to draw the right lines of division of labor.

In thinking about GL2, I’m reminded of Fred Brooks in the Mythical Man-Month warning about the second-system effect. Everyone wants the second system to be everything for everyone…

  • Matt

It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware!

Matt, this depends somewhat on the application you have in mind. That statement is true for most game companies (at least the majority without very deep pockets), but not true for the vis-sim or medical markets, where hardware costs are not an issue.

John also posted a comment on slashdot that is worth reading: http://slashdot.org/comments.pl?sid=34863&cid=3784210

In it, he cites the paper by Peercy et al from Siggraph 2000 ( http://www.cs.unc.edu/~olano/papers/ips/ips.pdf ) where they say that all that is needed to implement a Renderman type shader on graphics hardware is a floating point framebuffer and dependent texture reads. He also makes mention that this hardware may be available by the end of the year .

Given this, would it be possible to create a generalized shading language with driver-level multipass support that would work for all hardware from that point on? Is it just supporting anything less than this that is difficult (or impossible?)?

– Zeno

Yes, there are techniques along these lines that work in certain scenarios, although they have some issues of their own.

The most obvious one is running out of memory; each intermediate value needs a texture. Another annoying one is that it uses rectangular regions, but for certain cases a rectangular region requires a lot more pixels than are truly needed.

There are also still issues with making all depth and stencil test and blend modes work without disturbing the current framebuffer contents inappropriately; especially if the shader computes a Z per fragment. The paper does not discuss making all depth/stencil/blend modes work, so far as I can tell.

The depth/stencil/blend thing is the real problem here. Everything else is small beans. (Throw in alpha test too.)

Z per fragment and alpha test are both really hard because you don’t know which fragment is the visible one on a given pixel until after you’ve computed the shader result for every fragment. But since this technique only stores one intermediate result per pixel (not per fragment, unlike the F-buffer), you are at a loss as to which intermediate result is the relevant one.

Blending is tough because multiple fragments may affect the final result, not just the “topmost” one.

I have a hard time convincing myself that this algorithm is capable of handling all the hard cases.

  • Matt

Where can i find the ‘prototype OpenGL 2.0 extensions’ that Carmack is talkin about? :wink:

On the 3dlabs P10 based cards?