PDA

View Full Version : Carmack .plan: OpenGL2.0



Thaellin
06-28-2002, 07:44 AM
Hyena #1: "Carmack."
Hyena #2 (shivers): "*brr* Ooh, do it again!"
Hyena #1: "Carmack."
Hyena #2 (shivers): "*brr*"

Since someone will eventually post it I'll mention Carmack updated his plan with impressions of the P10 and a word about OpenGL 2.0.

You can read it here: http://www.shacknews.com/finger/?fid=johnc@idsoftware.com

Here's a snippet regarding OpenGL 2.0:


I am now committed to supporting an OpenGL 2.0 renderer for Doom through all the spec evolutions. If anything, I have been somewhat remiss in not pushing the issues as hard as I could with all the vendors. Now really is the critical time to start nailing things down, and the decisions may stay with
us for ten years.


Enjoy,
-- Jeff

[This message has been edited by Thaellin (edited 06-28-2002).]

Zeno
06-28-2002, 08:12 AM
I like this quote:

"but drivers must have the right and responsibility to multipass
arbitrarily complex inputs to hardware with smaller limits. Get over it."

This would, of course, be great for developers. However, I know the hardware/driver people will fight it http://www.opengl.org/discussion_boards/ubb/smile.gif

-- Zeno

mcraighead
06-28-2002, 09:02 AM
I honestly see it as infeasible, or at least much worse in performance.

The only way I can imagine implementing it in the general case is to use a SW renderer. (And that's not invariant with HW rendering!)

And eventually, for a big enough program, the API has to refuse to load it. The real issue is whether this will happen in a _defined_ way or an _undefined_ way. I've always been of the opinion that program loading should be fully deterministic: you can decide _in advance, on paper_, given the extension spec and the implementation limits, whether a given program will load or fail to load.

NV_vertex_program has a fully deterministic program loading model. ARB_vertex_program has a 99.999% deterministic model (and the one thing that got the exemption from that, it wouldn't have made sense to do any other way, and it's still fully deteministic in practice).

If programs' failure to load is undefined, and there are no resource limits, you can end up with unintended scenarios such as the following. What if we decided that a GF3 implementation should only load the programs that fit in one GF3 rendering pass? It'd make our lives easy. The spec wouldn't _explicitly_ prohibit what we were doing. Okay, so you might complain. But replace GF3 with GFx. Say we make our limits and capabilities high enough that we think you won't ever notice in practice, but we still refuse to do multipass. Is that OK? We could be all dishonest and return a GL_OUT_OF_MEMORY. (But is that really dishonest? Isn't a register file on chip that only has N register slots a "memory" that you can be "out of" too?)


The invariance issue is especially interesting. It can be brushed off for vertex programs, but not on the fragment level. Wouldn't you really _want_ to have a guarantee that if you change _only_ the currently bound fragment program, the X/Y/Z of all rendered fragments must not change? Consider a GF3 implementation of "driver multipass". Not only is multipass impossible there in a lot of cases, but you can easily write shaders that _can't_ be run in hardware no matter what. But then even very simple apps could get Z-fighting artifacts...

Consider any app that draws its Z first (which can definitely use hardware), and then does shading in subsequent passes (this is a typical model for stencil shadow apps). If that shading _ever_ requires a software fallback, your scene will be covered with Z-fighting artifacts. And since you could write lots of programs that GF3 could never even hope to emulate, even in ridiculous numbers of passes, many nontrivial shaders would appear to render incorrectly.

But then what is the whole _point_ of this no-limitations transparent-multipass stuff? It's that old hardware _should_ be able to run new shaders -- perhaps slowly, but at least correctly. Unfortunately, invariance can break that promise.

- Matt

knackered
06-28-2002, 09:08 AM
If that shading _ever_ requires a software fallback, your scene will be covered with Z-fighting artifacts

Why?

Gorg
06-28-2002, 09:17 AM
Matt, isn't doing multiple draw to a texture solve the multipass problem?

This is a technique that works well on current hardware for complex effects(and is demonstrated by nvidia in a ppt presentation).

The driver could break down the compiled code into sections that the card can do in single pass and draw them to texture and collapse the result.

This is probably a simplistic view, but couldn't a scheme like that work.

Zeno
06-28-2002, 09:21 AM
Originally posted by knackered:
Why?

I'm assuming it's because the graphics card does math differently than a CPU would, especially on a fragment level where math is only 8 bits clamped. Or in the depth buffer where things are 16 or 24 bit...rounding may be an issue. If you had to emulate all that on a CPU, I'd think it'd be prohibitively slow.

-- Zeno

mcraighead
06-28-2002, 09:37 AM
Software rasterizers use different precision math and different Z interpolation techniques than hardware rasterizers. The two are near impossible to reconcile, especially if every single hardware generation does it just _slightly_ differently... and just as for Z interpolators, so also for clippers, triangle setup, backface culling, subpixel snapping, and rasterization rules for edges.

The easiest example of a shader that _cannot_ be broken down into multiple passes on a GF3 is a specular exponent done in real floating-point math. Often, bumpmapping uses the combiners to take exponents; here I'm talking about the equivalent to C's pow(). It can be as simple as:

out = pow(in, 20.37126f);

Or you can do specular exponents where the shininess actually comes from a texture:

out = pow(N dot H, tex0) [N and H are probably textures too, from normalization cubemaps; or you can do normalize math in FP for exact results]

etc. If you don't have this math operation per-pixel (at bare minimum, you need per-pixel exp() and log() to emulate pow()), it's essentially impossible to emulate it. Sure, you can use a dependent texture lookup to do a power table. But that's not what the user is asking for here. They're asking for this math operation done in floating-point.

This is also a good example also of tradeoffs between speed and quality. A 1/2/3D texture is simply a function of 1/2/3 inputs. The driver can't make decisions about when a function can be approximated by a texture; the app can. In practice, you don't necessarily want to run the same shader on every piece of hardware. In this case, you _really_ want to have a graceful fallback to something a little more suited to the HW, rather than a complete collapse into SW rendering.

- Matt

Gorg
06-28-2002, 09:54 AM
Originally posted by mcraighead:

The driver can't make decisions about when a function can be approximated by a texture; the app can.

Why not give this power? I know I would not mind.

I understand that there might be a better way to do it on lower end hardware, but I much rather have my code working on large range of cards without too many special cases. The results will look a bit different on different card, big deal!

Well, I guess this will be a hot topic of dicussions on the future of shaders.


[This message has been edited by Gorg (edited 06-28-2002).]

mcraighead
06-28-2002, 10:22 AM
Errr, no, this would work pretty badly. Textures are lookup tables and therefore can be made with widely varying degrees of accuracy. The last thing you want us doing is to be able to cut that 256x256 lookup table down to 64x64 so we can win a benchmark. http://www.opengl.org/discussion_boards/ubb/smile.gif The looked-up value is also a source of precision woes -- the question of how many bits are needed.

The driver is not in the position to know how the app wants to make its image quality vs. performance tradeoffs.

For that matter, if the driver can replace a function with a texture, why not just go the next step and say that the driver can replace a function with a simpler function? Who's to say that "x + 0.001f" can't just be simplified to "x"? http://www.opengl.org/discussion_boards/ubb/smile.gif Or why not move a fragment computation up to the vertex level for efficiency against the app's will?

The driver's job is to do what the app says, no more and no less.

- Matt

Gorg
06-28-2002, 11:38 AM
You have good points, but I for one believe it is not the role of the APP to decide of the quality vs speed. It's the user role.

So if card uses lookup tables to support some functions, have a driver option to set the size.

And about moving the fragment operations to vertex level, if the user wants it, why not! http://www.opengl.org/discussion_boards/ubb/smile.gif

Making things look crappier without any external demand does not change anything for the app, it will just annoy the user.



[This message has been edited by Gorg (edited 06-28-2002).]

mcraighead
06-28-2002, 12:11 PM
If the user is deciding, then it needs to be the app's responsibility still to decide *how* to cut down on the scene/shader complexity. The driver simply has far, far too little information to make *any* policy decisions for the app. In general, policy decisions for rendering *must* be made using higher-level information, such as scene graph information or knowledge about the particular demands of the application.

The driver's role is to do precisely what it is told to do, but as quickly as possible.

The app is the only one who can decide what to do. This is simply the nature of immediate-mode APIs.

Of course, building textures to replace functions is a grossly special-case optimization (the functions usually must have domain [0,1] and be continuous), it can be very expensive (building and downloading textures!), and it may not accelerate final rendering (what if the shader is texture-fetch-limited rather than math-limited?) anyhow.

If you want an API where someone else makes rendering policy decisions, and you simply say things like "I want bumpmapping", you need a higher-level rendering API than OpenGL. OpenGL cannot solve that problem by itself.

- Matt

knackered
06-28-2002, 12:41 PM
Does John Carmack understand all this, matt?
Can you think of his reasoning for trying to shift the responsibility to the driver?

Gorg
06-28-2002, 01:40 PM
I am not sure I am fully understanding you Matt.

The app still specifies the shader. What happens to make it work is not really its business.

I understand that all the texture work needed is expensive, but people will still code for lowest denominator and make sure the performance is fine on it.

Korval
06-28-2002, 01:53 PM
Can you think of his reasoning for trying to shift the responsibility to the driver?

Easy enough. He wants to make his life easier. If you give the driver the responsibility of multipassing, then he doesn't have to write specialized code for different types of hardware. It's not a question of whether he things that it belongs there. He just wants to stop writing hardware-specific code.

ehart
06-28-2002, 02:30 PM
Just a couple random comments here.

First, the whole multipass issue is getting brought down into implementation details on HW that really isn't powerful enough to support it. Every time the API is making a significant step forward, it is going to leave some HW in a spot of either not supporting it well or at all. GL2 is obviously designed with its primary target as future HW. There will be a transition period, somewhat like Doom is already doing where an app will need to code old-style in addition to GL2 style.

Next, the SW fallback thing really isn't anything new or any different than where we are today. Most of the changes you would want to make on different passes do NOT guarantee invariance anyway. Any reliance on invariance is generally relying on knowing that the target HW has more support than required.

Next, SW fallbacks are nothing new. One of the goals stated in the GL2 white papers was to specify an API that HW can grow into. Much as OpenGL originally was. (Hardly any HW could fully do OpenGL in HW back in 1992) Just a couple years ago consumer HW was working with how to avoid SW fallbacks such as some blending modes not being HW accellerated. Right now we have the API chasing the HW, this makes it hard to come up with a consistent clean path of evolution as each time the API is being extended, the focus ends up on current limitations rather than the correct/robust/orthogonal thing to do.

Next, when it comes to guaranteeing multipass invariance, nothing will be better at doing this than the driver. When API multipass would be invoked the driver would know that all passes must run in HW or all must fall to SW. If the app is doing this, it could screw it up. (Like using one fixed function pass, and others with ARB_vertex_program without position invariant)

Finally, virtuallization of resources is a very reasonable direction. It has been the subject of many recent research papers. (one is by Stanford and called the F-Buffer IIRC) The virtualization allows the underlying HW to choose the most efficient method, rather than the app compiling in a set bunch of methods for each know piece/class of HW known of at development time. (Coding in a normalize is better than creating a shader with a normalizer cube map, as the HW may have an rsq instruction that doesn't eat bandwidth, and it wouldn't suffer the cube-map edge filtering issues)

Just MHO.

-Evan

BTW, anybody going to SIGGRAPH and wanting to learn more about the GL2 shader languages etc, there is a course scheduled. It is not an official ARB thing, but it is a where the proposals are as of this date thing for people to get ready.

mcraighead
06-28-2002, 04:53 PM
I've never spoken to John; I've sent him a few short emails at times, and never gotten replies.

Gorg, you are taking a very limiting view of what a "shader" is. It is absolutely essential for the app to know and specify _how_ a given series of computations will be implemented. For example, whether they happen at vertex or fragment level, or whether they occur at IEEE float precision or some sort of fixed-point precision. Whether they are done with a texture lookup or with real math operations. Whether you implement specular exponents using lookup table, repeated squaring, or a power function. Apps most definitely need to specify which technique they desire. A "shader" is not just a series of desired math operations: "specular bumpmapping with constant specular exponent" or something. At the API level, a shader needs to consist of a sequence of _operations_ that compute a result. The driver does not have the semantic information of _why_ this particular computation is being done. Remember that graphics hardware can be used for non-graphics computation, for one; someone would be rather angry if a driver dropped in a texture for a power function in a scientific app.


In the example of a "normalize", sure, several approaches should exist. You could implement it using DP3/RSQ/MUL, or you could implement it as a cubemap texture. But that cubemap texture could exist in widely varying resolutions and precisions, either with nearest or linear or fancier filtering. But I think the onus falls clearly on the application to not just say "normalize" and expect the vector to come out normalized, but to clearly specify whether it needs a full (e.g.) IEEE float normalize, or it can live with a texture lookup.

I expect that -- in practice -- artists will be writing different shaders for different generations of hardware. That's not going away, no matter what. On each particular piece of hardware, not only might you pick a different shader, but you might compile that shader to different low-level operations and a different low-level API.


Evan, I think you have it backwards on who can do multipass more easily. The vast majority of multipass scenarios are handled in the real world by doing the first pass with (say) DepthFunc(LESS), DepthMask(TRUE), and later passes as DepthFunc(EQUAL), DepthMask(FALSE). This is a very nice way to be able to implement multipass. But this method is completely out of bounds for a driver implementing multipass, because splitting a DepthFunc(LESS), DepthMask(TRUE) pass into multiple passes, the later ones with DepthFunc(EQUAL), DepthMask(FALSE) does *not* produce the same results! In particular, if several fragments have the same X,Y,Z triplet, you get double blending. Again, the app is the one that has the semantic information about whether this sort of thing will work or not.

I am very skeptical of the practicality of F-buffer approaches.


If GL2 were merely "forward-looking", I'd be all for that. But I think in practice it is (as currently proposed) a very worrisome direction for OpenGL to be heading. The proposals generally do not reflect my vision of where graphics should be 5 years from now. In fact, they fail to reflect it to such an extent that I think I would seriously consider finding a different line of employment [i.e. other than working on OpenGL drivers].


Again, I think it is completely wrong to be talking about how people are going to stop writing for piece of hardware X or Y. You may be able to write a piece of software that _works_ on any given piece of hardware, but this completely fails to solve the problem of graceful degradation of image quality between different hardware platforms. It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware! Instead, I see future applications and engines playing out a careful balancing act between image quality and framerates. Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes. Indeed, the hope would be to maintain a constant framerate, insofar as that is possible.

Some people seem to think that these problems are going to get solved by the API. I disagree. I think they will be solved via a very careful interplay between the driver and a higher-level game engine/scene graph module, and also with some effort on the part of app developers and artists. Scalability takes work on everyone's part. The important thing is to draw the right lines of division of labor.


In thinking about GL2, I'm reminded of Fred Brooks in the Mythical Man-Month warning about the second-system effect. Everyone wants the second system to be everything for everyone...

- Matt

Zeno
06-28-2002, 06:25 PM
It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware!

Matt, this depends somewhat on the application you have in mind. That statement is true for most game companies (at least the majority without very deep pockets), but not true for the vis-sim or medical markets, where hardware costs are not an issue.

John also posted a comment on slashdot that is worth reading: http://slashdot.org/comments.pl?sid=34863&cid=3784210

In it, he cites the paper by Peercy et al from Siggraph 2000 ( http://www.cs.unc.edu/~olano/papers/ips/ips.pdf ) where they say that all that is needed to implement a Renderman type shader on graphics hardware is a floating point framebuffer and dependent texture reads. He also makes mention that this hardware may be available by the end of the year http://www.opengl.org/discussion_boards/ubb/wink.gif.

Given this, would it be possible to create a generalized shading language with driver-level multipass support that would work for all hardware from that point on? Is it just supporting anything less than this that is difficult (or impossible?)?

-- Zeno

mcraighead
06-28-2002, 07:04 PM
Yes, there are techniques along these lines that work in certain scenarios, although they have some issues of their own.

The most obvious one is running out of memory; each intermediate value needs a texture. Another annoying one is that it uses rectangular regions, but for certain cases a rectangular region requires a lot more pixels than are truly needed.

There are also still issues with making all depth and stencil test and blend modes work _without_ disturbing the current framebuffer contents inappropriately; especially if the shader computes a Z per fragment. The paper does not discuss making all depth/stencil/blend modes work, so far as I can tell.

The depth/stencil/blend thing is the *real* problem here. Everything else is small beans. (Throw in alpha test too.)

Z per fragment and alpha test are both really hard because you don't know which fragment is the visible one on a given pixel until *after* you've computed the shader result for every fragment. But since this technique only stores one intermediate result per _pixel_ (not per fragment, unlike the F-buffer), you are at a loss as to which intermediate result is the relevant one.

Blending is tough because _multiple_ fragments may affect the final result, not just the "topmost" one.

I have a hard time convincing myself that this algorithm is capable of handling all the hard cases.

- Matt

Ozzy
06-28-2002, 07:46 PM
Where can i find the 'prototype OpenGL 2.0 extensions' that Carmack is talkin about? ;)

Nutty
06-29-2002, 01:33 AM
On the 3dlabs P10 based cards?

Julien Cayzac
06-29-2002, 01:44 AM
Originally posted by mcraighead:
It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware!
(...)
Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes.

Why not introducing a LOD concept into shading? As we now have mipmaps for lookups, we could get "mip shaders" in future GL releases. Then one shader could be used at same framerate on all platforms, given the proper LOD bias...

Julien.

Ozzy
06-29-2002, 01:55 AM
Originally posted by Nutty:
On the 3dlabs P10 based cards?

I thought Carmack was talkin about NV prototypes ext. Maybe you're right and i've misunderstood what he's written.

knackered
06-29-2002, 02:00 AM
Can someone clearly define the problem of passing multipass rendering over to the driver? I'm having trouble decyphering what is being said (not having much of an insight into low level hardware gymnastics)...

As far as I can see:-
The output of pass1 must be available as the input to pass2 in exactly the same form as the outputs between exisiting texture stages in a single pass - ie. float texture coordinates for all units, float fragment registers...etc.
Is this correct?

Rml4o
06-29-2002, 02:02 AM
Theoretically, the GL2.0 prototypes have not been reviewed by the ARB yet. But 3DLabs must have some beta versions for developpers, even though I didn't find them on their developper site. Does anyone know where one could find them, together with the specs?

LordKronos
06-29-2002, 02:23 AM
I have to partially agree with matt here. While getting functions to automatically multi-pass on old hardware is a pretty thought, it is way less than practical today. I see no way to make a shading setup reliably take advantage of a geforce 4 while still scaling back to a TNT. Its just not reasonable because of the completely different set of functionality.

However, I think we need to (at some point) stop writing a dozen different shader paths for everything. As I see it, the hardware coming up (GL2/DX9 compiant) should be flexible enough to do most anything we need to throw at it for quite a while. That is where we need to begin to simplify things, from that point forward. We should be able to just throw a shader, have it calculate the result to a p-buffer if required, and render the final result to to the frame buffer as if it were a magical single pass.

As far as dealing with supported features, we could still use some sort of caps bit and select how each feature is to be implemented. Something like:
I need to take the log7 of each fragment (I'm just making this up, ok). I see the hardware supports logn (any base). Well, Ill just use that. No wait, it only supports log10. Then I want the hardware to calculate log10(fragment)/constant_value(log10(7)). Oh wait, you mean the hardware has no log support? OK, then I want it to look it up from this texture table using a dependant read. etc.

The point is, we would get the ability to make simple, confined decisions. I can do a simple:
if (supported(LOG_BASE_N)) apply(LOG_BASE_N)
else if (supported(LOG_BASE_10)) apply ....

Each decision would be confined to a single feature. We would no longer have to worry about the combinatorial explosion that we do today. Today we say: oh, thats simple, Ill do this calculation instead....but now I need to break it into 2+ passes. So here is an optimized 4 texture version, a 6 texture version, an 8 texture version, a 4 texture version in case this other feature isnt supported and we have to break into 3 passes, etc.

So again, its still on us to decide how to do each little part of the equation, but how to get that equation into 1 or more passes is the driver's responsibility.

I dont understand the invariance issues you are talking about matt. If I pass the hardware a single x/y/z with 25 sets of texture coordinates, the hardware should be able to keep coming up with the same fragment-z for every pass. I understand some hardware isn't invariant between 2 different render setups, but I think maybe we need to progress to the point where we say it does have to do that, and the IHV's make it happen. I personally think its kinda ridiculous how for some cards, even making some stupid little render setup change means you lose invariance. Maybe there is a very real reason why hardware does it that way, but "because thats the way it has to be done" isnt a good enough excuse for me. Make the hardware so that it doesnt have to be done.

So maybe GL2/DX9 cards wont have all the features necessary to do this, but we need to make it a top priority to say that in the next couple of revisions, we need to get these features in so we CAN do this for the long run.

PH
06-29-2002, 02:38 AM
The invariance issues that Matt is talking about makes a lot of sense but he's talking about implementing this stuff on hardware that isn't flexible enough .

If I'm not mistaken, the _transition_ from OpenGL 1.3-1.4 to OpenGL 2.0 is done via new extensions. I think that's what John Carmack is doing with DOOM 3. It's just like what we're doing today - seriously, are you using extensions or do you require full GL 1.3 support ? I still bind extensions and require only GL 1.1. That's the sort of transition I'm thinking about.

With the right hardware, I think Matt would agree that driver multipass is a good thing ( right Matt http://www.opengl.org/discussion_boards/ubb/smile.gif ? ).

IT
06-29-2002, 06:20 AM
A few posts ago, when Cg first came out, I made the similar comment that John C made (of course, I'm no John C) regarding having the API/driver/whatever break up complex shaders into multipass automatically.

I think the gist of all of this is that the current state of affairs regarding shaders is a mess. It's possible to make cool shaders, but there is no guarentee that they'll work across a wide range of hardware and one has to program for DX shaders, NV OGL extensions, and ATI OGL extentions.

Perhaps hardware vendors need to go back to the drawing board and rethink their designs so that future hardware handles arbitrarily complex shaders and become more of a general purpose processor with some specializations for fast fill rate, texture lookups, AA and other things.

ehart
06-29-2002, 06:54 AM
Matt,

First on the normalize thing, that was an example I used because I felt it would be one just about everyone here was familiar with. Now, with the GL2 shaders as defined, you still could make the decision to bind a normalizer cubemap if you wanted. Also, your arguments about the precision are somewhat mute. All the whitepapers/specification suggest setting minimal ranges/precisions for operations that a conformant implementation would be required to comply with. As for the scientific usage issue, I would argue that they would require additional info from implementors just as they would with today's requirements. GL only requires floats with 17 bit accurate mantissas for vertex transformations today.

On the multipass issue, you vastly oversimplified things. The requirements stated in the proposals would require that the results be as if the data went into the buffer in a single pass. This means that in the case of blending an F-buffer style implementation or writing to an intermediate buffer may be required. This would be a burden on the driver in the short term, and additionally, it would occasionally have a characteristic where ISV's would know that going beyond some certain limit is not a good idea on generation x of HW. This will always exist, as it is going to be quite a while before we can run million instruction programs per-fragment at 60 Hz.

You also missed my point on shaders working across multiple pieces of HW. I agree, less agressive shaders will be necessary. On the other hand as someone else mentioned, then it becomes a problem of writing a few less aggressive shaders rather than one for every single permutation of HW to run as effectively as possible. Also, it allows the heaviest shaders at ship time to run as best they can on HW released after ship. I can bet you have come across a fair number of apps that were released before the GeForce3 that only use 2 textures per pass for what is only a 3 or 4 texture effect. I would wager a fair sum, you would like an easy mechanism to collapse those into a single pass.

Now, I'll stand up and say that I don't necessarily agree with everything in the GL2 white papers or all the things that get said about it. It is not a cure all, and sometimes make it seem that way. Yes, their is still going to be plenty of work for app developers and artists. I do think that some of what has been said here are misconceptions (such as the blending thing and applicability to old HW), and I do want to make sure that this stuff gets represented in the correct light.

As for the discussion I have seen about developers needing to control exactly what is going on, I would argue that is a real stretch. Few people today program in ASM for full applications, and even those that do on x86 machines aren't really programming to HW anymore anyway. (The instructions are decoded to micro ops and ...) Then there are systems like Java and MSIL (.net VM) (Now this sort of thing gets into a holy war I don't wnat, so just take this at face value.) These are ways of distributing a program that runs (reasonably) well across multiple processor architectures. I would argue that the challenge of shipping shader code that runs across NVIDIA, ATI, 3DLabs, Matrox, Intel, etc is very similar to the challenge of shipping code that runs well across x86, MIPS, DragonBall (Palm processors?), Sparc, etc.

As for your disagreement, that is fine, as I stated I don't necessarily agree with everything either. This is still a proposal, and up for debate. I think the overall concept is quite sound. As for your retirement from graphics Matt, I would hate to see you go so soon. http://www.opengl.org/discussion_boards/ubb/wink.gif

-Evan

ehart
06-29-2002, 06:59 AM
IT,

As for the shame of the multitude of shaders, I think all the IHV's agree on this. The ARB recently adopted a standard vertex program extension. There is also a push for a standard fragment program extension.

These serve an near-term need to get stuff out the door. The GL2 stuff is a more long term look to virtualize the resources used by the shaders.

John Pollard
06-29-2002, 10:42 AM
I don't know exactly what Carmack is trying to say, but I can tell you what I *hope* he's trying to say, and what I would like to see.

I would like to see an infinite level of simultanaeous texture stages supported. When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass. But, sometimes, this isn't possible, because the math is impossible to work out, because the last pass is dependent of data in the framebuffer, which is screwed by passes of previously rendered objects (I know there are ways around this in some cases, but it's just annoying).

The solution? The HW should easily be able to support an infinite amount of texture units. Lets say I need to do 7 passes, but have 4 texture units. The card would do the first 4 passes (simultanously), store this somwhere, then do the last 3 passes (simultaneously), then combine these results, and write to the frame buffer. This avoids alot of re-transforming of the objects, and is just plain easier to code for.

There is no excuse that I can think of, for not supporting this. You still supply all the textures, math, etc. The driver only simulates more than it's actual texture units. Thats it.

Korval
06-29-2002, 12:28 PM
There is no excuse that I can think of, for not supporting this. You still supply all the textures, math, etc. The driver only simulates more than it's actual texture units. Thats it.

Why make the driver do that work? It's nothing the end-user can't do easily enough. Not only that, I don't like the idea of the driver spending the time/resources that it takes to perform this operation. I would much rather know that the hardware only supports 4 multitextures and simply opt for a different shader than to waste precious time doing some multipass algorithm on older hardware.

John Pollard
06-29-2002, 12:58 PM
Why make the driver do that work? It's nothing the end-user can't do easily enough. Not only that, I don't like the idea of the driver spending the time/resources that it takes to perform this operation. I would much rather know that the hardware only supports 4 multitextures and simply opt for a different shader than to waste precious time doing some multipass algorithm on older hardware

Well, I don't want the hardware to have to transform, and clip the geometry all over again, each time I have to do extra passes. We are dealing with 50-80k+ tri scenes in DNF, and it gets expensive REALLY fast.

Second, sometimes it's just not possible to get the math right, unless you use a render target to store temp results. You can also use the alpha channel to store temp results, but this is crazy, the driver can do a much better job in this case. They have the transformed geometry, they just need to make several passes on this data at a low level.

Think of it like memcpy. It subdivides the bytes into ints, then words, then takes care of the left over bytes. But in this case, the HW would subdivide the work by how many texture units it has.

Don't forget though, you could still decide to do it yourself if you need to, but I doubt it.

Can you give me an example where you'd rather do it on your own, because the hardware couldn't do a better/similar job? Because maybe I'm just overlooking something, so I'd like to hear more oppinions.

PH
06-29-2002, 01:23 PM
Originally posted by John Pollard:
...When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass.

What do you mean ? If you do 5 passes you need to transform the geometry 5 times. Is there a different interpretation of the term 'pass' in Direct3D ?

John Pollard
06-29-2002, 01:47 PM
Well, it would be 5 passes with 1 texture unit. 1 pass with 5 texture units.

A pass in this case, is how many DrawPrimitive calls you had to make, to pull off the effect (sorry to use Direct3D terminology).

John Pollard
06-29-2002, 02:08 PM
When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass

Heh, I can see where this was confusing. Sorry, trying to bbq, and do this at the same time http://www.opengl.org/discussion_boards/ubb/smile.gif

A better way to say this, would be:

When I have an effect, that requires 5 TMU's, but I only have 4 TMU's to use, I will have to do it in 2 passes.

Korval
06-29-2002, 04:49 PM
They have the transformed geometry, they just need to make several passes on this data at a low level.

I don't know where you got the idea that the driver has the transformed data, but that is incorrect. In every card since GeForce1, when rendering with the hardware T&L/vertex programs, the driver has no access to the post-T&L results. Therefore, in order to multi-pass, it will have to retransform the verts again.

Not only that, more complex shader algorithms need special vertex shader code to mesh with various textures in the pixel shader. The driver would have to take your shader, break it into two pieces based on which data is necessary for which textures/math ops in the pixel shader for that pass. In all likelyhood, in hardware that causes a shader to fall to multipass, you're going to have to send different vertex data to each pass (different sets of texture coordinates, vertex colors, etc).

Also, you're making a fundamental assumption. You're assuming that I want the more powerful shader program run on any hardware, regardless of the cost. It may require a render-to-texture op coupled with a blitting operation. Not only that, a 5-texture algorithm with hardware that only has 4 texture units may actually require 3 (or potentially more) passes, depending on what I do with those textures and how I combine them in the shader. Each pass will need its own vertex and pixel shader code, which has to be generated on the fly from the shader code passed in.

Given that I might be writing a high-performance appliation (like a game), I may not want to pay the cost of a 3-pass algorithm, when I could use something that doesn't look as good, but is cheaper. Not only that, I have no idea how long a particular shader is going to take; therefore, I don't get to have fallback options given particular hardware. In short, I still have to code for the lowest-common denominator, since coding for the high-end guarentees that the low-end users will get horrific performance.

LordKronos
06-29-2002, 05:54 PM
Originally posted by John Pollard:
We are dealing with 50-80k+ tri scenes in DNF, and it gets expensive REALLY fast.
Yeah, but by that time..... http://www.opengl.org/discussion_boards/ubb/wink.gif

John Pollard
06-29-2002, 06:47 PM
I don't know where you got the idea that the driver has the transformed data, but that is incorrect. In every card since GeForce1, when rendering with the hardware T&L/vertex programs, the driver has no access to the post-T&L results. Therefore, in order to multi-pass, it will have to retransform the verts again.

Someone has the data, I really don't care who. I just know I pass it in. I'm not saying things wouldn't need to be re-arranged to support this.


you're going to have to send different vertex data to each pass

Yes, exactly. I can send UV vertex data for 8 TMU's, even though the HW only has 4 TMU's. This means, on the first pass, the HW would interpolate the first 4 UV's, then on the last pass, interpolate the last 4 UV's. Not all that complicated.


You're assuming that I want the more powerful shader program run on any hardware, regardless of the cost. It may require a render-to-texture op coupled with a blitting operatin.

All the HW has to do, is subdivide the shader into parts. Each part would go through the shader engine. Then, each part would get combined. I can't think of any scenerio where there wouldn't be a solution. It just might be a little slow, worse case.

But it means less work for me, and probably isn't going to be any slower than any fallback case I would need to write to support that effect anyhow. But I don't see how it would be slower, since alot of the work isn't being duplicated anymore.


Given that I might be writing a high-performance appliation (like a game), I may not want to pay the cost of a 3-pass algorithm, when I could use something that doesn't look as good, but is cheaper.

You can still do this. Nothing is stopping you from coding the way you always have.

Though, there is the drawback of not being able to calculate the number of cycles a particular shader is going to take (in the case the HW had to subdivide the workload). But this is a luxury, and I can handle that. When coding, I would calculate cycles based on the fact that the the HW didn't need to switch to fallback mode. Of course, I would also allow the user to turn a feature off, if his machine didn't have the goods.

I just think it would be really cool to write a shader targeted for the GF3, but still have it work on a GF1 (at a price of course).

I guess I'm in dreamland though, and I'll just have to keep writing 10 different code paths to support the different combo's of cards http://www.opengl.org/discussion_boards/ubb/frown.gif Such is life...

barthold
06-29-2002, 08:12 PM
Originally posted by Ozzy:
I thought Carmack was talkin about NV prototypes ext. Maybe you're right and i've misunderstood what he's written.

John Carmack was kind enough to write some OpenGL2 shaders last week and give them a try on a Wildcat VP (the official name for a P10 board), as he stated in his .plan file.

The standard drivers that ship with a Wildcat VP do not have OpenGL2 support. The reason being that our OpenGL2 implementation is still in the early phases, and we do not want to mix that with production quality drivers. But I'll be happy to provide an OpenGL2 capable driver to anyone with a Wildcat VP board, who wants to experiment a bit. Just drop me an email.

At the ARB meeting a week and a half ago we presented a plan for the ARB to work on getting parts of the OpenGL2 white paper functionality into OpenGL in incremental steps. we presented 3 extensions and the OpenGL2 language specification. The extensions handle vertex programmability, fragment programmability and a frame work to handle OpenGL2 style objects. As a result a working group was formed by the ARB. This working group is headed by ATI. Jon Leech should post the ARB minutes on www.opengl.org (http://www.opengl.org) shortly, if he didn't already.

Barthold,
3Dlabs

zeckensack
06-29-2002, 08:15 PM
John,

I think you're on the right track. While both the NV20/25 and R200 support some kind of loopback to extend their texture stages, one could still argue that it's just 'pipe combining' and you can't go over your total limit of physical TMUs (of all pipes combined).
That's the easy way to do it, I believe, but what do I know about these chips, really ...

Well, but proof for the existance of true loopback is the Kyro II. Ok, no promoting or bashing here, we know it all. But this little thing does 8x loopback in D3D and 4x in OpenGL. And I could well imagine that it's an arbitrary driver limitation, that it doesn't support *unlimited* loopback (or perhaps "close to unlimited" http://www.opengl.org/discussion_boards/ubb/wink.gif ).

So I wonder, why can't the big dogs do that? http://www.opengl.org/discussion_boards/ubb/smile.gif

Korval
06-30-2002, 01:35 AM
Someone has the data, I really don't care who.

That data is gone. Note that the rendering pipeline goes one way: vertex shader/fixed-function feeds the rasterizer, which feeds the fragment pipeline, which feeds the pixel blender.

In order to do what you're suggesting (having post-T&L data lying around), the vertex shader would have to be done in software and the data stored on the CPU. This is, quite simply, completely unacceptable from a performance standpoint.

It is currently impossible to simply read back data from a vertex shader and store it to be multi-passed over again. Not only that, as I explained before, you can't do that, since you need to run the other portions of the shader on each individual set of data.


This means, on the first pass, the HW would interpolate the first 4 UV's, then on the last pass, interpolate the last 4 UV's. Not all that complicated.

Then let me complicate it for you.

Let's assume your equation is the following:

T1 * T2 + T1 * T3 + T1 * T4 + T1 * T5 + T2 * T3 + T2 * T4 + T2 * T5 + T3 * T4 + T3 * T5 + T4 * T5 = output color.

Oh, and the output color will be alpha blended with the framebuffer.

Given a Radeon 8500 (with more blend ops, perhaps), this is trivial; no need to multipass. Given a GeForce3, this is most certainly a non-trivial task in reducing the

Note that each texture coordinate came from a vertex shader program that may have performed similar opertations


It just might be a little slow, worse case.

But it means less work for me, and probably isn't going to be any slower than any fallback case I would need to write to support that effect anyhow. But I don't see how it would be slower, since alot of the work isn't being duplicated anymore.

Multipass == slow. It is far slower than a single-pass hack. I, for one, refuse to use any multipass algorithm unless it produces a particularly good effect (and even then, it had better only require 2 passes).

Not only that, if you're building your shader relatively dynamically (say, based on slider-bar values or a configuration screen), then the shader 'compiler' has to compile dynamically. Splitting a vertex shader into two passes is a non-trivial algorithm. It can, even worse, make the vertex shader even slower.

Not only that, verifying that a shader fits within the resource limitations of the hardware isn't a trivial task.

Saying that this kind of thing is a relatively easy task that will not impair the performance of the hardware is simply erronous. Besides, I'm more inclined to believe Matt than John Carmack about the potential nightmares of implementing such as system in drivers. Carmack's job is to get people like Matt to do their work for them.


I think you're on the right track. While both the NV20/25 and R200 support some kind of loopback to extend their texture stages, one could still argue that it's just 'pipe combining' and you can't go over your total limit of physical TMUs (of all pipes combined).
That's the easy way to do it, I believe, but what do I know about these chips, really ...

The reason there is a limit to what can be done in a single pass is that there is a limit to how much texture-state information can be stored on-chip. Take the original TNT, for example. It has only 1 texture unit. But it has register space for 2 active textures, which were accessed via loop-back. Most of the time, it is more efficient to store additional register state for active texture objects than to actually have more texture units.

As to why there isn't more loopback? Simple: register space isn't cheap. Because the Kyro was a tile-based renderer, it could probably get away with having lots of register space for texture objects per polygon (for some reason. I don't know enough about the specifics to say why, but given the unorthadox nature of tile-based renderers, I would be willing to believe it).

Sundy
06-30-2002, 09:56 AM
Hi all,

I dont really understand a thread with John Carmack's name getting sooo hot. I have a few points to make.

1) John Carmack is what he is now because he's got amazing people with him doing those graphics.

2) Though there were a lot of games using the Quake3 engine there was nothing that could make use of it as much as id software did. Again, the credits goes to those graphic designers who brought the game alive.

3) Stop worshiping him as an idol, He's just like u and me.

4) If John Carmack is reading this I am sure He understands this.

5) I am not jealous of him or something, I still think he is as good as you and me.

6) I can go on....


-Sundar

[This message has been edited by Sundy (edited 06-30-2002).]

John Pollard
06-30-2002, 10:04 AM
In order to do what you're suggesting (having post-T&L data lying around), the vertex shader would have to be done in software and the data stored on the CPU. This is, quite simply, completely unacceptable from a performance standpoint.

But doesn't the HW have this data though? This is where it should happen. The HW simply re-rasterizes the triangle.


T1 * T2 + T1 * T3 + T1 * T4 + T1 * T5 + T2 * T3 + T2 * T4 + T2 * T5 + T3 * T4 + T3 * T5 + T4 * T5 = output color.

Yeah, unless you did some crazy work, this shader would fail. You would have to place one restriction on a shader (and you might think this is a huge restriction). Lets say you had 4 TMU's (TMU0-3), but you needed 8 TMU's in your shader (TMU0-7). The restriction would be, that TMU0-2 could not interact with TMU5-7, and vice versa. However, TMU3/TMU4 *could* interact. You can think of these 2 TMU's as the bridge between the virtual TMU gap.

In this case, the in-between result of TMU3/TMU4 would require a temp buffer to "carry over" the results, so the shaders can be combined. I know your thinking that I can pull this off now, using a render target and current HW, but this is not very scalable, and my shader would never take advantage of future HW, unless I planned ahead.

Instead, as more TMU's are introduced, my shaders just get faster, without me having to do anymore work.


Oh, and the output color will be alpha blended with the framebuffer.

This is a post operation, that would be handled no differently then it is now. Would it not?

davepermen
06-30-2002, 10:04 PM
ahhh eassssyyyyyyyyy you found the final solution. doing the multipass per triangle. i mean, state changes are cheap, we can do that all the time...

does not sound that you know that much about how the hw works, do you?

i know it CAN be done to set up the shaders automatically but as the different parts of the gpu have really much different power and programability, its hell of a complicated thing to get this working. if you say its that easy, why not providing some own interface. you can gen an ext from it if you want, and nvidia and the others could then implement it directly to gain some speed. i mean, you should get it working as well..

there isn't even a really simple register combiner language possible, they are too specialized.. you're faster coding that stuff directly.. setting up a multipasssystem is in fact quite easy, but the different passes i want to code myself. no one can beat me at fast coding shaders. no computer, at least..

and if you can, you prefer staying at singlepass by dropping some little features/accuracies..

Ysaneya
06-30-2002, 11:27 PM
Sundy, i'm 100% with you. There is some people here that need to face reality, and stop worshipping Carmack. I'm tired of hearing Carmack here, Carmack there, Carmack is the best programmer in the world, etc.. He's a good game programmer, point. There is many people better than him. He did not invent BSPs. He did not invent lightmaps. And he certainly did not invent per-pixel lighting. His code design/quality seems to be very average (efficient ok, but not nice). And what about all these people that are working with him ? Artists, true, but also other programmers, musicians, designers, etc.. ? Don't they deserve as much credit as him ? Would you be so impressed by Doom3 with crappy graphics ? Sorry for the rant. I know it won't change anything, but i feel better..:)

Y.

knackered
07-01-2002, 12:45 AM
Nope, just 2 core engine programmers at id - carmack and another guy.

ash
07-01-2002, 12:47 AM
[QUOTE]Originally posted by mcraighead:
> I honestly see it as infeasible, or
> at least much worse in performance.
>
> The only way I can imagine implementing
> it in the general case is to use a SW
> renderer. (And that's not invariant
> with HW rendering!)
>
> And eventually, for a big enough program,
> the API has to refuse to load it.

Why? For any assembler-level program, no matter how large, I can imagine a trivial (totally inefficient) multipass implementation in which every instruction is implemented as a single pass, with intermediate results being written to memory. Work backwards from there, collapsing passes as far as you can, to imagine a more optimal implementation. Loops and branches might need to be done in software, but that doesn't introduce invariance in itself.

ash
07-01-2002, 01:25 AM
> The easiest example of a shader that
> _cannot_ be broken down into multiple
> passes on a GF3 is a specular exponent
> done in real floating-point math.

[snip]

> If you don't have this math operation
> per-pixel (at bare minimum, you need
> per-pixel exp() and log() to emulate
> pow()), it's essentially impossible to
> emulate it.

Yes, but the problem then isn't really a problem to do with multipassing per se; you can't even do that single operation in a single pass, if I understand you correctly. So it's not a very interesting case, as a counterexample for multi-passing, is it?

If your point is that multi-passing alone isn't enough to let you emulate any given program on any given piece of hardware, then sure you're right. You hardly need a counterexample to prove that. The hardware must support the basic operations of the language at some level, if you want to run any programs at all. So I don't think your example is very relevant.

LordKronos
07-01-2002, 02:57 AM
Originally posted by knackered:
Nope, just 2 core engine programmers at id - carmack and another guy.

Sure about that? There are 5 programmers on Doom 3:
John Carmack
Graeme Devine
Jim Dose
Robert Duffy
Jan Paul van Waveren

Robbo
07-01-2002, 04:55 AM
I think he means `core programmers'. I wouldn't expect all those people are working on the engine. I expect some to be working on scripting and tools etc.

LordKronos
07-01-2002, 05:28 AM
Originally posted by Robbo:

I think he means `core programmers'. I wouldn't expect all those people are working on the engine. I expect some to be working on scripting and tools etc.

Primary Contributions:
John Carmack's - graphics
Graeme Devine - According to JC "Graeme's primary task is going to be a completely new sound engine "
Jan Paul van Waveren - physics engine
Robert Duffy - Tools
Jim Dose - Game logic, animation system, scripting

So that's at least 3 "core programmers".And
thats assuming you dont consider tools or scripting systms to be a core component of the engine(and as complex, customized, and integrated as these are I think personally I would consider them to be "core").

I think some people need to get away from the notion that graphics are the game, graphics make the game, graphics are the toughest part of the game, graphics programmers deserve all the real credit. I personally consider a good physics engine to be much more complex and difficult to implement than a good graphics engine.

[This message has been edited by LordKronos (edited 07-01-2002).]

PH
07-01-2002, 05:36 AM
Usually everything is built around a scripting system so I would consider this one of the main components of the engine too.
I agree, a physics engine is a lot more complex to implement than a renderer ( requires some theory too ). The actual rendering is not what's impressive, it's the preprocessing that needs to be done on the geometry ( and this part can get complex ).

Pentagram
07-01-2002, 06:38 AM
JC only does the fun part, physics & collision detection are the real nasty parts of an engine. How many games are there where you have good graphics but you get stuck in walls things float in mid air,...

knackered
07-01-2002, 07:06 AM
Core graphics engine programmers, I was talking about - seeing as though the discussion is about graphics. I used the word 'core' to avoid replies like yours lordkronos...obviously naively hoping that, just for once, the discussion wouldn't shoot off up another semantics avenue.
Yes, collision detection, physics, and sound are important too - yawn. http://www.opengl.org/discussion_boards/ubb/wink.gif
Ysaneya - all ID games (except perhaps the original doom) have contained distinctly average graphics and sound assets, the gameplay is renound for being an empty experience, with little imagination, and the tools are bad (compare qradiant to worldcraft to see what I mean...qradiant was written by an ID person, while Worldcraft by someone at Valve)...so I wouldn't give any of the other contributors to their various titles any credit at all... the graphics engines have always been first class, very efficient and capable (if you like wandering around indoors). It's left to the mod makers, and people who license the engines to make the 'real' games.
There...counter-rant over.

Mezz
07-01-2002, 07:34 AM
The reason Carmack gets lots of credit is because of what he did in the earlier days, Doom and Quake were great leaps and the graphics engines for those probably were pretty hard to program (this includes having them run at acceptable speeds).

Yes, now there are more coders but then much more goes into a game, it's no longer acceptable to have mediocre sound or buggy physics or stupid AI, when every other developer spouts off about how 'revolutionary' their approach is to a particular area mentioned above (they never say revolutionary graphics engine though, because they know all id's will wipe the floor with them...). But anyway, lets just see how D3 turns out.

-Mezz

Robbo
07-01-2002, 07:44 AM
I have to differ with all of you on what is the most important aspect of a game. At 31 I can no longer stomach rushing around being `fragged' every 30 seconds with some spotty kid trying to stick a rocket up my backside.

For me the critical aspect of a game is in the design ( not of the engine, but the storyboard ). I cannot over-emphasise the importance of narrative. Even totally linear games like Half-Life and Homeworld are a real joy to play because of the story thats unfolding before you. Quake III (for example) didn't have this narrative so for me it was eye-candy coupled with a Motor\Somatosensory Cortex and Cerebellum test http://www.opengl.org/discussion_boards/ubb/wink.gif [fun for a while but not much `I wonder what happens next' in there].

I read that ID have a fiction writer storyboarding the entire game. This is good news for all fans of narrative. Couple that with JCs obvious graphical skills and I can see it being a winner.

Personally, I think the really amazing things Carmack did were the original Doom and the step up to 6DOF in Quake. Since then things have been progressing steadily, but nothing groundbreaking has arrived. Perhaps Doom III will be the first game to put it all together (graphically) - and show what the new gfx technology can really do. I've seen lots of stencil shadow demos, lots of bumpmapping demos, lots of lighting demos etc. I can't wait to see a game that does the lot.

Just a few thoughts above.

barthold
07-01-2002, 09:13 AM
Originally posted by ehart:

Just a couple random comments here.

[snip]


I agree fully with Evan's post.

OpenGL2 is setting a direction for hardware to grow into. Supporting current generation hardware is not one of its primary goals. Having the API chase the hardware (as has happend for the last several years) has gotten OpenGL in the state it is in currently; fragmented, and lacking behind the other major API in feature set. Only by setting a clear path forward for the next 5 years will OpenGL survive. Anyone claiming it is 'too hard' to implement parts of OpenGL2 on yesterday's hardware does not understand the real purpose of OpenGL2.

The new API ideas that have been proposed come from conversations with real ISVs facing real problems. In addition, we have tried to harmonize features that are readily available today, but through a different extension on each vendor's platform. Hardware vendors need to recognize this, and adapt their future hardware design to facilitate the new API ideas. (We can discuss what these new ideas are exactly, but if the fundamental principle of a "forward-looking API" is not agreed on, feature discussion is a moot point).

Note that this approach is nothing new. This is how SGI back in the old days set a vision for OpenGL 1.0. Most ARB members at that time were not able to do in hardware what OpenGL 1.0 was proposing, but they went with it anyway, and adjusted their next generation hardware plans accordingly. That approach made OpenGL wildly succesful.

John Carmack has an attractive point. Why should he, or any ISV, have to worry about all kinds of hardware resource limits, while writing shaders in a high level language? Why should an ISV care how many instructions a fragment shader happens to be able to use? How many temporary registers there are? How many uniforms it can use? How many texture stages it can use? If you have to worry about those limits, you're still writing different back-end renderers for different hardware. The whole point is to enable the ISV to write fewer (preferrably one) back-end. Now, if they want to spend the effort in writing more back-ends, because of performance, invariance issues, fear of driver multi-passing etc, they can do so. But they are not required to do so. Of course this goes hand in hand with some kind of query mechanism where you can find out how well a given shader maps on a certain piece of hardware (how many passes it would take to run this shader, for example).

OpenGL2 is the direction of the ARB (see the last ARB meeting minutes). An OpenGL2 working group has been formed, and the majority of the ARB members have volunteered people to participate in this working group. Initial drafts of specifications for the OpenGL Shading Language and the the OpenGL 1.3 extensions to support it were circulated to the ARB three weeks ago. Progress is being made in converging onto what it exactly should look like. 3Dlabs has played an important role in nurturing the OpenGL 2.0 initiative, but our goal here is to provide a forward looking API that exposes next-generation programmable hardware at the highest level possible, and beyond any vendor's specific "hardware gadgets".

Barthold
3Dlabs

John Pollard
07-01-2002, 10:15 AM
ahhh eassssyyyyyyyyy you found the final solution. doing the multipass per triangle. i mean, state changes are cheap, we can do that all the time...
does not sound that you know that much about how the hw works, do you?

Sarcastic much? http://www.opengl.org/discussion_boards/ubb/smile.gif

I realize that this would be hell on the internal texture cache, etc, of most current HW. I'm just thinking out loud. Thinking a little "outside of the box" if you will.

Ok, worse case scenerio, the entire emulation is done in the driver. As a matter of fact, the driver may very well do the same thing I would have done, or something close. But this is for the driver to figure out. There still may be limitations, and all emulations might not be possible. But, it would make life alot easier for alot of people. I'd rather see 1 driver save the work for 100's of coders, rather than each and everyone of those coders redoing work, and possibly screwing something up.

As far as worrying that your effect would require too many passes to emulate, there is still nothing stopping you from writing these fallback cases. In this case, I think it is a good idea for you to do this work, as the driver cannot guess how much you would be willing to "give up" to achieve similar results. The driver would only emulate the effect, if the results would be the same.

[This message has been edited by John Pollard (edited 07-01-2002).]

zed
07-01-2002, 10:31 AM
>>John Carmack has an attractive point. Why should he, or any ISV, have to worry about all kinds of hardware resource limits, while writing shaders in a high level language? Why should an ISV care how many instructions a fragment shader happens to be able to use? How many temporary registers there are? How many uniforms it can use? How many texture stages it can use? If you have to worry about those limits<<

i agrre completely.
this has been discussed lots of times in these forums under different guises + the common consesus is.
do the shader, do a small test, time it, is it quick enuf? yes. else use a simplar shader.
in the past another word was inserted instead of shader. eg vertex blending.

davepermen
07-01-2002, 10:36 AM
i just want to see the driver developer that can code such a complex thing. if it would be that easy it would be that easy for us all as well and we would have yet the shader-converter and there would be a GL_ARB_general_shader now here. its just not to code this.. its much much much too much work. cpu's are much more easy to let compilers generate optimized code. but the gpu has some VERY VERY VERY different parts wich simply don't work really sweet together. say you want to do some dotproducts and they don't fit into one pass... blending can NEVER do dotproducts (except with 12 or 24 passes, don't remember.. and with clampingerrors..), so you have to split the dots between the passes and hope they don't need them directly and all that..

you can draw into rgba independendly to store up to 4 results, bind them as texture afterwards or blend to them in some fancy way (but then you can't extract the components individually btw..) so what? its just a big nono..

SHOW ME THAT IT CAN BE DONE. you essensially do have the same interface as driver developers do have, at least on nvidia hardware you get full access to the vertexcallback (my name for vs or vp), the texturelookupsettings (textureshaders), the fragmentcombiners (registercombiners), and the framebufferblendings (blending,depthtest,alphatest,stenciltests). the hw can't do more. possibly the driver devs could implement it a little more efficiont but they can't add more features.

so DO IT YOUR SELF. SHOW THAT ITS FAISIBLE. then soon you get a job at nvidia or ati company.. you would solve the thing that could not be solved since some years now (since the first multitexturing came in we had this problem with the fallback and multipass/singlepass settings)

show me you're god. carmack would love you.. http://www.opengl.org/discussion_boards/ubb/smile.gif

Julien Cayzac
07-01-2002, 12:57 PM
Originally posted by barthold:

OpenGL2 is the direction of the ARB (see the last ARB meeting minutes).


When will those damn minutes be online ??? http://www.opengl.org/discussion_boards/ubb/smile.gif

Julien
human monitor of the ARB page :'(

John Pollard
07-01-2002, 01:01 PM
blending can NEVER do dotproducts (except with 12 or 24 passes, don't remember.. and with clampingerrors..), so you have to split the dots between the passes and hope they don't need them directly and all that..

You are thinking about details too much. Think of it like a CPU. You can do whatever you want. You can pretty much dream up the math, and it works. Your only enemy is serialization.

This is where the (future) gfx HW comes in. You write the code, and the HW figures the rest out. You don't even need to know how many TMU's there are. There could only be 4 TMU's, but there might be 16 parallel math processors. Or there may be 32 TMU's. Those are details, that we don't need to know about. We just write code, assuming it will be serialized, and parallelism is taken care of by the HW. That's not to say though, that you can't write the code to be friendly with the HW, to get maximum parallelism.

In the future, I don't think we will even be thinking of it as "passes". There will be no more passes. We will think of it in terms of how much parallelism a piece of code will get on a particular piece of HW.

I know this sounds crazy, but I'm pretty sure this is the future. It might be 3-4 years before it takes effect though. Maybe I'm just crazy though http://www.opengl.org/discussion_boards/ubb/wink.gif

In the mean time, I'm just dreaming up ways the current drivers could possibly implement this functionality on existing HW. Kind of ease the transition.

knackered
07-01-2002, 01:33 PM
davepermen - you should lay off the caffeine, you're sounding a little manic...and you know about as much about the internal details of a 3d card as I do (ie. about 10%).

It's refreshing to hear this kind of vision from 3dlabs. I agree entirely. The whole point of opengl1.0 was its transparency - the programmer didn't need to know how the triangle was textured, it just was textured...just as I don't need to know how many colours a context is rendering in, I just give values from 0 to 1. The hints mechanism was designed to give hints, not explicity tell the driver what to do.

Now, tell me why there isn't a dot product framebuffer blending mode. It would solve some problems in the short term, so give us it!

jwatte
07-01-2002, 01:53 PM
> I expect that -- in practice -- artists
> will be writing different shaders for
> different generations of hardware.

I expect that -- in practice -- artists won't be writing shaders at all. Artists will be configuring 3dsmax or Maya, or whatever, using the knobs that these tools provide. Then it's up to programmers to turn those knob settings into shaders.

Korval
07-01-2002, 08:12 PM
You are thinking about details too much.

Why is that so terrible? Because it tears your vision of the future apart? Well, somebody's got to look at the details. If it's not us, then it's somebody else.


Think of it like a CPU. You can do whatever you want. You can pretty much dream up the math, and it works.

A graphics card is not a CPU, nor should it ever become one. At best, it will become 2 CPUs: a per-vertex processor and a per-fragment processor. Precisely what they can do will be limitted for performance reasons.

You see, the closer graphics chips get to full-fledged CPU's, the faster they lose their one advantage over CPU's: performance. The only reason we aren't all writing software rasterisers is because graphics chips do them faster. Adding all this "programmability" will simply slow them down (or drive the prices up).

There are reasons why programming the texture unit's filtering is not something that hardware developers are even considering. There are reasons why every "displacement mapping" technique being proposed is done with vertices rather than fragments. Those reasons are performance.

If you look at the Stanford shader, which is the closest thing currently avaliable to what you're asking for, even it has limitations. It will reject some shaders on some hardware as being too complex. You'll also find that its performance leaves much to be desired, both on the compiling end and on the running end.

The only way to make something like this even remotely fesible (for shaders that will actually compile on the hardware rather than simply rejecting the shader) is to have some mechanism that tells the users exactly what resources will be used. And I mean exactly, from the number and sizes of extra buffers that will be allocated, to the overall cycle-count per-pixel/vertex, to the number of texture accesses/filters that will be used.

This doesn't remind me of arguments of assembly vs C/C++. It reminds me of arguing with people over the fesibility of a CPU/computer that actually natively understands C/C++. I don't mind having the compiler layer between my C and the assembly layer.

What I would not be adverse to seeing is an off-line shader compiler that generates a "shader object" file (a .c file). The application can link the shader in and tell it to run given a set of data (via a reasonable OpenGL-esque interface). It makes it easy to see exactly what the shader will need. Not only that, it makes it easy to go in and optimize a shader by doing certain operations a different way.

I could see each OpenGL implementer creating a module to the compiler that generates optimized C code for their particular hardware.

[This message has been edited by Korval (edited 07-01-2002).]

davepermen
07-01-2002, 09:44 PM
hum you got me wrong. (and no coffee here..). the guy wants a general multipass shader compiler for todays hardware. todays hardware has too restricted, too different parts in. that future hardware will drop these "faults" is pretty logical. pixelshader and framebuffer access possibly merge sometimes, so that we always have access to the framebuffer"texture" if we enable blending. for example. but on todays hardware getting a shader working over severall passes automatically is just impossible fast. the stanford shader language demos do work at home, as well as they do here. but they don't work smooth at all on a geforce2mx. and this for simple one-mesh-demos. doing the right optimizations for having the shaders fast on the specific hardware means knowing the hw posssibilities till the last bit. and on todays hw (okay, i know only the gf2mx by heart as i never coded for gf3 or such yet.. sooner or later a gf4ti4200, we'll see), setting up some registercombiners and blendfuncs and everything is simply too complex for a compiler. i helped in optimizing quite a lot of rc's yet and i've seen tons of restrictions of this and that here and there and well.. that thing is _NOT_ like a cpu at all..
for the vertexshader_callbacks, yes, they can cut to multipass, i think.. you set up tex0 to tex8 for example and the shaders get compiled into two callbacks, one for pass0 and one for pass1, wich only set up the specific needed texcoords.
but for todays pixelshading capabilities (mostly multipass pixelshading, see the huge doom3 topics about how to implement a general lighting equation and you see where the real problems lie in todays hw and "generic programming" (what your shader then would need to be.. sort of)), i don't think there is a general AND FAST way.. (well.. on the radeon it _can_ yet be quite generalized with multipass to just look up textures in dependend ways where ever you want.. i _think_, but on a gf3 or gf4, the dependend texture accesses with texshaders are quite restricted anyways..)

PROVE ME I AM WRONG.
haven't seen any working implementation of you yet... http://www.opengl.org/discussion_boards/ubb/smile.gif
http://tyrannen.starcaft3d.net/PerPixelLighting

your compiler has to generate such thing with --optimize:speed tag. one pass for the equation i plugged in, two pass for --optimize:no_approx (this for gf2mx, for gf3 and more it would then set up texture_shaders and do the math in 1 pass with optimize so that normalizations get wrong, and with --optimize:accurate it would set up multipass with normalizations and rendertexture for depenent texture accesses and all.. --optimize:accurate on gf2mx would set up a huge multipass system to do the normalizations for all the vectors as well.)

think you get the idea what _could_ be a problem on TODAYS hardware.
on future hw i don't see much problems.. a generalized gpu like a p10 doesn't have any problems to implement multipasses and singlepasses in the same way..

knackered
07-01-2002, 11:02 PM
Originally posted by davepermen:
hum you got me wrong. (and no coffee here..).

Well in that case, DRINK SOME!
Seriously, you've made your point (something about 3dcards being hugley limited), but there are people who design the hardware discussing the 'future' of graphics co-processors here...not ranting on about the limitations of todays hardware....if people had your attitude then we'd still be using single textured, flat polys, with software t&l.

folker
07-01-2002, 11:23 PM
Originally posted by mcraighead:
I honestly see it as infeasible, or at least much worse in performance.

The only way I can imagine implementing it in the general case is to use a SW renderer. (And that's not invariant with HW rendering!)

- Matt

No, it is no problem to support it in hardware very efficiently.

In the OpenGL 2.0 spec it is an central aspect that the driver can split a complex shader program into multiple passes automatically.

This is based on auxiliar buffers for storing intermediate results. Current hardware does not support it, but it is no real problem to support it in hardware. OpenGL 2.0 is designed explicitely for it.

The OpenGL 2.0 designers know very well what they are doing. And Carmack is absolutely right: Splitting complex programs automatically is the real quantum leap to get real hardware independence, so that you don't have to write different shader programs for different hardware.

For details see the OpenGL 2.0 specs.

dorbie
07-02-2002, 12:48 AM
A key concern I think is handling multipass transparently and efficiently. It's one thing to say you're going to use auxiliary buffers quite another to thrash state behind the scenes between packets of data. But if you think about this, it might be a powerful argument FOR OpenGL 2.0. Implementations with big FIFOs (they all do)can package this up at a granularity of their choosing after compiling to target their pass requirements and persistent register count between passes (aux buffers). Other implementations with recirculation or more units can compile to a single pass.

Applications wrestle with this now, we have Carmack using destination alpha to store terms between multiple passes and losing color information because destination color is accumulating final results on one implementation, and doing it all in a single pass on another with a completely different API. There have been discussions you've already had where we've talked about using pbuffers an reprojecting fragment results back onto the database to get around these issues. This is not a pretty picture. There is no good solution today, not even a hand coded one. In other words applications don't have significantly better options today even with hand crafted hacks like destination alpha as a persistent register(the magic sauce to pass more than a single value between one pass and another). What is possible today is not the issue, we know it sucks wind.

So the real debate was framed by Carmack quite well although he pretty much dismissed the couter arguments with a glib "get over it". His main desire (I think) is to write a single path without having to ask the hardware how he needs to split up his code to implement an effect, it's a drive for transparency. The question is how much does that cost. Opinions differ but I think the correct answer depends to on your timescale, neither option is particularly great TODAY, Cg has the slight edge because it's intentionally dysfunctional with the promise of better in future, but long term where should we be headed?

Just as an interesting aside, at what point do Cg and OpenGL 2.0 converge?

[This message has been edited by dorbie (edited 07-02-2002).]

davepermen
07-02-2002, 02:32 AM
Originally posted by knackered:
Well in that case, DRINK SOME!
Seriously, you've made your point (something about 3dcards being hugley limited), but there are people who design the hardware discussing the 'future' of graphics co-processors here...not ranting on about the limitations of todays hardware....if people had your attitude then we'd still be using single textured, flat polys, with software t&l.


if you know me you know i'm the last wanting single textures flat polys with software t&l and its a shame to get blamed to think of this.

he asked for driver developers to do it in todays hw and i just want to show up how nearly impossible it is.

for future hardware and future hardware design, its no problem at all, thats logical. and i can't wait for the future hw, because well.. with my gf2mx i don't have much of the new fancy features.. but anyways, its much possible even on this hardware, but to code for this you have to do it manually..

for future hardware the framebuffer should optionally be bindable in the fragmentshader, that would solve much of todays issues. (blending with rc's power..). and multiple buffers to draw in. but anyways.. we'll see what is comming. as the hw gets more generally programable multipass will not really be something we touch anymore. possibly. but not based by teh api, thought.. instead some higher level interface (cg? http://www.opengl.org/discussion_boards/ubb/wink.gif) wich does this. direct hw access should still be there, the rest is part of some d3dx equivalent (glu,glut, etc.. rebirth of them http://www.opengl.org/discussion_boards/ubb/wink.gif)


and no coffee for me. as long as i can stay awake without..

dorbie
07-02-2002, 03:02 AM
Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don't just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors. Depth should be handled similarly.

As a personal preference:
at this point your combiner (or whatever you call it) replaces blend equation / blend func hardware completely and you use destination color as texture register to implement what we now call blendfunc + other goodies.

That would seems quite clean programatically and avoids the inevitable glMultiBlendFuncEXT crap someone is bound to ask for, it also makes framebuffer fragment processing another part of a texture unit. Basically eliminating a big chunk of orthogonal and increasingly redundant functionality while bringing texture functionality like DOT3 to framebuffer blending.

The framebuffer just becomes one of several optional 'persistent registers'.

Even if blending is just a special 'final combiner' initially I'd like to see this be the direction things move in.

davepermen
07-02-2002, 03:26 AM
yeah.. finally dropping the stencilbuffers, alphatests, depthtests and and and.. instead, we then have two rgba buffers (both with 32bits per component), to where we can render and from where we can read as we want.. (meaning buffer2.a == depthvalue, buffer2.b == shadowvolumecount and such stuff). just generalizing the stuff to say we have say 32textures max, and say 8 independent grayscale 32bit buffers (one buffer with 8values in, but thats hw implementation), and the first 3 of them are the screenrgb. rest use for what you want..

that way we could generate fancy dephtbuffershadowing with the seconddepth, more or less order independent transparency (independent per mesh is quite important http://www.opengl.org/discussion_boards/ubb/wink.gif)

am i only dreaming?...

and then.. dropping the rastericer in the middle and let the vertexshader output on screen.. http://www.opengl.org/discussion_boards/ubb/smile.gif

davepermen
07-02-2002, 03:31 AM
and finally support for rendering tetraedric objects.. (glBegin(GL_TETRAEDER);glEnd() http://www.opengl.org/discussion_boards/ubb/wink.gif
that means, rendering 4 triangles at the same time with 4 points defined, getting both the min and maxdepthvalue in teh combiners, "clamp" them into the framebuffer depthrange (means if max>frame.depth,max=frame.depth). that way we can finally render real volumetric objects. fog and such, no problem anymore..

but i think that is still FAAAAAAAAAAAR away.. but as filtering and sampling gets programable, actually with programable anysotropic filtering you could yet sample over a line, meaning you could sample through a volumetric texture over a line and get the result. rendering real volumetric textures... that would be cool..

dorbie
07-02-2002, 03:49 AM
I was thinking of more that two mere color buffers. Color would just be one use of a whole bunch of general persistent registers.

As for drawing fog, I assume you mean "tetrahedra". You should look at the SGI volumizer API, you might find it interesting. If you can output the depth value to a persistent register you can do what you want without multiple simultaneous source fragment generation. In anycase the fog volume stuff has aready been done for arbitrary shaped volumes (or the equivalent) on current hardware, there are several tricks that make this possible. Storing intermediate results to auxiliary buffers which could then use dependent reads to apply an arbitrary function would make it even simpler to implement. You could even wrap it in a tetrahedra interface but it would be inefficient. Polyhedra for homogeneous or textured fog (and other gaseous phenomena) would do.

This link may help: http://www.acm.org/jgt/papers/Mech01/


[This message has been edited by dorbie (edited 07-02-2002).]

folker
07-02-2002, 04:51 AM
Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don't just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors.

This is exactly what OpenGL 2.0 does
-> see OpenGL 2.0 specs.



Depth should be handled similarly.

As a personal preference:
at this point your combiner (or whatever you call it) replaces blend equation / blend func hardware completely and you use destination color as texture register to implement what we now call blendfunc + other goodies.


This is also exactly what OpenGL 2.0 does
-> see OpenGL 2.0 specs.

But note that in OpenGL 2.0 there are still also the standard depth test and blending units which can be combined with fragment shaders due to performance reasons, since these fixed function units are much faster. For example, the fixed depth test can get a major speed-up by a hierarchical z buffer.



That would seems quite clean programatically and avoids the inevitable glMultiBlendFuncEXT crap someone is bound to ask for, it also makes framebuffer fragment processing another part of a texture unit.


No, accessing textures has fundamental differences to accessing the framebuffer or aux buffers. For example, think of the gradient and filtering aspects.

davepermen
07-02-2002, 05:53 AM
why, folker? both are just arrays of pixels/texels. you can filter bilinear from the framebuffer as well.. http://www.opengl.org/discussion_boards/ubb/smile.gif
what i want is just dropping the framebuffers/pbuffers and textures, and instead just do one thing. (about what dx does)
gl2BindDrawBuffer(GL2_RGBA,texID);
gl2BindDrawBuffer(GL2_DEPTH,tex2ID);
drawing onto them();
gl2Finalize(texID);
gl2Finalize(tex2ID);

or so.. and use them and draw onto them and bind them and read from them etc..

and actually dorbie means that the framebuffer gets its values into the rc's as constants perpixel. means simply the same data as you get in the blendingequation but yet in the registercombiners.... (i thought about that at least, i think dorbie as well)

John Pollard
07-02-2002, 07:48 AM
he asked for driver developers to do it in todays hw and i just want to show up how nearly impossible it is.

I know it's probably hard to emulate alot of shaders on todays HW. But using a render target as an intermediate result, putting restrictions on what TMU's can depend on each other etc, you can almost pull it off in alot of cases.


yeah.. finally dropping the stencilbuffers, alphatests, depthtests and and and.. instead, we then have two rgba buffers (both with 32bits per component), to where we can render and from where we can read as we want..

Ohh. Something I've been wanting for awhile. For some things I need to do, I need two zbuffers, and I need to test from each one using different comparisons.

Rather than special casing all these buffers (stenil, alphas, zbuffer like you say), we will eventually need access to the frambuffer in the pixelshader pipe. Matter of fact. Just get rid of the term "framebuffer". Just set a texture as the "active" texture. This replaces the active framebuffer. I can write my own stencil shader code in this case.

We can sort of do this now using render targets, but it's not quite there yet.

[This message has been edited by John Pollard (edited 07-02-2002).]

folker
07-02-2002, 08:11 AM
Originally posted by davepermen:
why, folker? both are just arrays of pixels/texels. you can filter bilinear from the framebuffer as well.. http://www.opengl.org/discussion_boards/ubb/smile.gif
what i want is just dropping the framebuffers/pbuffers and textures, and instead just do one thing. (about what dx does)
gl2BindDrawBuffer(GL2_RGBA,texID);
gl2BindDrawBuffer(GL2_DEPTH,tex2ID);
drawing onto them();
gl2Finalize(texID);
gl2Finalize(tex2ID);

or so.. and use them and draw onto them and bind them and read from them etc..

and actually dorbie means that the framebuffer gets its values into the rc's as constants perpixel. means simply the same data as you get in the blendingequation but yet in the registercombiners.... (i thought about that at least, i think dorbie as well)

Such an universal design would be possible of course, but you basically would give up the performance advantages of GPUs compared to CPUs. The reason why GPUs are much faster in 3d rendering compared to CPUs is that the do NOT have such an universal design.

Both texture access and frame buffer writing is optimized by taking advantage of the fact that there are no side effects, which is the key for massive pipelining and massive parallelization (16, 32, 64 texture access units / fragment units in parallel in hardware etc. etc.): Textures are accessed random access (optimized by texture swizzling), but only read -> no side effect. Frame buffer is read-modifiy-write, but (basically) linear, which is the reason that fragment programs can access only one pixel
and not neighboring pixels. If you would have random read-modify-write access to textures or framebuffer or whatever memory, you basically give up the performance advantage of GPUs and are back to CPU software rendering. And there are no advantages compensating this disadvantage.

The clever trick of 3d in hardware is to use techniques which can be implemented much faster than usual CPUs (massive parallelization), but try to be still as flexible as possible (-> vertex, fragment shaders).

Coop
07-02-2002, 10:00 AM
Originally posted by dorbie:
Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don't just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors. Depth should be handled similarly.

From what is already known about DX9 (see the link below), the DX9 class hardware is going to support a simultaneous writing to up to 4 render targets from one pixel shader (one z/stencil for all). Z/stencil test is the only thing that will work at the pixel level - there will be no blending, alphatest, etc. In my opinion it makes the whole multirendertarget idea pretty useless.
http://download.microsoft.com/download/whistler/WHP/1.0/WXP/EN-US/WH02_Graphics02.exe


[This message has been edited by coop (edited 07-02-2002).]

Nakoruru
07-02-2002, 10:38 AM
Why would you want to bilinear filter from the framebuffer? This is not a good idea at all. First how are you supposed to know what is in the framebuffer except what is directly below the pixel you are rendering? Rasterizers do multiple pixels at a time in parallel, so sampling anything but the pixel directly underneath the one you are rendering would make it impossible to even know what you are going to get.

Explain to me how you sample an image that is in the process of being rendered by 4 or more parallel pixel pipelines?

It makes so much more sense to render to texture and then sample from that to do interesting filters. At that point you have an exact idea of what you are sampling. Just look at the ATI demos of things like edge detection. They say not to be afraid of render to texture anymore.

dorbie
07-02-2002, 11:47 AM
folker, there are many different ways a register fragment can be generated even on todays hardware (the differences dwarf those between texture and framebuffer) but this is not used as an excuse to have completely separate blending APIs for the various types of fragments.

Maybe I wasn't clear enough, with OpenGL 2.0 I'd always seen that this implicit bunch of operations still there as part of the output to certain registers and these are still seen as 'special' with all sorts of funky legacy stuff going on and I'm not just talking about glBlendFunc. I still think I'm right on this and OpenGL 2.0 I don't think does all I'm suggesting, sure it must support multiple stored values (at a low level transparently behind the language) but gl_FragColor is not simply written to the framebuffer. I'd be delighted if I'm wrong. It makes no sense to say 'OpenGL 2.0 does all this just as you want' and then close by saying 'what your suggesting is impractical' (to paraphrase).

My objective may be misguided but just to restate what I mean for clairty I suppose it boils down to wanting to take that hardware that's dedicated to the other fixed function legacy stuff, throw it away and implement it in the shader language itself making that hardware available for general purpose programmability. I think some interesting things come of it. At the very least it should be possible to turn off depth testing and blending and STILL implement the legacy fixed function pipeling in the shader, now when you see you are doing that efficiently then you know it's time to throw that hardware away. Perhaps it's an inevitability that this will ultimately happen.


[This message has been edited by dorbie (edited 07-02-2002).]

folker
07-02-2002, 12:25 PM
Originally posted by dorbie:
folker, there are many different ways a register fragment can be generated even on todays hardware (the differences dwarf those between texture and framebuffer) but this is not used as an excuse to have completely separate blending APIs for the various types of fragments.

Maybe I wasn't clear enough, with OpenGL 2.0 I'd always seen that this implicit bunch of operations still there as part of the output to certain registers and these are still seen as 'special' with all sorts of funky legacy stuff going on and I'm not just talking about glBlendFunc. I still think I'm right on this and OpenGL 2.0 I don't think does all I'm suggesting, sure it must support multiple stored values (at a low level transparently behind the language) but gl_FragColor is not simply written to the framebuffer. I'd be delighted if I'm wrong. It makes no sense to say 'OpenGL 2.0 does all this just as you want' and then close by saying 'what your suggesting is impractical' (to paraphrase).

My objective may be misguided but just to restate what I mean for clairty I suppose it boils down to wanting to take that hardware that's dedicated to the other fixed function legacy stuff, throw it away and implement it in the shader language itself making that hardware available for general purpose programmability. I think some interesting things come of it. At the very least it should be possible to turn off depth testing and blending and STILL implement the legacy fixed function pipeling in the shader, now when you see you are doing that efficiently then you know it's time to throw that hardware away. Perhaps it's an inevitability that this will ultimately happen.


[This message has been edited by dorbie (edited 07-02-2002).]

I am not sure what exactly you mean. For example, what is a "register fragment" and what are "types of fragments"?

I indeed want to say that "OpenGL does all this just as you want". But of course, depending on the feature, it may be fast or slow. In detail:

a) Blending / depth buffer test: You can read the old frame buffer value of the pixel, so you can do any custom blending and depth testing operation you want. You also can re-implement the classical blend / depth function tests as shader program. You can do all you want. But of course, using the fixed-function glBlendFunc is likely to be much faster on most hardware, especially also on future hardware. Simply because this fixed function unit allows optimizations which are not possible in the general case. So it is a good idea to keep the fixed function units, even if the same functionality can be implemented using the universal shader language.

b) Framebuffer as texture. You always can render into a texture and use it in an seperate pass. So you can do everything. However, having random-access to the framebuffer itself within one rendering pass itself will make it impossible to create performant hardware (as described in a previous post), without having a advantage compared to render-into-texture.
So a unified framebuffer / texture architecture would be much slower without giving you more flexibility.

Nakoruru
07-02-2002, 12:39 PM
It's not just that you would get horrible performance from random access to the framebuffer, it's that there is not one single compelling reason to do so.

folker
07-02-2002, 12:50 PM
Originally posted by Nakoruru:
It's not just that you would get horrible performance from random access to the framebuffer, it's that there is not one single compelling reason to do so.

Exactly.

dorbie
07-02-2002, 01:13 PM
I call these things fragment registers because that's my preferred term and it's used in some implementations I like. It's just a fragment variable I suppose you'd call it, in this context I'm referring in one respect to the multiple register sources that are available to current implementations which call these variable registers, whether it's from a texture of an color fragment interpolated from vertex values. I suppose it's use has a lower level connotation.

As for the render to texture stuff, no, I used to hear similar arguments over multipass vs multitexture. I just don't see the sense in arguing the difference between texture and framebuffer access then saying you can do it WITH a texture and it'll be faster. It's self evident nonsense, and not just because of the overhead of generating the right fragments or the loss of easy prefetch cues.

That is a moot point though, implement it as you like. Any persistent fragment register (or fragment variable sent between passes if you preffer) is effectively a framebuffer of sorts, but it all gets swept under the covers if you're using a higher level API. It becomes a hidden implementation detail.

None of this addresses the issue of the legacy fixed function pipeline getting eliminated and rolled into the "fragment shader" implementation.

dorbie
07-02-2002, 01:17 PM
Not a single compelling reason? What do you think Carmack is doing with destination alpha or destination color? With compiled shaders this all becomes a hidden issue, but any per fragment shader variable sent between passes is exactly what I'm talking about. Ofcourse there's a compelling need. It's the same thing by different names. You go pat each other on the back, I'm off to thump my head on the nearest wall.

folker
07-02-2002, 01:33 PM
Originally posted by dorbie:
I just don't see the sense in arguing the difference between texture and framebuffer access then saying you can do it WITH a texture and it'll be faster. It's self evident nonsense, and not just because of the overhead of generating the right fragments or the loss of easy prefetch cues.

To summarize my points: Designs not allowing massive pipelining / massive parallelization of fragment evaluation have fundamental performance disadvantages. Allowing random access to the framebuffer during the execution of an fragment program prohibits such pipelining / parallelization and so will be very slow. Because of this, the classical, existing architecture of rendering into texture is faster.


None of this addresses the issue of the legacy fixed function pipeline getting eliminated and rolled into the "fragment shader" implementation.

Also here, the point is performance. For example, see the OpenGL 2.0 specs regarding accessing the previous frame buffer color and depth value.

dorbie
07-02-2002, 01:36 PM
Just to expand on this a bit, Carmack uses source_color * destination_alpha (probably *2) + destination_color on a 4 texture system. He does this because it's almost his only option. He has very few options to make his lighting work, if there were more variables available a lot of this would be more powerful. If the framebuffer were available as a texture register (shader variable) he could for example write his normal vector to an auxiliary color buffer, this would then be available in subsequent lighting passes to texture units for DOT3 operations. Sure it can be done now (at the cost of a texture unit) but don't tell me it's faster, that's just what current hardware does, you potentially have perfect prefetch for this and some architectures (tiled) would benefit from the coherent read (not random access).

The rest is just asking why the legacy fixed function pipeline when you should be able to do this in the shader architecture. When you start asking for multiple destination colors (or better named register stores between passes) for use in texture operations (or better inputs to programmable function units), it's a natural progression to ask why not perform stuff like blendfunction in the same programmable function units.


[This message has been edited by dorbie (edited 07-02-2002).]

Nakoruru
07-02-2002, 01:39 PM
Dorbie, why does Carmack need RANDOM access to the framebuffer? Certainly he only needs to access the pixel right under the one he is currently working on. I have absolutely no problem with that, although I think that the floating point aux buffers are a better fit to the problem.

folker
07-02-2002, 01:40 PM
Originally posted by dorbie:
Not a single compelling reason? What do you think Carmack is doing with destination alpha or destination color? With compiled shaders this all becomes a hidden issue, but any per fragment shader variable sent between passes is exactly what I'm talking about. Ofcourse there's a compelling need. It's the same thing by different names. You go pat each other on the back, I'm off to thump my head on the nearest wall.

There is a difference between random access to the framebuffer (caused by treating the framebuffer as texture / problematic and no reason for it) and non-random access to framebuffer (as used by Carmack and as included in OpenGL 2.0 / very useful).

dorbie
07-02-2002, 01:43 PM
I didn't say random access. It's been called that because of folker's assumptions about hardware design & fragment generation from rasterized primitives. He assumes the access will be random and unpredictable. Don't put words in my mouth. I'm talking about coherent predictable access of results from earlier passes on the same fragment. The real issue is this design vs a fixed function one where the fragment blends are done at the back end vs reads by to texture units. As for the need, maybe you missed one of my later posts.

The irony is that this is a prerequisite for any functional shader. It IS essential by one means or another, the only point of real debate is whether you clean house and the legacy blend stuff get's rolled into that capability. If it's efficiently implemented that should be possible.


[This message has been edited by dorbie (edited 07-02-2002).]

Nakoruru
07-02-2002, 02:01 PM
Sorry Dorbie, I really don't mean to put words in your mouth. By random I mean treating it like a texture and doing dependent reads, not temporally random, but spacially.

Of course we need buffers that store temporary values. The aux buffers in GL2 do just that, and with more precision than the framebuffer. With the kill keyword in a fragment shader you can implement the entire fixed function fragment pipe after coverage calculation. You should be able to Disable() it all and do stencil, depth, and blend. 3D Labs says they are keeping the fixed pipe because its easy to implement, faster than programmability, and of course, they want backwards compatibility.

I see no reason to even output a framebuffer color in OGL2 until the last pass, all the other passes can be to the aux data buffers.

folker
07-02-2002, 02:08 PM
Originally posted by dorbie:
I didn't say random access. It's been called that because of folker's assumptions about hardware design & fragment generation from rasterized primitives. He assumes the access will be random and unpredictable. Don't put words in my mouth. I'm talking about coherent predictable access of results from earlier passes on the same fragment.


Using the results of a previous pass of course makes sense and is very useful.

But isnīt that exactly what the frame buffer access in OpenGL 2.0 and the usual render-to-texture makes possible?
And with render-to-texture, you use the (previous) frame buffer content as texture for the next pass in the same way as normal textures.

dorbie
07-02-2002, 02:14 PM
Nakoruru, yes I think random access would incur a penalty just as it does now (and probably require filtering, right). The real issue with dependent reads AFAIK is pipelining them and avoiding stalls, but if you do costly stuff you incur a cost :-)

However it's implemented shaders of arbitrary complexity will need functionality like this even if it is hidden, the real issue is the legacy stuff. You can call it what you like, framebuffer or not, we know it needs to hold variables to send between passes so I think we're agreeing.

I know I'm on much shakier ground with the elimination of the fixed function pipeline stuff like blendfunction and depth testing and stencil testing. It just seems appropriate to ask the question. Once you start reading information from the framebuffer into texture units the first question you ask is can these units implement the fixed function stuff I have now and therefore suprecede it. There's something compelling about the idea but I'm much more ready to admit that this is impractical than the rest.

[This message has been edited by dorbie (edited 07-02-2002).]

dorbie
07-02-2002, 02:24 PM
folker, render to texture I think has clear overheads, but yes, and ultimately under a language you can hide the implementation details. The interest question arises from taking these ideas to their extreme conclusion in an efficient implementation. Is it a wonderful piece of design rationalization or an unworkable performance pit. Ultimately you MUST be able to implement that fixed function stuff in a shader, then you turn it off the fixed function pipeline, then you throw away the hardware and implement the legacy calls on your shiny new hardware. As a general design goal making the shaders implementation functional enough and nearly as efficient as the fixed function stuff should be high on the list.

folker
07-02-2002, 02:34 PM
Originally posted by dorbie:
Nakoruru, yes I think random access would incur a penalty just as it does now (and probably require filtering, right). The real issue with dependent reads AFAIK is pipelining them and avoiding stalls, but if you do costly stuff you incur a cost :-)

However it's implemented shaders of arbitrary complexity will need functionality like this even if it is hidden, the real issue is the legacy stuff. You can call it what you like, framebuffer or not, we know it needs to hold variables to send between passes so I think we're agreeing.

Agreed also completely.
(And OpenGL 2.0 provides exactly this functionality.)


I know I'm on much shakier ground with the elimination of the fixed function pipeline stuff like blendfunction and depth testing and stencil testing. It just seems appropriate to ask the question. Once you start reading information from the framebuffer into texture units the first question you ask is can these units implement the fixed function stuff I have now and therefore suprecede it. There's something compelling about the idea but I'm much more ready to admit that this is impractical than the rest.

[This message has been edited by dorbie (edited 07-02-2002).]

This is definitely a valid question, agreed completely.

And assuming a clever compiler, you don't need these fixed function operations any more, agreed completely.

The argument for keeping these fixed function units is simply a practical one(performance also without a very clever compiler etc.)

So I think OpenGL 2.0 is exactly the right approach. For example, the SL specs state:


7) Is alpha blending programmable?
Fragment shaders can read the contents of the frame buffer at the current location using the built-in
variables gl_FBColor, gl_FBDepth, gl_FBStencil, and gl_FBDatan. Using these facilities,
applications can implement custom algorithms for blending, stencil testing, and the like. However,
these frame buffer read operations may result in a significant reduction in performance, so
applications are strongly encouraged to use the fixed functionality of OpenGL for these operations if
at all possible. The hardware to implement fragment shaders (and vertex shaders) is made a lot
simpler and faster if each fragment can be processed independently both in space and in time. By allowing read-modify-write operations such as is needed with alpha blending to be done as part of the
fragment processing we have introduced both spatial and temporal relationships. These complicate
the design because of the extremely deep pipelining, caching and memory arbitration necessary for
performance. Methods such as render to texture, copy frame buffer to texture, aux data buffers and
accumulation buffers can do most, if not all, what programmable alpha blending can do. Also the
need for multiple passes has been reduced (or at least abstracted) by the high-level shading language
and the automatic resource management.
RESOLVED on October 12, 2001: Yes, applications can do alpha blending, albeit with possible
performance penalties over using the fixed functionality blending operations.

zeckensack
07-02-2002, 02:35 PM
Originally posted by dorbie:
I know I'm on much shakier ground with the elimination of the fixed function pipeline stuff like blendfunction and depth testing and stencil testing. It just seems appropriate to ask the question. Once you start reading information from the framebuffer into texture units the first question you ask is can these units implement the fixed function stuff I have now and therefore suprecede it. There's something compelling about the idea but I'm much more ready to admit that this is impractical than the rest.While the fixed function stuff is still in the API, no IHV is forced to build dedicated hardware for it. It may make sense though for performance reasons, but the API hides the exact implementation details - as usual.
If you can implement it in a fragment shader, they can do that too, and you'll never notice the difference http://www.opengl.org/discussion_boards/ubb/wink.gif

folker
07-02-2002, 02:41 PM
Originally posted by dorbie:
folker, render to texture I think has clear overheads, but yes, and ultimately under a language you can hide the implementation details.
Copy-to-texture has an unnecessary overhead.
But render-to-texture? It seems to be to exactly provide the functionality to access the previous frame buffer value (which as to be stored somewhere else to be not overwritten by the current render pass).


The interest question arises from taking these ideas to their extreme conclusion in an efficient implementation. Is it a wonderful piece of design rationalization or an unworkable performance pit. Ultimately you MUST be able to implement that fixed function stuff in a shader, then you turn it off the fixed function pipeline, then you throw away the hardware and implement the legacy calls on your shiny new hardware. As a general design goal making the shaders implementation functional enough and nearly as efficient as the fixed function stuff should be high on the list.

Agreed. But as far as I understand hardware architectures, this is very hard for blending / depth tests etc. For example, to implement hierarchical z-buffer tests for arbitrary shader programs implementing any kind of z-test seems nearly impossible to me. And most applications simply need the classical less-than z-test.

mrbill
07-02-2002, 03:21 PM
Originally posted by folker:
So I think OpenGL 2.0 is exactly the right approach. For example, the SL specs state:

7) Is alpha blending programmable?
Fragment shaders can read the contents of the frame buffer at the current location using the built-in
variables gl_FBColor, gl_FBDepth, gl_FBStencil, and gl_FBDatan.

See also the related issue 23. Should fragment shader be allowe to read the current location from the frame buffer?
...
RESOLUTION: None

-mr. bill

[This message has been edited by mrbill (edited 07-02-2002).]

folker
07-02-2002, 03:30 PM
Originally posted by mcraighead:
The depth/stencil/blend thing is the *real* problem here. Everything else is small beans. (Throw in alpha test too.)

Z per fragment and alpha test are both really hard because you don't know which fragment is the visible one on a given pixel until *after* you've computed the shader result for every fragment. But since this technique only stores one intermediate result per _pixel_ (not per fragment, unlike the F-buffer), you are at a loss as to which intermediate result is the relevant one.

Blending is tough because _multiple_ fragments may affect the final result, not just the "topmost" one.
- Matt

Agreed, these are hard problems which are not solved yet. But they are no argument against OpenGL 2.0, because they are no argument against the aim of trying to make the shader language hardware independent as far as possible.

In the worst case, the hardware independence (implemented by transparent multipass rendering) only works for non-blending operations. Maybe even this should be part of the spec of OpenGL 2.0 . But even then, OpenGL 2.0 would be a major step forward having the hardware independence for opaque rendering.

Of even in the more-worst case, if everyone thinks that all blending shaders must run on every OpenGL 2.0 hardware, then this may force future hardware to support arbitrary complex pixel programs (looping back results). But what would be bad about it?

Maybe that not all dreams will become true, but then we should not give up everything, but should still implement everything which is possible to implement. And I think it is a lot.

folker
07-02-2002, 03:32 PM
Originally posted by mcraighead:
Again, I think it is completely wrong to be talking about how people are going to stop writing for piece of hardware X or Y. You may be able to write a piece of software that _works_ on any given piece of hardware, but this completely fails to solve the problem of graceful degradation of image quality between different hardware platforms. It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware! Instead, I see future applications and engines playing out a careful balancing act between image quality and framerates. Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes. Indeed, the hope would be to maintain a constant framerate, insofar as that is possible.
- Matt

Your post is an older post, but I think it is important to contradict.

Both for games and for other software, it is very important to re-use software. Now suppose a new better hardware is availabe which can do the same rendering effect with less passes (current sample: doom3 two-pass on a gf3, single-pass on a radeon).

a) Current solution: The developer has to rewrite parts of the software (shader code).

b) Philosophy of OpenGL 2.0: The driver automatically can reduce the rendering effect from two-pass to one-pass, because it is an implementation detail.

Important aspect: The advantage of supporting new hardware as easy as possible is not only to improve performance of existing applications (e.g. games) from 60 fps to 120 fps. Instead, you can use the same software (the same game engine) to create more detailed content with 60 fps.
You don't have to rewrite your engine, you only have to fed it with new content. Isn't that the aim of all software development? (Especially since we as software have enough to do anyway... ;-)

Of course, we can discuss about problems of OpenGL and difficulties of implementing OpenGL 2.0. But this discussion is useless if we don't agree about the aim to reduce the need to write different shaders for different hardware.

folker
07-02-2002, 04:43 PM
The depth/stencil/blend thing is the *real* problem here. Everything else is small beans. (Throw in alpha test too.)

Z per fragment and alpha test are both really hard because you don't know which fragment is the visible one on a given pixel until *after* you've computed the shader result for every fragment. But since this technique only stores one intermediate result per _pixel_ (not per fragment, unlike the F-buffer), you are at a loss as to which intermediate result is the relevant one.

Blending is tough because _multiple_ fragments may affect the final result, not just the "topmost" one.

I have a hard time convincing myself that this algorithm is capable of handling all the hard cases.

- Matt[/B]

At the first moment it my impression also was that this is a problem. But thinking again about it, it seems to me no problem at all:

What about simply falling back to per-primitive multipass (instead of multipass for a group of primitives) in case of such critical blending / depth test / stencil test situations?

Agreed, you need a fragment program state change for each pass. But on the other hand, are there reasons making it hard to implement very efficient fragment program state changes between a fixed set of primitive program passes so that there is no performance penalty to execute the passes per fragment?

So I think there should be no problem. Or do I miss something?

dorbie
07-02-2002, 05:49 PM
folker, the issue is not just outputing the fragments to texture in render to texture but in applying those fragments on subsequent passes. There's a crapload of stuff has to happen to get that texture fragment in the right place. You're ignoring all that and saying it's faster. The memory access pattern should be identical for an FB read with no fancy filtering and it's much more direct and predictable (that's important for prefetch). Render to texture also costs you a texture unit, there is complex hardware associated with the texture fragment evaluation and filtering that isn't needed for a simple framebuffer fetch. You micht need that hardware resource for a real texture instead of wasting it.

davepermen
07-02-2002, 10:24 PM
i dont see much problems if you have enough parallel pipelines. while the pipeline execution time gets longer because you stream in the framebuffer instead of just stream out the new data over the framebuffer(s), it does _NOT_ affect parallelism at all. with enough parallel pipelines you have the very same pixel troughput. dependend texture reads like in "true reflective bump mapping"-effects is MUCH MUCH MUCH worse in fact. on the setup of the triangle you yet know wich data from the framebuffer you need, and reading out scanlines of the buffer is the fastest way of reading data.

and a p10 and other future hardware will have enough parallel pipelines so that one pipeline can execute quite slow. as long as they don't need to interference (wich they don't), its not a huge problem.

and providing stuff that is slow is anyways cool. say a 3dsmax can simply render with gl2. while the pixelpipeline can be terribly slow, it will be faster than softwarerendering anyways. not everyone needs 100fps

folker
07-02-2002, 11:05 PM
Originally posted by dorbie:
folker, the issue is not just outputing the fragments to texture in render to texture but in applying those fragments on subsequent passes. There's a crapload of stuff has to happen to get that texture fragment in the right place. You're ignoring all that and saying it's faster. The memory access pattern should be identical for an FB read with no fancy filtering and it's much more direct and predictable (that's important for prefetch). Render to texture also costs you a texture unit, there is complex hardware associated with the texture fragment evaluation and filtering that isn't needed for a simple framebuffer fetch. You micht need that hardware resource for a real texture instead of wasting it.

You only should use render-to-texture if you need random access to your (previous) frame buffer content. If you need sequential access without filtering etc. (like the Carmack sample), you can use the aux buffers as defined in OpenGL 2.0.

MikeC
07-02-2002, 11:14 PM
Originally posted by davepermen:
i dont see much problems if you have enough parallel pipelines. while the pipeline execution time gets longer because you stream in the framebuffer instead of just stream out the new data over the framebuffer(s), it does _NOT_ affect parallelism at all.

Is that really true? I'm curious. Average triangle sizes have been creeping down since forever - what happens if the hardware has 32 parallel fragment pipelines and you throw a 5-pixel triangle at it?

If 27 of the pipelines sit idle while you finish the triangle, that doesn't sound very efficient. If 5 pipelines start work on the fragments for that triangle, while the rest are immediately available for work on the next triangle, and the fragment shader can read the framebuffer, then you've got problems when the two triangles cover some of the same pixels. Essentially a race condition, with the result depending on whether triangle A's pipelines write to the FB before triangle B's pipelines read from it. Not nice.

If the fragment shader can't read the FB, then stuffing multiple triangles into the fragment pipelines simultaneously is fine as long as they all have the same latency and will finish in the same order they started. Adding conditionals to the shading language sounds as if it could mess that up, though.

Am I missing something obvious here?

folker
07-02-2002, 11:14 PM
Originally posted by mcraighead:
If GL2 were merely "forward-looking", I'd be all for that. But I think in practice it is (as currently proposed) a very worrisome direction for OpenGL to be heading. The proposals generally do not reflect my vision of where graphics should be 5 years from now. In fact, they fail to reflect it to such an extent that I think I would seriously consider finding a different line of employment [i.e. other than working on OpenGL drivers].

A final comment to this discussion ;-)

I am surprised about your radical rejection of OpenGL 2.0, because to me it seems that OpenGL 2.0 is the very natural step lying completely on the main philosophy of OpenGL:

Providing an hardware-independent interface for 3d applications.

This is exactly the aim of OpenGL 2.0: Make shader programs hardware independent (instead of all the hardware dependent solutions running not on every hardware).

I think that your vision of needing different shader programs for each hardware to archive the same framerate on every hardware fundamentally contradicts the philosophy of OpenGL. This somehow was the philosophy of properitary interfaces like Glide: Implement the optimal solution for every hardware. But this is not the philosophy of OpenGL.

Robbo
07-03-2002, 12:36 AM
This is kinda funny. Every time Carmack farts, 200 posts go up on opengl.org trying to work out how.

I remember listening to a presentation by M. Abrash when he told the old anecdote about his friend at work who made some piece of hardware run at speed x. His friend was then told that another company had a new piece of hardware out that ran at speed x + 5. He sat around thinking about this for days and days until eventually he worked out how to get it to run at speed x + 7 (probably how the other company did it). The funny thing was, the other company didn't actually have anything that quick in the first place. It was all just gossip.

I'll leave you to work out the moral of the story http://www.opengl.org/discussion_boards/ubb/wink.gif

Mezz
07-03-2002, 01:05 AM
Yeah, that little story was in his Black Book Of Graphics Programming as well...
It probably won't curb curiosity though http://www.opengl.org/discussion_boards/ubb/smile.gif

-Mezz

ash
07-03-2002, 02:10 AM
Originally posted by mcraighead:
It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware! Instead, I see future applications and engines playing out a careful balancing act between image quality and framerates. Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes. Indeed, the hope would be to maintain a constant framerate, insofar as that is possible.



Originally posted by folker:
Your post is an older post, but I think it is important to contradict.

Both for games and for other software, it is very important to re-use software. Now suppose a new better hardware is availabe which can do the same rendering effect with less passes (current sample: doom3 two-pass on a gf3, single-pass on a radeon).

a) Current solution: The developer has to rewrite parts of the software (shader code).

b) Philosophy of OpenGL 2.0: The driver automatically can reduce the rendering effect from two-pass to one-pass, because it is an implementation detail.

Important aspect: The advantage of supporting new hardware as easy as possible is not only to improve performance of existing applications (e.g. games) from 60 fps to 120 fps. Instead, you can use the same software (the same game engine) to create more detailed content with 60 fps.
You don't have to rewrite your engine, you only have to fed it with new content. Isn't that the aim of all software development? (Especially since we as software have enough to do anyway... ;-)


I can agree with both of you here; I think they are different but worthy goals.

On the one hand, it's a very good thing not to have to write different code for different hardware if you don't want to -- and OGL2 is an important step in that direction.

On the other hand, it's also a good thing to be able to code your engine such that it degrades usefully on different hardware, by intelligently degrading scene detail and rendering complexity (for example by means of alternate shaders of varying complexity) according to the capabilities of the hardware at hand.

I don't see these goals as necessarily contradictory: we must design solutions that solve both, at the end of the day. The question is whether the ability to write one shader and have it run on any (capable) hardware at some speed prevents you from also regulating rendering complexity in your app. I don't believe it does. Surely providing multiple shaders at various levels of detail and choosing between them at runtime is no more difficult in OpenGL2 than providing multiple levels of geometric detail, and selecting between them. They're both still the app's problem.

Ash

ash
07-03-2002, 03:05 AM
Originally posted by folker:
I think that your vision of needing different shader programs for each hardware to archive the same framerate on every hardware fundamentally contradicts the philosophy of OpenGL. This somehow was the philosophy of properitary interfaces like Glide: Implement the optimal solution for every hardware. But this is not the philosophy of OpenGL.

Level of detail is traditionally done by the app, and is not expressed explicitly within the API; typically the app would supply multiple shaders and switch between them at runtime (using standard SetShader-type API calls) depending on information it gathers about the rendering cost of different shaders, just as apps have traditionally switched between geometric representations at the scene hierarchy level without explicit API support for this in OpenGL.

Like you I see making a single shader run correctly at *some* speed without modification on all capable hardware as the important goal at the API level. And that doesn't prevent the app from doing level of detail with multiple shaders, at all.

Ash

dorbie
07-03-2002, 03:16 AM
Ash, I'd say it is even easier in OpenGL2, geometry requires that you fundamentally change the content, the other does not (unless for example you needed to go from a vector bumpmap to a heightmap bumpmap for legacy reasons). It should be relatively simple to perform some sort of isfast test with a complex opengl2 path and fall back to a simpler shader by no more than commenting out a few lines of shader code. But you DO need that test. With Cg it is just as easy, however the test is based on support and some legacy hardware needs a completely different code path. If you're writing Cg even high end hardware WILL need an alternative code path even the latest hardware; the best option will probably be an opengl2 codepath :-) I could say the same thing about opengl2 requiring an additional code path on NVIDIA. Options like 3rd party opengl2 wrappers that provide support on nvidia hardware make the ballgame even more interesting.

There's no way NVIDIA's going to win this debate now, hmm... what are we debating? Cg vs opengl2 is not the issue, it's really Cg or not? Forget about "vs opengl2", it's done and dusted, the Carmack has spoken and delivered a code path in his engine and multiple vendors (not just 3DLabs and Matrox) are pursuing opengl2 implementations. Carmack DID address the central argument against opengl2 in his .plan so he's not missing some information that will change his mind. The difference is a philosophical one and his philosophy is "get over it".

We're going to have a situation where developers will be writing opengl2 code paths and opengl1.x + extensions code paths (it just happened), the question is how big a place will a Cg code path have beside that.

The answer would be easy if it weren't for that pesky market share and an unseen round of hardware innovation that's going to bring a radical leap forward from more than one vendor. Clouded the future is.


[This message has been edited by dorbie (edited 07-03-2002).]

Robbo
07-03-2002, 03:18 AM
Originally posted by ash:
Level of detail is traditionally done by the app, and is not expressed explicitly within the API; typically the app would supply multiple shaders and switch between them at runtime (using standard SetShader-type API calls) depending on information it gathers about the rendering cost of different shaders, just as apps have traditionally switched between geometric representations at the scene hierarchy level without explicit API support for this in OpenGL.

Like you I see making a single shader run correctly at *some* speed without modification on all capable hardware as the important goal at the API level. And that doesn't prevent the app from doing level of detail with multiple shaders, at all.

Ash

Finally a response I can agree with. It is not the responsibility of the drivers to make you app degrade gracefully. If you find your card has only 4 texture units, then you need to use a less complex shader instead of the 8 tex shader you are using to max your detail.

What you are essentially requesting is that the same detail results from a card which is less capable than some other card and that the driver should sort out how to do it, even if it goes down to 0.1fps. I cannot think of a useful situation for this scenario. A more useful situation would be for you to stay at 30fps but render a pixel which is the result of a less complex interaction.

knackered
07-03-2002, 03:26 AM
The silence from the NVidia contributors is deafening.

dorbie
07-03-2002, 03:29 AM
Hmm sounds like many situations with OpenGL in the past. Pick a feature, from hardware texture, blend, stencil or 3D texture. They have been implemented on systems where performance degraded radically because the hardware was not capable. Hence isfast tests were required for some of this stuff. I don't think the objection is against having to optimize for one platform vs another but currently things are a mess, you need radically different rendering paradigms in an application on different platforms even with the 'same' API.

[This message has been edited by dorbie (edited 07-03-2002).]

Robbo
07-03-2002, 04:12 AM
Yes and I was aware that Cg, GL2 and DX9 were going to sort out some of the mess - at least a common subset of features all could agree on.

If you want to do something state-of-the-art on `current' hardware, you will always have this problem. Only when sota becomes mainstream `standard' will you be able to use it with more or less impunity.

knackered
07-03-2002, 04:23 AM
Originally posted by dorbie:
I don't think the objection is against having to optimize for one platform vs another but currently things are a mess, you need radically different rendering paradigms in an application on different platforms even with the 'same' API

I can't remember getting any support from you when I pointed this out a while ago, Dorbie - just flippent remarks like "go haunt the d3d forums then".
Whenever I've questioned the current state of OpenGL, I've been told that it's current state is a good thing - "you can be sure your code is highly optimised for particular cards"....mm, so why not read/write opcodes directly from/to the AGP slot?

davepermen
07-03-2002, 04:40 AM
can't remember if i supported you then, but its my thought since perpixellighting was a sweet new topic, since gf1 more or less http://www.opengl.org/discussion_boards/ubb/wink.gif

ash
07-03-2002, 04:48 AM
Originally posted by Robbo:
Finally a response I can agree with. It is not the responsibility of the drivers to make you app degrade gracefully. If you find your card has only 4 texture units, then you need to use a less complex shader instead of the 8 tex shader you are using to max your detail.

What you are essentially requesting is that the same detail results from a card which is less capable than some other card and that the driver should sort out how to do it, even if it goes down to 0.1fps. I cannot think of a useful situation for this scenario. A more useful situation would be for you to stay at 30fps but render a pixel which is the result of a less complex interaction.



I'm confused -- first you seem to agree, then you disagree http://www.opengl.org/discussion_boards/ubb/smile.gif

In response, I agree with your first paragraph, but disagree with the second. You say you can't think of a use for having the same visual output on different cards, even if they run at different speeds? Isn't that the normal practice? Certainly I wouldn't expect some card to suddenly start flatshading my triangles for example just because it can't gouraud them as fast as some other card. I don't see why shaders should be different.

And, if the driver is required to degrade the visual effect of a shader, then it's decided once and the app loses all control over the quality/performance tradeoff. That's why it's better to leave level of detail control to the app.

I'm not sure I understand you correctly though?

Ash

Robbo
07-03-2002, 05:09 AM
We don't disagree. I was giving some example of how you cannot possibly expect a driver to do your dirty work for you. The only exception I can think of would be to have virtual resources, which could just be ignored if they couldn't be mapped to a physical piece of hardware. Your end result would of course be completely undefined and so this is totally useless.

If cards can agree on a shader language then thats one issue. But if there are cards with 2, 4, 7 or 8 units then I will code different shaders for 2, 4, 7 or 8 units. Coding for 8 units and expecting it to scale down to 2 or 4 is just silly - multi-passing isn't the same thing and even if it were, remember your fillrate? The likelyhood is that your 2 unit card will have a substantially lower fill-rate than your 8 unit card (for obvious technological reasons) - so multi-passing will be useless and not at all scaleable in this scenario.

The only reason people are complaining at present is because they have to code different shader languages - not, I suspect, because one card is better than another. This will be sorted to an extent with the new APIs but like I say, the application should scale itself according to the caps of the hardware.

ash
07-03-2002, 05:43 AM
Ok got you now

dorbie
07-03-2002, 05:52 AM
I disagree somewhat. Currently if you use oh say 6 textures, you have to code different paths even though your software is perfectly viable on all target platforms even IF the API were the same. With more color buffers (as is on the way) the overheads and compromises of going multipass diminish, but you still must implement multiple paths, and even restructure your application to handle it. That problem doesn't go away with better hardware it just moves.

I agree a common API is a good thing, at the same time it should make the shader compilers available on a broad range of platforms.

[This message has been edited by dorbie (edited 07-03-2002).]

tfpsly
07-03-2002, 07:05 AM
Originally posted by dorbie:
I disagree somewhat. Currently if you use oh say 6 textures, you have to code different paths even though your software is perfectly viable on all target platforms even IF the API were the same.

Why? You can write some code to feed in the texture units, whatever their number is and whatever your shader is. Ok sometime you'll not be able to put a pass on a texture unit 'cos there are not the proper combine-ext settings available. Maybe you were speaking about per pixel lighting (which I haven't tested yet) ?

Back on the subject (ie: common pixel/vertex programs for every cards (Cg/ogl2), or specific path for each card: I'll go for the first method. And even if old cards are too slow to render my progs "fast enough", i'd write another "light" version for the set of slow cards (which would be used when an all old card is detected, namely old geforces, radeons and the like) and a "full effect" version for new cards. Still only 2 progs would be written, instead of x shaders for x cards.

[This message has been edited by tfpsly (edited 07-03-2002).]

Robbo
07-03-2002, 07:58 AM
Yea. Thats kinda what I was suggesting. You need a HI path and a LO path (perhaps even a MED path) - it isn't a problem if you have a common shader language. If you MUST take advantage of cutting-edge-feature-x on card y, then go and code a new path for that specific feature for that specific card.

dorbie
07-03-2002, 08:19 AM
You're asking why? This is pretty basic stuff. Maybe you're asuming I'm taking a position that I'm not. If you disagree with what I wrote then explain how without a shader like opengl2 you can avoid different code paths on different hardware that's perfectly capable of implementing your algorithm at full speed.

There's a difference in complexity between a high and low path today implemented in hand coded multipass and something that's transparent even with different shaders, but even that wasn't my point.

To implement the SAME algorithm on hardware that HAS the legs to handle it you have to explicitly querry the hadrware and code different paths in one approach vs a single path in another. The key here is that it's not JUST about code paths for performance reasons, there's an additional degree of abstraction with one approach which lets you ignore hardware features like the number of texture units and simply worry about performance as a single variable.

I think that makes the business of offering a couple of shader states (not full code paths IMHO because multipass nastiness is hidden) much simpler for a developer.

[This message has been edited by dorbie (edited 07-03-2002).]

nevermore
07-03-2002, 09:12 AM
Now I'm not completely up to date with shaders (namely due to a lack of hardware), and it probably doesn't help that I have only skimmed over some the more recent posts in this topic, but it seems you guys are looking at this from a "write a shader for a single pass and let the driver expand it to a multi-pass solution for lower end hardware" perspective. Wouldn't it be much simpler to write the shader as a multi-pass solution for a lowest (choosable of course) common denominator, and then allow the driver to collapse it to less passes (or even a single pass) on the appropriate hardware? That way you have control over the speed for the slowest hardware you are willing to support and when newer hardware comes out you get the benefit of having your complex shader being collapsed to less passes. Of course, you would still have the option of writing a specialized shader for the higher end hardware so you can take advantage of some prettier stuff.

Or am I just completely off? Feel free to rip me apart...

dorbie
07-03-2002, 09:47 AM
An interesting observation, however this is the central difference between approaches. The problem is not so much the details of the algorithm and how it breaks out into passes, rather the fact that the OpenGL implementation vs the application must implement the multipass.

With your approach you are basically saying the OpenGL implementation implements the multipass bacause that's the only way it could collapse multiple pass state into a single pass. Therefore implicit in the approach is the assumption that the pass state is sent to OpenGL and then a single application pass is performed with OpenGL left to determine the details. With a good compiler, fndamentally it is no different from the multipass in the compiler approach, except that you lose opportunities for optimization. I assume each pass has some minimum degree of functionality, so it does introduce additional compromises. Itteration would also require some additional awkwardness.

knackered
07-03-2002, 09:48 AM
How would the driver know that your next drawing commands are for a second pass? Simply because you don't change the vertex pointer?
The modular nature of the shading language should make it easier to break up into passes...write the shader with lots of small functions that do very little, to give the driver opportunites to break it up, maybe.
In the same way a C compiler can optimize code better when there's lots of simple functions (inline'd http://www.opengl.org/discussion_boards/ubb/smile.gif ).

Korval
07-03-2002, 09:53 AM
I think we've gotten off track of the argument in question.

I doubt anyone is really arguing against a common shader language. There's little argument against it, and the slight speed sacrifice I might get from a poorly-optimized shader compiler is acceptable compared to having hardware-specific shader languages.

What is in question is whether shaders should be allowed to implicitly multipass?

First of all, nothing is ever going to change the fact that multipass is always slower than single pass. No matter how many aux-buffers a card has or any other feature it may possess. Nothing is going to change the fact that you are going to have to re-transform geometry (certainly with a new vertex shader, thus causing a state change stall).

Secondly, as someone pointed out, having dozens of pixel pipelines is horribly inefficient. As polygon counts go up, the size of polygons drops. At 20 million polys per second, a 2x2 pixel footprint will be just as efficient as an 8x8 or 4x4 footprint. Therefore, the added cost of the extra pipelines is utterly useless in the common case.

As has already been pointed out, for performance reasons, you're already going to have multiple shaders for different levels of hardware. If you want optimal performance, you want to avoid multipass. Therefore, you will need to be able to look at the hardware capabilities and decide how best to utilize the features of the hardware in your shaders. No driver will ever be able to discern this information.

cass
07-03-2002, 11:08 AM
Korval,

Multipass is not necessarily slower than single pass - to implement the same shading equation. It is a very architecture-dependent question, and future architectures are going to be tuned for the "common usage" case. Using fewer resources in multipass may well be faster than using the maximum resources available in a single pass.

I agree with your assertion that the time is not yet right for transparent multipass inside the driver. However, vendors are welcome to explore these ideas through extensions.

This is actually one of the more troubling aspects of the current OpenGL 2.0 approach. OpenGL has always built on proven ideas. OpenGL 1.0 came from SGI's experience with IrisGL, and 1.1 thru 1.4 came from proven extensions. The GL_GL2_* naming implies that OpenGL R&D has moved into the ARB, and that's a bit scary. The ARB is a *governance* body - there to make existing functionality uniform and standard in the OpenGL API. Do we really want the ARB debating standardization of interfaces and functionality that don't even exist yet?

I applaud the desire to move OpenGL further faster. I am all for that.

Thanks -
Cass

[This message has been edited by cass (edited 07-03-2002).]

folker
07-03-2002, 12:04 PM
Originally posted by ash:
I can agree with both of you here; I think they are different but worthy goals.

On the one hand, it's a very good thing not to have to write different code for different hardware if you don't want to -- and OGL2 is an important step in that direction.

On the other hand, it's also a good thing to be able to code your engine such that it degrades usefully on different hardware, by intelligently degrading scene detail and rendering complexity (for example by means of alternate shaders of varying complexity) according to the capabilities of the hardware at hand.

I don't see these goals as necessarily contradictory: we must design solutions that solve both, at the end of the day. The question is whether the ability to write one shader and have it run on any (capable) hardware at some speed prevents you from also regulating rendering complexity in your app. I don't believe it does. Surely providing multiple shaders at various levels of detail and choosing between them at runtime is no more difficult in OpenGL2 than providing multiple levels of geometric detail, and selecting between them. They're both still the app's problem.

Ash

Agreed completely with your smart analysis.

The reason why I wanted to contradict to Mark is that he used the wish of adapting rendering complexity to the hardware performance as argument against OpenGL 2.0, which is wrong and does in no way interfer with the goal of OpenGL 2.0 of making shaders hardware independent.

folker
07-03-2002, 12:21 PM
Originally posted by Robbo:
What you are essentially requesting is that the same detail results from a card which is less capable than some other card and that the driver should sort out how to do it, even if it goes down to 0.1fps. I cannot think of a useful situation for this scenario. A more useful situation would be for you to stay at 30fps but render a pixel which is the result of a less complex interaction.


Running the same shader code with different speed on different hardware (as aimed by OpenGL 2.0) is very useful not only in many situations, but also for principle reasons:

Hardware is getting faster and faster. So if after X month / years you have a hardware which is 6 times faster (e.g. because it automatically can reduce 6 passes to single-pass as in OpenGL 2.0), then you can use all the same shader code without modification to run new content(!) which has 6 times more complexity / details.
-> You can re-use software to render more detailed content.

If you manually implemented your shader with 6-path on your old hardware, you have to re-write your effect to single-pass to take advantage of the new hardware -> you have to rewrite your software, at the end only because the hardware is faster (by allowing the same effect with single-pass instead of 6-pass in this sample).

Re-using software is really essential in the fast-developing software industry. It is already complex enough to develop new features, so we should avoid that only faster hardware requires that you have to rewrite your software all the time.

knackered
07-03-2002, 12:36 PM
Originally posted by cass:
Do we really want the ARB debating standardization of interfaces and functionality that don't even exist yet?

Do we really want NVidia doing this? (ie. Cg)

cass
07-03-2002, 12:54 PM
Knackered,

Yes. I think developers want *all* IHVs working on a shading language solution (hopefully in cooperation with each other), but let's not pre-ordain that one will *be* the OpenGL 2.0 shading language until the developer community has had an opportunity to evaluate different proposals (through implementations that consist of extensions - and possibly external libraries like Cg) and work the kinks out of them. This model of proving, then standardizing functionality and interface is one of the best things about OpenGL, and I'm loathe to abandon it.

Cass

folker
07-03-2002, 01:21 PM
Originally posted by cass:

Knackered,

Yes. I think developers want *all* IHVs working on a shading language solution (hopefully in cooperation with each other), but let's not pre-ordain that one will *be* the OpenGL 2.0 shading language until the developer community has had an opportunity to evaluate different proposals (through implementations that consist of extensions - and possibly external libraries like Cg) and work the kinks out of them. This model of proving, then standardizing functionality and interface is one of the best things about OpenGL, and I'm loathe to abandon it.

Cass

This is exactly happening with OpenGL 2.0:
The shading language extensions proposed for ARB approval are successfully implemented by 3dlabs.

(This does not include the full OpenGL 2.0 drafts. But note that 3dlabs only proposes ARB approval for these parts of OpenGL 2.0 they have implemented successfully. So there is no standardization of non-existing features.)

nevermore
07-03-2002, 01:47 PM
Originally posted by knackered:
How would the driver know that your next drawing commands are for a second pass? Simply because you don't change the vertex pointer?
The modular nature of the shading language should make it easier to break up into passes...write the shader with lots of small functions that do very little, to give the driver opportunites to break it up, maybe.
In the same way a C compiler can optimize code better when there's lots of simple functions (inline'd http://www.opengl.org/discussion_boards/ubb/smile.gif ).

Perhaps something like a glShaderPointer(...) would be in order. The shader could explicitly state what happens in each pass for your lowest denominator (or you could branch the code to write something else for different hardware if you so choose), and a run time compiler could check to see if anything could be collapsed into less passes with the current hardware in use.
That's just one way of doing it, and I'm sure there are better ways, but I'm just throwing some ideas around.

folker
07-03-2002, 01:53 PM
Originally posted by cass:

I think developers want *all* IHVs working on a shading language solution (hopefully in cooperation with each other)...

Right. But all IHVs are already working on a shading language solution in cooperation with each other, and this solution is OpenGL 2.0:

According the ARB Meeting Notes of December 12, 2001, the "Overall response" of all ARM members to OpenGL 2.0 "is positive".

And as far as I know all official statements of NVidia about OpenGL 2.0, they always say that they want to support OpenGL 2.0 as soon as it is approved by the ARB. As far as I know, the OpenGL 2.0 shader language extensions are scheduled for ARB approval for the next ARB meeting.

I hope that this all will work out fine, because then we will have a cool future-oriented shading language supported by all IHVs! :-)

cass
07-03-2002, 03:23 PM
Originally posted by folker:
This is exactly happening with OpenGL 2.0:
The shading language extensions proposed for ARB approval are successfully implemented by 3dlabs.


No, it's different.

ARB extensions - particularly ones of great consequence - are never "new" functionality.

Having a complete, robust implementation (which I cannot attest to, because I haven't seen any implementation) is only the first step on the way to standardization.

We're getting ahead of ourselves if we think we're ready to standardize fully general high level hardware shading having never really done it before. The whole "profile" concept in Cg is about using a fully general programming language with an evolving (relaxing) set of restrictions.

Thanks -
Cass

PS These are my personal opinions -- not necessarily those of my employer.

davepermen
07-03-2002, 09:49 PM
sweet cass..
just want to note that the gl2 features are not here to enforce you now to support but to show you how your hw should look like. the design is actually done by vendors, working together in the arb (arb is build of vendors and some other companies, if i look at the list). they would be stupid if they would not design it in an
a) revolutionary and bugcleaning
b) good to implement in the vendors hw
way. but this time gl evolves before stuff is yet there, and thats a good thing. the whole functioncallbacks (aka shaders) are a very new topic and they should not die out in a war looking like the html-webpage standard, wich even today doesn't work completely. thats what made gl2 coming: defining the standard before some pseudostandards are there.
we have a lot of pseudostandarts even today.. see vertexshaders and vertexprograms, see now cg etc. now we soon have gl_arb_vertex_program, so one part of it is standart. but the pixelshader standarts still have to wait, and pixelshading is a very complex theme in the form of what_should_be_in_the_standart. so starting the discussion yet today is very good (even today is too late, we yet have some crappy shaders like nvidia and ati shown us (sorry, they can do fancy stuff, but the usage is crap))

oh, and about the multipass faster than singlepass.. actually thats hurting deep in my heart http://www.opengl.org/discussion_boards/ubb/smile.gif but i know on a gf4 its faster to draw two times with 2 textures than one time with 4.. nutty tested it and it was that way..
that hurts, but well, its the way it is.. http://www.opengl.org/discussion_boards/ubb/smile.gif

anyways, don't take this in the wrong moment, take a coffee before http://www.opengl.org/discussion_boards/ubb/smile.gif

folker
07-03-2002, 11:07 PM
Originally posted by cass:
No, it's different.

ARB extensions - particularly ones of great consequence - are never "new" functionality.

Having a complete, robust implementation (which I cannot attest to, because I haven't seen any implementation) is only the first step on the way to standardization.


Our software is successfully running with the OpenGL 2.0 SL extensions. These extensions which are running successfully on real existing hardware are proposed for ARB approval.

(BTW, our software is also running successfully with Cg already - we are pragmatic ;-)


Originally posted by cass:


We're getting ahead of ourselves if we think we're ready to standardize fully general high level hardware shading having never really done it before. The whole "profile" concept in Cg is about using a fully general programming language with an evolving (relaxing) set of restrictions.

Thanks -
Cass

PS These are my personal opinions -- not necessarily those of my employer.



The functionality of the OpenGL 2.0 SL is basically the same than Cg and DX9 HLSL (OpenGL 2.0 SL is more powerful regarding control statements etc.). In other words, there is already a agreement about a high-level C-like shader language, the debate about the language itself is finished already. Even the differences to the old good Stanford SL are only formal ones. All these languages agree with the fundamental concepts.

Why do you want to wait with approving the OpenGL 2.0 SL?
Because people need further three years to find out which casting syntax they prefer?

There are no open questions anymore to define a high-level shader language. But we have an urgent problem of having no shader language standard(!) in OpenGL. So Why not simply solving this problem by standardizing the shader language?

The next step about full OpenGL 2.0 is a different task. For example, all this transparent-multipass discussion. But no-one is suggesting this functionality for ARB approval yet. And indeed, it still needs it time. (Claiming that the ARB should approve something which is not existing yet is simply wrong.)

But it is important to define the future direction. For example, the direction towards a shader language which is really hardware independent, so that every shader runs on every hardware, "only" with different performance. It was always the philosophy of OpenGL that every app runs on every hardware. OpenGL 2.0 is the first design solving this problem which needs an solution as soon as possible.

Dave Baldwin
07-03-2002, 11:26 PM
Originally posted by cass:
This is actually one of the more troubling aspects of the current OpenGL 2.0 approach. OpenGL has always built on proven ideas. OpenGL 1.0 came from SGI's experience with IrisGL, and 1.1 thru 1.4 came from proven extensions. The GL_GL2_* naming implies that OpenGL R&D has moved into the ARB, and that's a bit scary. The ARB is a *governance* body - there to make existing functionality uniform and standard in the OpenGL API. Do we really want the ARB debating standardization of interfaces and functionality that don't even exist yet?

I applaud the desire to move OpenGL further faster. I am all for that.


Actually all the ideas in the OGL2 proposed shading language derive from RenderMan and C - both which predate OpenGL 1.0 by some years so to paint the picture that all this is new and unproven so is risky to adopt as a standard is clearly suspect.

Dave.
3Dlabs.

Dave Baldwin
07-03-2002, 11:47 PM
Originally posted by cass:

Knackered,

Yes. I think developers want *all* IHVs working on a shading language solution (hopefully in cooperation with each other), but let's not pre-ordain that one will *be* the OpenGL 2.0 shading language until the developer community has had an opportunity to evaluate different proposals (through implementations that consist of extensions - and possibly external libraries like Cg) and work the kinks out of them. This model of proving, then standardizing functionality and interface is one of the best things about OpenGL, and I'm loathe to abandon it.

Cass

All the IHVs except nvidia have been working together on this, together with an enormous amount of input from the ISV community and other ARB members. Despite all our attempts to include nvidia in this process they have road-blocked at every point. Instead they introduced Cg - similar enough to the OGL2 shading language on the one hand, but different enough to bifurcate the industry on the other. Having an effective marketing organization helps!

Dave.
3Dlabs

Mezz
07-04-2002, 12:50 AM
Ouch.
Perspective, such a wonderful thing.

folker
07-04-2002, 01:08 AM
Originally posted by barthold:
At the ARB meeting a week and a half ago we presented a plan for the ARB to work on getting parts of the OpenGL2 white paper functionality into OpenGL in incremental steps. we presented 3 extensions and the OpenGL2 language specification. The extensions handle vertex programmability, fragment programmability and a frame work to handle OpenGL2 style objects. As a result a working group was formed by the ARB. This working group is headed by ATI. Jon Leech should post the ARB minutes on www.opengl.org (http://www.opengl.org) shortly, if he didn't already.

Barthold,
3Dlabs

"shortly" means how many month? ;-)

ash
07-04-2002, 01:21 AM
Originally posted by nevermore:
Now I'm not completely up to date with shaders (namely due to a lack of hardware), and it probably doesn't help that I have only skimmed over some the more recent posts in this topic, but it seems you guys are looking at this from a "write a shader for a single pass and let the driver expand it to a multi-pass solution for lower end hardware" perspective. Wouldn't it be much simpler to write the shader as a multi-pass solution for a lowest (choosable of course) common denominator, and then allow the driver to collapse it to less passes (or even a single pass) on the appropriate hardware? That way you have control over the speed for the slowest hardware you are willing to support and when newer hardware comes out you get the benefit of having your complex shader being collapsed to less passes. Of course, you would still have the option of writing a specialized shader for the higher end hardware so you can take advantage of some prettier stuff.

Or am I just completely off? Feel free to rip me apart...

I can see your thinking on this, but it seems better to me to make the program written by the developer a "pure" hardware-independent one and leave the job of coping with hardware details to the compiler. That's the approach taken with CPU high-level languages, and I think it's the right one.

There's also the question of dating: if the shader is written with the details of particular hardware in mind, then it dates quickly as that lowest common denominator (is there even such a thing?) gradually disappears. A shader written assuming nothing about how many passes will be involved is always correct even when faster hardware comes along.

Trust the compiler http://www.opengl.org/discussion_boards/ubb/smile.gif

Ash

ash
07-04-2002, 02:11 AM
Originally posted by folker in response to Robbo:
Running the same shader code with different speed on different hardware (as aimed by OpenGL 2.0) is very useful not only in many situations, but also for principle reasons:

Hardware is getting faster and faster. So if after X month / years you have a hardware which is 6 times faster (e.g. because it automatically can reduce 6 passes to single-pass as in OpenGL 2.0), then you can use all the same shader code without modification to run new content(!) which has 6 times more complexity / details.
-> You can re-use software to render more detailed content.

If you manually implemented your shader with 6-path on your old hardware, you have to re-write your effect to single-pass to take advantage of the new hardware -> you have to rewrite your software, at the end only because the hardware is faster (by allowing the same effect with single-pass instead of 6-pass in this sample).

Re-using software is really essential in the fast-developing software industry. It is already complex enough to develop new features, so we should avoid that only faster hardware requires that you have to rewrite your software all the time.



Having already been through this with Robbo I'll say (at the risk of putting words into Robbo's mouth) that I think Robbo agrees with this, and when he said he "couldn't see any use" for having the same shader working identically at different speeds on different hardware, what he meant was simply that in practice one might want to use multiple shaders at different complexities, to keep the framerates constant.

Correct me if I'm wrong, Robbo.

Ash

Robbo
07-04-2002, 02:54 AM
I think thats basically correct. Add in the fact that a requirement that any given shader should be decomposable to some multi-pass solution and what you are in effect doing is [probably] limiting the kinds of things you can do in the shader in the first place.

Firstly, why use fragment\vertex shaders in the first place if you want to support `low' end cards like a GF2MX? Surely having shader capable hardware is a basic prerequisite. I've heard people complaining about their Voodoo 3 not being able to do this or that and people have responded with `get yourself a better card'. Touche!

Secondly you can safely make assumptions about your bottom line in terms of number of combiner stages etc. Nobody said Carmack had to write a shader to do it in one pass on an ATI and two or three on a GeForce. Why not have a single code-path that does it in two or three for both the ATI and NVIDIA cards? If he wants to make it quicker for the ATI, then he has to write another code path to do it in one. Thats his problem, not the drivers.

At present I don't even assume multi-texture or CVA locking even though most cards have this capability.

Julien Cayzac
07-04-2002, 02:55 AM
Originally posted by Dave Baldwin:
All the IHVs except nvidia have been working together on this.

That partly validates my fears. See my post in the "future versions" forum http://www.opengl.org/discussion_boards/ubb/frown.gif

Julien.

folker
07-04-2002, 04:05 AM
Originally posted by Robbo:

I think thats basically correct. Add in the fact that a requirement that any given shader should be decomposable to some multi-pass solution and what you are in effect doing is [probably] limiting the kinds of things you can do in the shader in the first place.


Automatic splitting of shader programs into multiple passes only makes sense if this works for arbitrary shader programs, not only special restricted shader programs. This is the reason why OpenGL 2.0 suggests auxiliar buffers.



Firstly, why use fragment\vertex shaders in the first place if you want to support `low' end cards like a GF2MX? Surely having shader capable hardware is a basic prerequisite. I've heard people complaining about their Voodoo 3 not being able to do this or that and people have responded with `get yourself a better card'. Touche!

Secondly you can safely make assumptions about your bottom line in terms of number of combiner stages etc. Nobody said Carmack had to write a shader to do it in one pass on an ATI and two or three on a GeForce. Why not have a single code-path that does it in two or three for both the ATI and NVIDIA cards? If he wants to make it quicker for the ATI, then he has to write another code path to do it in one. Thats his problem, not the drivers.


Hm, this seems to be the oposite to my opinion again ;-)

Carmack has to write a separate implementation of the same functionality for a faster hardware only in order to take advantage of this faster hardware. In my opinion this contradicts fundamentally to the idea that a driver interface like OpenGL should hide hardware specific implementation details of the same functionality.

But I agree that of course it also makes sense to provide different pathes having less or more effect features for slower and faster cards. But this is independent from the above aspect.


[This message has been edited by folker (edited 07-04-2002).]

EG
07-05-2002, 05:13 AM
IMO the main reason why drivers should take care of multipassing when needed is that most OpenGL developpers don't want to take care of that, don't have the time to take care of that, don't have the expertise to take care of that, or just don't have the money to buy all the hardware required to test... http://www.opengl.org/discussion_boards/ubb/wink.gif

For the remaining "power users", having and using low-level, hardware-specific fallbacks, hints, callbacks or details isn't that big of an issue (though even them may not want to always get down to minute details).

Never forget that applications outlive hardware, not the other way around.

davepermen
07-05-2002, 05:27 AM
there's just one problem with the automultipass technique.
you code a modern shader, with advanced features used in. it runs on a new card with has these features (as example dot3 http://www.opengl.org/discussion_boards/ubb/wink.gif) build in. it can do it in one pass and runs fast anyways (cause its new).
the automatic fallback generates lots (for dot3 really LOTS http://www.opengl.org/discussion_boards/ubb/smile.gif) of passes for the hw that does not have dot3. result: 12passes (i think). the problem is. its not 12times slower now, but even more, as the hw that does NOT support dot3 yet is just an old, slower card.. so the gap between old and new hw grows even more rapid, and the power of the automultipassindriver results actually in the bigger problems you expected. its like enabling nv20emulator on gf2mx. running some program that has a fallback for gf2 but not checking for emulators simply runs with gf3 features. and.. for pixelshaders.. well.. *ouch*. its even more of a problem if they start using 3dtextures but without really checking for support (querying for the correct extensions, there are quite a few around now.. EXT,ARB,gl1.3 etc). an engine that for example has distanceattentuation on gf3+ with a 3d texture and on gf2- with 1 1d and 1 2d tex will fail to use the gf2 technique just because the gf2 emulates tex3d always. and 3dtextures on gf2 are SLOOOOOOOOWWwwwwwwwwwww

it would be fun to have it, but i prefer to get a standart lowlevel interface. as i only use one pass most the time anyways and just plug everything in there http://www.opengl.org/discussion_boards/ubb/wink.gif (have huge problems with multipass with transparency, and as carmack, i only want one solution for a problem, not if(that) setupthis else setupthat, and i talk about different objects in code, not different gpu's.)

harsman
07-05-2002, 05:40 AM
So what's the difference then between the way OpenGL works now and having transparent driver multipass? Either everything works but you have to use shader LOD and benchmarking to get good framerates or you change capabilities depending on hardware and switch depending on that. I don't see any difference between this issue and D3D caps and opengl software rendering as they exist today. I don't think the caps approach is better but in practice the difference is probably small. The "OpenGL way" has a little more abstrction but in practice the difference to ISV:s is small. Either you get no visual error but bad performance, or you get nothing but a clear indication on what capabilities you can't use.

EG
07-05-2002, 05:45 AM
You can always decide to turn your shaders off just like you turn off cube maps when in your existing code.

Who's still coding in 8bits ASM today? Nobody. IMO in a not so distant future, today's GPUs will be in the same position as 8bits CPUs are now (and GPUs evolve *much* faster).
Like I relied on Intel and AMD to make my stuff run orders of magnitude faster (optimisation can only go so far), I would prefer to rely on 3DLabs, nVidia & ATI to make my graphics run orders of magnitude faster, not just home-grown per-hardware hand-tailored tricks.
Remember the "C is slow" debates in those early days? with people arguing for lower-level but available-sooner alternatives like Fortran? or the fixed-point vs floating-point debates?

Having a higher-level API does not prevents you from going low-level, but it sets the spirit, the goal... and cuts development time drastically whenever it's "Good Enough".

JelloFish
07-08-2002, 06:36 AM
Originally posted by Dave Baldwin:
All the IHVs except nvidia have been working together on this, together with an enormous amount of input from the ISV community and other ARB members. Despite all our attempts to include nvidia in this process they have road-blocked at every point. Instead they introduced Cg - similar enough to the OGL2 shading language on the one hand, but different enough to bifurcate the industry on the other. Having an effective marketing organization helps!

Dave.
3Dlabs


You would think the marketing people would get it, if ISV's have to break something down to 2 passes there is a good chance that a 1 pass setup for hardware that wont exist until next year wont get written, if the driver broke it down instead when that new piece of hardware came out games would actually use it right away and people would have a reason to upgrade.

There is a fundamental problem with supporting new hardware right now and it needs to be addressed with upmost urgency.

The IHV's need to chase the ISV's, the higher code reuse that exists the higher innovation that will exist. I think people can exist with new games running 20fps for a year or so until the IHV's ramp up, its the way it works on the CPU side and GPU's should follow suit. The optomization for a GPU has to come from the fact that a driver layer exists between it and the software, not the fact that the software is coded perfectly for it(even though that can still be done in any case).

What I want to know is what marketing guru came up with "z correct bumpmapping" as a new thing for GF4, doesn't that feature exist on GF3?

[This message has been edited by JelloFish (edited 07-08-2002).]

henryj
07-08-2002, 12:25 PM
...until the developer community has had an opportunity to evaluate different proposals (through implementations that consist of extensions - and possibly external libraries like Cg) and work the kinks out of them.

In the post Microsoft age this doesn't happen any more. He who releases early and saturates the market becomes the standard. Even if it sucks. Surely I don't have to give you examples http://www.opengl.org/discussion_boards/ubb/smile.gif
This is exactly why nVidia (with Microsoft's backing) has done this.
Has anyone checked if nVidia has lodged any cg related patents lately? Any truth to the rumour that Microsoft pulled the IP card out at an ARB meeting recently.
The sooner we get OGL2 the better.