PDA

View Full Version : DirectX Next



pkaler
12-07-2003, 09:48 AM
Anyone have any comments on this article? http://www.beyond3d.com/articles/directxnext/

Looks like the super buffers extension will cover a fair chunk of that functionality. The Topology processor sounds interesting. I'm not so sure about the surfaces tesselator.

Cyranose
12-07-2003, 10:41 AM
I was going to ask the same questions, almost word for word. http://www.opengl.org/discussion_boards/ubb/smile.gif

I've been wanting to use a topology processor for a long time as I'm a big fan of procedural geometry. Emulating this in a VP was just too much of a pain so far.

The virtual video memory confuses me a bit, since that's what we normally do anyway, at least one-way. Maybe it's just more "managed" for us simpletons. The big question is whether page faults can be absorbed or will they cause stalls. That's the tricky bit, IMO.

I do like the bit about arraying all scene matrices and letting geometry seamlessly index into those (my interpretation of the bit about instancing). We've talked about a couple of ways to do that with current hardware, but I haven't heard of any GL extension proposal that'll allow that explicitly. I could use that yesterday.

Avi

3k0j
12-07-2003, 10:56 AM
final page of meltdown'03 "Future Features" presentation:
Please send in your feature requests
- What features are still missing?
- What scenarios does this feature set still not enable?
- What syntax would you like to see used to expect these features?
- Mail to directx@microsoft.com


translation: VaporwareX Next (R).

MICROS~1.OFT clearly doesn't know yet what DireX Next will be.

Korval
12-07-2003, 11:25 AM
Since DX Next will ship with Longhorn, you're looking at a 2006 launch time. So Microsoft doesn't really need to have a firm, definitive knowledge of what it will be yet.

sebh
12-07-2003, 12:06 PM
Impressive but OpenGL will be able to do the same thing with extention.....

Ostsol
12-07-2003, 01:11 PM
Well, MS had been planning on DirectX 9 to last quite a while, a fact made quite obvious by the existance of the very powerful shader version 3.0 specs. During the wait perhaps video card design will mature further. FP32 on all new cards, plus good FP render target and texture support. In truth there's still a heck of alot that can be improved on video cards with the current DirectX specs and OpenGL [ARB] extensions.

pkaler
12-07-2003, 02:34 PM
A related question, anyone know the status of the super buffers working group? The ARB meeting notes aren't up yet from the September meeting. And I think there is supposed to be another meeting coming up this week.

Zeno
12-07-2003, 06:17 PM
I have a comment on the frame buffer access page: Hardware vendors, please please please give us access to the frame buffer from within fragment program...even if it's just the fragment we're about to write to. It's so important to be able to do customized blending, particularly to higher-precision buffers.

If the statement that hardware vendors want to drop this is accurate, could someone please explain why?

Korval
12-07-2003, 07:30 PM
If the statement that hardware vendors want to drop this is accurate, could someone please explain why?


Fundamentally, it's the same reason why uploading a texture is faster than downloading it from hardware. The pipeline works best/fastest when it goes one way. Which is why the blend stage is at the very end of the pipe, and it is also why turning on alpha blending causes a performance drop.

Also, it prevents them from making parallelizing optimizations, where a program might be running such that it is overwriting the same fragment on two different pipelines. The "alpha blending" unit would be responsible for sorting out which data goes where. If this is possible in advanced hardware, a framebuffer read operation would have to cause a full-pipeline stall.

In general, from a performance standpoint, it is an all-around bad idea.

Besides, what do you need custom blend operations for if you have have infinite length fragment programs?

Adrian
12-07-2003, 08:13 PM
From an article on the inquirer about a leaked ATI document.
"ATI's PCI Express...will offer bi-directional simultaneous 4GB/s bandwidth" http://www.theinquirer.net/?article=12991

If they are going to give us lots of bidirectional bandwidth then presumably they are going to make use of that bandwidth. Could it mean we are getting fast frame buffer reads soon?

[This message has been edited by Adrian (edited 12-07-2003).]

Zeno
12-07-2003, 09:17 PM
Originally posted by Korval:
Fundamentally, it's the same reason why uploading a texture is faster than downloading it from hardware. The pipeline works best/fastest when it goes one way.

I disagree with your analogy. Texture downloading is slow partially because any remaining graphics instructions must be flushed first, and partially for some mysterious reason I have yet to figure out. Texture downloading is broken somehow....it is certainly NOT AGP speed. I have heard that whatever this mysterious problem is, it will be fixed with PCI express, so it's not fundamental....in fact it probably has something to do with current PC architecture, not some inherent "you're going the wrong way" problem.



Which is why the blend stage is at the very end of the pipe, and it is also why turning on alpha blending causes a performance drop.


Blending is at the end of the pipe because has to be. It requires the final about-to-be-written fragment color which is only available at the end of the pipe. It causes a performance drop simply because there is additional calculation (and, yes, a read instruction) per pixel. The performance drop is not very significant, though.



Also, it prevents them from making parallelizing optimizations, where a program might be running such that it is overwriting the same fragment on two different pipelines.

The "alpha blending" unit would be responsible for sorting out which data goes where. If this is possible in advanced hardware, a framebuffer read operation would have to cause a full-pipeline stall.


This argument is wrong if the fragment program is only allowed to read the pixel it is about to write. I think that's a fair and understandable limitation. I just want a register that has the color of the pixel at the current location.



In general, from a performance standpoint, it is an all-around bad idea.


I don't think you've supported this point.



Besides, what do you need custom blend operations for if you have have infinite length fragment programs?

Sure, I guess if the fragment program could be infinite length and have infinite storage, I could upload my whole scene to it, sort out the transparent items, and software rasterize the triangles. Nothing's infinite, though. Blending is necessary ANY time you have two different translucent materials on top of eachother in the frame buffer. A (normal, non-infinite) fragment program only deals with 1 fragment from 1 triangle at a time. You can't get info about other triangles that rasterize to the same point but different depths in screen space, so you can't use it to do what you would normally use blending to do. Think about seeing through water or glass or rendering a particle system.

Korval
12-07-2003, 09:46 PM
I disagree with your analogy. Texture downloading is slow partially because any remaining graphics instructions must be flushed first, and partially for some mysterious reason I have yet to figure out. Texture downloading is broken somehow....it is certainly NOT AGP speed. I have heard that whatever this mysterious problem is, it will be fixed with PCI express, so it's not fundamental....in fact it probably has something to do with current PC architecture, not some inherent "you're going the wrong way" problem.

AGP only works one way: to the graphics card. Access from the CPU uses the PCI bus, which is very slow. Apparently, PCI express will solve this.


This argument is wrong if the fragment program is only allowed to read the pixel it is about to write.

My point was that, in an advance architecture (one that is very different from the current one), another pipe might be running on a fragment that is going to write the same pixel. If it issues a read request, it would have to stall until all other dependencies are finished. If it can't issue that request, then it is up to later parts of the pipeline to stall, ones that might be designed to do so.


A (normal, non-infinite) fragment program only deals with 1 fragment from 1 triangle at a time. You can't get info about other triangles that rasterize to the same point but different depths in screen space, so you can't use it to do what you would normally use blending to do.

But what do you need that functionality for? If you're doing physically correct blending, then you don't need a fragment program to do it; the current blending technology is sufficient. If you aren't doing physically correct blending, then tough.


Think about seeing through water or glass or rendering a particle system.

I don't see how either of those needs anything more blending operations than what they already have.

Zeno
12-07-2003, 10:21 PM
Originally posted by Korval:
My point was that, in an advance architecture (one that is very different from the current one), another pipe might be running on a fragment that is going to write the same pixel. If it issues a read request, it would have to stall until all other dependencies are finished. If it can't issue that request, then it is up to later parts of the pipeline to stall, ones that might be designed to do so.

Of course, in that case, a stall would be caused. What would be wrong with making a system with the understanding that if you read from somewhere other than where you are about to write, either order is not guaranteed or there will be a stall?


But what do you need that functionality for?

I'm not sure why you always take this angle when people talk about feature requests. It shouldn't matter to you, but I'll answer it anyway. There is not a single thing that one has to have this feature for. It is simply nice. It would be useful in 1000 different cases and essential in none. You can always draw to a buffer then bind it back in as a texture and shuttle it through the entire pipeline again. Nevermind that doing so takes an additional pass and obfuscates code. Nevermind that this may make some scenes almost impossible to get right. Nevermind that it seems horribly inefficient. Nevermind that you could use this to argue for the removal of any blending unit whatsoever, or whole classes of OpenGL functionality. Why would anyone need it? Because it would be damn convenient. Same reason there are polygon modes other than GL_TRIANGLE and the same reason there is an XPD instruction in fragment program.


I don't see how either of those needs anything more blending operations than what they already have.

Because what we have doesn't support high-precision blending. What we have doesn't support exponential light decay based on thickness (obtained by having the ability to read depth or stencil buffer of current pixel in fp). What we have doesn't support using a different blend equation for color and alpha, or for each color channel. An analogy for you: Ability to do blending by reading frame/depth/stencil buffer from fragment program is to the current blending model as fragment programs are to register combiners. The former is completely general whereas the latter is nothing more than flipping a few hard-wired switches.

Korval
12-07-2003, 11:30 PM
I'm not sure why you always take this angle when people talk about feature requests.

Because that is the angle that ought to be taken. Rather than innundating hardware developers with random requests for functionality, the functionality in question should have some justification to it.

I could just as easily say that hardware should do shadows for you. However, the complicated nature of shadows in a scan converter, coupled with the means of accessing it in a fragment program, makes this too difficult to implement in hardware. As such, even making the request is unreasonable.

The existence of fragment and vertex programs is justified. The existence of "primitive" programs is justified. The existence of floating-point render targets and textures is justified. However, there is an argument for each of these features that justifies them. If you can't justify a feature, it shouldn't be added.


There is not a single thing that one has to have this feature for. It is simply nice. It would be useful in 1000 different cases and essential in none.

Then name some cases. Justify the necessity of having this functionality in the same way that other features are justified. Or are you simply wanting to have a feature simply to have it? That kind of thinking leads to a hardware nightmare, where you just add an opcode because it sounded like a good idea at the time, rather than evaluating the need for a feature.

If you tell me that an entire class of advanced rendering techniques would use this functionality, and without it they would run 20x slower, and that they are crucial towards the ultimate goal of photorealism, then there is sufficient justification for adding the feature. If you can't do that for this feature, then there is no point in having it.


Because what we have doesn't support high-precision blending.

Which hardware developers have promised to provide in the future (NV40/R420). So that point is moot.


What we have doesn't support exponential light decay based on thickness (obtained by having the ability to read depth or stencil buffer of current pixel in fp).

Which, of course, could be passed in as the "alpha" given the above operations. So, once again, hardly a necessity.


What we have doesn't support using a different blend equation for color and alpha, or for each color channel.

How useful is this, compared to what you already have? And how often will this functionality be required?

Also, EXT_blend_func_separate exists, so at least RGB and ALPHA can be blended separately. Said functionality could be extended to offer independent RGB blend functions.


An analogy for you: Ability to do blending by reading frame/depth/stencil buffer from fragment program is to the current blending model as fragment programs are to register combiners. The former is completely general whereas the latter is nothing more than flipping a few hard-wired switches.

That's not justification; that's explaining the current situation. Also, it presupposes a certain state of mind: that fixed-function is always bad, and that programmability is always good. This is not the case for all fixed-functionality. Should we start ripping out bilinear/trilinear/anisotropic filtering operations and just let the fragment shader do it? It can, so why waste the hardware? Except for the fact that the texture unit will always be much faster at it than a fragment program.

The justification for fragment programs is pretty obvious; a programmable model is needed in order to support the flexibility of modern and advanced graphics needs. Virtually any advanced graphics application will need fragment programs.

Most of these applications will be just fine with the regular alpha blending ops.

I'm just asking questions that hardware developers ask. No more, no less. It is precisely these questions that lead to hardware vendors telling Microsoft to remove the feature from DX Next.

[This message has been edited by Korval (edited 12-08-2003).]

[This message has been edited by Korval (edited 12-08-2003).]

Humus
12-08-2003, 12:34 AM
Distance through fog with FBColor:




float depth = ... ;

if (gl_FBColor.a > 0.0){
gl_FragColor = depth;
} else {
gl_FragColor = abs(depth - gl_FBColor.a);
}


Trying to do the same with standard blending:

* Create 2 additional render targets
* Draw two passes with max and min blending
* Pass both RTs to the fragment program and subtract there.
* Take care of the case when you're inside the fog volume.
* If you want to use your distance through fog for something more complex than what the standard blending offers (likely) you will need another render target.

[This message has been edited by Humus (edited 12-08-2003).]

Zeno
12-08-2003, 08:34 AM
Originally posted by Humus:
Distance through fog with FBColor:




float depth = ... ;

if (gl_FBColor.a > 0.0){
gl_FragColor = depth;
} else {
gl_FragColor = abs(depth - gl_FBColor.a);
}

[This message has been edited by Humus (edited 12-08-2003).]

Humus - doing volume fog is almost like trying to do a per-pixel water shading based on line of sight depth. Korval explained how to do this above:


Which, of course, could be passed in as the "alpha" given the above operations. So, once again, hardly a necessity.

See, instead of reading from the color or depth buffers, you should make a texture map for your volume fog/water geometry, transform this texture to window space, cast a ray from the viewer through each texel and generate an rgb depth/alpha map for the volume (per frame). Then your blending parameters can be passed in as alpha values and you won't have to bother the hardware vendors for new redundant features.

In fact, I have come to a new enlightenment. All that I really need is a CPU and frame buffer http://www.opengl.org/discussion_boards/ubb/smile.gif. I can't believe I've been wasting my money on fancy graphics cards all these years!

Jan
12-08-2003, 09:38 AM
Well, i agree with Zeno.
I my app i came to a point where i needed to blend and then modify the value even more. Thatīs impossible at the moment - at least in one pass.
So i had to do two passes.
So in general it is not necessary, but it if it works fast enough, than it will speed up a lot of programs that use fragment programs. Plus, it makes life a lot easier.

So, we donīt need it, but it could make some stuff "realtime", which is today simply to slow because of too many required rendering-passes. If thatīs not a good reason, than hardware-vendors can drop it.

But i am quite sure that at least one of them will try to make it possible on their hardware - simply because of competition - which would make developers use their hardware in the first place and therefore will force other vendors to add the feature too.

Jan.

Korval
12-08-2003, 09:50 AM
Distance through fog with FBColor:

I'm not quite sure what it is that this method is trying to accomplish. The alpha of the destination color seems to be a depth value, but it is also a color (since you're setting the fragment color to it)?


See, instead of reading from the color or depth buffers, you should make a texture map for your volume fog/water geometry, transform this texture to window space, cast a ray from the viewer through each texel and generate an rgb depth/alpha map for the volume (per frame). Then your blending parameters can be passed in as alpha values and you won't have to bother the hardware vendors for new redundant features.

I presume the feature you're interested in is the ability to apply blending based on the eye-radial distance (ie, z-depth) between the object you're rendering and the objects that have been previously rendered? So, what you really want isn't color reads; it's depth reads. So what you should do is just bind the depth buffer as a render source when you do your fog pass. You're not going to be using depth buffer writes when you're doing this fogging, so it makes sense.

See? You can find features and power that you didn't even know you had simply by looking for them. This method doesn't induce much slowdown, and it certainly doesn't require hardware developers to rebuild the lower-end of the rendering pipeline to make it operate in reverse.


In fact, I have come to a new enlightenment. All that I really need is a CPU and frame buffer . I can't believe I've been wasting my money on fancy graphics cards all these years!

If you don't want to justify your requests, then don't. Just don't complain when hardware vendors don't bend over for unjustified requests for features.

deshfrudu
12-09-2003, 10:44 AM
Having access to the current frame buffer pixel from a pixel shader would just be damned handy. I can't count on both hands how many times I've wished I had this feature. The lack of it has cost me extra render targets/passes/complexity many times. If it's really going to cost us more in performance than extra render targets/passes/complexity, than I don't want it, but I doubt it would (maybe I'm wrong). It would be a nice thing to have, and the HW peeps are in the business of making our lives easier, so why not. I mean, with enough passes and off-screen buffers, you can acheive the effects of any modern shaders on 3-year-old hardware, but it's such a major pain in the ass (and slow) that it's just not practical. Giving us access to the current pixel would likewise make a lot of things that are "technically possible" now actually practical. My 2 pesos...

Zeno
12-09-2003, 12:52 PM
deshfrudu - Thanks for clarifying what I was trying to say above. There is no example I can give that can satisfy Korval, because he will always be able to come up with some other multi-pass method of doing the same thing. It's a convenience feature, I admit it.

After all of this arguing, my question is still not answered, so let me rephrase it:

What is it about the design of a modern graphics card that would make reading from the destination buffer at the position of the current fragment impractical? Also, how would access to that pixel be much different from the read that must already take place in fixed-function blending?

zeckensack
12-09-2003, 01:31 PM
Current framebuffer pixels could easily be prefetched early, whenever they're required inside a fragment program. They have to be fetched for fixed function blending, too, after all. You'd just need to do it a tad earlier to hide latencies.

Multiple instances of the same (x/y) fragment shouldn't be in the pipe at the same time anyway. That's an extreme ordering hazard and would probably just blow up in your face (again: same thing with fixed function blending).

I can't find a reason why that should be hard to implement at reasonable performance levels. Fixed function blending costs bandwidth, programmable blending costs bandwidth and ... what?

edit:
Int to float conversion. Is that a good reason?

[This message has been edited by zeckensack (edited 12-09-2003).]

OpenGL guy
12-09-2003, 04:13 PM
Originally posted by zeckensack:
Current framebuffer pixels could easily be prefetched early, whenever they're required inside a fragment program. They have to be fetched for fixed function blending, too, after all. You'd just need to do it a tad earlier to hide latencies.
Doesn't work. Say you are rendering to the same pixel twice on two consecutive triangles (think overlapping, blended particle effects). If your shader execution is multithreaded, then you're in for a heap of trouble if you want to change to a pixel on the second triangle before the first is complete. That's as much as I'll say.

Multiple instances of the same (x/y) fragment shouldn't be in the pipe at the same time anyway. That's an extreme ordering hazard and would probably just blow up in your face (again: same thing with fixed function blending).
No, it's not a problem at all if you don't allow frame buffer access within the shader.

I can't find a reason why that should be hard to implement at reasonable performance levels. Fixed function blending costs bandwidth, programmable blending costs bandwidth and ... what?
It's not just programmable blending. You're moving a whole chunk of the pipeline (i.e. the blending unit into the shading unit). Blending can normally be done independently of shading. That won't be the case if you allow framebuffer access in the shader.

Korval
12-09-2003, 04:48 PM
There is no example I can give that can satisfy Korval, because he will always be able to come up with some other multi-pass method of doing the same thing.

That's not true.

If you were to argue for the inclusion of arbitruary texture access in a vertex program, as opposed to GeForce 1 technology (or even NV20-level stuff), you could make a compelling case for it. You could say that:

1: Without hardware support, implementing this is, at best, prohibitively expensive.

2: With hardware support, a vast number of possibilities present themselves. From good shadow mapping to EMBM to a wide variety of other, genuinely useful, visual effects.

The reasoning for the feature is both clear and convincing. Each effect might be doable in another way, but the sheer quantity of effects that this allows, coupled with the painful nature of the alternatives, makes this feature almost self-justifying.

Can you say the same for the arguments you have postulated here? So far, we have some fog (whether it is water or atmosphere, it is the same effect), and some nebulous "I can't count on both hands how many times I've wished I had this feature" kinds of things, which can't really be evaluated on a case-by-case basis.

Would I mind it if my R400 or R500 had this feature? Probably not, unless that made it slower overall than its nVidia counterpart. Would I care if it never saw the light of day? Probably not, assuming that programmable blending became a reality at some point (register combiner-level functionality would be sufficient). In the grand scheme of things, it just isn't that important.


What is it about the design of a modern graphics card that would make reading from the destination buffer at the position of the current fragment impractical?

It isn't the design of modern graphics cards that is a concern. It is the design of future graphics cards that would be limitted by this decision. Effectively, it means that it is impossible to allow hardware to have multiple fragment programs "in flight" over the same pixel/sample, even though, from a performance standpoint, this might be a worthwhile idea. They could have the "blending" unit sort out which fragment gets written and blended as an asynchronous process to running a fragment program.

zeckensack
12-09-2003, 05:50 PM
Originally posted by OpenGL guy:
Doesn't work. Say you are rendering to the same pixel twice on two consecutive triangles (think overlapping, blended particle effects). If your shader execution is multithreaded, then you're in for a heap of trouble if you want to change to a pixel on the second triangle before the first is complete. That's as much as I'll say.This is what happens with fixed function blending, too.
If you're of course talking about something akin to OOOE CPU designs where you "retire" in order at the end of the pipeline only, all I'll say for now is *cough* ... and I wasn't aware of that.

No, it's not a problem at all if you don't allow frame buffer access within the shader.Ditto, sort of.

It's not just programmable blending. You're moving a whole chunk of the pipeline (i.e. the blending unit into the shading unit). Blending can normally be done independently of shading. That won't be the case if you allow framebuffer access in the shader.I see. It doesn't quite sound like what I had in mind, which was:
Move the color buffer read to an earlier stage, but not "the blending unit" as a whole.
Whenever a 2x2 pixel quad, or whatever you happen to use enters a fragment processor *twinkle*, and the current fragment program wants read access to target.color (or so), fetch that block from the target and pass it down the fragment processor along with the interpolator outputs. And spec it as read only.

Assuming the quads are generated roughly in order, and multiple quads generated at the same time never overlap (?), this might just work. You then don't even need to "retire" in order, because you took snapshots of the target contents at the right time. It doesn't look like it could come for free, of course. The stuff obviously needs to be stored somewhere.

I am by no means a hardware designer. I'm just thinking out loud.

Korval,



!!whatever

PARAM rgb_to_luminance={0.3,0.59,0.11,0.0};
DP3 result.color.rgb,target.color,rgb_to_luminance;
MOV result.color.a,fragment.color.a;

Combine that with fixed function blending to yield the dreadful discoloration cloud, a weapon so evil, it must be wielded by a madman only Batman can hope to stop http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Korval
12-09-2003, 07:59 PM
Move the color buffer read to an earlier stage, but not "the blending unit" as a whole.

Fundamentally, that's the same thing. If blending is off, you can still do the blending operation in the shader. And, since you're taking it as an input, it must be assumed that the output will vary depending on this value. As such, it's really no different than blending.


Whenever a 2x2 pixel quad, or whatever you happen to use enters a fragment processor *twinkle*, and the current fragment program wants read access to target.color (or so), fetch that block from the target and pass it down the fragment processor along with the interpolator outputs.

Not good enough. A currently-in-execution "quad" could be about to write to this value. You don't want it read until that quad has written to it. Which means that a synchronization event must occur in the middle of the pipeline.


Combine that with fixed function blending to yield the dreadful discoloration cloud, a weapon so evil, it must be wielded by a madman only Batman can hope to stop

Huh? I'm not sure what this is even in reference to.

davepermen
12-09-2003, 09:05 PM
this synchronisation issue should only cause a slowdown if you have to multipass individual triangles that are at the size of about .. 8 pixels. else, it can schedule those pixels and continue with other, independent ones.

oh, and, that scheduling can happen automatically by simply drawing the stuff in order.

i know there _are_ issues. but they are NOT a problem in any normal situation. as long as you draw just one triangle, all your pixels are actually processed independent from the others. only from one to the next triangle, there can be overlap. this, combined with backface cullling, is a not-often happening event.


and there will be no need for the blending unit at all if we can access the "dst_color". a MUL MAD can do anything then.

this colourkilling example is just a funny idea to show what tons of features you could do with it..

btw, there are tons of algos that are not possible with the flipflop multipass method.. except with individually scheduling and flipflopping for each triangle. not practical. you look rather restricted, korval (not to say braindump.. you're not..), if you don't see the uses of this.

Humus
12-10-2003, 03:33 AM
I'm willing to take a performance hit for it if that's neccesary. It's like with depth-writes from a shader, which causes a large performance reduction, and isn't nearly as useful.

Won
12-10-2003, 05:20 AM
First off, I'm glad people that here are mostly taking the approach: "DirectNext hints at future hardware capabilities" rather than the tiresome "Microsoft sucks" rants. I, too, am curious about the whole SuperBuffer thing. It has been suggested here that they are probably adapting the SuperBuffer concept to a more VertexBufferObject style. They would need some more BufferObject targets (PIXEL_PACK and PIXEL_UNPACK have already been hinted in the PBO section of NVIDIA's VBO whitepaper), and some uniform way of swapping and copying buffers. That might be nice: you'd get the ability to double/triple buffer not only your frame buffer, but your vertex buffers, index buffers, textures, etc.

But while we're all playing "OpenGL Hardware Designer" here:

Yes, it would be possible to schedule around potential fragment->pixel hazards, but do we really expect that to be in the next-generation implementations? Personally, I don't (maybe next-next...). I want my GPU to be a lean mean stream processing machine. I want the transistors to be there to do computation, not scheduling, so such hazards should be enforced by the API (now, and in the near future). Personally, I'd rather see floating-point blends and programmable texture fetching/filtering. Both are useful and neither would break parallelism.

When it comes down to it, DirectNext has some great things, like the unified shader model (with integer instructions), programmable tesselation and the topolgy processor. Covers about 95% of the things on my wish list.

-Won

zeckensack
12-10-2003, 06:28 AM
Originally posted by Korval:
Fundamentally, that's the same thing. If blending is off, you can still do the blending operation in the shader.No, it isn't because no, you can't. That's what we're talking about.

And, since you're taking it as an input, it must be assumed that the output will vary depending on this value. As such, it's really no different than blending.It's different from fixed function blending nonetheless (ie it's more flexible, in fact an entirely different beast, as seen by the little example I gave).


Not good enough. A currently-in-execution "quad" could be about to write to this value. You don't want it read until that quad has written to it. Which means that a synchronization event must occur in the middle of the pipeline.Then so be it. Let the quads going down be scheduled so that there are no ordering hazards. I've outlined how I'd imagine that could be done:
1)take a snapshot of the target contents at the time the quad starts into fragment processing
2)block whenever two (or more) overlapping quads would be in flight at the same time

Read access to fragmet.position.z isn't free either. I never complained about that, it's just to be expected.
(and as an aside, read access to target contents is a lot more interesting than fragment.position.z IMO)

I really wonder how often issue #2 crops up in reality. How bad is it, really? I honestly don't know but I'd like to.

Huh? I'm not sure what this is even in reference to.I've made it up. In addition to the shader, you'd set glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA); and draw arbitrary geometry over the finished scene (say, a particle system). Only input alpha matters.
Where alpha==1.0, you'll turn the framebuffer to pure intensity.
Where alpha==0, the color buffer is unchanged. For everything else, you get a linear blend between full color and intensity.

Stupid, fancy, unheard-of special effects, so to speak http://www.opengl.org/discussion_boards/ubb/smile.gif

(and you simply can't do it with fixed function blending alone, unless, of course, you copy generous portions of your render target to a texture)

Korval
12-10-2003, 10:11 AM
I'm willing to take a performance hit for it if that's neccesary. It's like with depth-writes from a shader, which causes a large performance reduction, and isn't nearly as useful.

My concern is not that there will be a performance drop from using it. My concern is that the feature would require a restructuring of the entire back-end of the renderer, and that such restructuring would either prevent the use of performance-enhancing features (like having multiple quads in the pipe) or dramatically complicate the back-end logic, thus increasing the cost of the chip or costing us other, potentially useful, features.


do we really expect that to be in the next-generation implementations?

DX Next will come out with Longhorn in 2006. As such, the API is going to be something of an indicator of the expected functionality of cards of that era. Not of the cards of next year.


Personally, I'd rather see floating-point blends and programmable texture fetching/filtering. Both are useful and neither would break parallelism.

Programmable texture fetching sounds like it'd be really slow, but floating-point blending is clearly something that would be of great value in the (near) future.


No, it isn't because no, you can't. That's what we're talking about.

The point I was making is that if you "Move the color buffer read to an earlier stage, but not "the blending unit" as a whole.", then it is the same as what we are discussing. It isn't an alternative to moving the blending into the fragment shader; it's the exact same thing, because if you could read the framebuffer from the shader, you'd never used fixed-function blending again.


Read access to fragmet.position.z isn't free either.

Read access is free (or, at least, pretty cheap). Write access isn't, since it screws up all the fast z-culling hardware.


and as an aside, read access to target contents is a lot more interesting than fragment.position.z

True. But read access to the framebuffer is much more difficult than simply giving the fragment program the computed z-depth.


I really wonder how often issue #2 crops up in reality. How bad is it, really? I honestly don't know but I'd like to.

Well, it never happens on an ATi chip because the hardware isn't designed to have multiple "quads" in-flight simultaneously. Apparently, this is not true for FX chips. I don't imagine that it would come up too much, as you would have to have a pretty deep pipeline for it to happen, but the hardware designers would have to devote resources to preventing the problem in any case.


and you simply can't do it with fixed function blending alone, unless, of course, you copy generous portions of your render target to a texture

Well, technically, you don't have to "copy" it. With ATI_draw_buffers, you can write the color to the frame buffer and write the luminance to an AUX buffer. From there, assuming ARB_superbuffers, you just bind that buffer as a texture, and you can do regular blending as a post-process.

Mazy
12-10-2003, 10:30 AM
Im not that sure that its that costly.. former ati drivers showed signs of gl_FBcolor in their beta of GL2 shaders ( before the final spec was approved ), and i know that nVidia allows binding a pbuffer as a texture at the same time you render to it.. this seems to be pretty much the same requirements. Expecially when the superbuffers seems to allow whatever as rendertarget, so the 'real' frambuffer and other various rendertargets (pbuffer for now) seems to be handled much alike.

ZbuffeR
12-10-2003, 10:51 AM
Originally posted by Mazy:
and i know that nVidia allows binding a pbuffer as a texture at the same time you render to it..
If I may add my grain of salt, I tried that, and it does not work. It is possible though to bind the pbuffer as texture, render it on a different context and omit wglReleaseTexImageARB before rendering again to the pbuffer. That is not the same, here there is context switching, so the card can release the texture binding on its own.

zeckensack
12-10-2003, 11:12 AM
Originally posted by Korval:
The point I was making is that if you "Move the color buffer read to an earlier stage, but not "the blending unit" as a whole.", then it is the same as what we are discussing. It isn't an alternative to moving the blending into the fragment shader; it's the exact same thing, because if you could read the framebuffer from the shader, you'd never used fixed-function blending again.I wouldn't say so, not with fixed function blending implemented in fast integer hardware. That would be a LRP and a MAD for fully general 'emulation', and I'd rather do that with only the required precision, which is not necessarily floating point.

And indeed, my made-up example didn't go there.


Originally posted by Korval:
Read access is free (or, at least, pretty cheap). Write access isn't, since it screws up all the fast z-culling hardware.In theory, yes. Last time I checked (which was with Cat 3.4 IIRC), reading from fragment.z alone caused a heavy performance drop in an otherwise simple shader.

I'll repeat the test once I'm finished poking at my brand new 9200 http://www.opengl.org/discussion_boards/ubb/smile.gif
(just to make sure I'm not talking nonsense)

Originally posted by Korval:
Well, technically, you don't have to "copy" it. With ATI_draw_buffers, you can write the color to the frame buffer and write the luminance to an AUX buffer. From there, assuming ARB_superbuffers, you just bind that buffer as a texture, and you can do regular blending as a post-process.MRTs?
Well, yes, not technically a copy, but at even higher bandwidth cost.
Yours:
a)write color buffer, write luminance buffer (whole viewport)
b)read color buffer, read luminance buffer, blend (region of effect)
c)write color buffer (region of effect)
_____
two reads, three writes

Mine:
a)write color buffer (whole viewport)
b)read color buffer (forward this to fixed function blending, if possible), compute luminance (region of effect)
c)blend, write color buffer (region of effect)
______
one read, two writes

If the region covered by the effect is small(er than the viewport), it gets a lot worse quickly, because I have to pay to cost for writing the whole viewport to the luminance target.

Mazy
12-10-2003, 11:50 AM
ZBuffer : you can be right, i havent tested that myself, but the technique is described in http://developer.nvidia.com/docs/IO/8230/GDC2003_SummedAreaTables.pdf with the warning "results may be undefined", just as teh spec says about this, but they have showed a demo of it, so at some point it had to work on their cards.

Won
12-10-2003, 11:56 AM
Korval -- you stuck a quote from me in your reply to Zeck...

Programmable texture fetching/filtering doesn't need to be slow. You'll have to deal with some extra latency in the case of particularly complex fetches, but the implementation would really only need to make sure that the standard modes, when implemented in programmable form, are fine. Probably harder than I make it sounds, but there are no obvious reasons why it might be slow.

Assuming you still have access coherency, you then only need to deal with the occassional stall, but then you just need to make your texture cache line big enough so that you can mask that latency by having multiple fragment execution threads.

Aside from being able to define your own texture filter kernels, you can define your own texture formats, wrap modes etc. And maybe there are funky things you can do when you use it to address geometry image textures or something.

-Won

zeckensack
12-10-2003, 12:13 PM
Sorry, can't edit the post without it blowing up ...

Originally posted by zeckensack:
If the region covered by the effect is small(er than the viewport), it gets a lot worse quickly, because I have to pay to cost for writing the whole viewport to the luminance target.
What I should add is that in pure theory, "my" approach uses a lot less bandwidth even with the whole viewport covered. If only a small region is covered, the difference gets even bigger.

It is to be expected that allowing this sort of access at all comes at a cost (stalls to avoid race conditions; more on-chip storage is used). I'm pretty confident that the bandwidth savings can outweigh the initial performance costs. Now I'd need to figure out the cost in transistors, which I quite frankly just can't.

Won
12-10-2003, 01:26 PM
There are probably many other issues (besides raw transistor count/area), like design validation times etc. I think it's probably a good guess to say that for the next few years, GPUs are going to approach multi-threaded in-order stream processors because it is the most easily scalable approach. Then again we're all basically talking out of our ass unless you've designed a GPU before.

-Won

titan
12-11-2003, 06:49 AM
My favourite is Mesh Instances on the general IO page. It looks like display lists with variables.

V-man
12-11-2003, 11:41 AM
About the ability to access the FB withing a fp...

Personally, I was expecting this feature to be thrown in soon. I don't see what the big deal is. In fixed pipe mode, if you enable blending, then obviously the blending unit has to to access the FB to do the blending and eventually will have to write back values.

So there are 2 ways to offer the solution : Programmable blending unit and developers have to write a separate program for it or just extend the current fp language.

Geez! The name of the game is to offer programmability here. Why argue against it?

If you are talking about random FB readbacks, then it can get complicated, and you can get undefined results as fragments may have interdependencies.

dreld
12-11-2003, 11:43 PM
Korval, reading your statement about NV40/R420 having higher blending precision....do you have a reference for this proposition? I enquired several times at ATI about this without success.

BTW, processing the same fragments belonging to different triangles cannot in general be performed concurrently since order matters in the fixed function pipeline as well (unless you use min/max blending). So, I don't see how providing the fb color in the fragment shader adds new dependencies here (you have to take care of what you render first anyway and further parallelisation could be achieved by shading fragments of objects which do not overlap).

Cheers!