PDA

View Full Version : NV30 Extensions



Nakoruru
08-29-2002, 07:51 AM
nVidia has just released their Detonator 40 drivers with emulation for NV30 along with OpenGL extension specifications.
http://developer.nvidia.com/view.asp?IO=nv30_emulation

Korval
08-29-2002, 09:29 AM
After a cursory look at their slides, I'd have to say that they're doing some interesting per-fragment things. However, one thing caught my eye: 16 texture maps, but only 8 texture coordinates? What kind of crap is that? Sure, if you really need more than 8 texture lookups, you're probably going to be running some dependent accesses anyway, but why not err on the side of caution? The 9700 allows for 16 independent texture coordinates.

Somehow, I still prefer the 9700 better, even though the NV30 allows for new features. They both have full 32-bit float registers, but the 9700 doesn't have a 16-bit option. This is a benifit, as it keeps the 9700's language cleaner. The derivitive instructions sound like a good idea, but I'd prefer to do that on the CPU (as a pre-processing step, of course), especially considering that the height map may not be at the full resolution that it was originally. Not only that, those derivitives are probably a bit expensive, as they require 2 texture fetches.

knackered
08-29-2002, 09:35 AM
Wow, the fragment program extension looks too good to be true....

Nakoruru
08-29-2002, 09:49 AM
Being able to output 8 texture coordinates from a vertex program seems like plenty to me.

Why dis the derivative functions? You cannot do per pixel derivative calculations on the cpu (quickly). At least, not for procedural textures. The derivatives are for any arbitrary expression you calculate per pixel. I think you misunderstand how it works if you think you can emulate that.

NitroGL
08-29-2002, 10:14 AM
Originally posted by Korval:
After a cursory look at their slides, I'd have to say that they're doing some interesting per-fragment things. However, one thing caught my eye: 16 texture maps, but only 8 texture coordinates? What kind of crap is that? Sure, if you really need more than 8 texture lookups, you're probably going to be running some dependent accesses anyway, but why not err on the side of caution? The 9700 allows for 16 independent texture coordinates.

Somehow, I still prefer the 9700 better, even though the NV30 allows for new features. They both have full 32-bit float registers, but the 9700 doesn't have a 16-bit option. This is a benifit, as it keeps the 9700's language cleaner. The derivitive instructions sound like a good idea, but I'd prefer to do that on the CPU (as a pre-processing step, of course), especially considering that the height map may not be at the full resolution that it was originally. Not only that, those derivitives are probably a bit expensive, as they require 2 texture fetches.

Actually, it's the same on the 9700.

Lars
08-29-2002, 10:27 AM
Ahhhh

I was just happy that there is an arb_vertex_program that all would implement, and now they did their nv_vertex_program2.
But beside that the new features look damn good.
One thing all the time it was mentioned the NV30 can do >65000 instructions in VPs, and everybody thought cool more VP code then c++ code.
But now it shows that there is an instruction limit of 256 instructions, and only with loops and branches it is possible to achieve the large number of instructions.
Also it was nowhere mentioned how long an instruction takes or how long it takes to do 65000 instructions.
Would be interesting.

Lars

Humus
08-29-2002, 11:07 AM
I think the floating point implementation is kinda half-hearted. No filtering, only texture_rectangle (no 1d,2d,3d,cubemap), no mipmapping, no blending, not supported for texenvs.
Rendering to a high precision cubemap is certainly one of the features I want most in the new generation hardware, so this is kind of a disappointment :p
Anyone knows if any of these restrictions holds for the R9700 too?

Nakoruru
08-29-2002, 11:27 AM
Humus. Here I completely agree with you. I really wanted floating point cube maps too. It really feels like a first generation effort. I guess we can't expect floating point to be feature completely on the first go (well, we CAN expect it, and most of us were, but that doesn't mean we will get it ^_^).

mproso
08-29-2002, 11:35 AM
Well, on my old geforce2MX there are couple of new extensions:
GL_ARB_vertex_program,
GL_ARB_window_pos,
GL_NV_point_sprite,
GL_NV_pixel_data_range.
What does this last extension do?
It is not in the specs.

Zeno
08-29-2002, 11:42 AM
Also it was nowhere mentioned how long an instruction takes or how long it takes to do 65000 instructions

NVIDIA has always stated that each vp instruction takes exactly one clock cycle. If the GPU were 400 Mhz, then 65000 instructions would take 0.0001625 seconds per vertex. This translates to a little over 6000 triangles per second if you use strips and are not limited by anything else.

I just realized that my calculation assumes only one VP unit and that vertices cannot be streamed through.

-- Zeno

[This message has been edited by Zeno (edited 08-29-2002).]

pkaler
08-29-2002, 11:42 AM
I guess us linux guys are at the mercy of the driver writers.

Korval
08-29-2002, 02:13 PM
You cannot do per pixel derivative calculations on the cpu (quickly). At least, not for procedural textures.

If, by "procedural texture", you mean that it is something generated on the CPU and uploaded to the card each frame, I don't see the point of not doing the derivative on the CPU.

If, by "procedural texture", you mean a set of fragment program state that generates a color based on a "texture coordinate input" and various parameters, then yes, it has some use. I still submit that this instruction can't be particularly fast, considering that it is using information in fragments that may not be currently rasterizing yet.


Actually, it's the same on the 9700.

What is the same on the 9700?

mcraighead
08-29-2002, 02:22 PM
NV_pixel_data_range is somewhat analogous to NV_vertex_array_range. It lets you do faster streaming of textures and asynchronous ReadPixels.

- Matt

mcraighead
08-29-2002, 02:34 PM
DDX/DDY are essential to implementing "analytic antialiasing" of shaders. This is a standard idiom in the Renderman world. (The derivative instructions aren't really that expensive.)

Some of the standard OpenGL pipeline features make very little sense to implement for floating point. Blending is a key example. All the blending operations are predicated on "1-x" being really cheap to implement -- it's just an XOR for fixed-point. But in floating-point, 1-x requires essentially a full FP math unit, with a variable-width shifter and all! It's unreasonable to expect that some of these pipeline stages will _ever_ be implemented in their classic OpenGL form for floating-point framebuffers.

The same can be said for float texture filtering. It's unclear that it will ever be a high-performance feature. Too many adds and multiplies, even for the simplest filter modes.

Remember that "8 texture coordinates" really means 8 _texture coordinate outputs_ from the vertex program, i.e., 8 general-purpose interpolants. You have 16 vertex attributes going in, which is more than enough for almost anything. And you can always compute texture coordinates analytically in any way you want inside your fragment program. The key word is "interpolants" here. You get 2 color interpolants (fixed-point [0,1] range), and you get 8 generic ("texture coordinate") interpolants.

Also, it's not true that each instruction in a vertex program takes exactly one clock cycle. This is a rough estimate for the NV20 architecture, but you can do better or worse, depending on your exact program.

- Matt

MZ
08-29-2002, 05:47 PM
Ok, but support for 1D/2D/3D/cube texturing with simplest mipmapping (just GL_NEAREST_MIPMAP_NEAREST) wouldn't hurt.
All necessery math is already there, for standard fixed-point texels. Computation of memory location of fetched texel is always the same for any texel format. I don't understand this limitation.

Correct me if I'm wrong, it seems fragment shader is powerful enough to emulate any texture dimensionality (3D, 4D, 5D, ... ?) (maybe even mipmapping?) at the cost of extra instructions, and tricky packing texels into texture_rectangle. So cube depth map is not so hopeless task?

davepermen
08-29-2002, 09:30 PM
Originally posted by mcraighead:
[B]Some of the standard OpenGL pipeline features make very little sense to implement for floating point. Blending is a key example. All the blending operations are predicated on "1-x" being really cheap to implement -- it's just an XOR for fixed-point. But in floating-point, 1-x requires essentially a full FP math unit, with a variable-width shifter and all! It's unreasonable to expect that some of these pipeline stages will _ever_ be implemented in their classic OpenGL form for floating-point framebuffers.
so you're just lazzy.. there is no point in treating floatingpoint buffers as something special, and it will by no means stay this way. at least supporting a fixedpoint blendingfactor (it has to go from 0 to 1 and not more anyways for getting the standart working for floats) would not be _that_ hard.. but okay, its work, and you have enough problems to get that out till christmas. good luck btw..


The same can be said for float texture filtering. It's unclear that it will ever be a high-performance feature. Too many adds and multiplies, even for the simplest filter modes.

there, for sure, its just in the range of 0 to 1. it is an essencial feature, and the result will be all will just do it by hand. it doesn't mather that much if its slower, but it should be supported because it makes live _easier_. it will for sure be one of the first functions i have to code manually, just because its not standart. at least bilinear would not have been that terrible..



Remember that "8 texture coordinates" really means 8 _texture coordinate outputs_ from the vertex program, i.e., 8 general-purpose interpolants. You have 16 vertex attributes going in, which is more than enough for almost anything. And you can always compute texture coordinates analytically in any way you want inside your fragment program. The key word is "interpolants" here. You get 2 color interpolants (fixed-point [0,1] range), and you get 8 generic ("texture coordinate") interpolants.
and you get only 4 textures if you use the standart pipeline. (just to note that, thats no good or bad.. http://www.opengl.org/discussion_boards/ubb/biggrin.gif). btw, what i really thought is funny, is, that, in the end, you get the standard register combiners. when i read that, i rofl.. not that its bad, its just ridiculous somehow http://www.opengl.org/discussion_boards/ubb/biggrin.gif (and i thougth we finally get rid of that.. http://www.opengl.org/discussion_boards/ubb/biggrin.gif)



Also, it's not true that each instruction in a vertex program takes exactly one clock cycle. This is a rough estimate for the NV20 architecture, but you can do better or worse, depending on your exact program.

how can _we_ do bether or worse by using the instructions? http://www.opengl.org/discussion_boards/ubb/biggrin.gif but that they are not at the same speed is quite logical somehow..

btw, you still don't provide percomponent negotiation, do you? this would be essencially useful for quaternion multiplications..

oh, and is executing an instruction with one of those (LE.x) attachments slower than without?

davepermen
08-29-2002, 09:34 PM
Originally posted by Nakoruru:
Humus. Here I completely agree with you. I really wanted floating point cube maps too. It really feels like a first generation effort. I guess we can't expect floating point to be feature completely on the first go (well, we CAN expect it, and most of us were, but that doesn't mean we will get it ^_^).

i remember the time of gf1 and gf2.. i expected full perpixellighting (as the marketing suggested this)..
again, we CAN expect it, but we won't get it..

i think this is really stupid.. i wanted to base my code fully on floatingpoint. setting up a 128bit floatingpointbuffer, using all floatingpoint textures, and all.. i don't care if its at 1/4 the speed than 32bit versions.. it will be much faster than my gf2mx anyways. and the image quality is awesome (doing a lot of softwarerendering i know how real floatingpoint math does look like.. wow..)

but no, once again a generation of gpu's filled with hacks.. i want my gl2.. http://www.opengl.org/discussion_boards/ubb/frown.gif

Nakoruru
08-29-2002, 10:26 PM
Korval,

You still cannot calculate the derivative on the CPU because the fragment program derivative is relative to window space x and y. If you calculate a derivative for a texture it will be relative to texture space s and t.

Nakoruru
08-29-2002, 10:34 PM
Hmmm, emulating floating point cube maps with a texture rectangle... That fits right into my 'no more new fixed functionality' philosophy.

From now on I really would prefer new program instructions over new fixed functionality, as long as it makes sense to do so. However, a few standard filters (bilinear!) implemented for floating point would not be too bad. At least in a high level shading language it would be implemented as a function call, and it doesn't really matter if the hardware truly supports it.

Is it just me, or does it seem like with all these extensions I should call this nVidiaGL because it seems that I could write a program that uses almost no standard OpenGL by using nVidia's extensions. The only standard thing left seems to be texture objects!

gking
08-29-2002, 10:48 PM
it doesn't mather that much if its slower, but it should be supported because it makes live _easier_.

If people were willing to buy a $1000 GPU, putting enough FP MACs and dividers on the chip to perform floating point texture filtering, blending, and mipmapping wouldn't be an issue.

In the mean time, here's a Cg routine that will bilinear filter a float texture for you (I haven't tested this, but it should work):

float4 FilterFloatBuffer(samplerRECT tex, float2 uv) {
float4 deltabase;
deltabase.xy = frac(uv);
deltabase.zw = floor(uv);
float4 smp = f4texRECT(tex, detabase.zw);
float4 accum = (1.0.xxxx-deltabase.xxxx)*smp;
smp = f4texRECT(tex, deltabase.zw+float2(1,0));
accum = accum + smp*deltabase.xxxx;
accum = accum * (1.0.xxxx-deltabase.yyyy);
smp = f4texRECT(tex, deltabase.zw+float2(0, 1));
float4 tmp = smp*(1.0.xxxx-deltabase.xxxx);
smp = f4texRECT(tex, deltabase.zw+float2(1,1));
tmp = tmp+smp*deltabase.xxxx;
accum = accum + tmp*deltabase.yyyy;
return accum;
}


i want my gl2

The first gl2 part doesn't support any floating point arithmetic at all. Floating point support in GPUs is definitely a case where Moore's law is a limitation.

[This message has been edited by gking (edited 08-30-2002).]

gking
08-29-2002, 10:59 PM
As a corollary to my above post --

If you have a list of float textures that represent mip maps of a base float texture, you can do trilinear filtering in the fragment shader.

It'll be pretty nasty (and I'll leave coding it as an exercise for the reader), but everything you need to select MIP level and blend is available (i.e., screen-space partial derivatives).

For floating point cube maps, all you need to do is save the cube map as a rectangular cross texture and then convert [x y z] into the correct coordinates on the cross.

mcraighead
08-29-2002, 11:27 PM
davepermen,

gking is on the right track here. We don't have 10 billion transistors to throw at this problem. This has absolutely nothing to do with being "lazy" or anything of the sort. In fact, if we had an unlimited transistor budget, our lives would probably be a lot easier.

Floating-point multipliers are big. Floating-point adders are BIG. And we're talking about full S1E8M23 precision here.

When you think about float buffer mode, a good analogy is to imagine it as a step as big as the step from color index mode to RGBA mode. Some of the stuff that the previous mode did doesn't make sense in the new mode.

It is entirely plausible that it might *never* make sense to support old-style blending in combination with float buffers. And it is virtually guaranteed that filtering of float textures, even if eventually supported, will lead to large slowdowns.


Condition code modifiers do not make instructions any slower. This can lead to nice speedups over the older "SGE/MUL/MAD" approach. (It also gets rid of the whole "0 * NaN = 0, 0 * Inf = 0" annoyance.)

- Matt

davepermen
08-30-2002, 12:00 AM
Originally posted by mcraighead:
davepermen,

gking is on the right track here. We don't have 10 billion transistors to throw at this problem. This has absolutely nothing to do with being "lazy" or anything of the sort. In fact, if we had an unlimited transistor budget, our lives would probably be a lot easier.
i do understand the _HARDWARE_ limits you can get. but you're saiing it will be useless anyways. and this statement is just plain stupid.

Floating-point multipliers are big. Floating-point adders are BIG. And we're talking about full S1E8M23 precision here.
well, it depends. sure, they are big, but, for example for bilinear filtering you don't need the full precision. and you only need to filter in the 0..1 range, so the multiplication is very different. take half float for example, you could convert them to.. 64bit integers? without any precicion loss, i think (this is from brain, i don't have a calculator here if it would be enough..), so sample the 4 values, convert to 64bit integers, do the same bilinear you did for years, and you know its fast, and convert them back..
i _KNOW_ its not as easy as fixed point, and i _KNOW_ it would be slower than doing pointsample. but do you think using the cg function from above will be faster?! bilinear filtering is a common task..

about the cubemaps. so.. coding them for yourself. why the heck did you implemented cubemaps then? put them out again.. no need for them. we render onto a width*(6*height) texture, 6 times, with glScissor, then we bind it, and we sample manually. cubemaps are useless. you can drop them as well..

same for 3d textures, 1d textures. why the hell do we still have 2d textures. we could do it all with 1d textures..

really.. there is no point in not supporting the stuff you support since long time. they are very handy to have them automatically, and now we have to do them by hand again.. there you _are_ lazy.. it doesn't mean more transistors.. it just means less work for drivers for you..


When you think about float buffer mode, a good analogy is to imagine it as a step as big as the step from color index mode to RGBA mode. Some of the stuff that the previous mode did doesn't make sense in the new mode.

It is entirely plausible that it might *never* make sense to support old-style blending in combination with float buffers.hm.. okay, old style blending is not really needed. not all at least.. but simple modulation with frame buffer, or addition is quite useful.. but then i remember we don't have a real framebuffer anyways.. sort of funny.. how do we actually draw onto a floatingpointbuffer? we have 4 outputs..

oh, and its not that a big step as the 8 to 32bit. the math from the software side remains the same over most the stuff.. one thing that changes is the clamping.. so we can now have full dynamic range _on_parts_of_the_rendering_pipeline_. i thought you will support a full floatingpoint version of opengl.. instead you provide some float rendertargets, and thats it. no real float textures, no real float screen mode actually..


And it is virtually guaranteed that filtering of float textures, even if eventually supported, will lead to large slowdowns. well.. filtering.. isn't this actually.. a*b + c*d + e*f + g*h.. with a,c,e,g as the filtering kernel, and b,d,f,h as the four samples? isn't that just a DP4 instr? i don't see the point.. the filtering kernel you can generate about the same way you did before..



Condition code modifiers do not make instructions any slower. This can lead to nice speedups over the older "SGE/MUL/MAD" approach. (It also gets rid of the whole "0 * NaN = 0, 0 * Inf = 0" annoyance.) thats fine.. i know that it does speed up by removing some instructions, but say you get instead slower execution of the individual instruction would be.. well.. not that nice..


Originally posted by NakoruruIs it just me, or does it seem like with all these extensions I should call this nVidiaGL because it seems that I could write a program that uses almost no standard OpenGL by using nVidia's extensions. The only standard thing left seems to be texture objects!

is it just me, but are for example the fragment programs quite difficult to code with, as each register can hold a) a float, b) two halffloats, or c) two fixed points, or how ever values, as well as the branching and such stuff. is it just me, or do we now need to use cg to get our code readable again?

and why is NV_vertex_program2 not based upon ARB_vertex_program? with the additional instructions? would be more clean imho..

well, thats about all i have to rant currently. i just want to state that the nv30 will be, sometimes in the future, quite a good hw, from what i can see for now.. i just think we're still quite far from perfect.. more far than i actually thought before reading those specs..

nutball
08-30-2002, 12:07 AM
Originally posted by mcraighead:

Floating-point multipliers are big. Floating-point adders are BIG. And we're talking about full S1E8M23 precision here.

When you think about float buffer mode, a good analogy is to imagine it as a step as big as the step from color index mode to RGBA mode. Some of the stuff that the previous mode did doesn't make sense in the new mode.


Humm. Right, well I must agree with the sentiments expressed by others that the lack of any blending whatsoever in the FP buffers is a real disappointment. OK, so 1-x doesn't make sense if x<0 or x>1, OK don't allow multiplicative blending, but I can't even do additive blending ???

Sigh. Skulks off to wait for NV40.

ToolTech
08-30-2002, 12:38 AM
I am wondering when I can use displacement maps for my shadow volumes in Gizmo3D. When will we see displacement maps in NV arcs

Humus
08-30-2002, 03:00 AM
Oh come on people, don't be too hard on Matt. My deepest sympathy for everyone involved in trying to get floating point math in 3d. http://www.opengl.org/discussion_boards/ubb/smile.gif
When floating point in the fragment pipeline was first talked about my first thought was actually something along the line of "how are they going to be able to do that?" knowing the cost of implementing them in hardware. Somehow I believed in the magic of ATi and nVidia engineerers anyway. But as it seams now with all these kinds of restrictions it turns out stuff are yet again going to be massively painful to code for. Somehow I think we would have better waited another generation for full floating point fragment shading, and possibly only taken the step of adding full 16bit/channel fixed point 1d/2d/3d/cube textures, possibly even 32bit/channel fixed point. Then in the next generation and maybe a smaller manufacturing process they might be able to take the real step into floating point fragment math.

davepermen
08-30-2002, 03:14 AM
in half a year you can have nice cheap amd cpus wich you can plug up to 16 on a motherboard.. hehe, that is a full floatingpoint gpu/vpu/spu http://www.opengl.org/discussion_boards/ubb/biggrin.gif and no restrictions..

can't wait to trace my rays there.. http://www.opengl.org/discussion_boards/ubb/biggrin.gif

till then i will get the ati, as its yet here and provides about the same 'features'..


in about one year pixelshaders and vertexshaders will finally be real shaders, done in a very useful way, generalized and all.. i think at this time floatingpoint math will be fully supported as well.. at least, i hope so..

Nakoruru
08-30-2002, 05:52 AM
I have to admit that I was a little surprised when I heard about floating-point frame buffers being in the next generation cards. So, its kind of odd that I should be so disappointed that things are not completely and thoroughly floating point everywhere with every feature we are used to. I should have been more sceptical. I'm not sure whether to blame myself or the hype machine.

I do not think I should be comparing the R300 or NV30 to some perfect card I can only dream about. The new features are an outstanding improvement. No card can ever compete with the perfect one you imagine in your head (except that maybe my current dream card will be obsolete in 3 years ^_^).

Good Job nVIDIA!

If I really wanted to complain to nVIDIA, it would be about the fact that their NV_* extension specifications altogether probably make a larger document than the OpenGL 1.4 spec.

jwatte
08-30-2002, 07:27 AM
> And we're talking about full S1E8M23
> precision here.

If I can't have that, I'd be perfectly happy with 16-bit FP precision. That'd be enough for me for a long time. I've mostly been planning on staying in 16 bit per component anyway, as that doubles your available register space.

If I can't have 16 bit floats, I'd like 16-bit signed fixed (say, 4.12 or even 2.14) in as many places as possible.

Btw: You absolutely need multiplicative and additive blending in any reasonable graphics set-up, so it makes sense to say "we can't do mult and add, so you don't get anything". Whether you need 1-x, or 1-clamp(x,0,1), or something like that is less clear. After all, you COULD pre-multiply when you generate the initial data, instead, assuming you have enough input data to go around.

"Plastic transparency" where you typically use A,1-A blending currently (as opposed to regular transparency, which is just multiplicative) should be done as multiplicative + diffuse/specular anyway.

MZ
08-30-2002, 07:41 AM
(ToolTech) I am wondering when I can use displacement maps for my shadow volumes in Gizmo3D. When will we see displacement maps in NV arcs
I can't recall source, but i've read somewhere that nv30 will allow "render to vertex array".
Now, with the NV_pixel_data_range thing, i imagine this way: you render depth map to texture, do glReadPixels to VAR memory, and then you will have your shadow volume grid (am I correct?) ready to render.


(davepermen) and why is NV_vertex_program2 not based upon ARB_vertex_program? with the additional instructions? would be more clean imho..
I fully agree. And the same applies to fragment_program. The fact the ARB_fragment_program doesn't exist yet is not excuse. Program-object managment, parameter loading, etc., has been defined and should be reused, even if instuction syntax rules were much extended.
I like nv30 HW features (despite limits), but the whole new nv30 extension pack is a proof that OpenGL 2.0 is the only hope.

ToolTech
08-30-2002, 07:43 AM
I would not do any readpixels. Just keep it as a texture and render the displacement map to generate the volume.

ToolTech
08-30-2002, 07:45 AM
ok. I might have misunderstood you at first glance. Good idea !

Nakoruru
08-30-2002, 07:54 AM
MZ, do you mean that nVIDIA should reuse their own program loading API (you cannot really mean that because they do) or that they should use OpenGL 2.0's?

gking
08-30-2002, 08:15 AM
The only painful limitation is the inability to blend floating point framebuffers. However, if all you are interested in is one or two components, you can use your shader to pack higher precision data into the frame buffer (2x12-bit or 1x24-bit).

In most cases, floating point textures will be used primarily as intermediate buffers in screen space (so filtering them doesn't make too much sense). And, if you're focused on image quality enough to use floating point textures, you probably don't want to use just linear interpolation.

And I wasn't thinking clearly last night -- mipmapping/trilinear filtering a floating point texture doesn't require multiple textures. In fact, it becomes much less ugly when all MIP levels are stored in one.

Korval
08-30-2002, 08:35 AM
Not having floating-point blending isn't that much of a concern. After all, by the next rev, we won't have any blending at all: it'll just be a per-fragment parameter that we can do with as we please.

BTW, what happens when I want to have 4 per-pixel bump-mapped lights, where the tangent-space transform is computed in the fragment shader? NV30 doesn't really provide for that, since it can only pass 10 parameters (only 8 of which are full-precision). I'd need at least 13 parameters for 4 lights.

Of course, 8 is larger than the 4 I have now... but not that much larger than the 6 that the 8500 provided last year.

[This message has been edited by Korval (edited 08-30-2002).]

MZ
08-30-2002, 09:01 AM
Nakoruru,
I meant they should "freeze" NV VP, and use ARB VP interface as basis for *any* new gl1.x, asm style VP or FP.

gking
08-30-2002, 09:30 AM
Korval --

4 eye space light positions passed as parameters into the fragment shader (updated every frame)

an interpolated 3x3 tangent->eye matrix (or eye->tangent)
1 interpolated eye space object position

an eye space eye position constant at (0, 0, 0)

So, with 4 interpolants (3, if you recompute B=NxT every fragment) you can have quite a few more than 4 lights, and the resulting quality will be better than interpolating H and L per-vertex.

Of course, the performance won't be as good, but you should be able to do some load balancing between the vertex and fragment programs with the remaining 5 interpolants.

Coop
08-30-2002, 09:51 AM
Hi!
Just couple of my thoughts about float textures...
I think the lack of filtering for float textures shouldn't be considered the missed feature. Instead we should simply assume that nv_fragment_program bypasses not only the standard texture application, color sum, etc., but also the texture fetching mechanism. It's just another step towards "replace fixed functionality with programability". We can now code any filtering scheme we want. With fixed function there's only nearest, linear and anisotropic filtering, with mipmaps (nearest or linear). What if I want cubic filtering? Or summed area table for minification? Or linear filtering on s and nearest on t coordinate? All this is possible with fragment program (I think it is http://www.opengl.org/discussion_boards/ubb/smile.gif. Standard filtering types are just a couple of functions in Cg. The same with 1d, 2d, 3d, cube textures. What if I want a cube map access with correct filtering on texture boundaries? Or a 4d texture?
Another story is the speed of dedicated filtering hardware vs. filtering "emulated" in fragment program... But it's like dedicated lighting hardware being twice as fast as lighting calulated in vertex program in NV20. Nobody is crying that we cannot use it when we use vertex programs, especially since vertex programs get faster and faster.

But I think there's NO EXCUSE for not supporting any blending http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Coop

Adrian
08-30-2002, 10:28 AM
So if an app is suffering from banding due to blending many textures it will not benefit from the NV30s implementation of fp color?

davepermen
08-30-2002, 12:04 PM
Originally posted by Korval:
Not having floating-point blending isn't that much of a concern. After all, by the next rev, we won't have any blending at all: it'll just be a per-fragment parameter that we can do with as we please.

BTW, what happens when I want to have 4 per-pixel bump-mapped lights, where the tangent-space transform is computed in the fragment shader? NV30 doesn't really provide for that, since it can only pass 10 parameters (only 8 of which are full-precision). I'd need at least 13 parameters for 4 lights.

Of course, 8 is larger than the 4 I have now... but not that much larger than the 6 that the 8500 provided last year.

[This message has been edited by Korval (edited 08-30-2002).]

think about it..

you have full floatingpoint values
IN the fragment program.
that does mean as well
full floatingpoint constants..
why don't you store the lights in the constants? and simply send over the tangent space..
you don't need to send over the screenspace position as well btw, you will get it for free..
store tangentspace as a quaternion, and you only need to send 1 4d texture coordinate http://www.opengl.org/discussion_boards/ubb/biggrin.gif
no need for anything else..
vertexprograms are not needed anymore for any shading, only for animating/skinning/tweening, what ever.. not even for precalculating some lighting-data..

Coop
08-30-2002, 12:05 PM
Adrian,
No, but you'll have to use render to texture with two textures (one used as a texture with previous results and 2nd as a destination, swapped every pass) and do your blending in the fragment program.

[This message has been edited by coop (edited 08-30-2002).]

Nakoruru
08-30-2002, 12:10 PM
coop, I'm with you. I want transistors to be dedicated to more and faster pixel and vertex pipelines, not fixed function operations.

If filtering or blending requires a whole extra set of floating point units, then I would rather have those unit's transisters go towards building more general floating point units for the pixel pipelines.

In a high level language, the typical filtering techniques will become standard library functions anyway, so why do we care
how they are implemented?

Why would you need blending when you can render to texture? That is a much more general multipass solution.

Coop
08-30-2002, 12:26 PM
Originally posted by Nakoruru:
Why would you need blending when you can render to texture? That is a much more general multipass solution.

I agree it's more general than standard blending but we often don't need any sophisticated blending function. I think the most commonly used is just an addition(src + dest) p.e. accumulating lights in per pixel lighting. Using render to texture is a little too complicated for me in this case and it doubles the memory required (unless we can render to texture that we are texturing from). That's why I think at least simplified blending should be supported (addition, maybe multiplication).

Coop

mcraighead
08-30-2002, 01:59 PM
We do have plans to make NV_vertex_program2 work well with ARB_vertex_program. However, we also want it to be backwards compatible with NV_vertex_program.

In all honesty, there isn't a _huge_ difference between NV_vertex_program and ARB_vertex_program. You can even use the ARB APIs to load your NV program -- just call ProgramStringARB rather than LoadProgramNV.

Any approach would have required a new NV extension, because NV_vertex_program2 has a large number of new instructions and features that are not present in ARB_vertex_program.

I don't know if it's already in the spec, but what I believe you can do is write an ARB program and put in a special "OPTION NV_vertex_program2;" statement (I don't know if that's the precise syntax, but it's the right idea). This will let you write VP2 programs using more-ARB_v_p-like syntax.

I suspect this is already in both the spec and emulation driver, but I'm not 100% sure.

I think it's pretty clear from our inclusion of this option that our intent *is* to make all this stuff work quite smoothly with the ARB framework. I'm just not 100% clear on what the status of everything is -- I haven't been working on this myself.

- Matt

mcraighead
08-30-2002, 02:02 PM
By the way, we even made some last-minute changes to NV_fragment_program to make it work better with the ARB_v_p framework -- for example, we added numbered local parameters in addition to named parameters. We've put quite a bit of effort into making sure that all this stuff works together nicely.

- Matt

mcraighead
08-30-2002, 02:13 PM
By the way, I'm not saying that there will never be anything that lets you do anything along the lines of blending with float buffers. There are a few different approaches, each with advantages and disadvantages.

It's just that designing hardware is all about tradeoffs.

Let me put it another way. What was the last time anyone ever told us that they do NOT want certain features in their graphics card? http://www.opengl.org/discussion_boards/ubb/smile.gif If you take all the features that everyone wants, and put them _all_ in, you _will_ find yourself up at that 10 billion transistor count.

- Matt

Nakoruru
08-30-2002, 03:25 PM
Good point Matt, but I have started to get into the habit of saying 'please do not give me any new fixed-functionality and throw away all the legacy stuff you can'

On another point, having only 8 texture coordinates with 16 possible textures really makes sense if you think of it as 8 'input' textures and 8 'output' textures (although, of course, they are more general than that).

I really do not think the added complexity to do what would be a simple glBlendFunc in old opengl is a big deal. It seems like lazy programmers to me. glBlendFunc was invented because thats all that hardware could do, now that you can mix colors from anywhere in anyway you want why would you want an old crutch like glBlendFunc? Use the tools you are given to implement your own purpose built blendfuncs.

I would rather not have it hardwired, because it obscures what your shader program does because its an external piece of state. Your shader program could not stand alone. To be understood, you would have to say 'oh, by the way, the result is blended with the frame buffer after this', why not just do something equivalent in the shader program?

I would like the shader to be the final word on where colors come from and where they go and the sooner that stencils and alpha tests, depth tests, register combiners, and blending go away the better.

I know that this is an extreme position, but I think its where we are going. However, I am a practical person so I understand why things are the way they are now, and that some things will always be more efficient if they are hardwired.

Things like the huge optimizations that ATI and nVidia has brought to things like the depth-test probably mean that they are here to stay.

SirKnight
08-30-2002, 04:56 PM
I agree. I think the ability for us to program the blending functions ourself in a shader program is The Right Thing. I mean, the blending possibilities would be endless. The way they are now are of course fixed and sometimes we have to juggle things around a bit to use the fixed function blending modes in the way that we want. With the blending being programmable, we could use some funky math formula to do a blending effect not possible otherwise. Being able to do this may be a little ways off but i'm quite convinced this is what will need to happen eventually. In the mean time, I cant wait for my CineFX NV30 card. http://www.opengl.org/discussion_boards/ubb/smile.gif I'm so glad I can now emulate it, now I can start on some of the things I want to do that requires this card (the ati 9700 will work just as well though). That's if I can figure out how to do the displacement mapping stuff. I hope that feature can be emulated in the drivers right now anyway. http://www.opengl.org/discussion_boards/ubb/smile.gif

-SirKnight

jwatte
08-30-2002, 05:47 PM
Okay, so I have to pay two screen-size frame buffers to do blending. But then I get as much blending as I want, in floating point. Fill rate is no concern, right? :-)

As far as per-vertex attributes go, and passing them to the vertex program, you can store all the data you want in a look-up table, previously known as "texture". Then you can set up the interpolators to spit out weight values for each of the three verts, rather than some post-interpolated value. I believe with this set-up, you're theoretically limited more on the number of fragment instructions and addressable textures, than the amount of data that you can pass from the vertex shader to the fragment shader.

It probably puts a fair bit of load on that fragment processor, though. I'd better go back and start working on reducing overdraw :-) (although spiffy-Z ought to save my bacon a little bit already)

pbrown
08-30-2002, 06:17 PM
Originally posted by mcraighead:
I don't know if it's already in the spec, but what I believe you can do is write an ARB program and put in a special "OPTION NV_vertex_program2;" statement (I don't know if that's the precise syntax, but it's the right idea). This will let you write VP2 programs using more-ARB_v_p-like syntax.

I suspect this is already in both the spec and emulation driver, but I'm not 100% sure.

I think it's pretty clear from our inclusion of this option that our intent *is* to make all this stuff work quite smoothly with the ARB framework. I'm just not 100% clear on what the status of everything is -- I haven't been working on this myself.

- Matt

I have been working on this myself. :-)

If you stick an "OPTION NV_vertex_program2;" at the beginning of your ARB vertex program, the compiler should automatically accept any "!!VP2.0" constructs (condition codes, branch labels and instructions, jump tables, new instructions, and so on).

The only thing from NV_vertex_program and NV_vertex_program2 that you won't get automatically are the pre-defined register names (R0-R15, c[], v[], o[]). For those who care, I wonder if something like "OPTION NV_program_registers" might give you something like that? :-)

This path needs to be tested more rigorously, so is not yet documented in the NV_vertex_program2 or ARB_vertex_program specs. If anyone tries these options and gets something funky, shoot me an email.

Zeno
08-30-2002, 07:06 PM
Originally posted by Nakoruru:
Good point Matt, but I have started to get into the habit of saying 'please do not give me any new fixed-functionality and throw away all the legacy stuff you can'


Nice post Nakoruru. I've been thinking this way myself, starting with the ideas they put forth in the Geforce3. I would rather they got rid of the fixed-function pipeline altogether and have the drivers build custom VP's on the fly to emulate it, if it would free up transistors that could be used for more programmability.

I'd trade some speed for the additional flexibility, but I can also understand that end users usually only see the speed side of the equation.

-- Zeno

mcraighead
08-30-2002, 07:06 PM
I do believe that some sort of "blending-like thing" will show up in the future for float buffers. I just don't think that it will necessarily work the same way as glBlendFunc.

At the bare minimum, I'd predict that you would not get the ONE_MINUS modifiers the way you do so cheaply with glBlendFunc.


Let me make a suggestion. If what your app wants to do (which is a fairly common sort of thing) is to composite N light sources on top of one another with high dynamic range, then there's a good way to do this. (The obvious case where you need this is shadow volumes, where you only really get to do one light source at a time.)

Create a double-buffered float (probably 64-bit, since you probably don't need full IEEE for lighting computations) pbuffer, with a depth buffer. First, render your whole scene into depth, with color disabled. Then, do all subsequent passes with depth writes off.

On the first pass, render into your "front buffer" (scare quotes to indicate that it's not visible, because it's a pbuffer) for the first light. Use your normal shader.

On the second pass, bind the front buffer as a texture using RTT, and use WPOS or the like as your texture coordinate, and render into the back buffer. Use a slightly modified shader; you will texture out of the front buffer and add your lighting computation into that result.

From there on, just alternate between the two buffers. When you're done, do some sort of fancy HDR processing into your real window.

This is _almost_ as good as real additive blending. It costs some extra memory, but it avoids some of the ugly synchronization problems that could show up if you were to texture out the same surface you were rendering to, i.e., effectively "blend". (Hint: there's a data hazard when you have a deep graphics pipeline.) Although, in this case, I think you might actually get lucky because you did the Z pass first, and so you would only hit each pixel once.

Note that the cleverness here has to do with the fact that front and back of a given pbuffer share the same depth buffer.

- Matt

mcraighead
08-30-2002, 07:13 PM
Oh yeah. On the topic of displacement mapping, there is at least one way to accomplish this. Probably more ways exist that I haven't thought of.

It will sound slow at first; bear with me.

Render into a float buffer surface, using whatever sort of cool fragment program computation you want to displace your vertices. Your "color" output in RGB is just your vertices' XYZ position.

Use ReadPixels. Then, point your vertex array pointers at your "pixel" data you just read back, and blast it back into the HW as vertices.

Slow because of ReadPixels? Not really, at least if you use the (new) NV_pixel_data_range extension. Use wglAllocateMemoryNV to get some video memory. ReadPixels into that video memory using what is known as a "read PDR", and then use VAR to draw the vertices. No bus traffic required.

Your indices can just be constants that represent your surface topology.

- Matt

MZ
08-31-2002, 07:04 AM
Originally posted by mcraighead:
Create a double-buffered float (probably 64-bit, since you probably don't need full IEEE for lighting computations) pbuffer, with a depth buffer. First, render your whole scene into depth, with color disabled. Then, do all subsequent passes with depth writes off.

On the first pass, render into your "front buffer" (scare quotes to indicate that it's not visible, because it's a pbuffer) for the first light. Use your normal shader.

On the second pass, bind the front buffer as a texture using RTT, and use WPOS or the like as your texture coordinate, and render into the back buffer. Use a slightly modified shader; you will texture out of the front buffer and add your lighting computation into that result.

From there on, just alternate between the two buffers. When you're done, do some sort of fancy HDR processing into your real window.

Matt, I'm glad you touched this topic. The scenario you described is not allowed (or, at least, not assured) by WGL_ARB_render_texture specification. There is unfortunate limitation:


(Issues section)
14. What happens when the application binds one color buffer of a pbuffer
to a texture and then tries to render to another color buffer of the
pbuffer?

If any of the pbuffer's color buffers are bound to a texture, then
rendering results are undefined for all color buffers of the pbuffer.

I must say I can't imagine any technical reason that would justify this. And I wouldn't be surprised if it just worked as "expected" in existing drivers, despite violation of specs. Well, the "undefined" result could happen to be exactly the same as the "expected" one, right?

I need to have several screen-size color buffers for multipass purposes. I investigated 3 options:

1. Create n pbuffers, each with own color + depth data, and use WGL_ARB_render_texture.
This option obviously sucks, because it would ruin all early Z culling benefits.

2. Use single standard color + depth frame buffer, and do CopyTexImage a lot.
This is what I currently do.

3. Create 1 pbuffer with multiple color buffers within (FRONT/BACK/LEFT/RIGHT/AUXi...) + one shared depth buffer, and use WGL_ARB_render_texture.
This would be ideal, but then I learned about mentioned limitation. I didn't perform any tests, because anyway I didn't want my app to rely on undefined results. So, by now I'm staying with option 2.

My questions:
a) Are there chances for this limitation to be removed from specs? Or patched with something like WGL_ARB_render_texture2 ?
b) Is it well supported (read: accelerated) to create many color buffers within single pbuffer (like FRONT/BACK/LEFT/RIGHT/AUXi...) ?

Cab
08-31-2002, 08:19 AM
Originally posted by mcraighead:
Slow because of ReadPixels? Not really, at least if you use the (new) NV_pixel_data_range extension. Use wglAllocateMemoryNV to get some video memory. ReadPixels into that video memory using what is known as a "read PDR", and then use VAR to draw the vertices. No bus traffic required.
- Matt

Does this mean VAR will allow us to allocate more than one array of memory (one for AGP mem. and one for Video mem., for example)? And will it be possible switch between those arrays quickly?

Thanks.

davepermen
08-31-2002, 09:12 AM
matt, you made it! i could KISS you if you where a) female, hehe, and b) somewhere here around..

finally we will be able to "render into vertexbuffers".. its actually the most awesome feature of the nv30 imho..

btw, i hope this gets possible in dx as well, as i can't use opengl everywhere..

anyways, this will rock.. its the most advanced step forward imho somehow, it will give you the power to do extremely complex calculations fully on the gpu.. hehe, can't wait for it.. hehe http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Moshe Nissim
08-31-2002, 10:19 AM
I would like to discuss another side of the fragment program -- performance. With the register combiner model, it was (and is) straightforward to predict performance implications, since it is a fixed operator pipeline with programmable routing model. If you know how many stages exist in hardware, you know where you stand (although you can get into the re-iteration and it slows into just 1 'pass through' per 2 clocks). With the texture shader, the situation was pretty much the same. Again, you have to keep in mind that there are only 2 TMUs (not 4), but it is again a pipeline model.
Now with the fragment program, how can I predict performance? For example, how many TMUs are there? I guess not 16... So is each program instruction taking up one clock? What about stalls? What happens if I do 3 conssecutive TEX (or TXD or TXP) instructions and there are only 2 TMUs ? Nobody is even saying how many TMUs exist inside there.. What about data hazards? One instruction using as input the output of the previous instruction? Should I worry about instruction ordering to avoid data stalls? (like you do when programming assembler for CPU). I'm affraid the answer will be "use Cg, we in the backend know better than you how things really work deep down, and will optimize for it". But does Intel keep CPU details secret and tell people to just use C/C++/VB ?

Korval
08-31-2002, 11:45 AM
its the most advanced step forward imho somehow

Well, it isn't that advanced (except for the floating-point part). After all, it's not NV30-specific. I'll bet even a GeForce1 implements NV_PDR.

davepermen
08-31-2002, 01:09 PM
its quite advanced in the features it can provide.. you can render into geometry (only useful if you can render floats anyways).. the posibilities are awesome, from autoupdated and on gpu animated meshes, particles running fully on gpu (even interacting with geometry sort of, hehe), to actual quite helpful tools to implemenent raytracing, it provides tons of new possiblities..
updating the whole water animation on hw, let the water surface move.. etc..

Korval
08-31-2002, 02:05 PM
only useful if you can render floats anyways

Not true, especially with vertex shaders. Granted, an 8-bit per component position value doesn't offer much precision, but it's there, it works on older hardware, and it's faster than rendering to a floating-point buffer, even on newer hardware.

Besides, I've never been particularly impressed with doing things like particle systems or other such things on the GPU. It's a waste of resources, using something to perform a task that it is not optimized to do rather than performing the task on the CPU while rendering other stuff on the GPU. Rather than wasting precious GPU time on animating a mesh, I'd rather give that mesh more vertices/effects and do the animation concurrently on the CPU. The overall graphics quality of the rendering will be better, as will overall performance.

davepermen
08-31-2002, 02:12 PM
Originally posted by Korval:
Not true, especially with vertex shaders. Granted, an 8-bit per component position value doesn't offer much precision, but it's there, it works on older hardware, and it's faster than rendering to a floating-point buffer, even on newer hardware.

Besides, I've never been particularly impressed with doing things like particle systems or other such things on the GPU. It's a waste of resources, using something to perform a task that it is not optimized to do rather than performing the task on the CPU while rendering other stuff on the GPU. Rather than wasting precious GPU time on animating a mesh, I'd rather give that mesh more vertices/effects and do the animation concurrently on the CPU. The overall graphics quality of the rendering will be better, as will overall performance.

yeah, depending on the quality of the data you need you can use it even today..

its your problem if you're not interested in getting first physics and such stuff away from the cpu as well (as it _is_ faster on the gpu as well http://www.opengl.org/discussion_boards/ubb/biggrin.gif)..

at least its fun to have it.. useful features we will find for sure..

mcraighead
08-31-2002, 04:54 PM
Ugh, ARB_render_texture says that? Kinda lame.

You can certainly cause bad things to happen if you aren't careful with ARB_render_texture. If you render into one of the levels that is being textured from, then you have created a nasty data hazard. The results will differ between various hardware.

But this is probably a case where the spec should have been more careful about leaving things defined whenever possible.

One thing I'm pretty sure is left undefined by ARB_render_texture is what happens if you render into one mipmap level while texturing from another. This can be useful to implement funky mipmap generation algorithms -- texture from level n and render into n+1. After all, plain old averaging of everything is not always correct.

I would be 90% certain that what I proposed would _work_. If you can't get it to work, let me know (via email).

On the topic of displacement mapping: the PDR/VAR algorithm I'm proposing is not without its flaws. I'm really not sold on it myself; I'm merely proposing it as something you might play with.

Hugues Hoppe has a Siggraph paper this year on geometry images. What I'm proposing is really just another geometry image algorithm.

- Matt

Korval
09-01-2002, 11:14 AM
its your problem if you're not interested in getting first physics and such stuff away from the cpu as well (as it _is_ faster on the gpu as well )..

It may take less real-time on the GPU, but, ultimately, the program-loop (especially for games) needs the results of physics. Unless you're going to put your entire program loop up there too. In which case, you're wasting your CPU. Remember, CPU's are getting more powerful (and are certainly cheaper) than GPU's. Having the CPU compute physics/animation/AI/etc while the GPU works on rendering will give you the same result in half the time. Not only that, you can send more stuff to the GPU, so that your graphics look much better.

Maybe for a demo, putting physics in the GPU is a good idea. But it has no real practical applications in the real world.

davepermen
09-01-2002, 11:42 AM
Originally posted by Korval:
But it has no real practical applications in the real world.

never say never..

i see quite a good use.. i can simulate rain fully on the hw, all the rainparticles.. so rain is now no problem, except it needs some fillrate.. what do i gain from this? i can let it rain while my cpu can do heavy calculations for useful stuff.. rain doesn't hurt, except possibly the fillrate due the blending of the raindrops..

there _are_ ways to use it, and there _are_ ways to use it useful as well, i just stated the physics as it should show how much the features of this ext can give, first time you can actually process geometry on the gpu and store that information.. and that is powerful..

DFrey
09-01-2002, 12:32 PM
Putting the physics of rain on the gpu seems a little out of place. That, or you have some very simplistic rain physics. Does your rain interact with the world? Does it splatter on rooftops? Can it form puddles and streams?

Zeno
09-01-2002, 01:22 PM
Originally posted by Korval:
Remember, CPU's are getting more powerful (and are certainly cheaper) than GPU's.


I could hardly disagree more with both halves of this sentence.

CPU's are not getting more powerful than GPU's. Quite the contrary, GPU's are increasing their computational lead by doubling in speed at three times the rate of CPU's. Currently, the best pentium can perform about, what, about 6 gigaflops? While the best graphics card that I can find data for (Geforce4) can perform about 120 GigaFLOPS.

CPU's are also not getting cheaper than GPU's. Let's compare the latest:

Pentium 4, 2.8 GHZ - $537 (pricewatch)
Radeon 9700 - $399 (ebgames.com)

Not only is the CPU not cheaper in the absolute sense, but it also doesn't come with 128 MB DDR RAM and has far fewer transistors on the chip itself. This makes the price disparity even greater, IMHO.

Given all this, I'd say that it's probably wise to offload anything from the CPU that doesn't require global interactions or complex condition testing.

-- Zeno



[This message has been edited by Zeno (edited 09-01-2002).]

zeckensack
09-01-2002, 03:04 PM
Originally posted by Zeno:
Currently, the best pentium can perform about, what, about 6 gigaflops? While the best graphics card that I can find data for (Geforce4) can perform about 120 GigaFLOPS.Unfair comparison IMO. First of all it's GOPS without an F in between http://www.opengl.org/discussion_boards/ubb/wink.gif
Most of it is blending operations (register combiners and stuff) in integer space. It's not readily at your service.

There is certainly a lot of brunt in graphics chips, but it's not as freely available as it is in CPUs.

That's not saying that you shouldn't use it when it makes sense. But the numbers can't be compared. Benchmark it, then decide what's better.

Zeno
09-01-2002, 03:49 PM
Originally posted by zeckensack:
Unfair comparison IMO. First of all it's GOPS without an F in between http://www.opengl.org/discussion_boards/ubb/wink.gif
Most of it is blending operations (register combiners and stuff) in integer space. It's not readily at your service.

First, what you are saying, Zeckensack, is in direct opposition to what nvidia says. Here is a snippet from their Geforce3 press release:


The GeForce3 is the world's most advanced GPU with more than 57 million transistors and the ability to perform more than 800 billion operations per second and 76 billion floating point operations per second (FLOPS).

Now, I know I have seen the number 120 GFLOPS pertaining to Geforce4, I just can't find that video right now on their web site. It's not unreasonable that the Geforce4's T&L is 1.6 times as fast as Geforce3.

Second, it's not an unfair of a comparison in the context of this discussion, since what we're arguing about here is whether the GPU faster at anything that it CAN do than the CPU. Moving particles of rain according to the vector equation like x' = x + v*t is certainly something that the GPU would be good at.

-- Zeno

zeckensack
09-01-2002, 05:07 PM
Originally posted by Zeno:
Now, I know I have seen the number 120 GFLOPS pertaining to Geforce4, I just can't find that video right now on their web site. It's not unreasonable that the Geforce4's T&L is 1.6 times as fast as Geforce3.Then it must be my fault then. I was under the impression that this number was about ops, not flops. Looks like I was wrong. Point taken http://www.opengl.org/discussion_boards/ubb/smile.gif

Second, it's not an unfair of a comparison in the context of this discussion, since what we're arguing about here is whether the GPU faster at anything that it CAN do than the CPU. Moving particles of rain according to the vector equation like x' = x + v*t is certainly something that the GPU would be good at.

-- ZenoI'll try again.
I'd be perfectly comfortable comparing vertex shader ops with CPU ops. That's an area where you can do tradeoffs. I'd also be comfortable with pixel shader ops, as long as they feed back into geometry. And that's not an option on the current generation, which is where the numbers came from.

You can't trade CPU ops for pixel shader ops because software pixel processing just isn't an option.

It will be comparable on next gen, but that next will come with a new set of numbers.

And even then I'm quite sure that a big portion of these flops will be fixed function hardware like z iterators, 1/w iterators, triangle setup and clipping hardware which you cannot use for anything else.

Or to offer a different perspective, 136M verts/sec seems to be the transform maximum of the Geforce4Ti4600 ( proof (http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/005940.html) ). That's 3,8GFlops in my book*. Where's the rest? It doesn't get any faster when you start using more complex vertex operations. And there aren't any other fp units under your control on a Gf4.
I'm not saying the remaining flops aren't there, I'm just saying that they're not at your disposal as they already have specific work to do.

*assuming bare 3-float vertices and a single matrix mult per vertex (modelview/projection combined, no normal processing)
Vertex*Matrix mult is 28 flops, 16 mults and 12 adds

Zeno
09-01-2002, 06:38 PM
Good point....I can't argue with your calculation. Most of the flops seem to be tied up in clipping/triangle setup areas which can't be forced to do anything else.

Given this, I have to concede that the fastest CPU's would be better at most geometry-level stuff than the fastest GPUs assuming that's all they had to do. If not, benchmark, like you said http://www.opengl.org/discussion_boards/ubb/smile.gif

-- Zeno

Korval
09-01-2002, 08:30 PM
Given all this, I'd say that it's probably wise to offload anything from the CPU that doesn't require global interactions or complex condition testing.

Why? That just takes up valuable rendering time away from actual rendering. As long as my CPU time isn't greater than the framerate I want to have, offloading CPU data to the GPU is rather useless.

Not only that, things that require global interactions are pretty much everything that isn't rendering. Using a game as an example, the AI feeds into the animations, which feeds into physics, which feeds back into AI. Unless you can throw the entire thing up there, you won't get any real substantitive benifit out of it. And most of the really CPU heavy stuff (outside of fustrum culling algorithms like BSP's and so forth) are in AI, animation, or physics. And, remember, these cards still live across the PCI bus. Transfering data back from them for use by the CPU is going to be a very slow proposition.

Now, if it were possible, one thing that would be good to do would be to somehow put visibility culling onto the GPU (a significant source of performance degrading/CPU bottlenecking on many games). The problem with that is that vertex programs are running per-vertex, not per-object. This has the inherent problem of causing vertex programs to run much slower than necessary.

The thing about offloading stuff to the GPU is that the GPU is the only thing that can render. By offloading this processing there, you are guarenteed to lose rendering time.

As for the comparitive expence of CPU's vs. GPU's, I was only considering Athlons. P4's are priced excessively high, even taking into account the performance gains over Athlons.

As for the power argument, FLOPS, SHMOPS http://www.opengl.org/discussion_boards/ubb/wink.gif. You still have to implement it on hardware that was, fundamentally, not designed to handle this kind of processing. You don't have a lot of data to work with, and the read-back buffer stradgy requires waiting until the read-back is finished (you can do other things asynchronously, but the read itself will still take time). It may be able to perform a great deal of operations per second, but understand that many of these ops (like scan conversion and texture-coordinate interpolation) probably aren't going to be of great use to a physics system.

nutball
09-02-2002, 04:11 AM
Originally posted by mcraighead:
Ugh, ARB_render_texture says that? Kinda lame.

Yeah, I thought ARB_render_texture was the answer to my prayers, until I read that.

Lifting this restriction would be really, really, really great!

The other thing I don't like about this extension is the WGL_ bit. How might one go about getting such functionality running under Linux?

davepermen
09-02-2002, 04:15 AM
i dislike the whole rendertexture stuff.. i dislike in fact every part of that wgl-stuff.. its not portable. i want a simple rendertotexture wich is portable.. is that so difficult?

mcraighead
09-02-2002, 10:50 PM
The answer to that is GLX_ARB_render_texture.

- Matt

nutball
09-03-2002, 04:02 AM
Originally posted by mcraighead:
The answer to that is GLX_ARB_render_texture.

- Matt

Has that been ratified by the ARB yet?

The only references Google can find to it are some ARB minutes from June 2001 saying basically that the spec. hadn't been finished, for no apparent reason.

mcraighead
09-03-2002, 12:02 PM
I really don't know.

- Matt

pbrown
09-05-2002, 02:48 AM
Originally posted by nutball:
Has that been ratified by the ARB yet?

The only references Google can find to it are some ARB minutes from June 2001 saying basically that the spec. hadn't been finished, for no apparent reason.

There is no GLX render texture spec approved by the ARB as far as I know.

evanGLizr
09-05-2002, 12:26 PM
Originally posted by davepermen:
finally we will be able to "render into vertexbuffers".. its actually the most awesome feature of the nv30 imho..

btw, i hope this gets possible in dx as well, as i can't use opengl everywhere..

anyways, this will rock.. its the most advanced step forward imho somehow, it will give you the power to do extremely complex calculations fully on the gpu.. hehe, can't wait for it.. hehe http://www.opengl.org/discussion_boards/ubb/biggrin.gif

Out of interest, that is the same method P10 uses to do displacement mapping:


The displacement lookup (and optionally the tessellation) is done by the texture subsystem and the results left in memory where they can be read just like a regular vertex buffer. On the second pass the vertex shader will pick up the displaced vertices, light them and then they get processed as normal. This is a good example of using the flexibility of the SIMD arrays for not just their default purpose.


http://www.beyond3d.com/articles/p10tech/index.php?page=page6.inc

[This message has been edited by evanGLizr (edited 09-05-2002).]