Programmability...the double edged sword !!

After reading over the article (check opengl.org home page) on the P10 and Opengl 2.0 programs i am really getting concerned. I am a fan of low level programming, for the primary reason of control. No compilier or processor out there will ever be able to optimize better than you can. The higher the level of the programming language (as well as the number of total instructions) the harder it becomes for a compilier to generate really fast code. With x86 simple branch prediction has become a problem. While most of the time the branch prediction is ok you can force cases where the prediction will always come up wrong. With opengl 2.0 having something like 200 instructions i see some real issues. The P10 is suppose to be all about the parallelism but with so many instructions I promise you you can write a program (especially with loops and conditionals) that will severly hault that thing. Dependency between registers and operations is a serious problem. In lower level code you can be very careful about these issues but the higher you get the more you are at the mercy of the compiler.

The second thing that I don’t like is all the damn transistors. All the different instructions have eaten a lot of real estate. Although I know the 256 bit memory interface definately ate most of the real estate. I think I would have preferred the current level of vertex programs with say a dozen more instructions and the ability to pass in and out arbitrary amounts of data over the enlarged instruction set. For example I would like to be able to pass nurbs control points in and send out tri’s which is currently not possible with NV Programs or ATI. ITs now possible with 2.0 but at the same time I think I would of simply enjoyed having a couple extra texture units and a few less instructions.

Full Programability sounds great on paper but inside the silicon it gets messy and the potential for failure is equally as great as success. Programming is a balancing act best left to the programmer not the video card. I want to see the assembly that thing generates. I want the opcodes. And I want to write programs in hex. Then and only then will I feel like I am in control.

All I am saying is this. Watch it guys. Opengl 2.0 may come back to bite you. You better start the super advanced forum right now. You’re gonna need it.

Devulon

black vision guy…

there is no standart x86 instructionset for gpu’s yet, but there is the intermediate language… good gpu’s will support this directly, and like that you will get direct assembler programability…

and parallel programming was never easy (see ps2)

take it easy…

oh, and compilers can be DAMN GOOD!

There are good compiler’s I will give you that. But I would like to say that the number of good ones is a lot less than the number of bad ones. And of course define good. Visual Studio (Microsofts compilier for C++) is quite good. Very good in fact. But then again the intel compiler rips the crap out of it. But then again who can afford intel’s compiler.

As to the parallelism its never easy, thats my whole point. This is something that becomes implementation dependant. I imagine that the same code will run at different speeds on different videocards with different drives. The same way the exact same C code generates different assembly with different compilers on different platforms.

OpenGL 2.0 is a big leap. I would of preferred a big step. With a leap you really gotta watch your landing.

Devulon

Originally posted by Devulon:
the number of good ones is a lot less than the number of bad ones

A statistic thats generally true of everything in life. The number of programmers that can out-optimize even a modestly decent compiler is a lot less than the number of programmers that would be put to shame by it.

You are used to low level programming where you have a single target instruction set, with shader graphics that will not be true. The advantage of the language which you are missing here is the hardware abstraction is provides. The hardware manufacturers have different design philosophies and if you get what you wish for, you will have to hand tune your code for a plethora of graphics ‘instruction set’ platforms (at least the 2 or 3 most popular). I don’t think there is a good option here, I like competition that stymies NVIDIA’s plans to charge gamers $700+ for a video card, but don’t like that NVIDIA and ATI can’t agree on extensions. Maybe the price of freedom is a bit of opacity beneath a shader compiler.

[This message has been edited by dorbie (edited 05-06-2002).]

Originally posted by dorbie:
[b]You are used to low level programming where you have a single target instruction set, with shader graphics that will not be true. The advantage of the language which you are missing here is the hardware abstraction is provides…

Anyone every here of Direct X. Its abstraction is so nice that the number of tri’s you can push with it sucks compared to opengl. There is a price to pay for hardware abstraction. I would like them to agree on an instruction set. Not on a language. Add is Add I really don’t care what you call it. But c = a + b; where a, b, c, are vectors is not addition. Its logic, the obvious addition and a store. As well as the potential reads from memory/registers. Its a lot more complicated and hence has a lot of variations on what actually occurs. The end result must always be the same but the actual work the processor does can be quite different. I want to know what it does. Thats what allows you to make stuff fast. This add eax, eax is very explicit as to what I want the processor to do but eax = eax + eax; clearly isn’t. That abstraction will cost you.

Devulon

well actually it depends on hardware. hardware designed for the dx abstraction work well and fast… nvidias hardware is not that well designed for dx (but nvidia designed dx to 90% so i think this is rather stupid ) so they are not fast on dx…

but hey?

you say that on every other hardware we’ll get different speed?
well… if that is now not the case, why is the whole world comparing fps of quake3?

i don’t see your point… its not more or less abstract than before… its just different cause now you get programability you hadn’t before…

and if you want to use extensions you have to abstract them in some way to support other hardware…

if you don’t use extensions in gl you will get by no ways the fastest out of your gpu

and intelcompiler made my code slower than vc6! and vc7 is even faster than vc6… so who needs that crappy compiler?

(i would love to try vectorC one time… but oh well… money… )

I will gladly accept a performance loss for hardware abstraction and I think many developers will also. Low level programming is terrible in terms of productivity, and this is what business is about. Surely low level programmed code can be fast, but a wisely chosen algorithm hidden behind a high-level interface can be equally fast.

Low level programming at “standard” application level is a disease that should be fought.

Just my 0.02€

-Lev

Where is the comment of DX pushing less tris on NVidia hardware as on GL coming from.

I would say this is entirely untrue. Actually using VAR in GL or using DX is pretty much exactly the same thing.

There are a tiny amount of things that can be done directly in register combiners that can’t be expressed very well in pixel shaders (because of abstraction/having to work on other hardware), but your comment makes it sound like that’s not what you are talking about.

I agree, locking vertex buffers leaves it open to the driver as to how it treats that vertex data - in the case of nvidia hardware, the driver probably copies the vertex data to agp memory anyway, until the buffer is unlocked.
All in all, and this probably isn’t relevant to this topic, but I think opengl is becoming as messy as really early versions of direct3d…extensions are MESSY, and are getting messier.

>>Where is the comment of DX pushing less tris on NVidia hardware as on GL coming from.<<

i thought this was due to the overhead of COM for d3d

>>I would say this is entirely untrue. Actually using VAR in GL or using DX is pretty much exactly the same thing.<<

funny how the nvidia demo to show off their nvidia cards (benmark5) is about 10-15% faster when u convert the d3d code to opengl code.
why?
im no hardware freak (i find it all very boring ) but what actually causes the speed difference between opengl + d3d?, is it like what i wrote above the overhead of com?or is it something else?

i thought this was due to the overhead of COM for d3d

D3D COM is not particularly slow. The function call overhead is likely to be an additional bit of function-pointer indirection, which happens to be the same as any OpenGL call (since the ICD is done as a dll, there’s always a function-pointer indirection involved). D3D is done in-proc as a dll, so the speed isn’t particularly slow. Certainly, I can’t imagine the speed is slow enough to get past the actual hardware bottlenecks when the hardware is pushed.

Yes, VAR is probably a bit faster than D3D, even with locked buffers (since you are managing the memory yourself).

Programming is a balancing act best left to the programmer not the video card.

That’s a pretty arrogant statement. Who’s to say that the card doesn’t know more about itself than you? Who’s to say that you will be allowed to know enough about the hardware to even come close to being faster than their optimized version of code? Remember, the actual functionality of their hardware is propriatery, as it should be. The general public, even the developers, will not be allowed access to the meat of their hardware. That’s why they have good abstractions.

I want to see the assembly that thing generates. I want the opcodes. And I want to write programs in hex. Then and only then will I feel like I am in control.

Why do you want opcodes? You don’t need them to do what you want. And your statement that the abstraction will cause an inevitable slow-down from the peak performance you could squeeze out if only you knew the internals of the hardware is borne of a lack of trust. You aren’t willing to trust programmers who have direct access to the hardware designers to write decent compilers.

It’s not like vertex programs on nVidia cards aren’t compiled into some form of microcode or something. You didn’t think that it went through a simple assembler, did you?

Don’t fear abstractions simply because you no longer have access to the bare hardware (which you shouldn’t have to touch anyway). Instead, embrace them, for they give you access to a wide variety of hardware.

Originally posted by Korval:
D3D COM is not particularly slow. The function call overhead is likely to be an additional bit of function-pointer indirection, which happens to be the same as any OpenGL call (since the ICD is done as a dll, there’s always a function-pointer indirection involved).

Nah. The cost of one layer of pointer indirection is pretty insignificant in the context of a function call. Matt Craighead has stated (in response to an earlier question along similar lines) that D3D’s higher function call overhead is because DX runs in kernel mode. GL runs in userland, same as your app, so there’s no expensive mode switch involved.

Devulon, there is nothing equivalent to shader compilation going on in D3D. This is a means to a single codebase which exploits hardware features. Don’t confuse the two, just because I used the phrase hardware abstraction. You should be aware that OpenGL 1.x offers hardware abstraction to the developer while exploiting the hardware acceleration of the platform. The fact that OpenGL offers hadrware abstraction is a “GOOD THING”, not a bad thing. It is inherently desirable and when well designed need not be a performance killer. To say that OpenGL 2.0 is bad because it offers hardware abstraction simply because D3D does is totally misguided. The PRIMARY OBJECTIVE of an API like OpenGL 1.x is hardware abstraction.

As for them agreeing, see the ARB notes. There are several problems beyond the competition and ideological differences, there is little understanding that for many developers if it’s not core it is irrelevant. Interesting features are too easily shunted into a marginalized extension for short term gain by most ARB members.

The subtext is that a Graphics architecture takes 2-3 years to implement during which the parties try and jockey their their extensions into a favourable position (either with M$ or through the ARB), at the same time they are still finalizing out how to expose unreleased functionality through an API. They look for common ground but they have divergent hardware and don’t want to tell the competition what they are up to or even let on that they know what the competition is up to.

I think OpenGL 2.0 in future and some watered down lowest common denominator OpenGL 1.4 shader in the mean time is the best we can expect.

The D3D way is worse of course leaving all but one hardware developer to bang a square peg into a round hole or simply drop the feature.

[This message has been edited by dorbie (edited 05-06-2002).]

P.S. This is a best case scenario, the whole higher level shader thing could turn into a messy competition for developer hearts and minds. I’m not sure all the IHVs have bought into 3DLabs’ spec. There are even worse things which could happen but I don’t want to give you nightmares so I won’t reveal what a certain large predatory monopolist may have to wield as a cudgel.

I would much rather deal with graphics at a higher level abstraction than vertex program instructions, thank you very much. The cost of abstraction is extremely low, and it can buy you quite a bit. And there are situations where the compiler can do as good a job than a human (and sometimes better, depending on the human). Deal with it because you have plenty of better things to worry about in the grand scheme of things (graphics and otherwise).

Extracting parallelism is difficult in the general case, yes. However, these shading languages encourage data-parallel constructs – each vertex and fragment and pixel is operated upon pretty uniformly and independently. Extracting parallelism in this case is quite simple because you pretty much know the structure of the problem even before you compile. Consider the architecture of the new 3Dlabs chip.

-Won

Originally posted by MikeC:
Nah. The cost of one layer of pointer indirection is pretty insignificant in the context of a function call. Matt Craighead has stated (in response to an earlier question along similar lines) that D3D’s higher function call overhead is because DX runs in kernel mode. GL runs in userland, same as your app, so there’s no expensive mode switch involved.

I heard that DX runs in kernel mode on NT and that’s one of the reasons MS has hesitated to update DX for NT. The other probably being that NT is not popular among everyday users. I assumed that DX was in user mode on all the other windows.

Anyway, won’t a switch occur between your app and the driver anyway? So DX and GL must be on equal footing.

V-man

Won, you’re right about the parallelism. It’s inherently SIMD. You don’t program a scanline algorithm or anything so corny, you effectively program an individual fragment, with data inputs which vary across the primitive. There is no issue with parallelism. The real killer is support for branch instructions and whether a branch instruction has some kind of combinatorial or multipass effect or whether the hardware has native branch support.

All of you have made some interesting points. And I would like to comment.

I still don’t understand why we need to have hardware abstraction of Addition and Dot Product. All I am really asking for is a unified instruction set. Everyone use the same instructions, same register names, etc., etc. The individual instructions in the current level of shader programs really don’t need to be abstracted. I can’t imagine being able to make “function calls” and the like in my program that generates a color fragment.

As for my comment that the code may run at different speeds. I was thinking back through the intel processor line. It’s not just C/C++ and other high level languages that suffer from different compilers and different platforms. Look at pure basic 32 bit x86 assembly from the time the pentium came out through the pPro through the now P4. Explicit assembly code actually runs at different speeds all on different intel processors. This is largerly because of changes to the core. Yes the add instruction still works the same and usually doesn’t change the speed at which it executes but the branch predicition and memory fetching/writing drastically changed between processors. The fact that the entire memory interface and the way branching is dealt with will be different on each video card makes me believe that there will be identical code that runs at different speeds.

For example, pick a piece of code that intentional makes random memory accesses that you know are going to break cache lines. It sounds like the p10 will do a reasonable good job at preventing a major stall. I would think that any chipset with only a 128 memory bus will probably have a little more trouble. Just from the size of the bus I wanna guess that if the data is no where near the cache there is gonna be a memory run and I bet the 256 bit bus will be twice as fast. (Assuming the speed it runs at is the same, same latency, etc., etc.)

So we have this language that is abstracted except you already got me worrying about the memory issues between different types/sizes/speeds. I want to make sure that if I need to use say 129 bits that I am on a 256 bit bus so that I don’t end up with a two read stall versus one.

All I am trying to say is that its nice not to have to worry about these things. But unless every maker (basically nVidia, ATI nad 3dLabs) or 3d graphics chips use an almost identical core/memory bus/memory interface I am really gonna be concerned about the little timing nuances and the like.

As long as everything isn’t the same I will be concerned about the differences.

Devulon