History of Programmability

From OpenGL.org
Jump to: navigation, search

The History of Programmability in OpenGL spans a large number of hardware versions and OpenGL Extensions. These have all coalessed into the modern OpenGL Shading Language. This article gives a history of how programmability came into existence in OpenGL.

Basic Configurable Hardware

While the idea of shaders has been around for some time, the idea of applying such programmability into hardware rendering has not. Shaders, such as Pixar's Renderman standard, were software-based systems.

The earliest consumer-grade hardware was little more than a basic rasterizer with the ability to map a single texture to the triangle. The CPU had to do the initial transform of the vertex data. Multitexturing was implemented as a multipass technique; simply rendering the polygons again with blend functions applied. The only per-fragment operations such graphics chips could perform was the multiplication (or addition) of a single interpolated color with the texture color from the single interpolated texture coordinate.

The beginning of true programmability in consumer graphics chips was multitexture: the ability to map two (or potentially more) textures to the same triangle. ARB_multitexture was the first ARB extension, and it was designed to allow access to this hardware functionality.

With the complexity of two, or even more, textures, the obvious question arose: how do you combine them into a single fragment output? OpenGL 1.0 had something called the texture environment. This was a setting that controlled how, in single-texture situations, the per-vertex color would be combined with the texture color. ARB_multitexture extended this basic concept. Each texture also had its own independent environment function. At each stage, the previous stage's color would be applied to the texture color based on that stage's environment operator. The first stage's "previous" color was simply the per-vertex color, so it was backwards compatible.

This defined a pipelined architecture that represented basic configuration of per-fragment operations. Another extension, EXT_texture_env_combine, was later introduced that added new functions and operators for the various texture environment stages.

Enter Register Combiners

This was not enough for NVIDIA, however; it did not expose the power of the TNT hardware. So they added NV_texture_env_combine4. This added even more arguments and operators, but it added something special: reordering.

In standard combiner model, the only texture's color a particular stage can access is the one for that particular stage. And once the processor has moved on past the first stage, the per-vertex color is no longer available; you can only use the color from the previous stage. The NVIDIA extension exposed the ability to access the texture value from any stage, as well as get the per-vertex color at any stage.

There may have been only 2 stages, but this was the real beginning of programmability in graphics cards. Each stage was effectively an opcode in an assembly language that only had 2 opcodes in a program. It had a register file consisting of the per-vertex color and both of the texture colors. And it had one temporary register where the output of the first opcode was stored.

The GeForce 256, where NVIDIA coined the term "GPU" (Geometry Processing Unit) was the first consumer-grade hardware to offer hardware-based vertex transformation and lighting. This T&L was very fixed-function, essentially exposing OpenGL's fixed-function T&L pipeline in full (or near full). But where the real programmable power came was in the fragment processing.

Gone were the simple environment combiners. In their place was hardware that would remain, essentially, unchanged for 2 hardware generations: register combiners.

The texture environment model was working against NVIDIA's hardware evolution direction, so they simply made a new paradigm in the NV_register_combiners extension. These register combiners were NV_texture_env_combine4 on steroids. They were very explicit about being like an assembly language; the register file was an explicit construct, as were temporary registers and so forth.

NV_register_combiners was also designed with extensibility in mind. The system hard-coded a limit of 8 register combiner stages, but had a querriable enumerator. The NV10-based GPUs (GeForce 1 and 2) only provided 2 enumerators.

Each register combiner stage could perform 4 independent operations: 2 vector-wise operations (multiply or dot-product) and 2 scalar operations on the alpha or blue component of a color. The outputs had to be combined into 2 RGBA colors, to be stored into two readable registers for the next stage.

There was also a more limited final combiner stage, designed mainly to do fixed-function operations like fog blending.

At this point in the history of graphics processors, NVIDIA was the de-facto standard. NVIDIA GPUs were the best selling, 3DFX's biggest failures happened, and eventually 3DFX collapsed and was purchased by NVIDIA.

It was in the wake of all of this that the next phase in programmability came to pass.

Real Programmability

The GeForce 3, the first NV20 part, contained the first example of true programmability. Despite NVIDIA being a pioneer of highly configurable fragment processing, its programmability was in its vertex processing. The GeForce 3 was the first GPU that brought programmabilitiy to consumer hardware.

The vertex processor, exposed in OpenGL with NV_vertex_program, was capable of taking a program of up to 128 opcodes and executing it. It had an impressively large register file to work with (256 4-vector uniforms), as well as quite a few temporaries. It may not have been capable of looping or texturing, but it was very powerful compared to the meager fixed-functionality that came before it.

Even though the vertex processor had a massive leap, the fragment stage stagnated. NVIDIA, as promised in the register combiner spec, increased the number of combiners to 8 and added two additional texture accesses. Their texturing pipeline was not as configurable.

Rather than bring real programming to fragment processing, NVIDIA opted for something they called "texture shaders". Exposed in the NV_texture_shader extension, texture shaders were a way to use a texture unit as a computation unit. The computation unit took parameters, one of them being that texture stage's texture coordinates. Each texture shader feeds results into the next one, so you could have an effective program of 4 instructions, as NV20 hardware only had 4 texture units.

The limitations of this functionality were substantial. The 4 unit limit was much smaller than the register_combiner limit of 8 operations. Plus, the flexibility of the opcodes was never particularly great. Lastly, each one burns a texture; you're removing the ability to access a texture for each texture coordinate math operation you use. This forces more multipass rendering.

The holy grail of fragment processing in those days was generality of texturing. This meant the ability to take arbitrary inputs from the vertex pipeline, access textures, perform some arbitrary computations on them, feed those values as texture coordinates, and access more textures with the results.

Given NVIDIA's dominance in hardware, it was something of a surprise to see that ATI was the first one to actually get this one right. The Radeon 8500 came with a number of OpenGL extensions. Among them was ATI_fragment_shader. It exposed a more flexible fragment hardware design.

These fragment shaders could perform up to 6 texture accesses, perform up to 8 math operations on those accesses, then use the results of those operations to perform up to 6 additional texture accesses, followed by another 8 math operations. It wasn't perfectly generic, as you could only get one dependent texture access. However, it was much better than NVIDIA's hardware, and certainly generic enough to be called truly programmable.

Fixing the Mess

By this point, OpenGL's programmable pipeline was a big mess. That's because it didn't have one. Core OpenGL had adopted ARB_multitexture and EXT_texture_env_combine in the GL 1.2 core. But the real programmable work was being done in an ever increasing suite of vendor-specific extensions, with each set being incompatible with the other.

When NVIDIA owned the graphics industry, this might have been a functional state of being. But with ATI now really coming onto the scene competitively, this was not acceptable.

The first attempt to fix this was by building on what was core: the texture environment. There had already been a variety of extensions that added new operations to the basic texture environment. But the purpose of the extension.txt EXT_texture_crossbar extension was to generalize the texture environment. It effectively exposed fragment processing as being register-combiner like, with a register file and so forth. However, the specification itself was so poor that NVIDIA could not implement it, and without their support, it withered.

Plus, the new generation of NVIDIA and ATI fragment hardware at the time (texture shaders and ATI's fragment shaders) could do texture coordinate computation, which the crossbar extension didn't provide access to. The entire texture environment model was simply not going to be viable for the future. It could at best codify some the hardware of its day, but it could do nothing for the future.

The second attempt was made with ARB_vertex_program and ARB_fragment_program. These are assembly-level languages for programmable pipeline stages. They were a slightly more generalized combination of NV_vertex_program and ATI_fragment_shader. They were a reasonable pair of extensions that did the job. The design of the languages were such that they could be extended with new functionality as the need arose, much like any other facet of OpenGL.

Direct3D Note: Direct3D was far from immune to this issue itself. The "pixel" stage of Shader Model 1.0 could more accurately be called "NVIDIA's register combiners and texture shaders", because that's exactly what it was. It was not generic, and it was certainly not cross platform. The "pixel" stage of Shader Model 1.1 similarly could be called "ATI's fragment processor".
Due to this, no NVIDIA hardware before SM2 capable hardware supported SM1.1, simply because it couldn't. D3D had a single, unified way to specify a shader, but the fragment portion of those shaders had to be platform-specific. And due to the hardware differences, basic algorithms had to change based on whether you used SM1.0 or SM1.1. With SM2.0, Microsoft put their foot down and, much like the ARB_*_program extensions, forced NVIDIA and ATI to accept the same standard.

OpenGL Shading Language

With the Radeon 9700 and the GeForce FX, graphics developers had far greater power available to them than ever before. Fragment shaders could be longer and more complicated; ATI offered up to 4 dependent texture fetches, while NVIDIA leapt ahead with perfectly generic fragment shaders.

However, around this time, an effort had begun by one of the members of the OpenGL ARB. An effort designed to remake the OpenGL API into a more future-proof API. The initial design of OpenGL did guess right for various things; OpenGL needed no special extensions to make T&L work in the GeForce 256, for example. However, OpenGL missed programmability badly, and the various vendor-specific fixes only muddied the waters.

3D Labs, at the time a vendor of professional graphics hardware, started an effort they called OpenGL 2.0. They were attempting to rebuild the entire API, and they had a plan. Part of that plan included a C-style shading language simply called the OpenGL Shading Language.

This language was the only thing to survive 3D Labs's attempt at a revamped OpenGL.

The language was workshopped and improved into the form it was released as: GLSL 1.0. Four extensions governed its use in it's initial release. ARB_shader_objects defined the way to create shaders and programs, as well as how they interrelated. Yet it did not define what the actual stages were. _vertex_shader defined how to use the objects to override the fixed-function vertex pipeline. ARB_fragment_shader defined how to use the objects to override the fixed-function fragment pipeline. And ARB_shading_language_100 defined the actual language used.

Eventually, OpenGL did reach version 2.0, but there was no API rewrite. GL 2.0 did officially adopt GLSL into the core, the only one of the shading languages that has been adopted.


Throughout the GLSL era, NVIDIA has been relentless in keeping ARB_vertex_program and ARB_fragment_program up-to-date for their hardware. They have released a series of extensions that add to the language, bringing it up to modern levels of functionality.

NVIDIA has also championed the use of a language called Cg. Unlike GLSL, the language is part of a separate runtime. It compiles into other shading languages; it can even be compiled into GLSL. NVIDIA's biggest selling point is that Cg is virtually identical to Direct 3D's HLSL shading language.