New OpenGL extension

I thought I’d post here and get some opinions on the object before
posting on the nVidia dev forums…

Well, as work on my engine progressed, for the last three years or so I
was confronted with the problem of real time shadowing (as usual). It’s
pointless to explain the evolution my shadow algos went through (my first
idea was to use something I called “vis maps”), so here’s the fact: I
eventually came to the conclusion that “shadow volumes are The Right
Thing” (as usual :slight_smile: . Now, after some time of work, they work almost
perfectly - but with some considerable drawbacks. The most obvious is that
the so called - by me - “light volumes” (basically an inversion of SVs
better fitting into the architecture of my engine), have to be
extruded/clipped completely on the CPU. Also, curved surface SVs aren’t
really perfect for now (normal based extrusion on the GPU), and the solid
objects are still a pain in the ass - using the so called “semi-automatic
GPU extrusion” is something I strongly detest (still looking for something
better).

Now, the times are coming for shadow maps. Not only they allow various
soft shadow implementations, but they, and that’s the most important
thing, are completely geometry-independent. They can be safely and easily
implemented on top of almost every existing engine architecture without
destroying something else. In the worst case, they could be just an
optional feature.

This said, the times aren’t that far away when it will be possible to
have 10-20 moving omnidirectional lights in a complex scene in real time.
If I may quote Carmack :slight_smile: , things are always falling down to raw power in
the end. The GeForce 7800 may not manage aforementioned 20 lights, but it
is highly imaginable that the 8800 (for example) won’t have a problem
processing so much geometry.

So, finally, I’m coming to the point :slight_smile: . Omnidirectional lights require
rendering the world six times with a 90 degree frustum into a cubemap.
IMO, it would be much more efficiently having the card automatically
render the world into the cube map through an extension, instead of having
to set up the view 6 different times. Imagine having all visible world
objects from the light’s point of view, you’d have to perform frustum
culling on the CPU before sending the final geometry to the GPU, where
they are culled one more time, processed, assembled and then finally
rendered. Having all this work done on the card would save a lot of CPU
time, but also improve the cache coherency and copy times.

NV_render_to_cubemap should be the next thing to come :slight_smile:

Ideas / comments?

I do not see any need for this.

It still comes down to rendering the geometry six times and I also don’t believe you can optimise a lot in this case.

Compare :

  1. current method:
  • for 6 views do {
    • (optional) do basic culling of geometry with CPU
    • send geometry to GPU
    • GPU draws scene (transform, precise cullling, rasterisation)
      }
  1. HellKnight proposition :
  • (optional) do basic culling of geometry with CPU
  • send geometry to GPU
  • for 6 views do {
    • GPU draws scene (transform, precise cullling, rasterisation)
      }

So as I understand it, the GPU part would stay almost the same, but the improvement is on CPU usage and bus transfer.

Originally posted by ZbuffeR:
[b]Compare :

So as I understand it, the GPU part would stay almost the same, but the improvement is on CPU usage and bus transfer.[/b]
You store your data in VBO placing it inside the video memory - you still need only one copy. This extension is targeted at limiting the functionality, that’s why I am against that idea.

But, apart from the 6 additional depth buffers, without a separate camera matrix it’s a bit of a hassle to implement, isn’t it?

It doesn’t sound like it will save any clock cycles.
I’m against it :slight_smile:

Originally posted by Zengar:
This extension is targeted at limiting the functionality, that’s why I am against that idea.
Agreed. But this could be easily extended.

Imagine something like MRT, but not with multiple outputs in the fragment shader, but with multiple position outputs in the vertex shader.

Originally posted by V-man:
It doesn’t sound like it will save any clock cycles.
Agreed, too. But take into account geometry shaders, and suddenly it makes sense, the geometry shader only has to be executed once instead of six times :wink:

This is a quote of a reply on the same topic on comp.graphics.api.opnegl:

=================================================

Subject: Re: Proposal for a new OpenGL extension
From: sauron <sauron@localhost.localdomain>
Newsgroups: comp.graphics.api.opengl
Date: Mon, 11 Jul 2005 19:35:55 +0200

On Sun, 10 Jul 2005 16:56:02 +0200, fungus wrote:

> To do that you’d need six different view matrices
> and six different rendering pipelines on the card
> so it’s not likely to happen.

Well, why? You could split up the geometry on several vertex pipelines,
whose output is linked to one or more fragment pipelines, whose output
again is linked to one of the six cubesides (or FBOs, or whatever). The 5
additional view matrices could be easily computed from the modelview
matrix, so each vertex pipeline would have its own modelview matrix.

:eek:

Can you imagine the horror of driver-dev when you say this to him?

:smiley:

On the one hand, I can understand how this might speed some things up - I can imagine rendering the shadowcubemap in one pass without textures and without fragment shaders, then rendering the scene with textures, shadows and fragment shaders in the next pass.

On the other hand:
Either there’s going to be roughly 6 times as much hardware involved (my fan already drives me crazy, and to think of the monster you’d need to cool that rig…), or you’re going to end up with a shared pipeline.

And if you want to use it for environment mapping, you still need the fragment shader functionality in all 6 slices.

Ultimately, all you save is the cost of telling the hardware to do it. And probably not even that.

You still need to T&L the verts 6 times, one for each set of matrices. You still need to rasterize (or attempt to rasterize) each triangle 6 times. The only savings you get is making one call to render each mesh, as you still need to provide the 6 matrices in question.

Well, shadowmaps will definately be the shadow technique mainly used in future engines ('cause the current trend is moving as much of the CPU graphics computations onto the GPU), so I think the step of directly rendering into cubemaps will be made sooner or later…

You appended while I was typing.

Thinking a liitle bit more about it, you’d need to transform all of the vertices only ONCE (say, for the front view), assuming you’re using an axis aligned cubemap (which is mostly the case). To obtain the transformed coordintes for the rest of the five views is simply a matter of swizzling and negating (since only rotations of 90 degrees are involved), which are already efficiently implented in hardware. Culling against a 90 degree FOV doesn’t need any complex computations, too. So looking at the whole thing fromt this point ov view, many things could surely be optimzed (maybe using only one vertex pipeline, leaving the swizzling to the rasterizer ?) —> improvement over rendering the whole view six times…

To obtain the transformed coordintes for the rest of the five views is simply a matter of swizzling and negating (since only rotations of 90 degrees are involved), which are already efficiently implented in hardware.
Um, perspective transforms don’t work that way. You can’t just swizzle the XYZ’s of a persepctive-space vertex and get the 6 coordinates out of it. You might be able to simply invert the z to get the rear facing, but I shudder to think what would be the result of taking a perspective-space z value and pretending that it’s, for example, a y-value. So you’re still going to need to process the geometry it 3 times.

Plus, since you’re not rasterizing all of these vertices all at once, you’d have to store the results somewhere and reuse them. Now that’s a place where a clever programmer could come in. Becuase you could render to a vertex array (once they figure out how to provide this ability) the world-space vertices and other attributes that change based on the camera, and then use that buffer to do a quick projection transform for the 6 facings. This could be a win, especially if you have skinned geometry that takes pretty hefty vertex shaders.

Hm, you’re right. However, to achieve the best possible result, it’d be desireable to have each rasterization pipeline (bunlde?) render only 1/6th of the whole geometrly. That could be done with a simple extension to the clipping process, assigning each triangle to one/two or three of the 6 views (since a triangle can lie on the boundary of one/two or three cubemap faces); in such a case this must be implented after the primitive assembling step. However, if projection is performed right before rasterization (and not after world->view transformation) even the intermediate vertex buffer wouldn’t be needed anymore.

I have to admit that I don’t get that last part.

However, isn’t there some automatic nVidia functionality for 3D glasses? Doesn’t that mean that nVidia has dealt with it in a limited way already?

T101: the stereoscopy is a driver thing. Basically, the driver just renders two consequent frames, each one with a slightly different view matrix (the second one describing a view shifted a little bit to the left or the right), synchronizing them with the shutter glasses, thus creating the illusion of “real 3D”.

What I am proposing is to render the world only once, but each 1/6th of it being rendered into a different texture buffer (=cubemap face).

This could be achieved, as already stated above, by inserting a clipping/texture buffer-assigning/rotating & projecting (the latter one removed from vertex-processing) step right after the assembling pipeline. So each rasterization pipeline bundle would need to render only roughly 1/6th (in an uniformly tesselated world though) of the triangles submitted to the vertex pipeline.

I agree, in a perfect world, such extension would not be needed. Having indexed vertex buffers (common case) one could just tell the card, “render indices 1-50 to the first cubemap face, then render indices 51-100 to the second and so on”, ending up with the roughly same performance (I guess that’d be a little bit slower). However, we don’t live in a perfect world, so the CPU has at first to cull all geometry not contained in the current cubemap face frustum (since the polygons aren’t ordered in a clock-wise manner (f.e) most of the time, or such, that a set of indices are indexing ALL of the faces contained in a cubemap frustum), then build up an index buffer and send the latter to video memory. This done for every of the six faces is surely wasted computation time.

I hope you got it now :slight_smile:

Well, I got the language the first time, it’s just that I don’t see how you can send one triangle to only one of the cubemap faces.

Don’t the orientation and the frustum determine in which cubemap faces a triangle ends up? What if your triangle is not coplanar with exactly one of the cubemap faces? It would end up in 2 or often 3 cubemap faces if you ask me. So where does the 1/6th come from?

Shame about the stereoscopy thing: I thought it would have been the same frame from two different camera viewpoints.

Originally posted by HellKnight:
That could be done with a simple extension to the clipping process, assigning each triangle to one/two or three of the 6 views (since a triangle can lie on the boundary of one/two or three cubemap faces); in such a case…
The “extended-assembler” would do just that: depending on the 6 frustums, for each triangle it sets a “triangle flag” describing which of the 6 rasterization pipeline bundles should process it further (=render it)…

The 1/6th is a rough approximation. In reality, a rasterization pipeline would need to proceess more or fewer triangles. In an uniformly tesselated world (say, the indide of a sphere, f.e.), each bundle would have to render a little bit more than 1/6th, since some of the polygons, as stated above, will lie on a cubemap faces boundary and thus would need to be rasterized by two or even three pipeline bundles.

Thinking a liitle bit more about it, you’d need to transform all of the vertices only ONCE (say, for the front view), assuming you’re using an axis aligned cubemap (which is mostly the case).
Typically, the vertex transform stage isn’t the bottleneck in programs, so even if this was possible, it wouldn’t help much.

If GPU makers decide to add additional circuitry that has the ability to render to multiple faces in parallel, then an extension that could cooperate with this feature would be useful.