Future of OpenGL/GPUs

Jose_Goruka · December 19, 2009, 12:32pm

I’m dissapointed towards where graphics cards are going. Specially now that larrabee is cancelled. To put it simple, why can’t i do everything my rendering thread does entirely on the GPU? OpenCL and OpenGL are both heading towards being simple and flexible, but i still can’t seem to do that.

So let’s see… how does the usual basic render thread looks like?

Frustum Culling
Occlusion optimizations (portals/visibility grid/etc.)
Batching (by wathever you have to bind)
Rendering

there’s more stuff like shadows, postprocessing, but that’s not relevant to my point.

Basically, what i mean is, doing all this entirely on the GPU (NO CPU!)

Frustum culling can be easily done on the GPU. The GPU is so parallel that even a brute force frustum culling of a million objects would do.
Occlusion optimizations would easily be solved pretty well on the GPU by just doing a near->far depth sort and querying/rendering a zpass (query visible/discard visible). GPUs can already do this pretty quickly. However, i’m not sure if this would be doable. However, grid-based visibility and portals would definitely be doable. Well, don’t mind this point so much anyway.
Batching is pretty much just sorting, and GPUs can also do this extremely efficiently
I think command buffers nowadays mean you can have a pool of data that the GPU will interpret and follow. So i guess it should be able to render too. But i don’t really think that works as they exist.

So nowadays, i guess the main issue would be between points 3) and 4)… the only way to render what i came up in 3) would be going to the CPU and back, which is just too slow.

So my humble question is, i think right now there’s stuff i don’t have in both OpenGL and OpenCL to be able to do all the rendering logic in the GPU… so my question is more like… why are GPUs not heading this way? Is there anything i’m missing?

Alfonse_Reinheart · December 19, 2009, 12:56pm

To put it simple, why can’t i do everything my rendering thread does entirely on the GPU?

Because that isn’t OpenGL’s problem. OpenGL is a low level rendering system. If you want a system that takes care of all of this for you, OpenSceneGraph is more to your tastes.

there’s more stuff like shadows, postprocessing, but that’s not relevant to my point.

No. This is entirely relevant to your point. The stuff you outline is stuff that you do. Not every application does its stuff that way, and not every application needs to. Some applications need to do more stuff than you outline. Should this hypothetical OpenGL also do that stuff, even though you don’t need it to?

A one-size-fits-all approach to this is inherently limiting in some way. That’s why OpenSceneGraph is not used for games. This is why it is useful to have a low-level rendering system.

Jose_Goruka · December 19, 2009, 1:06pm

I expected my post would be misread or misinterpreted, so let me clarify. This has nothing to do with OpenSceneGraph. I’m actually talking about low level rendering and operations that can greatly benefit from being on the GPU… as in, at the same level of stuff like computing a particle systems, skinning, or even rigid body solving (which is also done on the GPU nowadays). Frustum cull and batching are, undeniably, much faster on the GPU than on the CPU and both are already low level aspects of rendering. (even though they are not very usable nowadays). Also, you seem to focus on OpenGL, i’m also talking about OpenCL which IS general purpose.

Brolingstanz · December 19, 2009, 1:15pm

Fwiw I’d like to see things generalize more and it just can’t happen fast enough as far as I’m concerned.

Interop seems the slippery slope of contention but the prevailing north vesterlies seem to be taking us in the general direction of a Larrabee. I imagine the bean counters have to first fit the right business model to it.

Alfonse_Reinheart · December 19, 2009, 3:19pm

I’m actually talking about low level rendering and operations that can greatly benefit from being on the GPU… as in, at the same level of stuff like computing a particle systems, skinning, or even rigid body solving (which is also done on the GPU nowadays).

None of these are low-level. They have interactions with low-level components, but they are not as a whole low-level.

There are times when you don’t want to do skinning on the GPU. There are times when you want to do GPU skinning differently (dual-quaternion, compared to matrix skinning). There are different ways to do frustum culling, depending on restrictions you can put on your camera (an orthographic game, for example, can much more easily frustum cull). There are so many ways to implement a particle system that you couldn’t begin to cover them with a single system.

Even if you were to build some system that uses OpenGL and OpenCL to do these things, a one-size-fits all solution simply will not work for these components. You could make one that fits your needs, but it would not fit my needs.

And who exactly do you want to implement and maintain this code? IHVs? ATi is barely able to make GLSL work; there’s no way they’d be able to take substantial parts of a videogame’s rendering engine into their drivers.

Also, you seem to focus on OpenGL, i’m also talking about OpenCL which IS general purpose.

OpenCL is general purpose. Which means you can use it to do whatever you want. That specifically means that it should not start taking on special-purpose features like frustum culling and such.

If someone wants to write an optimized OpenCL frustum culling library, more power to them. However, OpenCL should not come with such a library, nor should OpenCL providers be required to make and maintain such a library.

Jose_Goruka · December 19, 2009, 7:37pm

I was going to reply more concisely to your post, but after reading this, i realized you didn’t even understand my original post at all (or maybe i wasn’t clear enough).

To make it short, i’m ranting that there is no way to generate render instructions from within the card (from either CL or GL), which is what makes implementing the features i described impossible. I’m not sure but i think the 360 can do it as the command buffer format is documented. Take a time and think about it since i believe i’m not asking something stupid.

Simon_Arbon · December 19, 2009, 9:02pm

Do you mean some sort of “Command Shader” that replaces display lists and lets you write OpenGL and OpenCL commands in a shader that is compiled & stored on the GPU?
Currently a series of commands are issued from a CPU program into a command buffer that is periodically flushed to the GPU for execution.
Display lists allow a command buffer to be compiled and stored on the GPU where it can be called like a subroutine, but this is still a linear sequence of instructions with no loops or branches (except NV_conditional_render extension which executes a block of commands depending on the result of an occlusion query).
If display lists are replaced with a command shader then it could include a lot more conditional logic and would remove the lag currently involved in waiting for the GPU to finish something, reading the result, deciding what to do, then sending new commands to the GPU.
It would also be a lot easier to coordinate OpenGL with physics and animation using OpenCL, where the vertex data remains in a GPU buffer and is not altered by the CPU.

Ilian_Dinev · December 19, 2009, 9:12pm

Making the gpu omnipotent any soon can be hard. That’s why I proposed a first step:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=257713#Post257713

The cpu changes global settings: blending/render-modes, tex-units, targets, etc. Parallelizing access to these triggers can’t be done, but still can be put in one core of one cluster (of 8 or 16 cores) to work while fragments and vertices are computed in other clusters, and to append commands (via an interrupt-like instruction) to an internal FIFO cmd buffer.
In any case, per-primitive occlusion test results are needed first, instead of the currently available per-batch results; and a way to fetch them from within a geom-shader.

Anyway, with SM5 at the door, we can only hope and give detailed ideas/suggestions here - to be included in SM6 or as a feature of GL, that IHVs manage to implement+expose in SM4 or SM5 cards. Vague ideas like “do everything on the gpu” don’t help anyone.
With the current model, generally there are 3 bottlenecks:

getting results from the gpu (occlusion tests). Currently countered rather easily.
too many render-calls. Countered by instancing, can be further aided by per-instance VBO/IBO offsets and per-instance primitive-count.
each glXXX() call is 300-3000 cpu cycles. Currently countered by multithreaded drivers, texture-arrays, UBOs/etc and VAOs.

Brolingstanz · December 19, 2009, 10:41pm

Can’t we just pretend everything happens on the GPU?

In the short term if you can make everything look as though it operates on the GPU you’ve gone a long ways towards abstracting and simplifying the programming model and increasing productivity, at least in theory (cf. Radpid Mind’s Sh, etc).

But then that’s just a development environment thing. Anyone (read: almost no one) can create a special purpose language and the development tools necessary to present the illusions of homogeneity to the user, even if under the hood the implementation is a 2 headed monster with festering pustules.

I could be wrong but I think a great many coders would be tickled pink with something short of the real GPU McCoy. Though the questions of what form it takes and what it targets remain, either of which could paradoxically deepen the original quagmire; and neither of which have much to do with GL (or so it would seem).

Alfonse_Reinheart · December 20, 2009, 12:02am

To make it short, i’m ranting that there is no way to generate render instructions from within the card (from either CL or GL), which is what makes implementing the features i described impossible.

There is a good reason why this is impossible, at least as far as OpenGL is concerned.

OpenGL, as a specification, must maintain consistency about the order in which things are rendered. Conditional render is a way to make the rendering of one thing depend on the rendering of another. But where and when the dependent rendering will happen (if it does) is very well defined.

This is why threading OpenGL rendering to the same context is not possible. If I have started some process that will spawn rendering commands, then I issue some commands myself, which ones happen first? Is there some control over the order? How do you define an order for something like that? Can I tell when those commands have been completed?

The kind of thing you’re talking about is very sketchy. Even the 360’s GPU can’t issue commands to itself. A shader may be able to build a command buffer sequence, but it can’t tell the GPU to execute it. Only the application can do that.

too many render-calls. Countered by instancing, can be further aided by per-instance VBO/IBO offsets and per-instance primitive-count.

Correct me if I’m wrong, but isn’t a per-instance primitive count called “glMultiDraw*”?

Also, I’d be interested to know what kind of scene it is that you’re rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).

To me, it is these kinds of changes that are the most limiting.

Ilian_Dinev · December 20, 2009, 1:15am

Except that the 3 extensions explicitly state that glMultiDraw* will NOT update gl_InstanceID. Thus, what I asked for is not yet supported.

P.S:

Also, I’d be interested to know what kind of scene it is that you’re rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).
To me, it is these kinds of changes that are the most limiting.

Via gl_instanceID and the per-instance primitive-count, we could draw many different objects that use a specific shader and some common uniform values; while each “instance” (defacto can be completely different geometry and textures via tex-arrays) can fetch with gl_InstanceID its “instance”-uniforms and texarray.z . A tiny change in specs, and the cpu can be relieved of lots of calls.
Though, we currently definitely aren’t hurting for such extra functionality yet.

Jose_Goruka · December 20, 2009, 8:28am

Yeah, this is exactly what i mean, right now this is a huge bottleneck. Since in GL also (unlike D3D), everything works with numerical IDs, i dont see why you couldn’t rebind shaders, arrays or textures without having to go through the GPU if you had such kind of command buffers…

Jose_Goruka · December 20, 2009, 8:41am

This is fine, pretty much exactly how it should work.

well, rendering a scene all with the same material sounds pretty limiting to me. Also you still have to change arrays and textures, so for that instancing and multidraw are not very useful except for batching small groups.

In other words, I should be the one asking you what kind of scenes are you talking about, because as far as i know, every technology/engine/game I’ve seen or worked with pretty much does

Frustum culling
Batching to minimize state changes
some sort of visibility optimization

when rendering large scenes. That’s like, a very common case which should benefit enormously from some sort of hardware support for it. If you don’t care about doing stuff that way, i can understand, but i’m talking about what most users would benefit from.

Jose_Goruka · December 20, 2009, 8:46am

Except that the 3 extensions explicitly state that glMultiDraw* will NOT update gl_InstanceID. Thus, what I asked for is not yet supported.

P.S:

Also, I’d be interested to know what kind of scene it is that you’re rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).
To me, it is these kinds of changes that are the most limiting.

Ah, yeah this is something similar to what i mentioned, since you could do pretty much everything on the GPU… i guess the main limitation is not being able to switch arrays and textures per instance

Ilian_Dinev · December 20, 2009, 9:03am

You can switch vertex-attribs, textures and uniforms per instance with the current GL3.2. The gl_InstanceID is the key-factor, which lets your vtx/frag shaders choose where to pull data from; said data being: vtx-attribs, z-slice of a texture-array, whole chunks of uniforms for the current object/instance. With double/indirect-referencing you can also compress those chunks. Anyway, effectively currently you can render in one call 1000 unique objects, that have completely different geometry, textures, parameters (and subsequently shading) . It’s just that those 1000 objects currently must share a common num_primitives count. And occlusion-queries cannot be updated or affect on each instance; the granularity is just all_instances-or-nothing right now.

Jose_Goruka · December 20, 2009, 10:54am

Ah, i think i understand… you mean indexing all textures in a texture array, and bind most buffers in a single one?

That actually sounds like it could work (would need a little client side memory management but doesn’t sound so bad). Guess the only problem is that all textures have to be the same size in the texarray (well, not sure if it’s still the case)…

About occlusion queries… i guess you are right, but something like portals and rooms should still work fine if done from the CPU side, as it’s pretty cheap to compute. Still though, i’m not sure how one would specify the amount of primitives.

Alfonse_Reinheart · December 20, 2009, 11:48am

Via gl_instanceID and the per-instance primitive-count, we could draw many different objects that use a specific shader and some common uniform values; while each “instance” (defacto can be completely different geometry and textures via tex-arrays) can fetch with gl_InstanceID its “instance”-uniforms and texarray.z . A tiny change in specs, and the cpu can be relieved of lots of calls.

I’m not sure that this will buy you any real performance gains. Rendering calls in OpenGL are already pretty lightweight. What it buys you is mostly in state changes (not binding new textures/uniforms/buffers), not in rendering calls.

Also, I’m pretty sure that this would not be a “tiny” change as far as implementations go. Hardware makes regular instanced rendering possible and efficient. That’s why ARB_draw_instanced is only supported on 3.x hardware, even though they could emulate support internally with uniform updates between individual draw calls. If hardware doesn’t support MultiDraw with a gl_Instance bump between draws directly, then the driver is basically going to have to set a uniform whenever time it draws. Which I’m guessing is probably not the kind of performance profile you’re looking for

You can switch vertex-attribs, textures and uniforms per instance with the current GL3.2.

Not vertex arrays. Not unless you are manually reading your vertex data from a buffer with something like a buffer texture or a UBO or bindless (which is NV specific). And if you are, you’ve already lost quite a bit of per-vertex-shader performance.

Or are you sending the attributes for multiple objects and you’re just picking between them in the vertex shader? That’s still quite a bit of performance loss, since it has to pull memory for each of those attribute lists, regardless of whether or not they are used.

Since in GL also (unlike D3D), everything works with numerical IDs, i dont see why you couldn’t rebind shaders, arrays or textures without having to go through the GPU if you had such kind of command buffers.

Because hardware doesn’t work with numerical IDs. Hardware uses pointers, which currently shaders have no access to. Even the internal driver converts these to pointers ASAP.

And of course, having the GPU write such command buffers is fraught with peril with regard to object lifetimes. What if the application orphans some texture? The user has said that it can die, but the card is still using it for rendering something at the moment, so it still exists. Well, what happens if a shader, previously dispatched, decides to reference it? By the time that command buffer gets dispatched, the driver may have deleted the object. The driver would somehow have to be notified that the orphaned object is still in use.

This is why it is important that OpenGL be a pure input-driven system. You give it commands, and it executes them.

Now, if you want OpenCL to generate such a list, that is entirely possible. As long as you don’t expect to feed this list directly into OpenGL (ie: you have to load the list client-side and process it into rendering commands), that’s fine.

Ilian_Dinev · December 20, 2009, 12:32pm

Yes, stating again: this per-instance-count is not useful enough for performance just yet. The specs note this, too.
UBO/tex-buffer, yes. On GF it skips the dedicated attrib-fetch units, but threading hides the latency. On RHD the attrib-fetch is done this way anyway.

yooyo · December 20, 2009, 6:50pm

@Jose:
Your requests are related to forward and deferred rendering. But there is a some new and different approach to rendering, like megatexture in upcoming game Rage. JC in this title created new technology which utilize GPU memory way better than current games. Drawback is he cant use existing texture filters. Everything was calculated in a shader. He more needs access to whole GPU memory than sorting, batching, occlusion culling, etc.

http://en.wikipedia.org/wiki/Id_Tech_5
http://en.wikipedia.org/wiki/MegaTexture

Alfonse_Reinheart · December 20, 2009, 7:34pm

Your requests are related to forward and deferred rendering. But there is a some new and different approach to rendering, like megatexture in upcoming game Rage.

Megatexture and similar technologies are primarily concerned with keeping the right amounts of the right things in memory. Rendering those things can be done either forward or deferred, using the same techniques as regular rendering systems do. So one doesn’t really affect the other very much.