PDA

View Full Version : Future of OpenGL/GPUs



Jose Goruka
12-19-2009, 01:32 PM
I'm dissapointed towards where graphics cards are going. Specially now that larrabee is cancelled. To put it simple, why can't i do everything my rendering thread does entirely on the GPU? OpenCL and OpenGL are both heading towards being simple and flexible, but i still can't seem to do that.

So let's see.. how does the usual basic render thread looks like?

1) Frustum Culling
2) Occlusion optimizations (portals/visibility grid/etc.)
3) Batching (by wathever you have to bind)
4) Rendering

there's more stuff like shadows, postprocessing, but that's not relevant to my point.

Basically, what i mean is, doing all this entirely on the GPU (NO CPU!)

1) Frustum culling can be easily done on the GPU. The GPU is so parallel that even a brute force frustum culling of a million objects would do.

2) Occlusion optimizations would easily be solved pretty well on the GPU by just doing a near->far depth sort and querying/rendering a zpass (query visible/discard visible). GPUs can already do this pretty quickly. However, i'm not sure if this would be doable. However, grid-based visibility and portals would definitely be doable. Well, don't mind this point so much anyway.

3) Batching is pretty much just sorting, and GPUs can also do this extremely efficiently

4) I think command buffers nowadays mean you can have a pool of data that the GPU will interpret and follow. So i guess it should be able to render too. But i don't really think that works as they exist.

So nowadays, i guess the main issue would be between points 3) and 4).. the only way to render what i came up in 3) would be going to the CPU and back, which is just too slow.

So my humble question is, i think right now there's stuff i don't have in both OpenGL and OpenCL to be able to do all the rendering logic in the GPU.. so my question is more like.. why are GPUs not heading this way? Is there anything i'm missing?

Alfonse Reinheart
12-19-2009, 01:56 PM
To put it simple, why can't i do everything my rendering thread does entirely on the GPU?

Because that isn't OpenGL's problem. OpenGL is a low level rendering system. If you want a system that takes care of all of this for you, OpenSceneGraph is more to your tastes.


there's more stuff like shadows, postprocessing, but that's not relevant to my point.

No. This is entirely relevant to your point. The stuff you outline is stuff that you do. Not every application does its stuff that way, and not every application needs to. Some applications need to do more stuff than you outline. Should this hypothetical OpenGL also do that stuff, even though you don't need it to?

A one-size-fits-all approach to this is inherently limiting in some way. That's why OpenSceneGraph is not used for games. This is why it is useful to have a low-level rendering system.

Jose Goruka
12-19-2009, 02:06 PM
Because that isn't OpenGL's problem. OpenGL is a low level rendering system. If you want a system that takes care of all of this for you, OpenSceneGraph is more to your tastes.


I expected my post would be misread or misinterpreted, so let me clarify. This has nothing to do with OpenSceneGraph. I'm actually talking about low level rendering and operations that can greatly benefit from being on the GPU... as in, at the same level of stuff like computing a particle systems, skinning, or even rigid body solving (which is also done on the GPU nowadays). Frustum cull and batching are, undeniably, much faster on the GPU than on the CPU and both are already low level aspects of rendering. (even though they are not very usable nowadays). Also, you seem to focus on OpenGL, i'm also talking about OpenCL which _IS_ general purpose.

Brolingstanz
12-19-2009, 02:15 PM
Fwiw Iíd like to see things generalize more and it just canít happen fast enough as far as I'm concerned.

Interop seems the slippery slope of contention but the prevailing north vesterlies seem to be taking us in the general direction of a Larrabee. I imagine the bean counters have to first fit the right business model to it.

Alfonse Reinheart
12-19-2009, 04:19 PM
I'm actually talking about low level rendering and operations that can greatly benefit from being on the GPU... as in, at the same level of stuff like computing a particle systems, skinning, or even rigid body solving (which is also done on the GPU nowadays).

None of these are low-level. They have interactions with low-level components, but they are not as a whole low-level.

There are times when you don't want to do skinning on the GPU. There are times when you want to do GPU skinning differently (dual-quaternion, compared to matrix skinning). There are different ways to do frustum culling, depending on restrictions you can put on your camera (an orthographic game, for example, can much more easily frustum cull). There are so many ways to implement a particle system that you couldn't begin to cover them with a single system.

Even if you were to build some system that uses OpenGL and OpenCL to do these things, a one-size-fits all solution simply will not work for these components. You could make one that fits your needs, but it would not fit my needs.

And who exactly do you want to implement and maintain this code? IHVs? ATi is barely able to make GLSL work; there's no way they'd be able to take substantial parts of a videogame's rendering engine into their drivers.


Also, you seem to focus on OpenGL, i'm also talking about OpenCL which _IS_ general purpose.

OpenCL is general purpose. Which means you can use it to do whatever you want. That specifically means that it should not start taking on special-purpose features like frustum culling and such.

If someone wants to write an optimized OpenCL frustum culling library, more power to them. However, OpenCL should not come with such a library, nor should OpenCL providers be required to make and maintain such a library.

Jose Goruka
12-19-2009, 08:37 PM
OpenCL is general purpose. Which means you can use it to do whatever you want. That specifically means that it should not start taking on special-purpose features like frustum culling and such.

If someone wants to write an optimized OpenCL frustum culling library, more power to them. However, OpenCL should not come with such a library, nor should OpenCL providers be required to make and maintain such a library.

I was going to reply more concisely to your post, but after reading this, i realized you didn't even understand my original post at all (or maybe i wasn't clear enough).

To make it short, i'm ranting that there is no way to generate render instructions from within the card (from either CL or GL), which is what makes implementing the features i described impossible. I'm not sure but i think the 360 can do it as the command buffer format is documented. Take a time and think about it since i believe i'm not asking something stupid.

Simon Arbon
12-19-2009, 10:02 PM
Do you mean some sort of "Command Shader" that replaces display lists and lets you write OpenGL and OpenCL commands in a shader that is compiled & stored on the GPU?
Currently a series of commands are issued from a CPU program into a command buffer that is periodically flushed to the GPU for execution.
Display lists allow a command buffer to be compiled and stored on the GPU where it can be called like a subroutine, but this is still a linear sequence of instructions with no loops or branches (except NV_conditional_render extension which executes a block of commands depending on the result of an occlusion query).
If display lists are replaced with a command shader then it could include a lot more conditional logic and would remove the lag currently involved in waiting for the GPU to finish something, reading the result, deciding what to do, then sending new commands to the GPU.
It would also be a lot easier to coordinate OpenGL with physics and animation using OpenCL, where the vertex data remains in a GPU buffer and is not altered by the CPU.

Ilian Dinev
12-19-2009, 10:12 PM
Making the gpu omnipotent any soon can be hard. That's why I proposed a first step:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=257713#Post2577 13

The cpu changes global settings: blending/render-modes, tex-units, targets, etc. Parallelizing access to these triggers can't be done, but still can be put in one core of one cluster (of 8 or 16 cores) to work while fragments and vertices are computed in other clusters, and to append commands (via an interrupt-like instruction) to an internal FIFO cmd buffer.
In any case, per-primitive occlusion test results are needed first, instead of the currently available per-batch results; and a way to fetch them from within a geom-shader.

Anyway, with SM5 at the door, we can only hope and give detailed ideas/suggestions here - to be included in SM6 or as a feature of GL, that IHVs manage to implement+expose in SM4 or SM5 cards. Vague ideas like "do everything on the gpu" don't help anyone.
With the current model, generally there are 3 bottlenecks:
- getting results from the gpu (occlusion tests). Currently countered rather easily.
- too many render-calls. Countered by instancing, can be further aided by per-instance VBO/IBO offsets and per-instance primitive-count.
- each glXXX() call is 300-3000 cpu cycles. Currently countered by multithreaded drivers, texture-arrays, UBOs/etc and VAOs.

Brolingstanz
12-19-2009, 11:41 PM
Canít we just pretend everything happens on the GPU?

In the short term if you can make everything look as though it operates on the GPU youíve gone a long ways towards abstracting and simplifying the programming model and increasing productivity, at least in theory (cf. Radpid Mindís Sh, etc).

But then thatís just a development environment thing. Anyone (read: almost no one) can create a special purpose language and the development tools necessary to present the illusions of homogeneity to the user, even if under the hood the implementation is a 2 headed monster with festering pustules.

I could be wrong but I think a great many coders would be tickled pink with something short of the real GPU McCoy. Though the questions of what form it takes and what it targets remain, either of which could paradoxically deepen the original quagmire; and neither of which have much to do with GL (or so it would seem).

Alfonse Reinheart
12-20-2009, 01:02 AM
To make it short, i'm ranting that there is no way to generate render instructions from within the card (from either CL or GL), which is what makes implementing the features i described impossible.

There is a good reason why this is impossible, at least as far as OpenGL is concerned.

OpenGL, as a specification, must maintain consistency about the order in which things are rendered. Conditional render is a way to make the rendering of one thing depend on the rendering of another. But where and when the dependent rendering will happen (if it does) is very well defined.

This is why threading OpenGL rendering to the same context is not possible. If I have started some process that will spawn rendering commands, then I issue some commands myself, which ones happen first? Is there some control over the order? How do you define an order for something like that? Can I tell when those commands have been completed?

The kind of thing you're talking about is very sketchy. Even the 360's GPU can't issue commands to itself. A shader may be able to build a command buffer sequence, but it can't tell the GPU to execute it. Only the application can do that.


- too many render-calls. Countered by instancing, can be further aided by per-instance VBO/IBO offsets and per-instance primitive-count.

Correct me if I'm wrong, but isn't a per-instance primitive count called "glMultiDraw*"?

Also, I'd be interested to know what kind of scene it is that you're rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).

To me, it is these kinds of changes that are the most limiting.

Ilian Dinev
12-20-2009, 02:15 AM
Correct me if I'm wrong, but isn't a per-instance primitive count called "glMultiDraw*"?
Except that the 3 extensions explicitly state that glMultiDraw* will NOT update gl_InstanceID. Thus, what I asked for is not yet supported.

P.S:


Also, I'd be interested to know what kind of scene it is that you're rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).
To me, it is these kinds of changes that are the most limiting.


Via gl_instanceID and the per-instance primitive-count, we could draw many different objects that use a specific shader and some common uniform values; while each "instance" (defacto can be completely different geometry and textures via tex-arrays) can fetch with gl_InstanceID its "instance"-uniforms and texarray.z . A tiny change in specs, and the cpu can be relieved of lots of calls.
Though, we _currently_ definitely aren't hurting for such extra functionality yet.

Jose Goruka
12-20-2009, 09:28 AM
If display lists are replaced with a command shader then it could include a lot more conditional logic and would remove the lag currently involved in waiting for the GPU to finish something, reading the result, deciding what to do, then sending new commands to the GPU.
It would also be a lot easier to coordinate OpenGL with physics and animation using OpenCL, where the vertex data remains in a GPU buffer and is not altered by the CPU.

Yeah, this is exactly what i mean, right now this is a huge bottleneck. Since in GL also (unlike D3D), everything works with numerical IDs, i dont see why you couldn't rebind shaders, arrays or textures without having to go through the GPU if you had such kind of command buffers..

Jose Goruka
12-20-2009, 09:41 AM
The kind of thing you're talking about is very sketchy. Even the 360's GPU can't issue commands to itself. A shader may be able to build a command buffer sequence, but it can't tell the GPU to execute it. Only the application can do that.


This is fine, pretty much exactly how it should work.




Also, I'd be interested to know what kind of scene it is that you're rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).



well, rendering a scene all with the same material sounds pretty limiting to me. Also you still have to change arrays and textures, so for that instancing and multidraw are not very useful except for batching small groups.

In other words, _I_ should be the one asking you what kind of scenes are you talking about, because as far as i know, every technology/engine/game I've seen or worked with pretty much does

1) Frustum culling
2) Batching to minimize state changes
3) some sort of visibility optimization

when rendering large scenes. That's like, a very common case which should benefit enormously from some sort of hardware support for it. If you don't care about doing stuff that way, i can understand, but i'm talking about what most users would benefit from.

Jose Goruka
12-20-2009, 09:46 AM
[quote=Alfonse Reinheart]Correct me if I'm wrong, but isn't a per-instance primitive count called "glMultiDraw*"?
Except that the 3 extensions explicitly state that glMultiDraw* will NOT update gl_InstanceID. Thus, what I asked for is not yet supported.

P.S:


Also, I'd be interested to know what kind of scene it is that you're rendering that can withstand the limitations on instancing/multidraw (namely, no shader parameter/texture changes).
To me, it is these kinds of changes that are the most limiting.




Via gl_instanceID and the per-instance primitive-count, we could draw many different objects that use a specific shader and some common uniform values; while each "instance" (defacto can be completely different geometry and textures via tex-arrays) can fetch with gl_InstanceID its "instance"-uniforms and texarray.z.


Ah, yeah this is something similar to what i mentioned, since you could do pretty much everything on the GPU.. i guess the main limitation is not being able to switch arrays and textures per instance :(

Ilian Dinev
12-20-2009, 10:03 AM
i guess the main limitation is not being able to switch arrays and textures per instance :(
You can switch vertex-attribs, textures and uniforms per instance with the current GL3.2. The gl_InstanceID is the key-factor, which lets your vtx/frag shaders choose where to pull data from; said data being: vtx-attribs, z-slice of a texture-array, whole chunks of uniforms for the current object/instance. With double/indirect-referencing you can also compress those chunks. Anyway, effectively currently you can render in one call 1000 unique objects, that have completely different geometry, textures, parameters (and subsequently shading) . It's just that those 1000 objects currently must share a common num_primitives count. And occlusion-queries cannot be updated or affect on each instance; the granularity is just all_instances-or-nothing right now.

Jose Goruka
12-20-2009, 11:54 AM
The gl_InstanceID is the key-factor, which lets your vtx/frag shaders choose where to pull data from; said data being: vtx-attribs, z-slice of a texture-array, whole chunks of uniforms for the current object/instance.


Ah, i think i understand.. you mean indexing all textures in a texture array, and bind most buffers in a single one?

That actually sounds like it could work (would need a little client side memory management but doesn't sound so bad). Guess the only problem is that all textures have to be the same size in the texarray (well, not sure if it's still the case)..

About occlusion queries.. i guess you are right, but something like portals and rooms should still work fine if done from the CPU side, as it's pretty cheap to compute. Still though, i'm not sure how one would specify the amount of primitives.

Alfonse Reinheart
12-20-2009, 12:48 PM
Via gl_instanceID and the per-instance primitive-count, we could draw many different objects that use a specific shader and some common uniform values; while each "instance" (defacto can be completely different geometry and textures via tex-arrays) can fetch with gl_InstanceID its "instance"-uniforms and texarray.z . A tiny change in specs, and the cpu can be relieved of lots of calls.

I'm not sure that this will buy you any real performance gains. Rendering calls in OpenGL are already pretty lightweight. What it buys you is mostly in state changes (not binding new textures/uniforms/buffers), not in rendering calls.

Also, I'm pretty sure that this would not be a "tiny" change as far as implementations go. Hardware makes regular instanced rendering possible and efficient. That's why ARB_draw_instanced is only supported on 3.x hardware, even though they could emulate support internally with uniform updates between individual draw calls. If hardware doesn't support MultiDraw with a gl_Instance bump between draws directly, then the driver is basically going to have to set a uniform whenever time it draws. Which I'm guessing is probably not the kind of performance profile you're looking for ;)


You can switch vertex-attribs, textures and uniforms per instance with the current GL3.2.

Not vertex arrays. Not unless you are manually reading your vertex data from a buffer with something like a buffer texture or a UBO or bindless (which is NV specific). And if you are, you've already lost quite a bit of per-vertex-shader performance.

Or are you sending the attributes for multiple objects and you're just picking between them in the vertex shader? That's still quite a bit of performance loss, since it has to pull memory for each of those attribute lists, regardless of whether or not they are used.


Since in GL also (unlike D3D), everything works with numerical IDs, i dont see why you couldn't rebind shaders, arrays or textures without having to go through the GPU if you had such kind of command buffers.

Because hardware doesn't work with numerical IDs. Hardware uses pointers, which currently shaders have no access to. Even the internal driver converts these to pointers ASAP.

And of course, having the GPU write such command buffers is fraught with peril with regard to object lifetimes. What if the application orphans some texture? The user has said that it can die, but the card is still using it for rendering something at the moment, so it still exists. Well, what happens if a shader, previously dispatched, decides to reference it? By the time that command buffer gets dispatched, the driver may have deleted the object. The driver would somehow have to be notified that the orphaned object is still in use.

This is why it is important that OpenGL be a pure input-driven system. You give it commands, and it executes them.

Now, if you want OpenCL to generate such a list, that is entirely possible. As long as you don't expect to feed this list directly into OpenGL (ie: you have to load the list client-side and process it into rendering commands), that's fine.

Ilian Dinev
12-20-2009, 01:32 PM
Yes, stating again: this per-instance-count is not useful enough for performance just yet. The specs note this, too.
UBO/tex-buffer, yes. On GF it skips the dedicated attrib-fetch units, but threading hides the latency. On RHD the attrib-fetch is done this way anyway.

yooyo
12-20-2009, 07:50 PM
@Jose:
Your requests are related to forward and deferred rendering. But there is a some new and different approach to rendering, like megatexture in upcoming game Rage. JC in this title created new technology which utilize GPU memory way better than current games. Drawback is he cant use existing texture filters. Everything was calculated in a shader. He more needs access to whole GPU memory than sorting, batching, occlusion culling, etc.

http://en.wikipedia.org/wiki/Id_Tech_5
http://en.wikipedia.org/wiki/MegaTexture

Alfonse Reinheart
12-20-2009, 08:34 PM
Your requests are related to forward and deferred rendering. But there is a some new and different approach to rendering, like megatexture in upcoming game Rage.

Megatexture and similar technologies are primarily concerned with keeping the right amounts of the right things in memory. Rendering those things can be done either forward or deferred, using the same techniques as regular rendering systems do. So one doesn't really affect the other very much.

Jose Goruka
12-21-2009, 08:00 AM
@Jose:
Your requests are related to forward and deferred rendering. But there is a some new and different approach to rendering.

Megatexture still has a long way to go. It may work for rage, but it suffers from:

1) Being very static in nature (changes to geometry are very costly)
2) Materials are also pretty static.
3) Lack of support in tools
4) Need a huge amount of artists.

Also, i don't see why it won't benefit from occlusion culling.. I'm sure rage uses occlusion techniques.

ZbuffeR
12-21-2009, 03:07 PM
Getting off topic here but whatever...
1) not more static than texture coordinates. Ideally even for classic texturing, whenever something deforms, both the wrapping and the texture map should change to provide more texel definition in stretched areas. In practice, within reasonable constraints, this is not much a problem.
2) on the contrary I think it is even less a constraining with megatexture than when one uses deferred shading. Megatexture only define how texture are sampled, not how shading is performed. And as said above, both methods are orthogonal.
3) well yes. But AFAIK, baking to megatexture is only needed at the end of the assets pipeline. And in the shader access to texture. So it is done "under the carpet".
4) ? The point of megatexture is to allow (almost) unconstrained texture resolution on constrained hardware memory. Creating big textures can done even with programmer art :) Sure it is better with good artists, but it is the same for classic textures, models, animations, sounds... Just less skills needed to balance texture size and resolution across levels : throw high res digital photography here, a rasterized vector signpost here, a low res hand drawn pic here, ... it will all run at the same speed.

Maybe I am wrong ?

LogicalError
12-29-2009, 03:43 AM
Maybe I am wrong ?

No, you're absolutely correct.
In fact virtual texturing is basically nothing more than an advanced form of texture management.

You could put in all your textures and re-use them over geometry just like you would've without virtual texturing.
(well you'd store them like that on disk, the addressing would probably have to point to the same pages several times when a texture is repeated, the disk cache would only store each page once however)
If you keep the pages unique in the on GPU page cache you can even render into it, turning it into a shaded cache (I still need to try that).

Dark Photon
12-29-2009, 10:30 AM
virtual texturing is basically nothing more than an advanced form of texture management
A big dynamic 2D texture atlas from what I've seen (128x128 textures, 128k^2texels, so 1024x1024 res textures). Anyone got a pointer to a detailed write-up of it?

Seems that one disadvantage of it is the lack of support for MIPmap-based hardware texture filtering (aniso, etc.). I'd like to see more on how they did filtering than was revealed at SIGGRAPH last year. Also, appears that it might impose a close tie between vertex density and texel density. Since all the virtual textures were the same res, then seems you needed to subdivided 2x2 everytime you needed to CPU-fade into the next texture LOD, unless there's some texture/texcoord magic going on here.

However, if you can take the disadvantages, it definitely simplifies preallocation and updates of GPU memory for texture paging. One other cool thing was how they used DCT on disk and then transcoded that to DXT dynamically at run-time when paging.

skynet
12-29-2009, 01:06 PM
You'll find a lot here:

http://silverspaceship.com/src/svt/

Also, see LogicalError's blog (link above) and

here:

http://www.linedef.com/personal/demos/?p=virtual-texturing

Dark Photon
12-30-2009, 10:54 AM
Thanks.