PDA

View Full Version : Official feedback on OpenGL 4.0 thread



Khronos_webmaster
03-11-2010, 06:56 AM
Khronos Unleashes Cutting-Edge, Cross-Platform Graphics Acceleration with OpenGL 4.0

Open standard 3D API specification available immediately; Provides performance, quality and flexibility enhancements including tessellation and double precision shaders; Tight integration with OpenCL for seamless visual computing

March 11, 2010 Ė San Francisco, GDC 2010 Ė The Khronosô Group today announced the release of the OpenGLģ 4.0 specification; a significant update to the most widely adopted 2D and 3D graphics API (application programming interface) that is deployed on all major desktop operating systems. OpenGL 4.0 brings the very latest in cross-platform graphics acceleration and functionality to personal computers and workstations and the OpenGL standard serves as the basis for OpenGLģ ES, the graphics standard on virtually every shipping smart phone.

The OpenGL 4.0 specification has been defined by the OpenGL ARB (Architecture Review Board) working group at Khronos, and includes the GLSL 4.00 update to the OpenGL Shading language in order to enable developers to access the latest generation of GPU acceleration with significantly enhanced graphics quality, acceleration performance and programming flexibility. This new release continues the rapid evolution of the royalty-free OpenGL standard to enable graphics developers to portably access cutting-edge GPU functionality across diverse operating systems and platforms. The full specification is available for immediate download at http://www.opengl.org/registry.

OpenGL 4.0 further improves the close interoperability with OpenCLô for accelerating computationally intensive visual applications. OpenGL 4.0 also continues support for both the Core and Compatibility profiles first introduced with OpenGL 3.2, enabling developers to use a streamlined API or retain backwards compatibility for existing OpenGL code, depending on their market needs.

OpenGL 4.0 has been specifically designed to bring significant benefits to application developers, including:
two new shader stages that enable the GPU to offload geometry tessellation from the CPU; per-sample fragment shaders and programmable fragment shader input positions for increased rendering quality and anti-aliasing flexibility; drawing of data generated by OpenGL, or external APIs such as OpenCL, without CPU intervention; shader subroutines for significantly increased programming flexibility; separation of texture state and texture data through the addition of a new object type called sampler objects; 64-bit double precision floating point shader operations and inputs/outputs for increased rendering accuracy and quality; performance improvements, including instanced geometry shaders, instanced arrays, and a new timer query.
Lastly, Khronos has simultaneously released an OpenGL 3.3 specification, together with a set of ARB extensions, to enable as much OpenGL 4.0 functionality as possible on previous generation GPU hardware; providing maximum flexibility and platform coverage for application developers. The full OpenGL 3.3 specification is also available for immediate download at http://www.opengl.org/registry

Groovounet
03-11-2010, 07:12 AM
Amazing... you keep getting more and more astonishing!

EDIT: I just had a look on the extension, it's so much more than whatever I could have expected! I think there is a lot of developer little dreams that just happened here.

Ok, no GL_EXT_direct_state_access but still.

Heiko
03-11-2010, 07:15 AM
Wow! I was checking this forum often the last few days (knowing that GDC2010 was going on). But as nothing was announced until now I almost began to believe there would be no new OpenGL spec.

Very nice! Will be reading the specs right now. Good to see that OpenGL is still actively being developed and that the core specs are moving along with the newest hardware available.

Can't wait until we get the first OpenGL 4.0 drivers (hurry AMD! ;)) so I can play with this new baby (actually I would have to buy myself some new hardware as well, but that was already planned).

randall
03-11-2010, 07:18 AM
Great! Congratulations to Khronos and all people involved.

Groovounet
03-11-2010, 07:29 AM
A sampler binding is effected by calling

void BindSampler( uint unit, uint sampler );

with <unit> set to the texture unit to which to bind the sampler and <sampler> set to the name of a sampler object returned from a previous call to GenSampler.


I'm really not sure about this ... I was dreaming about it but keeping the "texture unit" idea was an issue of texture objects... Does it require it? I don't think. Instead of BindSampler, I thing we were expecting a UniformSampler

It would have removed the texture unit limitation...

Chris Lux
03-11-2010, 07:30 AM
congratulations!

reading through 3.3 and 4.0 spec atm. sadly no direct state access, but i am very impressed but the pace OpenGL is evolving currently.

keep up the excellent work!

-chris

Stephen A
03-11-2010, 07:35 AM
Congratulations!

Any word on Intel support?

Groovounet
03-11-2010, 08:03 AM
Stephen: You just made the joke of the day! ;)

It seams more honest from Intel to keep quiet.

I'm not even waiting for an OpenGL 2 implementation anymore.

overlay
03-11-2010, 08:06 AM
Hello,

The new glext.h (rev 59) has a corrupted line:

#ifndef GL_ARB_draw_buffers_blend
#define GL_@@@ 0x9110
#endif

This part in rev 58 used to be:
#ifndef GL_ARB_draw_buffers_blend
#endif

Stephen A
03-11-2010, 08:22 AM
Stephen: You just made the joke of the day! ;)

It seams more honest from Intel to keep quiet.

I'm not even waiting for an OpenGL 2 implementation anymore.



Yeah, it's wishful thinking on my part. But hey, imagine an Intel spokesperson announcing that, "we are planning to ship OpenGL 3.3 support by the end of May!"

Of course, reality kicks soon, when someone asks for the 100th time on how to perform offscreen rendering:

- User trying to create an invisible window or other broken hacks: "Why do I keep getting garbage back?"
- Linking to the OpenGL FAQ: "You are not passing the pixel ownership test. Use FBOs."
- "But FBOs don't work on Intel."
- "Try pbuffers."
- "Nope, no pbuffers either."
- "How about a sacrifice to Kthulu?"
- "Wha..?"
- "Just kidding. Software rendering for you."

That much for "high performance graphics" on Intel...

Jason Borden
03-11-2010, 08:23 AM
Any idea if a new quick reference card is coming along?

Groovounet
03-11-2010, 08:32 AM
I can't find any reference of GL_NV_texture_barrier feature in the new spec... anyone got more lucky?

That would be a big miss I think. :p

Heiko
03-11-2010, 08:43 AM
Any idea if a new quick reference card is coming along?

Or an updated version of the online OpenGL reference pages. That would be great (currently the top menu of the OpenGL page still lists the OpenGL 3.2 reference pages as `under construction').

edit: so far it looks good to me. I've mainly been looking into the OpenGL and GLSL 3.3(0) specs for now. Include directives and explicit attribute locations are nice additions to GLSL. Extra blending functionality, separate sampler states and timer queries are nice as well. All in all, it looks like a solid release to me!

Will be chewing through the OpenGL 4.0 and GLSL 4.00 specs now :D.

Aleksandar
03-11-2010, 09:31 AM
Extraordinary work!!!
Two versions in the same day. I'm amazed!
The revolution is started. :)

When the new (GL 4.0) beta drivers will be available? I cannot read specification without ability to try anything. ;). Last time they were released together with the new spec (the same day), but now we are not so lucky.

P.S. Although a new GLSL spec is not released, according to GL spec. 4.0 the new revision number will be 4.0. It seems inconsistent to jump to ver. 4.00 after 1.50. :(

Tomasz D&#261;browski
03-11-2010, 10:06 AM
It's hard to call it "revolution", I was hoping to see all glBinds deprecated and use only DSA (at least in core profiles).

Groovounet
03-11-2010, 10:25 AM
Extraordinary work!!!
Two versions in the same day. I'm amazed!
The revolution is started. :)

When the new (GL 4.0) beta drivers will be available? I cannot read specification without ability to try anything. ;). Last time they were released together with the new spec (the same day), but now we are not so lucky.

P.S. Although a new GLSL spec is not released, according to GL spec. 4.0 the new revision number will be 4.0. It seems inconsistent to jump to ver. 4.00 after 1.50. :(

It is released. It's called GLSL 4.0 to match OpenGL 4.0 because there is a GLSL 3.3 for OpenGL 3.3.

What if OpenGL 3.4 get released and GLSL versions are called 1.6 for OpenGL 3.3 and 1.7 for OpenGL 4.0.

Ok true, sound inconsistent but well. At least with all those versions of GLSL and OpenGL you now which one come with which one.


It's hard to call it "revolution", I was hoping to see all glBinds deprecated and use only DSA (at least in core profiles).

DSA still look "in progress" as I find a couple of references in the extensions files.

Fingers crossed ... with (OpenGL 3.4 and) OpenGL 4.1 at Siggraph 2010!

bobvodka
03-11-2010, 10:37 AM
It's hard to call it "revolution", I was hoping to see all glBinds deprecated and use only DSA (at least in core profiles).

Agreed; while its nice to see OpenGL has finally caught up with DX11's feature set the API is still the same old API which drove me away and DSA, the best thing to come out of the whole GL3.0 farce, is still missing.

I'll swing back around again when GL5 is out; maybe that'll have something to tempt me back from D3D11...

randall
03-11-2010, 11:04 AM
DSA would be nice I agree, BUT catching up the hardware has much greater priority in my opinion.

DSA is not something that can be cleanly integreted into OpenGL without much work. It requires API redesign. Bind methodology has been with OpenGL from the beginning.

I think that OpenGL 4.0 feature set is great.
Try to notice the beauty of it's initial design and how cleverly authors of this API has integrated new features into so old architecture.

It's much easier to throw away old API (D3D9) and to create nicely designed new one (D3D10+). But OpenGL authors can not do that. And I think that they did great job with OpenGL 4.0

PkK
03-11-2010, 11:08 AM
Stephen: You just made the joke of the day! ;)

It seams more honest from Intel to keep quiet.

I'm not even waiting for an OpenGL 2 implementation anymore.



Yeah, it's wishful thinking on my part. But hey, imagine an Intel spokesperson announcing that, "we are planning to ship OpenGL 3.3 support by the end of May!"

Of course, reality kicks soon, when someone asks for the 100th time on how to perform offscreen rendering:

- User trying to create an invisible window or other broken hacks: "Why do I keep getting garbage back?"
- Linking to the OpenGL FAQ: "You are not passing the pixel ownership test. Use FBOs."
- "But FBOs don't work on Intel."
- "Try pbuffers."
- "Nope, no pbuffers either."
- "How about a sacrifice to Kthulu?"
- "Wha..?"
- "Just kidding. Software rendering for you."

That much for "high performance graphics" on Intel...

Well, I stopped waiting for an OpenGL 2 drivers for Intel hardware when they arrived.
I'm writing this on my old laptop with it's Intel integrated graphics, which supports OpenGL 2.1, lots of extensions, including GL_ARB_framebuffer_object and GLX_SGIX_pbuffer.
The only missing feature from your list is GL_MESAX_elder_gods_sacrifice.

Philipp

Rob Barris
03-11-2010, 11:08 AM
The bad news is that not every release will have every feature that every developer wants. The good news is that the releases are now spaced by months rather than years.. ( http://en.wikipedia.org/wiki/OpenGL )

3.0 - July 2008
3.1 - May 2009
3.2 - Aug 2009
3.3 / 4.0 - Mar 2010.

So, as has been the custom for those releases, make your own assessment of what you like or don't like, and keep the feedback coming.

If a dozen separate developers all shout loudly for DSA for example, this could effectively raise its priority for an upcoming release (assuming the objections of some of the implementors can be reconciled).

Jan
03-11-2010, 11:35 AM
Well, i was going to complain, that although it's called "4.0" it does not contain DSA or multi-threaded rendering.

But after more carefully sifting through all the material, i am actually pretty happy with all the progress. And since tessellation does change so many things, i think it is OK to call it "4.0".

Good job guys!

Now that OpenGL has caught up with the most important features of DX11, i really hope 5.0 gets a more thorough API clean-up.

Jan.

Stephen A
03-11-2010, 11:46 AM
I'm skimming through the new specs and it is evident that much of the new functionality is based on community feedback. I cannot begin to describe how awesome that is! It's saying a lot about the new direction of the Khronos board - glad to see the API sailing at full speed again.

And bonus points for validating the design of OpenGL 3.2 (no deprecated features.) :)


I'm writing this on my old laptop with it's Intel integrated graphics, which supports OpenGL 2.1, lots of extensions, including GL_ARB_framebuffer_object and GLX_SGIX_pbuffer.
The only missing feature from your list is GL_MESAX_elder_gods_sacrifice.
What do you mean by old? Most complains have to do with the GMA950/XP combo found in most netbooks, followed by 915 and 865 chips. Newer chips, like the 4500HD, seem to support FBOs under Vista or newer (but still not FBO_blit. What you were thinking of implementing bloom on Intel IGPs? Naughty boy, this is Intel we are talking about!)

That said, I think someone issued a pull request for GL_MESAX_elder_gods_sacrifice, targeting Mesa 7.9 / Gallium. Should be interesting.

BarnacleJunior
03-11-2010, 11:46 AM
I'm another dev that wants DSA and some sensible thread model (contexts that create command lists are fine) before I leave DX11. But as important for DSA, and something no one is talking about, is a binary standard for shader IL and an offline compiler. Some shaders, especially GPGPU shaders, are getting long, and taking a very long time to compile (like 30seconds or more). This is an enormous annoyance with OpenCL especially. DX and CUDA both support portable ILs with offline compilation. The other thing with ILs, beside startup speed, is that it gives the dev a way to visually inspect a shader for branching, register usage, etc. The DX IL (which is almost 100% compatible with ATI's IL) is very nicely formatted and makes diagnosing shader performance issues much easier.

Stephen A
03-11-2010, 11:56 AM
All I can say is +1 to DSA and +1 to binary shaders (compile into common, high-level IL that is then read and optimized by the IHV drivers. Significantly faster than parsing from source every single time.)

Alfonse Reinheart
03-11-2010, 12:01 PM
To be honest, DSA is syntactic sugar; it would be very nice to have, but I can live without it.

But they just introduced 2 new shader stages. The ability to separate program objects is only going to become increasingly more relevant. Also, binary shaders. I don't care much about an intermediate language or anything; I just want to be able to get a binary shader and save it, then load it back later and use it.

Oh, and BTW: someone (I forget who) will be very happy with ARB_blend_func_extended. Especially since it's core 3.3, which means that most if not all 3.x hardware will be able to use it.

remdul
03-11-2010, 12:09 PM
Good stuff, but more DSA please!

Jean-Francois Roy
03-11-2010, 12:30 PM
I'm another dev that wants DSA and some sensible thread model (contexts that create command lists are fine) before I leave DX11. But as important for DSA, and something no one is talking about, is a binary standard for shader IL and an offline compiler. Some shaders, especially GPGPU shaders, are getting long, and taking a very long time to compile (like 30seconds or more). This is an enormous annoyance with OpenCL especially. DX and CUDA both support portable ILs with offline compilation. The other thing with ILs, beside startup speed, is that it gives the dev a way to visually inspect a shader for branching, register usage, etc. The DX IL (which is almost 100% compatible with ATI's IL) is very nicely formatted and makes diagnosing shader performance issues much easier.

OpenCL supports binary kernels. See section 5.4.1 of the 1.0 specification and the clCreateProgramWithBinary command. It is true that the IL is not portable and that it may be invalidated for any reason, but this does not negate its primary benefit of speeding up launch times.

elFarto
03-11-2010, 12:38 PM
Yay for GL_ARB_shader_bit_encoding!

Epic fail for GL_ARB_sampler_objects :(

Yay for GL_ARB_explicit_attrib_location, but where's the updated separate_shader_object spec...?

Regards
elFarto

pjcozzi
03-11-2010, 12:42 PM
I'm very happy with the new features in GL 4. I am glad to see many of them in GL 3.3 to support old, lol, current hardware.

I have a question regarding 64-bit support: this is a GL 4 feature that requires next generation hardware, correct? Page 14 of the GL 3.3 spec includes ui64 as a type descriptor but section I, which lists new features, does not mention 64-bit.

Also, does anyone know how the 64-bit pipeline is expected to perform?

Regards,
Patrick

Rob Barris
03-11-2010, 12:51 PM
Epic fail for GL_ARB_sampler_objects :(


If you have any more ... constructive ? detailed ? ... thoughts to add on that topic, could you type them in here?

CrazyButcher
03-11-2010, 01:18 PM
very nice work and keep up the great pacing (thanks for #include and transform_feedback3)!

sampler objects look good to me (same as dx10 I think).

I would have hoped that GL_ARB_explicit_attrib_location would also work for vertex shader outputs, and therefore bring along separate_shader_objects as well... that and some form of binary representation for load times, would also be my picks for very useful additions.

The buffer centric workflow GL offers is really nice, but Dx still has an edge when it comes to shader convenience. (and tools, can't khronos not just sponsor gdebugger for all)

Jan
03-11-2010, 01:27 PM
Oh, and BTW: someone (I forget who) will be very happy with ARB_blend_func_extended. Especially since it's core 3.3, which means that most if not all 3.x hardware will be able to use it.

That would be me! (http://www.opengl.org/discussion_boards/...1694#Post251694 (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Main=48979&amp;Number=2516 94#Post251694))

And yes, i am VERY happy about that, i did not think that anybody would take it seriously.

Jan.

ScottManDeath
03-11-2010, 01:49 PM
ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11 afaik.

Heiko
03-11-2010, 02:43 PM
And I see there is a link on the front page for an OpenGL 4.0 quick reference card! Also the link announcing online OpenGL 3.2 reference pages is changed so that it now announces OpenGL 4.0 reference pages. Still under construction though, I hope these will be available soon... I still peek at the 2.1 pages sometimes to check quickly how a certain function worked again.

Rob Barris
03-11-2010, 02:52 PM
ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11 afaik.

Can you elaborate on "dropped" ?

DX11 cards still need to be able to run DX9/DX10 software, so I don't see how this feature could be cut from silicon unless it has simply become another programmable behavior masquerading as FF behavior.. or do you mean that it's just not in the DX11 API any more.

xahir
03-11-2010, 02:55 PM
First of all, great job for the releases. It is really good to see covering dx10 features (Indirect draw commands, seperate sampler states ...) and having support on dx11 before market is going mad about it...

I want to comment on sampler objects; we bind samplers to texture units but when we want to bind a texture to same unit, we need to set active texture unit and then bind texture to the related target on same unit. We need actual DSA in here. Newborn features seem to expose DSA behaviour but old habits (binding points) cause crawling in my skin... Actually what I want is management of such state values in shader code (material idiom). CgFX got this, HLSL got this, but when we want some material approach in OpenGL we need to coordinate GL code with GLSL code all the time and need to "trust" each other

Alfonse Reinheart
03-11-2010, 03:07 PM
ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11 afaik.

... why? What, did they think having more powerful blending function was a bad idea?


Newborn features seem to expose DSA behaviour

That's actually a good thing. While it makes the API a bit inconsistent, it means that whenever they do get around to a DSA-style overhaul, that there will be fewer legacy-style functions lying around.

Khronos_webmaster
03-11-2010, 03:09 PM
Someone posted a comment on the OpenGL homepage news story. Hopefully they won't mind if I move the comment over here so it can get some clarification and worthy feedback:

---------------------
First, congratulations on the milestone. Second, as a cross-platform developer I would like to use OpenGL exclusively but itís commercially inviable to use it on Windows, due to the fact that OpenGL just doesnít work on most machines by default, which forces me to target my game to both DirectX and OpenGL. The OpenGL shortcomings on Windows arenít a big deal for hardcore games where the users are gonna have good drivers (although they are a cause of too many support calls which makes it inviable anyway) but itís a show-stopper for casual games.

If it was me in charge of OpenGL I wouldnít even bother coming up with new versions of the standard until this situation was rectified. I really donít understand why the effort isnít put in working with Nvidia, Ati and Microsoft to rectify the situation. If Microsoft wont help just bring back glsetup! Having to run an installer on installation of your product would be fine. But currently the user has to find good drivers by himself and then reboot, this is unacceptable. DirectX on the other hand can be streamlined as part of the installation. Until OpenGL isnít expected to just work in any Windows box it is dead in the Windows platform. Do something about this please.

Posted by: Margott
---------------------

elFarto
03-11-2010, 03:21 PM
Epic fail for GL_ARB_sampler_objects :(


If you have any more ... constructive ? detailed ? ... thoughts to add on that topic, could you type them in here?
The idea of having a textures and samplers separate is so that you can bind the texture object directly to the shader uniform/uniform buffer so you didn't have to play music texture units when you change the shader (rebinding the right texture object to the right texture unit).

Regards
elFarto

Alfonse Reinheart
03-11-2010, 03:30 PM
I really donít understand why the effort isnít put in working with Nvidia, Ati and Microsoft to rectify the situation.

Because none of them are responsible for it. ATI's drivers are OK, and NVIDIAs are pretty good. The problem is Intel. Their drivers are terrible. And if you're making casual games, you want to be able to play them on Intel integrated chipsets.


The idea of having a textures and samplers separate is so that you can bind the texture object directly to the shader uniform/uniform buffer so you didn't have to play music texture units when you change the shader (rebinding the right texture object to the right texture unit).

No it isn't. The purpose of separation is exactly what is said in the overview of the spec: to be able to use the same texture data with different filtering parameters without having to do a bunch of texture parameter state changes. This is very valuable.

And personally, I've come to rather like the game of "music texture units": it allows me to change exactly what state I want. I can use the same shader with different textures very, very easily.

Under the system you're suggesting, I have to change the program itself to swap texture sets. This is more expensive, both in terms of the number of function calls and how the hardware handles it.

Personally, what I would like to see is full separation of program objects (compiled/linked programs) from program object state (uniforms, etc). UBOs are about as close to that as it gets, so I'm fairly content with that compromise.

RoderickC
03-11-2010, 03:34 PM
Someone posted a comment on the OpenGL homepage news story. Hopefully they won't mind if I move the comment over here so it can get some clarification and worthy feedback:

---------------------
First, congratulations on the milestone. Second, as a cross-platform developer I would like to use OpenGL exclusively but itís commercially inviable to use it on Windows, due to the fact that OpenGL just doesnít work on most machines by default, which forces me to target my game to both DirectX and OpenGL. The OpenGL shortcomings on Windows arenít a big deal for hardcore games where the users are gonna have good drivers (although they are a cause of too many support calls which makes it inviable anyway) but itís a show-stopper for casual games.

If it was me in charge of OpenGL I wouldnít even bother coming up with new versions of the standard until this situation was rectified. I really donít understand why the effort isnít put in working with Nvidia, Ati and Microsoft to rectify the situation. If Microsoft wont help just bring back glsetup! Having to run an installer on installation of your product would be fine. But currently the user has to find good drivers by himself and then reboot, this is unacceptable. DirectX on the other hand can be streamlined as part of the installation. Until OpenGL isnít expected to just work in any Windows box it is dead in the Windows platform. Do something about this please.

Posted by: Margott
---------------------


Something like 3dsetup would indeed be a good short-term solution to provide people with modern OpenGL drivers. Too bad that Intel isn't shipping any GL3.x let stand 4.x drivers at this point :(

Another thing which is VERY related to this. A lot of developers encounter OpenGL driver bugs. Some implementations are buggier than others and this frustrates a lot of developers and it is also one of the reasons that some companies don't use OpenGL at this point.

In order to improve OpenGL driver quality I would urge developers when they encounter problems, to submit test cases to the 'piglit' an opengl testing framework hosted at http://people.freedesktop.org/~nh/piglit/. At least open source X.org OpenGL drivers are using it as a test bed but nothing prevents OSX/Windows developers to use it as well.

Roderick

Ilian Dinev
03-11-2010, 04:20 PM
The idea of having a textures and samplers separate is so that you can bind the texture object directly to the shader uniform/uniform buffer so...
That's too counter-productive, imho. Scenes have 5k+ objects visible, different textures each; fewer programs. It makes more sense to group by program, imho.
In case you meant to bind all N textures for the given mesh-instance at once, I don't think it's viable either: those GLuint names are not optimally mappable during shader execution (are not pointers, and should not be).

Rob Barris
03-11-2010, 04:38 PM
I really donít understand why the effort isnít put in working with Nvidia, Ati and Microsoft to rectify the situation.

Because none of them are responsible for it. ATI's drivers are OK, and NVIDIAs are pretty good. The problem is Intel. Their drivers are terrible. And if you're making casual games, you want to be able to play them on Intel integrated chipsets.


The idea of having a textures and samplers separate is so that you can bind the texture object directly to the shader uniform/uniform buffer so you didn't have to play music texture units when you change the shader (rebinding the right texture object to the right texture unit).

No it isn't. The purpose of separation is exactly what is said in the overview of the spec: to be able to use the same texture data with different filtering parameters without having to do a bunch of texture parameter state changes. This is very valuable.

And personally, I've come to rather like the game of "music texture units": it allows me to change exactly what state I want. I can use the same shader with different textures very, very easily.

Under the system you're suggesting, I have to change the program itself to swap texture sets. This is more expensive, both in terms of the number of function calls and how the hardware handles it.

Personally, what I would like to see is full separation of program objects (compiled/linked programs) from program object state (uniforms, etc). UBOs are about as close to that as it gets, so I'm fairly content with that compromise.

Progress!

What kind of GL app are you working on ?

mhagain
03-11-2010, 05:18 PM
Stephen: You just made the joke of the day! ;)

It seams more honest from Intel to keep quiet.

I'm not even waiting for an OpenGL 2 implementation anymore.



Yeah, it's wishful thinking on my part. But hey, imagine an Intel spokesperson announcing that, "we are planning to ship OpenGL 3.3 support by the end of May!"

Of course, reality kicks soon, when someone asks for the 100th time on how to perform offscreen rendering:

- User trying to create an invisible window or other broken hacks: "Why do I keep getting garbage back?"
- Linking to the OpenGL FAQ: "You are not passing the pixel ownership test. Use FBOs."
- "But FBOs don't work on Intel."
- "Try pbuffers."
- "Nope, no pbuffers either."
- "How about a sacrifice to Kthulu?"
- "Wha..?"
- "Just kidding. Software rendering for you."

That much for "high performance graphics" on Intel...
Meanwhile I've been quite happily using SetRenderTarget with D3D9 on Intel chips going back to the 915 without a problem.

The annoying thing is that the hardware actually does support hardware accelerated offscreen rendering perfectly well.

Foobarbazqux
03-11-2010, 07:38 PM
In section 1.2.1 of the GLSL 3.3 spec (Summary of Changes from Version 1.50) it says
"Added Appendix A to describe include tree and path semantics/syntax for both the language and the API specifications."
This appendix or any other information does not appear to be in the GLSL spec or the GL 3.3 spec. The related extension (ARB_shading_language_include) says
"We decided not to put #include into OpenGL 3.3 / 4.0 yet"

Mars_999
03-11-2010, 11:06 PM
In a word, "SWEET!" I love the new direction the ARB has taken with OpenGL! Keep it coming.

BTW as for setup of GL4.0, I haven't read the spec, but I am assuming its no different to setup than GL3.2 and similar usage vs. GL3.2?

Thanks

bertgp
03-11-2010, 11:51 PM
Great step forward! I just hope drivers will implement all these features reliably. A spec conformity test suite (ŗ la ACID tests for browsers) would be extremely useful for this.

I know at least 6 developer at my co. that want the ability to separate shader objects and have a binary shader format. Maybe the shader subroutines will help, depending on their performance.

DSA would be a nice to have, but not imperative since we wrapped all the object binding logic in classes.

Command lists as BarnacleJunior suggested would also be very useful. They would allow to get the maximum efficiency in the OpenGL draw thread since it would only execute a compiled list of OpenGL commands; kind of like a display list for each frame or each part of a frame.

CrazyButcher
03-12-2010, 04:35 AM
Progress!


I agree to look in the future now. OpenGL vs dx9 on dx9 class hardware (most intel integrated stuff) was clearly lost, FBO came too late, GLSL was also a bit dogy compared to the dx9 sm3 (and even compared to the arb program extensions).
So one should not try to fix the past up, that's just too much legacy, and not worth the effort. But for the sm4+ hardware things look different now, with both apis very close feature wise, and one being able to expose that functionality on all platforms, including win xp.

I am not sure what the mobile guys work on, but given the lean nature of the "core" profiles, I would think that GL ES might not be needed anymore, for the next-gen mobile stuff.

out of curiosity, is there a clear benefit of the "link" mechanism GLSL has (vs the dx like individual shaders) for the IHVs? In theory additional optimization could be done, but is this really being made use of?

randall
03-12-2010, 07:03 AM
To all those requesting DSA: write wrapper classes for OpenGL resources and you have DSA. Works great when done well.

Tomasz D&#261;browski
03-12-2010, 07:22 AM
To all those requesting DSA: write wrapper classes for OpenGL resources and you have DSA. Works great when done well.

To all those requesting anything from OpenGL: write your own software renderer and you have it. Works great when done well.

Binding/state system have no benefits. It is a minor problem for IHV and major problem for game programmers.
It was especially awful in FF days where every routine had to look like:

glBindALotOfThings();
glSetALotOfStates();
glDoSomethingUseful();
glSetEverythingBack();

to ensure that 2 different modules won't overwrite their settings. Now with shaders (no state machine there!) and VAO and other fancy stuff there aren't as much "binding places", but for example binding textures (now with additional sampler objects) and UBOs are still cumbersome.

Less API calls == better performance == profit.

Aleksandar
03-12-2010, 07:22 AM
To all those requesting DSA: write wrapper classes for OpenGL resources and you have DSA. Works great when done well.
Randall, do you know what does DSA mean? The point is that...

The intent of this extension is to make it more efficient for libraries to avoid disturbing selector and latched state. ... and you suggest to make a wrapper...

glfreak
03-12-2010, 08:00 AM
Glad to hear GL spec 4.0 is out.

"Functional" Drivers?

Intel graphics?

Dan Bartlett
03-12-2010, 09:19 AM
Are extensions like:

http://www.opengl.org/registry/specs/ARB/draw_buffers_blend.txt
http://www.opengl.org/registry/specs/ARB/sample_shading.txt

going to be modified to remove ARB suffix from tokens + entry points? (Otherwise the headers will need to include all these again, without the ARB suffix)

ScottManDeath
03-12-2010, 10:24 AM
ARB_blend_func_extended is called dual source blending in DX10, but got dropped in DX11 afaik.

Can you elaborate on "dropped" ?

DX11 cards still need to be able to run DX9/DX10 software, so I don't see how this feature could be cut from silicon unless it has simply become another programmable behavior masquerading as FF behavior.. or do you mean that it's just not in the DX11 API any more.

I blame my bad memory for making me think that some additional restrictions introduced in DX10.1 meant that dual source blending is getting the shaft. :o

Jan
03-12-2010, 10:26 AM
Well said, Aleksandar. DSA is about making an API more stream-lined and EFFICIENT. Sure, if you use glGet* and push/pop before EVERY state-change, you can make it work the same way, even today.

But then don't complain about slow rendering. Multi-threading is then completely impossible for the driver to accomplish.

Jan.

Alfonse Reinheart
03-12-2010, 12:10 PM
Binding/state system have no benefits.

That's not why it's still around.


Less API calls == better performance == profit.

That's not necessarily true. It can be true, but it certainly doesn't have to be.


going to be modified to remove ARB suffix from tokens + entry points?

They didn't do it when ARB_geometry_shader4 was promoted to core, so I doubt they'll start now.

Core extensions (ARB extensions without the suffix) are something of a nicety. They aren't 100% necessary, but they're nice to have when possible. It certainly isn't worth rewriting an extension specification just to have them, though.


Multi-threading is then completely impossible for the driver to accomplish.

This is probably the best argument for DSA. You can't have multithreaded rendering without it.

However, the problem is that, even if you use DSA, backwards compatibility means that you don't have to. What then happens to multithreaded rendering in that case? Does the spec just say, "attempting to call functions X, Y, Z will cause undefined behavior when threading?"

randall
03-12-2010, 01:13 PM
Yes, I know what DSA means. I was not talking about using glGet* and push/pop before EVERY state-change. I know that this would kill performance. I was talking about caching the most important state in app on the CPU side (tracking binding points etc.)

I agree that DSA would be nice and more efficent. But the reality is that we don't have it in the OpenGL 4.0.

I suggested to create thin layer (wrapper for OpenGL resources) which would "emulate DSA" for non NVIDA hardware and use fast path (DSA) for NV hardware. I have written such abstraction and it works well. So, do not complain. Be happy with OpenGL 4.0. It's getting better and better.

Gedolo
03-12-2010, 01:31 PM
Very pleasantly surprised by this OpenGL release.
Love the new stuff.

The drawing without cpu intervention is fantastic!
This saves a lot of valuable cpu-cycles.
Makes OpenGL very efficient :)
Good to see instancing going further.
Those timer query stuff is going to be really handy.
This makes it able for programs to do a mini benchmark.
Add to this the new shader subroutine flexibility.
It's going to be possible to write programs that might optimize themselves dynamically at runtime. :D :)

The only thing that's missing is DSA.
+1 for the DSA.

This is a very good release with a lot of nice goodies.
Khronos is really improving OpenGL very well. Kudos for that.

Aleksandar
03-12-2010, 01:43 PM
...I suggested to create thin layer (wrapper for OpenGL resources) which would "emulate DSA" for non NVIDA hardware and use fast path (DSA) for NV hardware. I have written such abstraction and it works well. So, do not complain. Be happy with OpenGL 4.0..
I'm sorry randall for misunderstood.
And, of course, I'm happy with both OpenGL and NV. ;)

Alfonse Reinheart
03-12-2010, 01:52 PM
The drawing without cpu intervention is fantastic!
This saves a lot of valuable cpu-cycles.

It's not there to save performance. What it does do is allow a shader that does transform feedback to decide how to do the rendering with the feedback data by itself.


Good to see instancing going further.

I was actually rather surprised to see them put that form of instancing back in the rendering pipeline. Especially since D3D took it out in version 10 (as I understand it).


The only thing that's missing is DSA.

*cough*shader separation*cough*.

I'm not using tessellation until I can runtime mix and match shaders as I see fit without having to re-link and everything.

th3flyboy
03-12-2010, 02:20 PM
Question regarding OGL 4 and 3.3. I have a GTX 280, would that support OGL 4 features, or would I just use 3.3. I have heard reports of DX11 features running on a DX10 card, so that's why I ask.

Heiko
03-12-2010, 03:04 PM
Question regarding OGL 4 and 3.3. I have a GTX 280, would that support OGL 4 features, or would I just use 3.3. I have heard reports of DX11 features running on a DX10 card, so that's why I ask.

No OpenGL 4.0 for DirectX 10(.1) class hardware. All functions from OpenGL 4.0 that can run on DirectX 10(.1) hardware are available in OpenGL 3.3. Examples of functions that you cannot use with your hardware are the new tessellator control shaders, these are therefore only found in OpenGL 4.0 (and not in OpenGL 3.3).

Alfonse Reinheart
03-13-2010, 01:54 AM
And now I'd like to present the most awesome thing in the OpenGL 3.3 specification:



The command



void DrawArraysOneInstance( enum mode, int &amp;#64257;rst,
sizei count, int instance );


does not exist in the GL, but is used to describe functionality in the rest of this section. This command constructs a sequence of geometric primitives by transferring...


Oh, and FYI: glDrawElementsOneInstance also does not exist ;) Are there any other functions in OpenGL that do not exist that the ARB would like to tell us about? :D

aqnuep
03-13-2010, 06:15 AM
It is funny, but that was my dream that the next release of the standard will include actually two versions: 3.3 for DX10(.1) hardware and 4.0 for DX11. It actually became real! Thanks Khronos!

Also thanks for GL_ARB_draw_indirect that will enable the writing of fully GPU accelerated game engines and for GL_ARB_shader_subroutine to enable modular shader development. Great work!

Amanieu
03-13-2010, 12:47 PM
How are subroutines different from just having a switch statement calling different function depending on an integer uniform?

mfort
03-13-2010, 01:06 PM
How are subroutines different from just having a switch statement calling different function depending on an integer uniform?

modularity

pjcozzi
03-13-2010, 03:35 PM
Are there likely to be 3.4, 3.5, ... releases? For example, when 4.1 comes out, will 4.1 features not requiring GL 4 hardware be put in the a core 3.4 spec? I'm assuming (or at least hope) the answer is yes.

Long term, wouldn't this get messy after several major GL releases?

Regards,
Patrick

Alfonse Reinheart
03-13-2010, 03:41 PM
They could just do it with core extensions. For example, if they have shader separation, they could just make an ARB_program_separate core extension rather than a point release.

They didn't make GL 2.2 just so that 2.1 implementations could use VAOs; they just made a VAO extension.

Rob Barris
03-13-2010, 03:55 PM
I've advocated for having more core releases because it clarifies the work that each implementor must sign on for in order to keep pace. IMO in the past there were too few core releases and way too many vendor extensions, leading to a lot of developer issues. Looking at it in the present tense, it's really no big deal if a modest number of extensions appear on top of 3.3, or if a 3.4 were to appear - neither one would result in a change of supported hardware, assuming your set of relevant vendors were to implement the completeness of either path.

Looked at another way, say if there are still some couple dozen features on DX10 level hardware that GL3.x has not yet exposed - (I don't think there are, but just for discussion) - there would really be no harm done to have a 3.4 / 3.5 / 3.6 to address those issues over time, as long as you didn't have to wait a couple of years to get there.

It's been about two releases a year for the last two years, IMO this is a sensible cadence that should continue, in turn reinforcing developer confidence.

I guess I'm saying that timeliness and cross-vendor coherency exhibit more value to me than the distinction between core and extension. An example of this would be anisotropic filtering. It's not in core due to some long standing IP conflict the details of which escape me. Doesn't matter though, because most implementations have it.

Mars_999
03-13-2010, 07:28 PM
I agree with Rob, that two core releases they have now are great. GL3 for DX10 hardware and GL4 for DX11 hardware. And 3.3-3.x anything that isn't covered yet can still be added. Same goes for GL4.0-GL4.x when DX12 hardware comes out in a few years! :)

Executor
03-13-2010, 11:00 PM
IMHO, Instead of making multiple versions API (GL2, GL3, GL4, GLES, GLES2 and other), can use profiles - one API, many profiles. Like this:
GL_CONTEXT_GL2_HARDWARE_PROFILE_BIT_ARB - for DX9 hardware
GL_CONTEXT_GL3_HARDWARE_PROFILE_BIT_ARB - for DX10(.1) hardware
GL_CONTEXT_GL4_HARDWARE_PROFILE_BIT_ARB - for DX11 hardware
GL_CONTEXT_GLES_HARDWARE_PROFILE_BIT_ARB - for embedded systems
GL_CONTEXT_GLES2_HARDWARE_PROFILE_BIT_ARB - for embedded systems

For next release opengl api, some profiles may be deprecated, or supplemented other new features.
For example, OpenGL 4.1:
added ARB_texture_barrier functional in core for GL_CONTEXT_GL3_HARDWARE_PROFILE_BIT_ARB and GL_CONTEXT_GL4_HARDWARE_PROFILE_BIT_ARB profiles.

Or something like this. :)

Alfonse Reinheart
03-14-2010, 12:55 AM
What does that matter? You can already specify a version number to glContextCreateAttribARB. Why would you need a special bit to say you want GL3 when you just asked for a GL 3.3 context?

Jan
03-14-2010, 05:28 AM
One important point is, that with version numbers (instead of profiles) vendors will be able to simply write "supports OpenGL 4.0" on their products. That is very important for competition, because when some other vendor is still stuck at version 3.2, some people won't buy their hardware, even though they might not know anything about OpenGL. Bigger number, more interest by customers (just like with DX, too).

That forces vendors to adapt OpenGL more quickly and that in turn is a good thing for developers.

Jan.

Emanuele
03-14-2010, 11:11 AM
Hi, I don't understand how with OpenGL 4.0 is possible to achieve MT rendering (or better submit n parallel commands) to the OpenGL driver without DSA api.
Could you please explain to me?
Does the OpenGL specification not only force monothread to issue commands but even isothread (it means that the only thread allowed to issue OpenGL commands within a process is the thread that creates the context, so an even more restrictive condition than simple monothread)?

If it's possible, in which instances?
Because some years ago one of my OpenGL programs was loading a lot of images as textures and actually I was able to load them from jpg files with separate threads in memory but then, to load them as OpenGL textures, I had to use the thread that created the context otherwise I'd experience an unexpected behaviour.

Could you please help me better understand?

Cheers,

Ps. I use the term iso (thread) because it's latin to mean "the same and only".

bsupnik
03-14-2010, 12:36 PM
Hi Y'all,

I agree with Rob mostly - it's certainly very nice to have DX11 type functionality in ARB specs slated for core now, so early on. With DX10 we had NV vendor specs for a long time, and it wasn't clear how many would make it to core. For an apps developer this introduces uncertainty in the road map.

Re: DSA and MT rendering, it is not at all clear to me what MT rendering case would be solved. This is my understanding of the situation:

- As of now, you can render to different render targets with different threads using multiple shared contexts. If you use "state shadowing" to avoid unnecessary binds/state selector changes, you'll need to use thread local storage for the shadowing. One context, one thread, one FBO, off we go.

- As of now, this will not actually cause parallel rasterization on the card because modern GPUs (1) have only one command interpreter to receive and dispatch instructions and (2) allocate all of their shaders in one big swarm to work through that rasterization. If anyone knows of a GPU that has gone past this idiom, please shout at me.

- If you do am MT render from multiple threads/shared contexts, what will really happen is the driver will build up multiple command buffers (one for each context) and the GPU will round-robin execute them, with a context switch in between..pretty much the same way a single core machine executes multiple apps. I do not know what the context switch cost or granularity is.

- Because drivers spend at least some time building up command buffers, a CPU bound app that spends a lot of time in the driver filling command buffers might in theory see some benefit from being able to fill command buffers faster from multiple threads. I have not tried this yet, and I do not know if driver sync overhead or the cost of on-GPU context switches will destroy this "advantage." It is my understanding that a GPU context switch isn't quite the performance nightmare it was 10 years ago.

If the goal is MT rendering to a single render target/FBO, I don't know how that will ever work, as serialized in order writes to the frame buffer is pretty close to the heart of OpenGL. I don't think DSA would address that either.

As a final side note, I'm not as excited about DSA as everyone else seems to be. (Of course, X-Plane doesn't use any GL middle-ware so we can shadow state selectors and go home happy. :-)

In particular, every time I've taken a careful look at what state change actually does, I see the same thing: state change hurts like hell. Touching state just makes the driver do a ton of work. So I'm not that interested in making state change any easier...rather I am interested in finding ways to avoid changing state at all (or changing the ratio of content emitted to state change).

In other words, what's annoying about selecting a VBO, then setting the vertex pointer is not the multiple calls to set a VBO/vertex pointer pair...it's the fact that the driver goes through hell every time you change the vertex pointer. Extensions like VBO index base offsets (which allow for better windowing on a single VBO) address the heart of the problem.

Cheers
Ben

pjcozzi
03-14-2010, 12:49 PM
I agree with the mentioned benefits of .x versions over just extensions. Besides guaranteeing vendor support and marketing reasons, it is also nice for developers to be able to create an x.y context and know what features to expect.


GL3 for DX10 hardware and GL4 for DX11 hardware.
Perhaps by the time GL5 comes out, people will be saying DX12 for GL5 hardware. :)

Regards,
Patrick

ZbuffeR
03-14-2010, 02:17 PM
to load them as OpenGL textures, I had to use the thread that created the context otherwise I'd experience an unexpected behaviour.
You want to use the PBO extension :
http://www.opengl.org/registry/specs/ARB/pixel_buffer_object.txt
See also :
http://www.songho.ca/opengl/gl_pbo.html

davej
03-14-2010, 02:18 PM
In gl.spec, Indexub and Indexubv both have a category of VERSION_1_1_DEPRECATED but neither have a deprecated property as the others with that category do.

Rob Barris
03-14-2010, 02:54 PM
I think the value of DSA when used on an MT driver has to do with these issues:

a) MT drivers don't like it when you query and may inflict big delays on your thread.

b) Currently, you have to perturb the drawing state (via binds) to do things to objects, like TexImage. Background texture loaders may run into things like this.

c) So if you want to do things to objects without perturbing the drawing state, you need to save and restore bind points.

d) if you are not in a position where you have done your own shadowing, so you know what the bind point was "supposed to be set back to" for drawing, then you have to query. This can be particularly difficult for things like in-game overlays that are implemented as a separate library without knowing anything about the code base they are co-resident with.

DSA allows for directly communicating changes to objects without altering the drawing state, and not having to query/reset bind points that affect the drawing state.

Chris Lux
03-15-2010, 12:36 AM
will the slides from the OpenGL session from GDC be available on the Khronos site?

Khronos_webmaster
03-15-2010, 06:23 AM
The slides from the GDC session OpenGL 4.0 are now available on the Khronos site (http://www.khronos.org/library/detail/gdc-2010/).

Groovounet
03-15-2010, 06:53 AM
Be optimistic, sampler object is the first DSA API!
That's a good sign for OpenGL 4.1
:o)

I had to design a renderer for a 10 years old software full of glBegin(*); glPushAttrib(ALL); reset(everything); selector; OpenGL context switchs when lost; etc...

Most of this was the result of programmers working with OpenGL without much of an understanding of it as OpenGL was not critical or a limitation until then. I'm not saying they were bad programmers, the WORSE OpenGL code I saw, has been written by the BEST programmer I worked with: A serious C++ god. The code was an absolute piece of art as a C++ code but an absolute piece of trash as an OpenGL code. He used OpenGL not as a main tool and didn't thought it could be much of a cost:"After all OpenGL is a standard, it should be good if people agreed?".

That's why I disagree with nVidia that's state that the deprecation mechanisium is useless. I'm not saying I don't want to see the deprecated functions anymore, I'm saying that I feel more confident to go the see developers and say: "Here is OpenGL core, basically, you can rely on that to be efficient. On other hand, there are more features with OpenGL compatibility but...". A simple guideline than anymore can understand.

With DSA, it's the same level of idea. DSA is a concept that any good programmer will understand and be able to work on a "good" software design.

Also, I think that the spec major version by hardware level allow to simplify a lot the support of a large amount of platforms. It's quite easy to understand to anyone not just heavy OpenGL fan seeking for extensions all the time.


"A great solution is a simple solution. DSA for OpenGL 4.1." (My campaign slogan for DSA! :p)

bsupnik
03-15-2010, 07:13 AM
Hi Rob,

Ah...MT driver + library that must reset state = terrible performance, I can see that.

For item (b), background loaders, is this a problem if the background loader is on a shared context on its own thread? I would [hope] that the background loader isn't inducing a reconfiguration of the pipeline within the rendering context.

Of course, I am assuming that my limited set of GL calls to do background loading isn't using the GPU at all, otherwise there is a hidden cost to async loading. That is, I am assuming that a glTexImage2D in a "loader context" without a real drawing of triangles won't cause the driver to go through and actually update the pipeline configuration. This is a completely unvalidated assumption. :-)

cheers
Ben

Groovounet
03-15-2010, 07:36 AM
OpenGL calls are collectionned in a command lists and can theorically stack up to 1 frame latency. That's a usual strategy used for conditional rendering with occlusion query so that the CPU never stall waiting for the GPU to process something.

Queries and glGet are likely to imply a stall, not glTexImage2D.
That's an other why DSA saves us! We don't have to use the save (which imply a glGet) / restore paradigm anymore.

There are several discussion over this forum about why we love DSA, I'd like to hear why some people don't?

A code sample that shows the problem:


GLuint Texture = 0;
// Some code
glBindTexture(GL_TEXTURE_2D, Texture);
// Some code
GLint ActiveTexture = 0;
glGetIntegerv(GL_TEXTURE_BINDING_2D, &amp;ActiveTexture);


The full command list need to be processed to be sure that Texture == ActiveTexture. Maybe in "some code" another texture as been binding.

pjcozzi
03-15-2010, 07:54 AM
The GLSL 3.3 spec (http://www.opengl.org/registry/doc/GLSLangSpec.3.30.6.clean.pdf) (and GLSL 4.0) say:


Added Appendix A to describe include tree and path semantics/syntax for both the language and the API specifications.
But the document does not include appendix A.

Thanks,
Patrick

skynet
03-15-2010, 08:07 AM
If DSA is all about capturing/restoring state, why not providing "state block objects"?

Something like:


GLint stateblock=0;
glGenStateBlocks(1, &amp;stateblock);
...
...
glCaptureState(GL_COLOR_BUFFER_BIT|GL_FRAMEBUFFER_ BIT, stateblock);
...
glBindFramebuffer(GL_FRAMEBUFFER, myFBO);
glColorMask(1,1,1,1);
glClearColor(0,0,0,0);
glClear();
...

glRestoreState(stateblock);


This lets you interact nicely with "external" code without ever needing to call glGetXXX(). Additionally, stateblocks may be a very quick way to set complex state really fast, just like Display Lists once did:



GLint savestate=0, mystate=0;
glGenStateBlocks(1, &amp;savestate);
glGenStateBlocks(1, &amp;mystate);

glBindFramebuffer(GL_FRAMEBUFFER, myFBO);
glColorMask(1,1,1,1);
glClearColor(0,0,0,0);
glCaptureState(GL_COLOR_BUFFER_BIT|GL_FRAMEBUFFER_ BIT, mystate);
...

//then the actual execution might look this way
glCaptureState(savestate);
glRestoreState(mystate)
glClear();
glRestoreState(savestate);

Groovounet
03-15-2010, 09:02 AM
@Shynet: It looks like glPushAttrib ... I'm not sure.

I submit few times a state object but more like to replace display lists. I sometime use display list as state object wish is more efficient than not using them. For that I design some "macro objects". When I "bind" those, I call the list.

It's possible that my design is efficient only because I work at macro object level which probably imply more calls that I could do be still: Nice and handy!

The thing is just that display list are deprecated and it would be nice to replace them!

Actually sampler object are king a step forward on that way.
(Like VAO? ouchhh)

Groovounet
03-15-2010, 09:18 AM
Something is bothering me: GL_ARB_gpu_shader_fp64 is part of OpenGL 4.0 specification... That's mean that my Radeon 5770 is not an OpenGL 4.0 card...

I went through the spec to check if there is a rule to relax double support... but it doesn't seams to be the case.

Considering that double are not that useful yet, I'm quite sceptical about this choice. It might slow down OpenGL 4.0 adoption and make quite some developers using OpenGL 3.x + a large bunch of extensions.

The good thing with OpenGL major version per hardware generation is:
- 1 code path for OpenGL 2.1 for GeForce 6 >, Radeon X >
- 1 code path for OpenGL 3.x for GeForce 8 >, Radeon HD >
- 1 code path for OpenGL 4.x for GeForce GT* 4** >, Radeon 58** >

Because of this we end up with an extra code path. OpenGL 4.WEAK which will be OpenGL 3.x + load of extensions for GeForce GT* 470 < and Radeon 57** <=.

Not cool.

If the idea was to make OpenGL 4.0 for high-end graphics only, maybe a high-end profile for Quattro and FireGL and high-end GeForce and Radeon would have been great. OpenCL have such option in the spec that I quite like.

Ilian Dinev
03-15-2010, 09:38 AM
Maybe the RHD57xx do support doubles, but in DX's nature it's not a mentionable/marketable feature?
(disclaimer: I haven't checked any in-depth docs on 57xx)

Edit: http://www.geeks3d.com/20091014/radeon-hd-5770-has-no-double-precision-floating-point-support/

Maybe there's a glGetIntegerv() call to check precision, just like gl_texture_multisample num-depth-samples in GL3.2 ?

Groovounet
03-15-2010, 10:26 AM
The only query the extension define is with glGetActiveUniform ...(and transform feedback but...) I don't think it would be anything useful.

Alfonse Reinheart
03-15-2010, 10:43 AM
Be optimistic, sampler object is the first DSA API!

The second. Sync objects are the first.


Because of this we end up with an extra code path. OpenGL 4.WEAK which will be OpenGL 3.x + load of extensions for GeForce GT* 470 < and Radeon 57** <=.

Oh well. It could be worse. At least OpenGL allows you to get at that 3.x + extensions.

Also, please note: most of the 4.0 features are core extensions, so you don't need a true codepath. Your code will call the same function pointers with the same enum values. All you need to do is check for version 4.0 or the extension.

Groovounet
03-15-2010, 11:03 AM
Be optimistic, sampler object is the first DSA API!

The second. Sync objects are the first.


I would not call it that way. Sync objects are something... "different" as it's not a name but a structure. let's say compared to the DSA extension.




Because of this we end up with an extra code path. OpenGL 4.WEAK which will be OpenGL 3.x + load of extensions for GeForce GT* 470 < and Radeon 57** <=.

Oh well. It could be worse. At least OpenGL allows you to get at that 3.x + extensions.

Also, please note: most of the 4.0 features are core extensions, so you don't need a true codepath. Your code will call the same function pointers with the same enum values. All you need to do is check for version 4.0 or the extension.

True I guess (especially because I'm not seeing myself writing OpenGL 4.0 code path anytime soon!, OpenGL 4.WEAK rules!).

I would say that just like Rob said, this lot of OpenGL spec releases is clearer for everyone and a better deal than the "lot the extension specs". "Lot of specs and lot of extension specs", I'm not sure.

bsupnik
03-15-2010, 11:53 AM
Groovounet - to be clear, I have nothing against DSA as a feature; since my app is monolithic it simply won't benefit in the major way that library-based apps might. (By being monolithic, we can simply shadow everything - the win might be in total function calls but we wouldn't be removing pipeline stalls.)

My first post is confusing because I completely misinterpreted the community's meaning in "MT" rendering, which I guess is understood to mean a multi-threaded GL driver, split between the user thread and a back-end that collects and executes calls later, providing parallelism between app code (iterating the scene graph) and driver code (validating hardware state change, preparing batches, etc.).

The multi-threaded rendering I would like is different and not a function of DSA: I would like to be able to render all six sides of an environment cube map in parallel from six threads. :-)

cheers
ben

Alfonse Reinheart
03-15-2010, 11:59 AM
I would say that just like Rob said, this lot of OpenGL spec releases is clearer for everyone and a better deal than the "lot the extension specs".

I wouldn't. It's much easier to look at an extension specification to figure out what it is doing than to see exactly how a specific feature like sampler_objects is defined in GL 3.3.


I would like to be able to render all six sides of an environment cube map in parallel from six threads.

You can already do that with geometry shaders and layered framebuffer objects (http://www.opengl.org/wiki/Framebuffer_Objects).

Before asking for something, you might want to make sure you don't already have it ;)

Jan
03-15-2010, 12:27 PM
Well, to be pedantic, that's not exactly what he said.

Using Geometry shaders and layered fbo's lets you render several things at once from ONE thread. He said he would like to render from several threads, like using the GPU in a true multi-tasking way. Actually that would be like having 6 contexts (though the GPU would still process everything in sequence, i guess).

Of course i assume that Alfonse's suggestion actually solves the problem at hand. I don't see how truly multi-threading the GPU should be of value to anyone.

@bsupnik: Correct, when people talk about "multi-threaded rendering" they are concerned with offloading CPU computations to different cores. Drivers do that by queuing commands and executing them in their own thread. Applications would like to do things like object-culling and preparing the commands for rendering them in a separate thread, such that the main-thread (the one that is most likely CPU-bound) is freed up.

I have an application where entity-culling can (depending on the view-direction) become a serious bottleneck. But since it is intertwined with rendering, it is not easy to offload it to another thread. If i could create a command-list, i could take the whole complex piece of code, put it into another thread and only synchronize at one spot, such that the main-thread only takes the command-list and executes it, with no further computations to be done.

I would really like to see something like D3D11's command buffers in OpenGL 4.1. Though i assume for that to work well, the API state management needs to be thoroughly cleaned up. And that is the one thing that the ARB (or IHVs that don't want to rewrite their drivers) has been running from so far. But well, maybe they come up with a good idea that does not require a major API change. With the latest changes they have proven to be quite clever about it.

Jan.

kRogue
03-15-2010, 02:07 PM
Not a request for a GL feature, but just on the spec itself. GL4 added tessellation, but it would have made the spec's pipeline much easier to read and understand if a diagram was included of all the shader stages, preferably something nicer than the ascii-art diagram found in ARB_tessellation_shader (http://www.opengl.org/registry/specs/ARB/tessellation_shader.txt) to make it clear when and if what shaders are run.

Same story for the transform feed back and vertex streams.

Additionally, a little more spelling out on the instancing divisor functionality. I liked how the spec explained the instancing, using the non-existent functions "DrawArraysOneInstance" and "DrawElementsOneInstance", may be go a step further to explicitly and clearly say the role of the divisor of VertexAttribDivisor.

Giggles, people still pining for DSA. For those saying it is just syntactic sugar, look at slide 65 of http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more and see that it is more than just a convenience when one is not tracking binded GL state yourself.

The new blending stuff was quite unexpected, does anyone know if AMD/ATI's 5xxx cards can actually do that? The new funky blending is also GL 3.3, can the GeForce 8 series do that too?
[note that in GL3.x the blending parameters cannot be set per render target though, only enabling and disabling can be set per render target]

All in all, I am very, very happy and very appreciative to the writers of the specs and the IHVs that support GL, that the GL API is being so actively maintained and updated (especially when compared to the dark days between GL2.x and GL3.0)

Heiko
03-15-2010, 02:22 PM
Something is bothering me: GL_ARB_gpu_shader_fp64 is part of OpenGL 4.0 specification... That's mean that my Radeon 5770 is not an OpenGL 4.0 card...

You made me looking at the specs at AMD's website because I was convinced the entire Evergreen line could do double precision calculations... but double precision performance in Gflops is not mentioned for these cards, so perhaps they do not support it!

Alfonse Reinheart
03-15-2010, 03:30 PM
It's just that only some of the HD 5xxx line supports double-precision, not all of them. So only some can offer GL 4.0 support.

Groovounet
03-15-2010, 05:40 PM
I have been assured that AMD is going to support OpenGL 4.0 on the entire Radeon 5*** line... I asked for how... but I didn't had any answer...

Smell bad, I don't like that!

I would not be supprized to see AMD coming up with support for OpenGL 4.0 even by query to the drivers when it's not actually the case! We had quite some histories like that from AMD, nVidia, Intel (Intel is King on that matter!) We will see with drivers release... within 6 months?

Other possibility is that AMD claims to support double but all the double uniforms and variables are actually performed at single precision... With the relax en implicit conversions it would be basically as simple to implement double than doing something like this:

define double float

I'll rather test OpenGL 3.x + lot of extensions than this. However, on a marketing point of view 4.0 is bigger than 3.3 + almost all extensions.

Rob Barris
03-15-2010, 07:42 PM
It's just that only some of the HD 5xxx line supports double-precision, not all of them. So only some can offer GL 4.0 support.

So, say they all support GL 3.3, and they all support the GL4-class extensions that can work on those chips. Is that a usable combination for the apps that don't need double precision, but do want to use tessellation for example?

Alfonse Reinheart
03-15-2010, 08:26 PM
Actually, the concern is more what Groovounet was suggesting: that ATI might pretend that the entire HD 5xxx line may support 4.0 and silently convert doubles to floats.

Without conformance tests, what would there be to stop such a thing? Indeed, I seem to recall something similar happening at the 2.0 transition, when unrestricted NPOT support was required by the specification. Some hardware that couldn't actually support unrestricted NPOTs would still advertise GL 2.0, but silently break if you used NPOTs.

Eric Lengyel
03-15-2010, 08:41 PM
Here's one more vote for DSA in the core.

ARB_explicit_attrib_location is nice. Now if we only had ARB_explicit_uniform_location (including samplers) as well for all shader types, we would finally get all the capabilities back that we enjoyed with ASM shaders and Cg. Right now, it's silly to keep rebinding uniforms over many shaders in cases when they have the same value globally.

Rob Barris
03-15-2010, 08:46 PM
Here's one more vote for DSA in the core.

ARB_explicit_attrib_location is nice. Now if we only had ARB_explicit_uniform_location (including samplers) as well for all shader types, we would finally get all the capabilities back that we enjoyed with ASM shaders and Cg. Right now, it's silly to keep rebinding uniforms over many shaders in cases when they have the same value globally.


UBO can solve some of those problems if you have a common group of uniforms that can be stored together?

Eric Lengyel
03-15-2010, 10:11 PM
UBO can solve some of those problems if you have a common group of uniforms that can be stored together?

Maybe. But UBOs are not available across all hardware I need to support (SM3+), and using UBOs is not guaranteed to be the fastest path. Something with the same effect as glProgramEnvParameter*() from ASM shaders is what I'd really like to see for GLSL.

Alfonse Reinheart
03-15-2010, 11:30 PM
But UBOs are not available across all hardware I need to support (SM3+)

A quite a bit of GL 2.1 hardware is coming to the end of support from IHVs. So even if they add an extension tomorrow, there is still a lot of hardware out there that won't have access to it.


Something with the same effect as glProgramEnvParameter*() from ASM shaders is what I'd really like to see for GLSL.

Is that guaranteed to be the "fastest path?"

Jan
03-16-2010, 04:17 AM
I agree with Eric. Something to make uniforms like the environment variables from ARB_fragment_program would be extremely helpful.

In my apps i actually treat all uniforms like env variables. If two shaders use a uniform with the same name, any update in the app will change both uniform values, no matter which of the two shaders (if any) is bound, atm. Honestly i don't see a reason why shaders should have their own local variables, at all.

Right now, everyone who treats uniforms as global program state, that the shader can "query" by creating a uniform with the given name, has to do lots of complicated management to get those values into the shaders and try to prevent unnecessary updates.

Jan.

MikeJGrantz
03-16-2010, 05:04 AM
Maybe the RHD57xx do support doubles, but in DX's nature it's not a mentionable/marketable feature?
(disclaimer: I haven't checked any in-depth docs on 57xx)

Edit: http://www.geeks3d.com/20091014/radeon-hd-5770-has-no-double-precision-floating-point-support/

Maybe there's a glGetIntegerv() call to check precision, just like gl_texture_multisample num-depth-samples in GL3.2 ?

Can i add my voice to those searching for clarification on whether single-precision DX11 cards will be able to run OpenGL4, and if not whether headline features of OGL4.0 and DX11 like tesselation will be unavailable on those cards? :)

As far as i know it is only the 5800 cards that can do dual-precision floats.

If 57xx cards won't be OpenGL 4.0 compliant then this is a shame, because even if there is an easy vender extension that can be called to achieve features like tesselation it is a balls up from a consumer perception POV, and it would be a shame to return to the vender extension market fragmentation whose supposed obsclelence was to my mind one of the best features of OpenGL 3/4 development.

p.s. congrats and thanks to the OpenGL development team. :)

bsupnik
03-16-2010, 09:37 AM
You can already do that with geometry shaders and layered framebuffer objects.

Before asking for something, you might want to make sure you don't already have it wink


Alfonse, I am aware of geometry shaders + FBOs but they don't solve my problem (nor are they thes same as what I am suggesting, namely true parallel rasterization, which I do realize is pie in the sky). Consider the metrics of the situation.

I have an algorithm that pre-renderes low quality parts of my scene from several view angles. Two examples would be cube-map based environment mapping for fake reflectivity or cascading shadow maps. In both cases depending on the view angle, it is quite possible that the batches and meshes for each "view" (six orthagonal frusta for the cube map or perhaps four or more non-overlapping frusta for CSM) will be mostly unique and different. In other words, the scene graph subset submitted to the far shadow map and near won't be the same.

In this case:
- I am submitting a lot of geometry. My tests with gshaders indicate that they slow down my throughput.
- I am submitting the geometry and batch state changes in series when it is actually independent for the render targets. That is, there is a 4x or 6x paralization win I don't get.

Further more, if the complexity of the shader is very low (CSM = depth only, probably not fill rate limited !:-) and the GPU is quite powerful, then I don't expect to really be able to drive the shading resources to their fullest potential by submitting lots of these simple batches. And there are a lot of them.

Since my app is already often batch limited, it only gets worse having to hit the scene graph over and over.

In the meantime, some of our users pick up SLI hardware, and then ATI comes out and demos a wall of video with the HD5000.

Hence my pie in the sky dream: if the GPU companies want to find a way to sell more hardware, sell me more command processors and let me prepare my gajillions of shadow maps in parallel.



Actually that would be like having 6 contexts (though the GPU would still process everything in sequence, i guess).


Right - if you were to code a CSM with threaded GL now, this is what would happen, and you might have _some_ benefit from preparing command buffers in parallel, if the driver does this well, and some loss, due to context switches.



Of course i assume that Alfonse's suggestion actually solves the problem at hand. I don't see how truly multi-threading the GPU should be of value to anyone.


It sort of does but sort of does not. gshaders + layered FBOs still requires a serialized submit (and serialized rasterization) of multiple independent pre-rendered textures (dynamic environment maps, shadow maps, what-have-you). If the driver is threaded you've gone from using 1/8th of your high-end machine to 2/8ths.



@bsupnik: Correct, when people talk about "multi-threaded rendering" they are concerned with offloading CPU computations to different cores.


Right...the problem is that with truly serialized rendering, we still have one producer and one consumer...that's not going to scale up to let me use the 8 cores (16 soon..joy) that some of our users have in their machines.

To put it simply, if I have to prepare 8 shadow maps to render my scene, and the 8 shadow maps are CPU/batch bound*, and I have 8 cores and I'm only using one...well, that's the math.



If i could create a command-list, i could take the whole complex piece of code, put it into another thread and only synchronize at one spot, such that the main-thread only takes the command-list and executes it, with no further computations to be done.


Why can't you do that now? I tried this with X-Plane, although the benefit in parallel execution wasn't particularly large vs. the overhead costs. You can roll a poor-man's display list/command Q by building a "command buffer" of some abstracted form of the gl output of your cull sequence...of course, this assumes that the actual work done by the cull loop in your app comes in some reasonably uniform manner that can be buffered easily as commands, and it also assumes that the actual "draw" is cheap except for driver overhead. :-)

cheers
Ben

Ilian Dinev
03-16-2010, 10:35 AM
You have glMapBuffer, the ARB_draw_indirect, PBO, texture-buffers and UBOs. Make several degenerate triangles, to group objects by num_primitives and vtx-shader (static or skinned or wind-bent or etc), and do lots of instancing with double-referencing through indices of data. Will skip vtx-cache (it's getting emulated on ATi cards anyway), but you can draw any current-style shadowmap scene in 16*3 draw-calls. [ 16=log2(65535 primitives/mesh) , 3 = [static, skinned, wind-bent ] . Each group of 48 draw-calls computed on whichever cpu core, written in the pre-mapped BufferObjects. The main thread only unmapping and executing the draw-calls.

The gpu cmdstream is one. Wanting to magically push more and more via the old methods indefinitely won't work. So, look at the HW caps, and think out of the box :) .

Jan
03-16-2010, 12:01 PM
@bsupnik: I think i cannot use DLs to do that, because many of the commands i use are not compiled into display lists. It is really a huge piece of rendering code that sets many different shaders, uniforms, binds textures, executes (instanced) drawcalls, etc. AFAIK at least some of that would not be included into DLs. Additionally i would need a second GL context and i would need to share all the data between them. I have no experience with that, but i think there are resources that are not easily shared.
It would really be nice to create a "command buffer context" like in D3D11, that can cache ALL commands in a buffer and does not need to share resources, because those resources are not really accessed in that context, and then execute that command buffer in the main context.

Of course i could create such a command buffer manually, through an own abstraction layer, but that is very much work and i don't have time for that. Actually, if i had time to invest into that piece of code, i would rewrite it entirely and make it much more efficient from the ground up. But atm that's out of the question. For the future a command buffer would still be nice to have, some things are simply hard to parallelize. Also the driver could already do some work when building the command buffer, and make later less work in the main thread.

The simple conclusion is: We need to use multi-threading to improve our results, but OpenGL makes it very hard to use multi-threading for the rendering-setup. That needs to change.

Jan.

Eosie
03-16-2010, 12:24 PM
... and glGet are likely to imply a stall
This is a myth and I wonder why people are following it so blindly. glGet should never cause a stall except for hardware queries. Drivers may (and should) store any OpenGL state in the context thread and pass only real rendering commands to the internal worker thread of a driver. Where are the stalls now? Yeah, there are none.


ATI might pretend that the entire HD 5xxx line may support 4.0 and silently convert doubles to floats
Welcome to the reality. Yes, features are being faked, were, and will be. Live with it - it's a tradition. Otherwise all ATI GL2.1 GPUs would be advertising GL1.1 (hw support for texture_lod is missing and the same holds for at least 2 other GL2 features) and some early pieces of ATI GL3 stuff would be advertising GL2.1 (hw support for draw_buffers2 is missing). They can't afford marketing their GPUs as being on the same level as the "old crap". It doesn't surprise me they're going to fake doubles, I am expecting it.

Alfonse Reinheart
03-16-2010, 12:37 PM
Drivers may (and should)

But they don't, so it doesn't matter what they "may" or "should" be doing. They don't, so glGet is a bad idea.

And why "should" they? If I don't need to glGet anything, why should the driver be taking up precious memory to mirror something I don't care about? That's why it is left up to the application.


Live with it - it's a tradition.

This is different. textureLod is not a feature that is frequently used. It also doesn't affect the output much, so its absence is not a huge thing. And drivers that don't yet implement everything are quite a bit different from drivers that will never implement everything.

If you ask for double-precision, it is only because you need it. You will not get an acceptable answer from single-precision computations.

On the plus side, double-precision matters a lot more for OpenCL than for OpenGL. So not many GL applications will actually care, and OpenCL has a way to effectively test this.

Eosie
03-16-2010, 01:49 PM
Drivers may (and should)

But they don't, so it doesn't matter what they "may" or "should" be doing. They don't, so glGet is a bad idea.
Do you have a proof for this? To my knowledge, an implementation-dependent issue which can be resolved so easily isn't a good justification for getting "some other feature" into the spec. I want DSA as much as you do but for the reason it's more efficient as a whole rather than some implementation has issues to do multithreading right in the first place. A few more kilobytes of variables in memory? Yeah that's really too much wasted space, isn't it?


If you ask for double-precision, it is only because you need it.
You don't ask for anything - it's simply advertised to you. It's the same case as with NPOT textures. You either get it in software fallback or never.

Alfonse Reinheart
03-16-2010, 02:03 PM
Do you have a proof for this?

Name an IHV who has ever, ever recommended that glGet functions are perfectly fine performance-wise and that user mirroring would be a waste of time.

Since the earliest days of OpenGL implementations, driver writers have drilled into our heads that glGet is bad. Maybe that's changed in 10 years, but they have not yet said otherwise.


You don't ask for anything - it's simply advertised to you.

Of course you ask for it. If you use a "vec3" in any GL 4.0 shader, you're getting floating-point values. Only by explicitly using a "dvec3" do you get double-precision values. Just like with NPOTs, you must ask for them to use them.

And just like with NPOTs, if you ask for them, it is because you need them.

Alfonse Reinheart
03-16-2010, 02:11 PM
I want DSA as much as you do but for the reason it's more efficient as a whole rather than some implementation has issues to do multithreading right in the first place.

DSA is not about efficiency. DSA does not mean that drivers can happily get rid of the name-to-pointer mapping. DSA does not make your code more efficient.

Ketracel White
03-16-2010, 02:27 PM
DSA does not make your code more efficient.


I disagree. DSA would mean that I could get rid of an entire headache inducing crap layer in my code that's only there to track down and manage current bindings.

Alfonse Reinheart
03-16-2010, 02:31 PM
I was talking to Eosie, who seems to think that what you are referring to is unimportant, since you should just be doing glGets.

Groovounet
03-16-2010, 03:23 PM
Ok, I actually agree with Eosie if we stick to the theory.

All glGet doesn't stall, some does, some doesn't. Which one? I don't know, It's more guessing than proving facts and probably implementation dependent so I prefer to consider the glGet stall as a good practice.

I made some tests (a while ago back at OpenGL 2.1 time!) comparing glGet and macro state objects build with display list: obvious win for the macro state objects even if I changed a lot more states than required this way.

Jan
03-16-2010, 06:08 PM
Don't forget that a glGet* call will call into a driver-DLL, which is ALWAYS slower, than calling your own functions, no matter how fast that function actually returns.

bsupnik
03-16-2010, 08:37 PM
@All: regarding stalls and glGet, I have observed directly via Shark that the OS X OpenGL implementation will stall on some glGets when the "threaded" driver is in use via a window-system enable. I have heard from my users that X-Plane performance is dreadful on NV drivers if the threaded driver is used, so I can only speculate that it's the same problem of a threaded driver stalling on a "get".

It's not clear to me why the driver must stall, as these gets are for state set only from the client side, but OpenGL is a leaky abstraction...the spec never promises fast gets and it never promises slow ones either. :-)


The gpu cmdstream is one. Wanting to magically push more and more via the old methods indefinitely won't work. So, look at the HW caps, and think out of the box.

@Ilian: Fair enough...I can't argue with the hardware.


It is really a huge piece of rendering code that sets many different shaders, uniforms, binds textures, executes (instanced) drawcalls, etc.


Ah. My "emulated display lists" took advantage of the fact that all of the OpenGL calls were going through only six calls into the lower levels of the rendering engine, so the culling code could simply accumulate these six calls as opcodes in a buffer and then run them later. It sounds like your app has code organization/architectural reasons why this won't work.

kRogue
03-17-2010, 01:49 AM
On DSA and why it is more than just a convenience, see slide 67 of http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more

But in all honesty, that the state is not shadowed is kind of screwy... we are not talking alot of data really... if we think about it, most objects just have a few bytes of state (here for example for textures I am not talking about the texture data itself but all the glTexParameters) so even if one had one million GL objects running around, it is not a lot of data...the only object that may have a fair amount of state is GLSL programs (as they save the values of uniforms) but one does not query those values anyways..

Groovounet
03-17-2010, 04:07 AM
New topic, precision qualifiers: highp mediump lowp.

How does it works with "double"? I expect
lowp vec4 Color; to actually be a vec4 of half-float.
What about:
highp vec4 Color; would is be a vec4 of double-float?
And finally:
lowp dvec4 Color; ???
mediump dvec4 Color; ???
highp dvec4 Color; ???

I have found any detail about that...

Pierre Boudier
03-17-2010, 04:49 AM
lowp/mediump/highp are currently ignored on desktop implementations (at least on amd). they are supported to ease porting of ES apps.

Groovounet
03-17-2010, 07:23 AM
My own little OpenGL 3.3 review:
http://www.g-truc.net/post-0267.html

Dark Photon
03-17-2010, 08:51 AM
On DSA and why it is more than just a convenience, see slide 67 of http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more
Wow. Amazed at all the DSA dittos with nobody asking for bindless.

What I'd really like to see in 4.1 is the huge perf benefit of NV bindless batches (http://developer.nvidia.com/object/bindless_graphics.html) in the core (namely just NV_vertex_buffer_unified_memory (http://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt)), or at least EXT/ARB extension. Convenience (DSA) is good, but performance sells.

But to add my voice, great job to the ARB on the 3.2/4.0 specs! An amazing amount of excellent work! Been said before, but this is like unwrapping Christmas presents.

Ilian Dinev
03-17-2010, 08:53 AM
I second bindless, and really hope for shader_buffer_load .

Stephen A
03-17-2010, 04:06 PM
On DSA and why it is more than just a convenience, see slide 67 of http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more
Wow. Amazed at all the DSA dittos with nobody asking for bindless.
I guess this has something to do with how fragile bindless graphics are (or seem to be).

Groovounet
03-17-2010, 04:58 PM
This post give a clue on how AMD will emulate double on Radeon 57XX <

http://oscarbg.blogspot.com/2009/10/double-precision-support-in-gpu.html

I quite expect GeForce GTS 450 or so to do the same ... as everyone else actually.

Is anyone as ever had a need of double for something on the GPU?

The only need I ever had of double was back when I was student and I did some kind of physic simulation based on dynamics and probably not the most clever implementation. Double help to maintain computation stability.

Dark Photon
03-17-2010, 04:59 PM
I guess this has something to do with how fragile bindless graphics are (or seem to be).
Could you clarify that? I don't find it "fragile" at all.

And the API mod for mere bindless batch support is so very, very simple, non-intrusive, and "intuitive". I'd take that alone! (i.e. vertex_buffer_unified_memory (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt)). shader_buffer_load (http://developer.download.nvidia.com/opengl/specs/GL_NV_shader_buffer_load.txt) is cool but could follow later IMO.

Alfonse Reinheart
03-17-2010, 05:01 PM
Is anyone as ever had a need of double for something on the GPU?

Scientists do. I imagine the guys working on Folding@home are frothing at the mouth to get more bits of precision and IEEE-754-2008 support into more people's houses.

But scientists care more about this getting into OpenCL than OpenGL.

Groovounet
03-17-2010, 05:04 PM
But scientists care more about this getting into OpenCL than OpenGL.

Exactly what I thought and why I think that GL_ARB_gpu_shader_fp64 should not be core. Plus OpenCL specify double as an option.

Rob Barris
03-17-2010, 05:56 PM
Is anyone as ever had a need of double for something on the GPU?

Scientists do. I imagine the guys working on Folding@home are frothing at the mouth...

Folding@Mouth!

Stephen A
03-18-2010, 02:51 AM
I guess this has something to do with how fragile bindless graphics are (or seem to be).
Could you clarify that? I don't find it "fragile" at all.

Pointers in shaders introduce a whole new class of errors that were impossible before. Without proper error reporting and debugging tools GLSL shaders will quickly become unwieldy.

For example, the current spec contains stuff like this:

What happens if an invalid pointer is fetched?

UNRESOLVED: Unpredictable results, including program termination?
Make the driver trap the error and report it (still unpredictable
results, but no program termination)? My preference would be to
at least report the faulting address (roughly), whether it was
a read or a write, and which shader stage faulted. I'd like to not
terminate the program, but the app has to assume all their data
stored in the GL is lost.

Ouch.

I'm not saying that the extension isn't useful or that it shouldn't become core, I'm just saying that it's better to err on the side caution in this case. A revised EXT version should probably be agreed on first, implemented and reviewed before it makes its way into the core.

Dark Photon
03-18-2010, 05:02 AM
Pointers in shaders introduce a whole new class of errors that were impossible before...
Perhaps. This might be one reason for only adding batch bindless to core/EXT initially (i.e. vertex_buffer_unified_memory (http://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt)). This has absolutely nothing to do with supporting pointers in shaders. It merely provides GPU VBO addresses for vertex attribute arrays and index lists ("batch data") to get rid of most of that ugly "batch setup" overhead, and achieve the performance of NVidia's legendary display lists with ordinary garden-variety VBOs.

The more batches you can pump via simple means, the less dev-time-wasting "contortions" you have to go through to pump those polys via more obtuse means.


For example, the current spec contains stuff like this:

What happens if an invalid pointer is fetched?

UNRESOLVED: Unpredictable results, including program termination? Make the driver trap the error and report it (still unpredictable results, but no program termination)? ...the app has to assume all their data stored in the GL is lost.
Yeah, but as C/C++ programmers, we're very used to the bad pointers can corrupt data and result in program termination thing (e.g. to glVertexPointer, glBufferSubData, leaving a "rogue" vertex attribute enabled with a bogus pointer bound to it).

Nothing really new here at all, except these are GPU addresses not CPU. You gotta manage state properly.

Jan
03-18-2010, 05:28 AM
The difference is, that bad pointers in shaders are much more likely to blue-screen your entire PC. That's something that is pretty rare these days in ordinary programming.

Ilian Dinev
03-18-2010, 06:07 AM
Let's not forget the data at those gpu addresses are read-only :) There are no ops to write to vram from within a shader. The worst that can happen is a too-long loop. Which, anyone can test for himself - leads simply to a gpu-reset (which also happens during alt+tab to another screen resolution). Nothing scary.

Pierre Boudier
03-18-2010, 07:10 AM
Let's not forget the data at those gpu addresses are read-only :) There are no ops to write to vram from within a shader

this won't be true for much longer, since the HW can now read/write to arbitrary location in the shader.

Stephen A
03-18-2010, 09:35 AM
Yeah, but as C/C++ programmers, we're very used to the bad pointers can corrupt data and result in program termination thing (e.g. to glVertexPointer, glBufferSubData, leaving a "rogue" vertex attribute enabled with a bogus pointer bound to it).


The difference is, that bad pointers in shaders are much more likely to blue-screen your entire PC. That's something that is pretty rare these days in ordinary programming.

And the other difference is that we have 20-30 years worth of debuggers, tools and experience to hunt down those bad pointers in C/C++. For GLSL we have... nothing. Good luck debugging a complex GLSL shader that fails randomly without any debugging aides at your disposal! :)

To my mind, if OpenGL can be improved (speed or features) without becoming more fragile then it's a no-brainer. However, if a given feature is introduces an exponential amount of failure points, then I believe it should be thoroughly examined before becoming core: is there a less dangerous way to do this? Is there a way to limit potential damage in failure cases? Is it possible to limit the extent of undefined behavior? Is there enough real usage information to judge the potential advantages/pitfalls? (If not, maybe this should become EXT first.)

If bindless graphics were a clear-cut win, the extension would have been promoted to core already. I have a feeling things are not so simple in this specific case.

Edit: just wanted to mention that several people have reported no performance increase in the official bindless graphics thread (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=256729). A quick google search doesn't uncover any other performance-related information.

Groovounet
03-18-2010, 10:08 AM
Edit: just wanted to mention that several people have reported no performance increase in the official bindless graphics thread (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=256729). A quick google search doesn't uncover any other performance-related information.

Some members of this forum did some tests on their own and report 2X performance included in a detailed explanation on how they did they test. It's not 7X but still great.

My personal opinion is that it is great but maybe for OpenGL 3.5 / 4.2 so that the experience with this extension get stronger. I wonder what nVidia learned from it yet, maybe with some communication about that I'll be more excited about it.

So far, I think that the todo list for OpenGL 3.4 / 4.1 is already very long with higher ranked features to work on first. With 2 specs a year ... I can wait one more year!

(Beware ARB: we are getting habits of great work!)

Rob Barris
03-18-2010, 02:01 PM
Edit: just wanted to mention that several people have reported no performance increase in the official bindless graphics thread (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=256729). A quick google search doesn't uncover any other performance-related information.

Some members of this forum did some tests on their own and report 2X performance included in a detailed explanation on how they did they test. It's not 7X but still great.

My personal opinion is that it is great but maybe for OpenGL 3.5 / 4.2 so that the experience with this extension get stronger. I wonder what nVidia learned from it yet, maybe with some communication about that I'll be more excited about it.

So far, I think that the todo list for OpenGL 3.4 / 4.1 is already very long with higher ranked features to work on first. With 2 specs a year ... I can wait one more year!

(Beware ARB: we are getting habits of great work!)


There's some drivers out there that can kill the system if you mix up your shader ins/outs (in ARBvp/fp). You can wipe yourself out with vertex array range. You can hang yourself with MapBufferRange if you don't understand all the options. You can go wrong in a million other ways and get a blank screen, hang, crash, or blue screen / panic.

We fought some lengthy battles during GL 3.0 having to do with this classic tension between safety and performance.

IMO one should be prepared to test, debug and improve their code to avoid these outcomes, and/or avoid usage of the latest extensions if they suspect the risks are too great. That's a different POV from saying "we should not provide these class of features because there is risk here."

Foobarbazqux
03-18-2010, 02:24 PM
T
Is anyone as ever had a need of double for something on the GPU?

It would be useful for exponential shadow maps.

Groovounet
03-18-2010, 02:38 PM
T
Is anyone as ever had a need of double for something on the GPU?

It would be useful for exponential shadow maps.

What is this and/or how doubles are related to this?



There's some drivers out there that can kill the system if you mix up your shader ins/outs (in ARBvp/fp). You can wipe yourself out with vertex array range. You can hang yourself with MapBufferRange if you don't understand all the options. You can go wrong in a million other ways and get a blank screen, hang, crash, or blue screen / panic.

We fought some lengthy battles during GL 3.0 having to do with this classic tension between safety and performance.

IMO one should be prepared to test, debug and improve their code to avoid these outcomes, and/or avoid usage of the latest extensions if they suspect the risks are too great. That's a different POV from saying "we should not provide these class of features because there is risk here."


ooookkkk. With this said, I guess bindless graphics is off the feature list of next OpenGL specification. Well, it makes sens.

kRogue
03-18-2010, 04:35 PM
Just to joint the chorus: Bindless in core would be sooooooooo sweet.

Dark Photon
03-18-2010, 05:15 PM
...You can wipe yourself out with vertex array range. You can hang yourself with MapBufferRange if you don't understand all the options. You can go wrong in a million other ways and get a blank screen, hang, crash, or blue screen / panic. ... IMO one should be prepared to test, debug and improve their code to avoid these outcomes, and/or avoid usage of the latest extensions if they suspect the risks are too great. That's a different POV from saying "we should not provide these class of features because there is risk here."
ooookkkk. With this said, I guess bindless graphics is off the feature list of next OpenGL specification. Well, it makes sens.
No, he said exactly the opposite -- it's not necessary off the table. OpenGL in C/C++ isn't a padded room within a plastic bubble with Pascal/Python-like safety checks everywhere and the perf hit to prove it. It's about performance and the developer knowing what they're doing and testing their code properly. If one corner is too scary for someone, they shouldn't use it. It should still be available for others to use.

Alfonse Reinheart
03-18-2010, 06:25 PM
It isn't really about scary corners and such. It's more about breaking an abstraction in a way that could have repercussions later. Sooner or later, some hardware maker is going to have truly virtualized video memory, such that an application can use as much video memory as they want. It may even be the OS that forces this on them.

What becomes of bindless graphics when GPU pointers are not actually GPU pointers? Would such hardware not be able to implement a prospective GL 3.4 or 4.1 that included NVIDIA-style bindless as a core feature?

We all want the benefit of faster performance. But we need to be careful that we aren't backing ourselves into a corner along the way. Buffer objects are a good, strong abstraction that gives the implementation a lot of freedom. Removing that freedom has potential side effects for the future that need to be thought through.

Groovounet
03-18-2010, 07:33 PM
...You can wipe yourself out with vertex array range. You can hang yourself with MapBufferRange if you don't understand all the options. You can go wrong in a million other ways and get a blank screen, hang, crash, or blue screen / panic. ... IMO one should be prepared to test, debug and improve their code to avoid these outcomes, and/or avoid usage of the latest extensions if they suspect the risks are too great. That's a different POV from saying "we should not provide these class of features because there is risk here."
ooookkkk. With this said, I guess bindless graphics is off the feature list of next OpenGL specification. Well, it makes sens.
No, he said exactly the opposite -- it's not necessary off the table. OpenGL in C/C++ isn't a padded room within a plastic bubble with Pascal/Python-like safety checks everywhere and the perf hit to prove it. It's about performance and the developer knowing what they're doing and testing their code properly. If one corner is too scary for someone, they shouldn't use it. It should still be available for others to use.

I see the point, it's just that I don't believe the ARB is going to reach an agreement on that... OpenGL 3.0... huummm ok I see the size of the conflict! I think this is an even older topic that use to be discuss back in OpenGL 2.0 days.

bsupnik
03-19-2010, 02:35 PM
What becomes of bindless graphics when GPU pointers are not actually GPU pointers? Would such hardware not be able to implement a prospective GL 3.4 or 4.1 that included NVIDIA-style bindless as a core feature?

Right - that was my initial reaction to the bindless NV extensions: "you really want to admit to me that you've implemented the GL as buffers in virtual memory? Will that always be true?"

On one hand, it seems like this binds (no pun intended) GPU developers to unoptimized plain-old-buffers in VRAM...e.g. without abstraction, the implementation can't optimize formats for performance.

On the other hand, has the GLSL spec gone so far in providing a generalized computing model that this is a moot loss of abstraction?

- The format of VBOs, PBOs, and UBOs is completely app-visible.
- Buffers are interchangeable (memory is memory).
- The format of a buffer is not attached to the buffer.

Apps know more about memory than the GL does at this point - a design that is flexible but makes it difficult for drivers to alter implementation.

Alfonse Reinheart
03-19-2010, 02:52 PM
Buffer objects have never been intended to store data according to specific formats controlled by the implementation. The intent was to have driver-allocated memory.

What the current abstraction gives the implementation to do is have the freedom to move this memory around as needed. Your application knows what is in the memory, but exactly where it is at any particular time is unknown.

This is part of the freedom that bindless's lock function takes away.

On a different topic:

As a measure of the current state of OpenGL progress:

Over half of all of the currently existing ARB extensions are from 2008 or later. That is, in the last two years, we have seen more ARB extensions than in the over ten years before GL 3.0.

Virtually all of these extensions have been folded into OpenGL's core at one version or another. Unlike the debacle of ARB_vertex_blend, and the defacto-uncanonized state of the ARB_vertex/fragment_program extensions, all of this work has born fruit.

Stephen A
03-19-2010, 07:00 PM
There's some drivers out there that can kill the system if you mix up your shader ins/outs (in ARBvp/fp). You can wipe yourself out with vertex array range. You can hang yourself with MapBufferRange if you don't understand all the options. You can go wrong in a million other ways and get a blank screen, hang, crash, or blue screen / panic.

We fought some lengthy battles during GL 3.0 having to do with this classic tension between safety and performance.

IMO one should be prepared to test, debug and improve their code to avoid these outcomes, and/or avoid usage of the latest extensions if they suspect the risks are too great. That's a different POV from saying "we should not provide these class of features because there is risk here."


We all know that OpenGL implementations are extremely unstable (much more so than Direct3d implementations, for whatever reason). Which is exactly why I suggested that presumably "dangerous" extensions should become EXT/ARB before being folded into core. This is a completely different POV than "we should not provide these class of features because there is risk here" - which I do not subscribe to.

As Alfonse said, will bindless graphics remain relevant in future OpenGL specifications? How large is the mean performance gain in typical OpenGL 3.2 applications right now? (7x sounds great, but it is only valid for solely batch-limited applications. What are the real numbers like?) Is there a way to achieve a similar performance gain without thoroughly breaking the current buffer/attribute/uniform model? Why are bindless graphics superior to geometry instancing? If bindless graphics become core, how will instancing be affected as a feature?

Dan Bartlett
03-20-2010, 08:07 AM
No, he said exactly the opposite -- it's not necessary off the table. OpenGL in C/C++ isn't a padded room within a plastic bubble with Pascal/Python-like safety checks everywhere and the perf hit to prove it. It's about performance and the developer knowing what they're doing and testing their code properly. If one corner is too scary for someone, they shouldn't use it. It should still be available for others to use.

Pascal (Delphi+FreePascal at least) does all it's type safety checks at design-time, you can happily cast anything to anything else or use raw pointers/assembly if you desire to with no impact on performance. The only time I can think that extra run-time checks occur is if you use the "as" operator rather than a simple type-cast:


procedure TForm1.Button1Click(Sender: TObject);
var
MyButton: TButton;
begin
MyButton:= Sender as TButton; // slower but safer
MyButton := TButton(Sender); // no checks at run-time
end;

One thing that could effect speed of compiled code is that Delphi has a single-pass compiler, so although it can compile faster than C/C++ compilers, it may not perform some of the more advanced optimisations.


I hope they extend the bindless idea to textures/samplers etc. and remove the need for texture units at all.

If samplers + textures were completely separate, it could be done something like this:


glGetTextureParameter(tex1, [GL_TEXTURE_2D,] GL_ADDRESS, tex1Address);
glGetSamplerParameter(sampler1, GL_ADDRESS, sampler1Address);

texloc := glGetAttribLocation(program, "tex1");
samplerloc := glGetAttribLocation(program, "sampler1");
glVertexAttrib(texloc, tex1Address);
glVertexAttrib(samplerloc, sampler1Address);

in sampler2D *sampler1;
in texture2D *tex1;

vec4 color = Texture2D(tex1, texcoord, sampler1);

But since they don't seem to be split this way in 3.3/4.0, then perhaps something more like this would work:


glGenTextureObject(1, texobj1);
glTextureObjectParameter(texobj1, GL_TEXTURE, texture1);
glTextureObjectParameter(texobj1, GL_SAMPLER, sampler1);

glGetTextureObjectParameter(texobj1, GL_ADDRESS, texobj1Address);

texobj1Loc := glGetAttribLocation(program, "ground");

in sampler2D *ground;
vec4 color = Texture2D(ground, texcoord);

Or, if re-using the current texture object, then when it has a sampler bound to it uses the sampler, otherwise uses it's internal data (The internal data could be deprecated at some later point).


glGenTexture(1, tex1);
glTextureParameter(tex1, [GL_TEXTURE_2D], GL_SAMPLER, sampler1);

glGetTextureParameter(tex1, GL_ADDRESS, texobj1Address);

tex1Loc := glGetAttribLocation(program, "ground");

in sampler2D *ground;
vec4 color = Texture2D(ground, texcoord);

Alfonse Reinheart
03-20-2010, 11:52 AM
I hope they extend the bindless idea to textures/samplers etc. and remove the need for texture units at all.

I think you've misunderstood what the point of bindless is, and how it creates optimizations.

Bindless (as far as vertex attributes go) works as a performance optimization because of cache issues. When you render a scene in a game, you have to go through every object in an arbitrary order, bind everything to the context, and render. When you bind a buffer object name for the purpose of rendering, the driver has to convert this name into an actual object pointer, then fetch the actual GPU address from this object (because buffer objects have state other than just an object pointer). Each memory access is almost guaranteed to be a cache miss, since the last time this memory was accessed was on the last frame (and the entire game logic loop has likely run since then, thus emptying the cache).

The key that makes bindless work for vertex attributes is that the only state you need for rendering is the GPU address. So if the application stores the GPU address, the driver doesn't need to do any memory accesses at all. Outside of actually putting that GPU address in the graphics FIFO, of course.

This is not the case for sampler objects or texture objects. Sampler objects contain only state; a pointer to the object would only save you one cache miss at best. Texture objects have a GPU address of the textures, but even then, they have crucial state associated with it that the texture accessing unit needs to know (the range of available LODs). So again, you need to read from the actual texture object.

As for "removing the need for texture units", why would you want this? All that means is that you have to bind textures and samplers to programs instead of the context. And binding to the context is faster.

There are effectively 3 kinds of textures: global textures (textures like shadow maps that are used for all objects in a scene), instance textures (the textures for a particular mesh that uses a particular program), and program textures (textures that must be used with a particular program. Lookup tables that don't change between users of a program). Trying to make everything into a program-local texture is not helpful.

If I want to change what the global texture is currently, I know what texture unit index I assigned it. So I can just bind a different texture there and every program I use from there on will use it. Under what you're wanting, I must go through every program and change what texture they use.

With per-instance textures, under the current scheme, I can pre-assign the diffuse texture to texture unit 0, the specular map to texture unit 1, etc. And as long as I set all of the program sampler uniforms correctly, I never have to change the sampler uniform state when rendering different models with the same program. Under your system, with GLSL assigning uniform locations, for each program I use, I have to store what uniform location the diffuse, specular, etc texture is and bind them to that location.

So no: textures and samplers are fine as is.

Dark Photon
03-20-2010, 07:08 PM
We all know that OpenGL implementations are extremely unstable (much more so than Direct3d implementations, for whatever reason).
Where NVidia is concerned, sorry, but this is blatently false.

But if you factor in "other vendors"..., I'll have to defer to someone else's experience there.

We're in a unique position with embedded systems where we pick the hardware, so we can just take the best hardware+software combo out there at the time (features, performance, stability, pricing, volume availability, etc.) and run with it.


How large is the mean performance gain in typical OpenGL 3.2 applications right now? ... What are the real numbers like?
Several of us have reported 2X speed-ups without changing anything else. And at least my tests were on a real shipping app and database written intelligently (VBOs, frustum culling, state sorting, batch combining, etc.) but where decent frustum culling is still critical to regulate useless GPU vertex/fill hit. The mod for bindless batches is a trivial source change, and yielded a whopping 2X perf boost with real use cases!

That's much less expensive than spending the dev time to trying to contort your rendering pipe through some other means to push more polys (in an effort to compensate for the CPU-side waste you have without bindless). Often times this comes to the detriment of frustum culling efficiency, meaning more wasted "junk" thrown down the pipe due to irrelevant batches which take valuable GPU cycles to discard.


(7x sounds great, but it is only valid for solely batch-limited applications.
Right. 2X is awesome though. You've just doubled the throughput of your GPU, for $0 cost in hardware and nearly no cost in software changes. Merely by eliminating some needless "pointer chasing" in the driver which otherwise kills your CPU cache and radically cuts your CPU and GPU utilization.


Is there a way to achieve a similar performance gain without thoroughly breaking the current buffer/attribute/uniform model?
Display lists. :D ...Oh, wait. Those were obsoleted in GL 3.0. And they're very time consuming to build, rendering them useless for run-time built/loaded geometry.

And there's "instancing" but this assumes you're rendering a bunch of copies of the same thing -- makes for much more boring scenes, AND reduces your CPU culling efficiency AND complicates LOD.


Why are bindless graphics superior to geometry instancing?

See previous paragraph, but more on that below.

As to the former point (instancing being a boring bunch of copies), yeah, you can use texture lookups (or more "instance" vtx attribs with ARB_instanced_arrays) and more shader logic/expense to try and vary the appearance a little per instance in the shader. Works, but what a pain. Recall, why are we barking up that tree anyway? Because of batch submission overhead, which is what bindless greatly reduces.

As to the latter (reduced culling efficiency), with instancing we just have to suck it up and deal with it. With instancing, you're shoving larger groups of "stuff" down the pipe to try and reduce batch submission overhead, so the irrelevant portion of that is larger and just gonna soak up cycles. You can try and reduce some of that by trying to do on-GPU frustum culling and dynamic rewriting of batches. But what a pain and even more polys to render to try and avoid rendering other polys! Instancing is useful but not a "silver bullet". It also complicates LOD.

If we could just do more culling/LOD/batch submission work faster on the CPU, then we could avoid alot of this on-GPU inefficiency or contortion. That's exactly what batch bindless gives you.

Alfonse Reinheart
03-20-2010, 09:23 PM
Display lists. grin ...Oh, wait. Those were obsoleted in GL 3.0.

He was talking about a new way. That is, a new extension that would provide the beneficial effects of bindless without the negative effects of it.

Both VAR and buffer objects are a way to deal with user-allocated graphics-owned memory. But buffer objects have a good abstraction, while VAR doesn't.


Because of batch submission overhead, which is what bindless greatly reduces.

I don't see how bindless takes away from instancing or is in any way fundamentally superior to it.

Instancing, regardless of method (draw_instanced or instance_arrays), is a way of drawing multiple copies of a mesh with different material properties. In essence, it has nothing to do with batch submission overhead (ie: calling gl*Pointer) and has everything to do with state change overhead. Setting parameters on a program takes time, and instancing is a way of removing this time-taking step.

Bindless doesn't help with changing program parameters. Indeed, if you're trying to render 10,000 copies of a mesh with bindless, you'll find it no faster than without bindless. It simply doesn't help with the fact that you need to, after every draw call, make at least one glUniform call.

Bindless reduces the cache overhead of making gl*Pointer calls. That's all it does.


As to the latter (reduced culling efficiency), with instancing we just have to suck it up and deal with it.

I don't see how culling is more or less efficient with instancing. Unless you're doing instancing by just rendering all instances all the time, you're presumably using a streaming buffer object to upload an instance list to the GPU every frame, or at least every other frame. Can't you do your culling when building this list?


If we could just do more culling/LOD/batch submission work faster on the CPU, then we could avoid alot of this on-GPU inefficiency or contortion. That's exactly what batch bindless gives you.

No it doesn't. Indeed, bindless has essentially nothing to do with GPU and everything to do with CPU inefficiency. The CPU cache issues are what drives the performance increases for bindless graphics, not anything that has to do with the GPU.

Dan Bartlett
03-21-2010, 09:56 AM
I think you've misunderstood what the point of bindless is, and how it creates optimizations.
Removing texture units wouldn't really be for direct speed optimizations, at least not in the same way bindless graphics does, rather to simplify scene-graphs etc. I guess I just saw a similarity between bindless + my wish to no longer need texture units.


As for "removing the need for texture units", why would you want this? All that means is that you have to bind textures and samplers to programs instead of the context. And binding to the context is faster.

It may currently be faster binding textures to context for rendering, but this can potentially be avoid completely in the rendering stage + instead done once per program/kernel at setup time.
If you don't know what order materials are to be applied in, then currently the simplest way to render each scene object may be to have code something like:


for each scene object
{
// Apply material
// Apply textures
for each texture/sampler used by material
{
glActiveTexture(texture.location);
glBindTexture(convertToGLEnum(texture.type), texture.handle);
glBindSampler(texture.location, texture.sampler);
}
glUseProgram(object.material.prog);
#set UBOs etc#
#set any subroutines?#
// Apply geometry
glBindVertexArray(object.geometry.vao);
// Draw
glDraw...
}

If you want to restore state after unapplying the material, then you'd also need to cache the bound textures + save/restore them.
You could also have a pre-render pass that figures out what materials each scene object uses, then sorts by material/depth/etc to minimize state changes/overdraw + decides what texture units textures will be bound to (taking into account the max combined texture image units with minimum values of 2/16/32/48/48/80 in OpenGL 2.1/3.0/3.1/3.2/3.3/3.4).


There are effectively 3 kinds of textures: global textures (textures like shadow maps that are used for all objects in a scene), instance textures (the textures for a particular mesh that uses a particular program), and program textures (textures that must be used with a particular program. Lookup tables that don't change between users of a program). Trying to make everything into a program-local texture is not helpful.

If I want to change what the global texture is currently, I know what texture unit index I assigned it. So I can just bind a different texture there and every program I use from there on will use it. Under what you're wanting, I must go through every program and change what texture they use.

With per-instance textures, under the current scheme, I can pre-assign the diffuse texture to texture unit 0, the specular map to texture unit 1, etc. And as long as I set all of the program sampler uniforms correctly, I never have to change the sampler uniform state when rendering different models with the same program. Under your system, with GLSL assigning uniform locations, for each program I use, I have to store what uniform location the diffuse, specular, etc texture is and bind them to that location.

The same could perhaps be done if texture units were removed, texture handles or addresses were passed directly to programs, and textures to be shared across multiple programs use uniform blocks like other globally shared data.

uniform sampler2D tex1;
uniform Shadow
{
sampler2D shadowTex
}
uniform MyMaterial
{
sampler2D diffuse;
sampler2D ambient;
sampler2D specular;
...
}

The code for rendering all scene objects would then be:


for each scene object
{
// Apply material
glUseProgram(object.material.prog);
#set UBOs etc#
// Apply geometry
glBindVertexArray(object.geometry.vao);
// Draw
glDraw...
}

And for further reducing what needs to be done at render time, at the cost of more resources required, you could have 1 program per material instance with all attached uniforms/UBOs, but if programs are too heavyweight in their current form to have 1 material instance per scene object, then maybe there should be a more lightweight kernel object, something like:

glBuildProgram(prog);
glGenKernelFromProgram(prog, ["main",] 1, kernel);
Then each scene object could have it's own kernel object with attached uniforms/UBOs, set up once per scene, rather than each time a program is used.


for each scene object
{
// Apply kernel
glUseKernel(object.material.kernel);
// Apply geometry
glBindVertexArray(object.geometry.vao);
// Draw
glDraw...
}


So no: textures and samplers are fine as is.

GL maintained texture units may be better for global textures than other alternatives, but since UBOs now seem to be the way to handle shared program parameters, then perhaps samplers should also go into them (not sure if an elegant solution to shared global state really exists unless CPU+GPU get closer). Some of the reasons I don't like the current system of texture unit => sampler2D:
Harder to learn/understand (non-intuitive for beginners, although a bit easier now without also needing to Enable/Disable texture binding points per texture unit)
Extra code complexity
Extra limitations for OpenGL programs
Can't be used in uniform blocks

Alfonse Reinheart
03-21-2010, 11:48 AM
You could also have a pre-render pass that figures out what materials each scene object uses, then sorts by material/depth/etc to minimize state changes/overdraw + decides what texture units textures will be bound to (taking into account the max combined texture image units with minimum values of 2/16/32/48/48/80 in OpenGL 2.1/3.0/3.1/3.2/3.3/3.4).

I don't know what this thing you have about the max combined texture image units is. Do you think that this is some fictitious API-created limitation rather than a real hardware limitation? If you compile and link a shader that uses more samplers than the hardware allows, you won't even get to the rendering; you will get a linker error. So if you have a compiled and linked program, you necessarily have enough texture units for it.

And state sorting has been a part of all scene graphs that are serious about performance for quite some time. The sorting algorithms have changed, but the need for sorting itself hasn't.


The same could perhaps be done if texture units were removed, texture handles or addresses were passed directly to programs, and textures to be shared across multiple programs use uniform blocks like other globally shared data.

Except that per-instance textures requires:

1: Storing what uniform location the instance textures go in.

2: Changing program uniforms for ever instance.

Neither of these is necessary under the current system.


GL maintained texture units may be better for global textures than other alternatives, but since UBOs now seem to be the way to handle shared program parameters, then perhaps samplers should also go into them

Uniform buffers are also not set directly into programs; they are set the exact same way as texture objects. You bind uniform buffers to the context, and you tell the program which buffer index in the context to use.

So I don't really see any difference here.

Dark Photon
03-21-2010, 01:21 PM
Because of batch submission overhead, which is what bindless greatly reduces.
I don't see how bindless takes away from instancing or is in any way fundamentally superior to it.
This isn't "instancing vs. bindless -- a fight to the death". They're both useful. And they tend to be most useful in different circumstances.

When you want a bunch of slightly-mutated copies of each other and can eat the "cull waste", you want instancing. When you don't (you really want different objects), and you just want to reduce batch overhead (to pump more batches), then you want batch bindless (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) (instancing is like hammering in a screw here).

Of course, you can use batch bindless (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) in both cases. It's just that it's the most benefit in the latter case.

I didn't say bindless was superior to instancing, or vice versa. I said instancing is a useful tool, but it is not a silver bullet. Bindless does allow you more flexibility in some circumstances (particularly non-instanced) because the CPU time wasted launching batches is greatly reduced. That means the cases in which you are forced to use instancing to "hammer in the screw" are less, and you can get finer grained CPU culling (and more state changes) with some of that reclaimed CPU time (read more complex scenes, not just a bunch of copies of stuff, and more efficient use of the GPU).

Net, batch bindless (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) means less wasted time on the CPU (just submitting batches) and less wasted time on the GPU (due to the resulting pipeline "bubbles" and through sending more irrelevant instanced "cruft" down the pipe than you need to, which the GPU is just going to throw away, after wasting cycles on it).


Indeed, bindless has essentially nothing to do with GPU and everything to do with CPU inefficiency.
If you discount the time the GPU is just sitting there twiddling its thumbs waiting on the CPU (which I don't/can't), then you are correct (...when talking about batch bindless (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) specifically).


Instancing...in essence, ... has nothing to do with batch submission overhead (ie: calling gl*Pointer) and has everything to do with state change overhead.
If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead. The more we can give the GPU to chew on at once, the less chance it'll be waiting on the CPU. And the more CPU time we'll have left over to do other things.

(...well, assuming you're not eating bunches of CPU memory bandwidth transferring huge amounts of data to the GPU every frame.)



As to the latter (reduced culling efficiency)...
I don't see how culling is more or less efficient with instancing. Unless you're doing instancing by just rendering all instances all the time...
In the simplest case for static instances, that's exactly what you'd like to do: here's a list of things -- just draw them. That way you can put the batch data on the GPU and just leave it there. Cheap, fast draw submission. Though if you let instance groups get too big, this flies in the face of LOD and culling efficiency which can net you a loss without careful balancing. Which brings us to...


...you're presumably using a streaming buffer object to upload an instance list to the GPU every frame, or at least every other frame. Can't you do your culling when building this list?
Similar to what Rob was recently describing with his "streaming VBO" write-up. No, not yet. But given the LOD/cull shortcomings of static instancing, in cases where instancing ("copies of something") is what we want, we may end up doing something like this. (Though it's not without its problems. The cull and LOD state of the instance group can totally change in one frame in our world.)

However, there are many other circumstances (complex scenes) where instancing is just hammering in a screw. This is where batch bindless (http://developer.download.nvidia.com/opengl/specs/GL_NV_vertex_buffer_unified_memory.txt) is the most useful. No pipeline/engine contortions. Just get rid of the pointless CPU overhead so we can push more varied content.

Alfonse Reinheart
03-21-2010, 05:09 PM
I said instancing is a useful tool, but it is not a silver bullet.

That's why I don't understand why you brought it up; nobody ever claimed instancing was a silver bullet or that bindless wasn't more general-purpose.


If you discount the time the GPU is just sitting there twiddling its thumbs waiting on the CPU (which I don't/can't), then you are correct (...when talking about batch bindless specifically).

Even so, it is important to denote the actual source of the problem, not the apparent source. And as you point out, the actual problem is the CPU, not the GPU.

And why are you unable to test CPU time?


If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead.

Except that "calling glDrawElements or similar in a loop" is not instancing. Without some kind of state change between glDrawElements calls, you will be drawing the exact same thing each time, which is not particularly useful. The purpose of instancing is to remove the state change overhead for the state changes necessary to do instanced rendering.

It has nothing to do with the overhead of glDrawElements itself and everything to do with the overhead of glBindTexture, glUniform, and other state changes.

Dark Photon
03-21-2010, 05:54 PM
I said instancing is a useful tool, but it is not a silver bullet.
That's why I don't understand why you brought it up; nobody ever claimed instancing was a silver bullet or that bindless wasn't more general-purpose.
"Stephen A" said:

Why are bindless graphics superior to geometry instancing?
which precipitated the whole instancing vs. bindless discussion. Hopefully this has clarified it for him.



If that were true, we would always just call glDrawElements or similar in a loop. But we don't. Why? Batch overhead.
Except that "calling glDrawElements or similar in a loop" is not instancing. Without some kind of state change between glDrawElements calls, you will be drawing the exact same thing each time
Right, if you didn't change a vertex attribute and/or index data ptr. But that would be useless, so you would for instancing, and arguably this hits batch setup. This approach is similar but less efficient than ARB_instanced_arrays, but which does this more efficiently under-the-covers with vertex stream frequency dividers. You could also do something similar with state changes (uniform/etc.)

But anyway, we're picking nits and I think on the same page here.

Alfonse Reinheart
03-21-2010, 06:08 PM
Right, if you didn't change a vertex attribute and/or index data ptr.

Is that how you render an object in different places? By changing a vertex attribute?

No; usually, if you want to render the same object in two places, you set some uniforms, render it, change some uniforms and render it again. You don't bind or unbind vertex attributes, so bindless saves you nothing here.

A bit more on-topic, after looking at the ARB_instanced_arrays spec, I'm not even sure what even means anymore. The divisor used to mean something else, but they seemed to have changed the spec. The divisor used to divide the index of the attribute, so as to match some old D3D 9 functionality. Now it has interactions with the instance value, which is entirely different from how it used to work.

I'm not sure it was a good idea for the ARB to repurpose an extension like this.

Groovounet
03-22-2010, 02:50 AM
Right, if you didn't change a vertex attribute and/or index data ptr.

Is that how you render an object in different places? By changing a vertex attribute?

No; usually, if you want to render the same object in two places, you set some uniforms, render it, change some uniforms and render it again. You don't bind or unbind vertex attributes, so bindless saves you nothing here.

A bit more on-topic, after looking at the ARB_instanced_arrays spec, I'm not even sure what even means anymore. The divisor used to mean something else, but they seemed to have changed the spec. The divisor used to divide the index of the attribute, so as to match some old D3D 9 functionality. Now it has interactions with the instance value, which is entirely different from how it used to work.

I'm not sure it was a good idea for the ARB to repurpose an extension like this.

I don't think it actually changed. For each instance, the divised attributes start back at the beginning just like any attribute. At least, that the way I understand it but I you can point me to further interaction with instanced draw called.

With all the 'instancing' techniques we might have to be more accurate when we use this word 'instancing'.

Dark Photon
03-22-2010, 05:01 AM
Is that how you render an object in different places? By changing a vertex attribute?
Aaaahhhh... the things we've done before for speed, particularly pre-DrawInstanced. ;)


(Re ARB_instanced_arrays (http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt)) I don't think it actually changed. For each instance, the divised attributes start back at the beginning just like any attribute.
I agree with your first sentence, but your second confused me. This is garden variety vertex stream frequency dividers back from 2006 (see this PDF (http://developer.nvidia.com/object/opengl-nvidia-extensions-gdc-2006.html) starting at pg 31), forged from the D3D "Oh crap! Our batch calls are 'so' expensive!" realization.

I believe the typical ARB_instanced_arrays (http://www.opengl.org/registry/specs/ARB/instanced_arrays.txt) use case works like this: vertex attributes can represent one of two things:
those that "repeat" per instance (e.g. position, normal, texcoord0, etc.) those that are "constant" per instance (e.g. an offset vector in texcoord1, a rotation quaternion in texcoord2, etc.)On the latter set, you set glVertexAttribDivisor to 1.

Think of the latter set as those values you'd store in a texture buffer and "look up" using gl_InstanceID with glDrawInstanced now. Let's call this "instance data".

Seems to me, the nice thing about this approach is it streams the instance data to the GPU (a push) as needed along with the instance definition, rather than having every single vertex in every single instance of your object bang on 1-N texture fetches from some potentially random (from the GPU's perspective) piece of a texture buffer (a bunch of pulls, albeit cached). Also, this gets rid of the need for texture buffer subload and bind "state changes" between instancing batches using the same material. Not only that, with the instance data now in VBOs, you ideally can bypass the setup overhead using bindless (had to tie that back in somehow :D ) That said, I haven't actually done a performance face-off between ARB_draw_instanced and ARB_instanced_arrays yet.

Groovounet
03-22-2010, 07:24 AM
I think instancing based on texture buffer and instancing based on attribute data (ARB_instanced_arrays) are complementary, the will perform best in different situation.

The positive side of instancing based on attribute data, just like you said, that is removed texture fetch latencies. However, you can read a massive amount of data in texture buffer for no further cost where the number of attributes is quite limited. A single 4x4 matrix => 4 attributes already taken!

Moreover, if you want to instantiate a group of geometry (no just one) they should have the same number of vertices or you need to add more per-instance data instead of reusing those... well, I am not just is could be a lot of data, so maybe not such a problem.

I think ARB_instanced_arrays will perform better for scenarios with more instances with small per-instance data and ARB_draw_instanced + texture buffer will perform better for scenarios with fewer instances but large per-instance data. Finally, somewhere in the middle, ARB_draw_instanced + uniform buffer might be the best performer.

A complet test would be nice but it's a complicated test to do which could be platform dependent...

Alfonse Reinheart
03-22-2010, 11:16 AM
I agree with your first sentence, but your second confused me. This is garden variety vertex stream frequency dividers back from 2006 (see this PDF starting at pg 31), forged from the D3D "Oh crap! Our batch calls are 'so' expensive!" realization.

... I don't know why I thought it was different before. I seem to recall that this form of instancing was based on building a large index array, with multiple copies of the same indexes, one for each instance. And then you set some state such that certain attributes divide the index by a number to produce the index they use, while others take the mod of the index. Or something.

I don't know where I got this idea from.

Groovounet
03-22-2010, 03:11 PM
I agree with your first sentence, but your second confused me. This is garden variety vertex stream frequency dividers back from 2006 (see this PDF starting at pg 31), forged from the D3D "Oh crap! Our batch calls are 'so' expensive!" realization.

... I don't know why I thought it was different before. I seem to recall that this form of instancing was based on building a large index array, with multiple copies of the same indexes, one for each instance. And then you set some state such that certain attributes divide the index by a number to produce the index they use, while others take the mod of the index. Or something.

I don't know where I got this idea from.

Probably from the billions of topics dealing about this on OpenGL forums! ;)

Chris Lux
03-23-2010, 12:41 AM
I think the glext.h and gl3.h header files are missing some tokens and entry points. I noticed that for OpenGL 3.3 the GL_ARB_instanced_arrays definitions are missing (the ARB extension entries are there using the ARB suffix).

Under the GL_VERSION_4_0 function pointer definitions references to GL_ARB_draw_indirect, ARB_draw_buffers_blend, GL_ARB_sample_shading, GL_ARB_texture_cube_map_array and GL_ARB_texture_gather are missing.

Also, the gl3.h header file contains immediate mode entry points for the ARB_vertex_type_2_10_10_10_rev extension. This file should only contain entry points for core profile OpenGL.

Alfonse Reinheart
03-23-2010, 01:07 AM
That's because these are also missing from the .spec files.

I'm happy that the ARB is getting much more proactive about advancing OpenGL, but you guys really need to do something about the errors in the .spec files in the initial releases.

Groovounet
03-23-2010, 06:24 PM
My OpenGL 4.0 review (http://www.g-truc.net/post-0269.html)

Rob Barris
03-23-2010, 11:34 PM
That's because these are also missing from the .spec files.

I'm happy that the ARB is getting much more proactive about advancing OpenGL, but you guys really need to do something about the errors in the .spec files in the initial releases.

It's a trade off isn't it...

Groovounet
03-24-2010, 09:46 AM
Agreed with you Rob and I think it's a good trade off.

Anyway, let's be realist, OpenGL 4 and even OpenGL 3 aren't widely adopted. OpenGL 4 remains "for us" so quite a few people that can handle these issues.

To have OpenGL 3.X widely adopted, we would need first "reference pages".

Obviously a perfect world would be great.

Stephen A
03-24-2010, 10:32 AM
Nice review/summary, Groovounet.

If I may suggest, filing a bug report (http://www.khronos.org/bugzilla/) is the quickest way to get the specs fixed. The specs are large and complex beasts and bugs are bound to pass through. It only takes 5 minutes to report such bugs, but those 5 minutes could save someone a large amount of time down the road.

Groovounet
03-27-2010, 04:30 PM
I must say it: I was excited by GL_ARB_explicit_attrib_location reading the specs but now I have tested. It is absolutely AWESOME! Please generalize it everywhere, it's a much.

CrazyButcher
04-02-2010, 10:22 AM
Thank you very very much for the new drivers sub-forum! Finally one good place to dump bugs,feedback and praise!

oscarbg
04-06-2010, 10:17 AM
Any hope of having "precise" qualifier outside of GPU_EXT_shader5 extension i.e. for not fermi and cypress gpus..
it's not good since double precision emulation on d3d10 gpus using
float-float approaches gets optimized by Nvidia compiler!

oscarbg
04-06-2010, 10:25 AM
float-float approaches gets badly optimized by Nvidia compiler!
Example code badly optimized (so not working as expected):


changing vec2 z to precise vec2 z; would fix that!
vec2 dblsgl_add (vec2 x, vec2 y)
{
vec2 z;
float t1, t2, e;

t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
return z;
}

vec2 dblsgl_mul (vec2 x, vec2 y)
{
vec2 z;
float up, vp, u1, u2, v1, v2, mh, ml;

up = x.y * 4097.0;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
//mh = __fmul_rn(x.y,y.y);
mh = x.y*y.y;
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
//ml = (fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

ml = (x.y*y.x + x.x*y.y) + ml;

mh=mh;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
return z;
}

Rosario Leonardi
07-07-2010, 04:56 AM
Oh.. hi!
I think there is a little mistake in the specification as pointed here (http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&amp;Number=279944#Post2799 44).

At page 104 of openGl 4.0 spec (without compatibility) it speak about how tassellated triangles are created.


Otherwise, for each corner of the outer triangle, an inner triangle corner is produced at the intersection of two lines extended perpendicular to the corner's two adjacent edges running through the vertex of the subdivided outer edge nearest that corner.

It speak about perpendicular lines, but they should be parallel. Only parallel lines (same angles) assure that triangles are similar.

kyle_
07-10-2010, 04:08 AM
I must say it: I was excited by GL_ARB_explicit_attrib_location reading the specs but now I have tested. It is absolutely AWESOME! Please generalize it everywhere, it's a much.
Pretty much this. HW stuff is always nice, but tweaking api in such ways is really cool - and what GL needs imo. I like how this obsoletes a few functions and removes string manipulation through GL code.
Also having something like GL_ARB_explicit_uniform_location seems like a natural progression.

Groovounet
07-12-2010, 09:24 AM
And something like GL_ARB_varying_uniform_location! (Which nVidia drivers support even if it shouldn't)

That's why I said just "generalized".

I think we had enough of "it's just a trick blabla" in the past. The way the API is designed implies a lot of consequences in the design of the software that uses OpenGL. By itself GL_ARB_explicit_attrib_location well it's "nice" but generalized... it's awesome!

No more query of locations, it's already "known", the C++ program can change the GLSL programs but still use the same ways to communicate with them... Thanks to something like a guaranty actually! In a way this is a more flexible approach than the "program environment object" of Long Peak.

As the design strategy level, the environment object would be the same approach than the vertex array object and generalized explicit location would be the same approach of the long time dreamt vertex layout object. (that show off as the evil VAO in OpenGL 3.0! :/)

Huge wish for OpenGL 3.4 and 4.1!

Dark Photon
07-13-2010, 04:07 PM
And something like GL_ARB_varying_uniform_location! (Which nVidia drivers support even if it shouldn't)
Guess I missed the memo on that one. You're saying you can set a specific uniform index for the GLSL compiler to use for a normal uniform (not UBO) given the uniform name? How?

Remember this for Cg and the assembly profiles (program env), but never saw it for GLSL.

Groovounet
07-14-2010, 02:53 PM
I meant something like like

In vertex shader:
#define COLOR 0
layout(location = COLOR) in VertexColor;

In Fragment shader:
#define COLOR 0
layout(location = COLOR) out FragColor;

Variables are connected through a number, no need of linking?

For variables between shader stages!
It would make separate shader more sensible (no need to use deprecated build-in variables... (*crap*)).

I actually had a look yesterday and I am not sure the nVidia drivers allows it. :p

EDIT: Sorry, I meant something like GL_ARB_explicit_varying_location (also GL_ARB_explicit_uniform_location and GL_ARB_explicit_block_index in a way)

Dark Photon
07-14-2010, 03:32 PM
Oh, OK. Thanks for clarifying.

kRogue
07-16-2010, 12:47 AM
I just wanted to give some comments on the explicit uniform business and the layout for outs of a vertex shader

the following (dumb and bad names) shaders fed to NV driver:

vertex shader:


struct giggles
{
float f1[4];
vec4 ff;
mat4 tt;
};

uniform float g;
uniform mat4 matrix;
uniform float f;
uniform float ggs[4];
uniform struct giggles more_bad_names;

layout(location=0) in vec4 vert;
layout(location=3) in vec2 tex;

layout(location=1) out vec2 ftex;
layout(location=2) out vec2 gtex;

void
main(void)
{
gtex=g*tex;
ftex=tex;

for(int i=0;i<4;++i)
{
ftex.y+=ggs[i]+more_bad_names.f1[i];
}
gl_Position=matrix*(f*vert +more_bad_names.tt*more_bad_names.ff);
}



fragment shader:


layout(location=1) in vec2 ftex;
layout(location=2) in vec2 gtex;
out vec4 color;

uniform sampler2D sl;
uniform float ss;

void
main(void)
{
color=ss*texture(sl, ftex);
}


work fine under nVidia, notice that assigning of location to the outs of a vertex shader and same values as ins of a fragment shader.

Using the same location twice in the vertex shader on a varying gives the error: (in this case using 2 twice)



0(23) : error C5121: multiple bindings to output semantic "ATTR2"


where as in the fragment shader no error is reported (I guess that means the two in's are sourced from the same place). Also no error is reported if the layouts don't match.

Also, the output for the locations of the uniforms is instructive:




Uniforms:
0: f
1: g
2: ggs
6: matrix
7: more_bad_names.f1
11: more_bad_names.ff
12: more_bad_names.tt
13: sl
14: ss



notice that a mat4 takes only one uniform slot, although it is 16 floats. People with ATI hardware, can you try the above shader pair out? (most likely remove the layout on vertex outs and fragment ins though).

Later I will make a better shader pair to see if the location for vert outs actually has an effect... that and see if EXT_seperate_shader_object is ok if one specifies the locations of (outputs of vertex shaders)/(inputs of fragment shaders) under NV.

Hmm.. this is kind of off topic now... maybe a new thread somewhere else for this? Though this is now walking to undocumented behavior land though.

kyle_
07-16-2010, 01:03 AM
Later I will make a better shader pair to see if the location for vert outs actually has an effect...

It does. I did a similar test, and you can 'mismatch' varyings names if the same location is used in vert and frag shader.

Ilian Dinev
07-16-2010, 05:42 AM
notice that a mat4 takes only one uniform slot, although it is 16 floats.
The uniform locations are not offsets. They are indices, IDs. Uniform IDs, not present in your list:



3 ggs[1]
4 ggs[2]
5 ggs[3]

8 more_bad_names.f1[1]
9 more_bad_names.f1[2]
10 more_bad_names.f1[3]

kRogue
07-16-2010, 06:28 AM
That is what I first figured, that uniforms are ID's and not offsets. One thing that then implies is that when setting uniforms, then there is a one more additional layer of indirection, i.e. translate ID to offset (or something).

That the mat4 was only one ID kind of irked me: in GLSL one can set a column of a mat4 matrix, but there is no GL call to just set the column of a mat4 matrix, if a mat4 took 4 ID's then one could, along the same lines setting one value of vecN is a vecN took N slots (then to go further then, a mat4 takes 16 ID's), and this does not address one still needs a way to differentiate between setting the entire element or just it's first sub-element.... pointless for me to talk of this anyways since the uniforms are handled via an ID, not an offset and for that matter the GL spec has this:



A valid name cannot be a structure, an array of structures, or any portion of
a single vector or a matrix.



Though the naming convention in the GL spec kind of suggests "they are offsets" (glGetUniformLocation)... the GL spec language for getting uniforms is kind of icky too, as one passes an "index":

void GetActiveUniform( uint program, uint index,
sizei bufSize, sizei *length, int *size, enum *type,
char *name );

where 0<=index<ACTIVE_UNIFORMS. Here the value of index has really nothing in common with the ID of a uniform.

Though, since there is this layer of indirection, no reason, except for the inckiness of handing structs, for GL API to not have something like "GL_explicit_uniform_location", as Grouvnet has been saying before.