PDA

View Full Version : Render to buffer / BufferTexture / Buffered Texture / read multisampled



hlewin
08-11-2013, 07:05 PM
Why can only textures be attached to Framebuffers and textures not be buffered transparently? Or why can renderbuffers not be read from directly via GetBufferSubData and there is no GetTexSubImage?
It's obvious that an attachment to the framebuffer has to provide data about it's color/pixel-Format, so it cannot be a plain buffer. But that does not mean anything attachable to a framebuffer must not be readable in parts directly.
Always clumsy ways of composing this assumedly simple operations have to be used. The (partial) read of a multisampled rendering-target is a prime example. To get to the "evaluated" pixels of a multisample framebuffer one has to create a second framebuffer and blit to it before being able to read from it which even adds a layer of abstraction to the operation required. What is so special about framebuffer to framebuffer transfer that cannot be done by ReadPixels? No - please do not tell me...
The workaround - a very good idea IMHO - would be transparently bufferable textures. One could render to them, use their data for almost anything but would eventually trade their all-around usability for some speed when sampling from them with linear filters - leave alone mipmaps that would have to be generated when Auto-creation is toggled on for them.
This brings us to buffered textures as they are: Those thingies do not seem to be usable for anything. They are 1dimensional (as if the y*w+x couldn't be computed by the GPU) cannot be attached to framebuffers and cannot even be sampled from - which doesn't hurt really as no one really samples from 1-dimensional textures anyways.

The generic buffer-abstraction should be extended to every aspect of the api that requires "data" in some form IMHO - be it at the cost of some rendering performance if used naively.

malexander
08-11-2013, 07:35 PM
You can bind any buffer to the PIXEL_PACK_BUFFER binding (to upload data from the buffer to a texture) or a PIXEL_UNPACK_BUFFER (to read back data from the texture). A glTexImage*() call with a NULL data pointer will transfer the data from texture to buffer attached to the PIXEL_PACK_BUFFER binding. A glGetTexImage() with a NULL data pointer will do the reverse with a PIXEL_UNPACK_BUFFER. The GL calls are necessary to define the data format of the data going in and out of the pixel buffer. It's not transparent, but then again neither is the image structure of textures on the GPU (they're all proprietary). Essentially, the glTexImage*() call optimizes the data for you.

The buffer binding names should give you a hint as to why a texture can't source directly from a buffer - packing (except in the Texture Buffer Object case, but even that has restrictions). For efficient texture sampling, a texel is normally padded to 2^n bytes. Often there are other restrictions that cater directly to the texture sampling fixed-function hardware. Violating a GPU's alignment rules can have pretty dire performance penalties. For example, I was sending a fp16 vec3 attribute to a vertex shader for a 20M polygon object using an AMD FirePro W8000, and it rendered the object in 275ms. When padded to a vec4, it rendered in 62ms (~4.5x faster). Seems that the hardware did not like the 6 byte alignment of the buffer data, but 8 bytes worked just fine (Nvidia cards didn't have a preference) So really, you don't want to send arbitrarily formatted data to the GPU to sample as a texture, as you'd very likely run into performance pitfalls that will change depending on the driver and hardware.

hlewin
08-11-2013, 08:19 PM
That's why glGetSubTexImage is missing. One has to copy the whole thing but only wants a few bits.
That leads to: Ignore the case when sampling from the texture and make the texture-data transparent - the texture is to be rendered to and then read those few bits via GetBufferSubData.
Something is missing or I'm not cold-blooded enough to query for the whole texture to get a few Colors (of course a copy is as near to the fastest Operation possible - but what about DMA-Transfer etc. -> Needs to be done by copying the Portion to a new tex first -> means creating a new tex, copy, get: 3 calls with complicated semantics (mabe Memory allocation, may do conversions, has to do whatever) for such a simple thing? One wants the bits as they are...).
As for your example: The performance hits may be conceptually unavoidable (as the matrix-sampling from a transparently buffered texture as example: The cache-line exploiting simply cannot be done then. It is unlikely that optimizations of that kind could be found for those. The locality of sampling operations would require to Keep a few (N*2)x3 Bytes in cache for a 3x3 Matrix: (N*1)x3 as General and (N*2)x3 as the Matrix may be across the cache-line Offsets. This may or may not be bad. If N is something like 16 and the GPU has a few kB of Cache I do not see that a big Problem if looking at that Operation in isolation).
As for the General Penalty for wrong alignment: Don't you think that such a Thing should be fixed? Ok - one cannot do Offset-calculations by bit-shifting anymore but has to do a multiplication. Does that really justify a 400% performance-hit?

EDIT: Those GPUs aren't 8086...

Alfonse Reinheart
08-11-2013, 08:33 PM
Why can only textures be attached to Framebuffers and textures not be buffered transparently?

Why should they be? Textures aren't buffer objects.


This brings us to buffered textures as they are: Those thingies do not seem to be usable for anything.

No, they're not useful for what you want to do with them. Just because they're not useful the way you want doesn't mean that they're not useful.

They're very useful to accessing large tables from shaders. Buffer textures can be much bigger than UBOs, and on pre-GL 4.x hardware, you don't have SSBOs and Image Load/Store. And even on 4.x hardware, buffer textures can't be written to, so they don't have any of the coherency issues that image load/store and SSBOs do. There's no need for barriers and so forth.


cannot even be sampled from

... OK, if you're going to suggest something for OpenGL, you should first know how OpenGL works. And therefore not make laughably inaccurate claims like this, which can be disproven by simply looking at the extension spec or the OpenGL Wiki page on the subject (https://www.opengl.org/wiki/Buffer_Texture#Access_in_shaders).


That's why glGetSubTexImage is missing.

Look, don't turn the entire "Suggestions" forum into you whining for this feature.

hlewin
08-11-2013, 09:21 PM
Look, don't turn the entire "Suggestions" forum into you whining for this feature.
You're right. I'll write a Spec-Feedback on this... ;)


And even on 4.x hardware, buffer textures can't be written to, so they don't have any of the coherency issues that image load/store and SSBOs do.
You're right - I was jumping over the 3.x Targets, but: Is this actually a hardware issue?
I have to wonder as I don't see any difference to other ordinary textures except their transparent data-storage. A copy from a framebuffer to a buffer-object followed by a buffer-read would have to deal with such issues many times. The Pixels to be copied must have been computed (I guess this is the most parallelized part), the data to be read from the buffer must have been copied to it (given the simple nature of the copy I'd doubt if this is even done in parallel) and then they can be written to the pointer supplied.
This seems to imply that the whole concurrency-issues are dealt by the render-buffer and texture-objects in a - let's say it frankly - not too resuable way. Even if spawning a few hundret threads to compute the fragments of a target texture the whole operation must be simple to check for completion. If exploiting partial completion there has to be data about the completion-status of individual Pixels, so that one can copy Pixel 0 when Pixel X-thousand just gets rendered. Somehow the restictions put upon the object-usage by the spec seem quirky to me.
In general I'd say that first one has to know what has to be done and then decide how it has to be done. Sometimes I get the feeling that certain restrictions are due to a false consensus. But then again I remember: This is OpenGL and the ARB - not direct3d and microsoft. One has to web-crawl for the extensions of the vendors of pc-hardware to get a Picture of what can really be done, as - when looking on my old sun-systems - I do not know about restrictions and do not want to - nor do I want about SGI-Workstations. That's the other side of openess of a standard I guess. That said: if seeing a vendor-specific-tagged extension I would not count on it to be available on other manufacturers Hardware for the same architecture - which may or may not be a fallacity. Waiting for the last OpenGL-implementor to Support a Extension seems to make it ARB.

Alfonse Reinheart
08-12-2013, 03:09 AM
A copy from a framebuffer to a buffer-object followed by a buffer-read would have to deal with such issues many times.

... what? Copies from a framebuffer to a buffer object via glReadPixels don't have to deal with any of the issues buffer textures do. That's because it's explicitly copying the data.

Buffer textures do not copy data. The texture is explicitly getting its storage from the contents of the buffer object. There is no storage apart from the buffer object. When you sample from a `samplerBuffer`, you're reading from a buffer object. There's no copying outside of the actual sampling of the data from the buffer's storage. In order for this to work, there must be some specification for what the data in the buffer object means, so that the user can make his data conform to the needs of OpenGL.

None of that happens when you copy data. In copying, you can play all kinds of games. You can do partial copies, you can do format changes. You can even downsample a multisampled buffer in a copy operation (though not a glReadPixels copy).

In order for your notion of being able to use buffer objects as the direct storage for textures to work, you will need to do the following:

1: Define, for every sized internal format, a specific byte-for-byte arrangement of pixel data that can be applied to every texture type. So, the specification now needs to set down a very specific set of rules for where pixel data is in an arbitrary array of bytes and how to interpret that. Good luck getting that through the committee, since every IHV is going to be pushing for the pixel arrangement that their hardware can do with the least effort.

2: Create a new set of `sampler` and `image` types. This is important, because, as you pointed out in the other thread, shaders do a lot of the heavy lifting for sampling data from textures. That means that shaders will need to know up front, at compile-time, whether a particular texture access will be using the optimized format or the "OpenGL-specified lowest-common-denominator" format. So you have just doubled the number of `sampler` and `image` types. Considering that there are already 30 of them (ignoring the "buffer" versions), you're talking about adding a lot of types to the system.


This seems to imply that the whole concurrency-issues are dealt by the render-buffer and texture-objects in a - let's say it frankly - not too resuable way.

What "concurrency-issues"? Nobody in this thread said anything about concurrency. Did you post this in the right thread? And what does concurrency have to do with your idea of using buffer objects as storage for arbitrary textures?

hlewin
08-12-2013, 04:36 AM
Good luck getting that through the committee
I would just say "I'll try" but there is no comitee as such for the hardware-vendors (they make extensions and then comes the comitee).
Nonetheless I just cannot miss the opporunity to say that those things have been defined for Input and Output to/from textures. Why should it be so difficult to do it for their storage? And why would someone think that a few new sampler-types were a problem? I doubt the compiler has problems dealing with a few hundret types...

kRogue
08-12-2013, 07:17 AM
This idea seems like an awful lot of pain to essentially avoid glReadPixel() with a buffer object bound to GL_PIXEL_PACK_BUFFER. The only benefit that something like this would give is the theoretical gain of avoiding the copy, but the cost of essentially asking hardware to do things far less efficiently repeatedly likely is much, much larger than the cost of that glReadPixels(). Writing and reading to nice boundaries is rampant everywhere; x86 just runs slow, ARM does not allow unaligned access. That is on CPU's which are generally more flexible than GPU's. Then we can talk about cache line stuff, that goes beyond alignment and into the realm of reorder of data for most likely access (read the non-linear order of texture data).

By the API having that glReadPixels() is needed to grab data from a texture, a developer knows that a conversion is going on and that conversion becomes a one time cost. If the format is so that no conversion but the hardware runs faster in a different format that means the cost is much higher against a one time cost(per-rendered image).

On the subject of writing to buffer objects directly we already have that: transform feedback, SSBO's and image/texture buffer objects.

If we really want the ability to "render to a buffer efficiently" instead of a texture, then there is a way, but it would be pain beyond imagining. One would need to be able to query GL for the format of the texture data. Just trying to specify how that would work is hard. Really freaking hard.

One possible compromise would be to follow Intel's lead of GL_INTEL_map_texture and set the texture as linearly ordered and have the ability to have a buffer object be an alias for the store of a texture, kind of like texture view on LSD; most GPU's need to have the ability to sample from a linearly ordered texture (at significant cost) because the cross-process image jazz (for example X's Pixmap) is linearly ordered.

Alfonse Reinheart
08-12-2013, 07:54 AM
there is no comitee as such for the hardware-vendors (they make extensions and then comes the comitee).

Oh yes. Virtually all OpenGL core functionality comes from vendor-specific extensions first. Like ARB_program_binary. And ARB_vertex_attrib_binding. And ARB_transform_feedback2 & 3. And ARB_tessellation_shader. And ARB_separate_shader_objects. And ARB_shader_image_load_store. And ARB_sampler_objects. And ARB_texture_storage. And ARB_buffer_storage. And ARB_OK_you_get_the_point.

Oh wait, none of them had vendor-specific versions come out first (EXT_SSO and EXT_shader_image_load_store are an EXT extensions, not vendor-specific ones). Does some functionality work that way? Certainly. Is it everything or even most core features these days? No.


Nonetheless I just cannot miss the opporunity to say that those things have been defined for Input and Output to/from textures. Why should it be so difficult to do it for their storage?

I'd answer that, but since you dismissed the answer given the last time you asked this, there's not much point.


And why would someone think that a few new sampler-types were a problem? I doubt the compiler has problems dealing with a few hundret types...

Because it makes the documentation and API bloated, all to support a feature that will rarely be used (due to the likely significant performance degradation it would cause).


One possible compromise would be to follow Intel's lead of GL_INTEL_map_texture and set the texture as linearly ordered and have the ability to have a buffer object be an alias for the store of a texture, kind of like texture view on LSD; most GPU's need to have the ability to sample from a linearly ordered texture (at significant cost) because the cross-process image jazz (for example X's Pixmap) is linearly ordered.

The problem there is that merely saying the texture is linearly ordered isn't enough. You need to offer guarantees about pixel sizes and ordering (is GL_RGBA8 stored in RGBA order?) that OpenGL doesn't currently require, as well as specify things like row alignment, strides for mipmaps, whether there is spacing between the images of array layers/cubemap faces, etc.

You basically have to enforce a particular arrangement of data, which may not be something that can be supported across a range of hardware. Note that the INTEL_map_texture extension has a linear format, but it also is implementation-specific about the other characteristics of the pixel arrangements and so forth.

mhagain
08-12-2013, 09:45 AM
Theoretically, mapping a texture can be done in a hadware-neutral manner, and the key is to realize that the pointer the driver gives you needn't be to the underlying texture storage on the GPU. The way it would work is like this:

Map for reading

Driver allocates a block of memory big enough to hold the mapped size.
Driver copies the texture to this block, converting as it goes.
Driver is free to recycle this block after the unmap.

Map for writing

Driver allocates a block of memory big enough to hold the mapped size. The block's contents are unspecified.
Program writes into this block.
On unmap, driver copies from this block to the underlying texture, converting as it goes.
Driver is then free to recycle this block.

None of this is technically difficult because it's how the current APIs work. For writing it's how glTexSubImage works, but with one less memory copy (i.e. the same rationale behind the original specification of glMapBuffer). For reading it's how glGetTexImage works but with the ability to specify a sub-region. Drivers already know how to convert from linear to swizzled and back again, so a hypothetical mapping of a swizzled texture can just use the same code paths. There's no need to specify anything extra here.

aqnuep
08-12-2013, 10:19 AM
Theoretically, mapping a texture can be done in a hadware-neutral manner, and the key is to realize that the pointer the driver gives you needn't be to the underlying texture storage on the GPU.

That's true, but it is also one of the main points of inefficiencies when handling buffers, as the developer is simply unable to predict how expensive a mapping operation will be, as allocating a temporary buffer and mapping that is potentially orders of magnitude more expensive than if you map the actual buffer.

Seriously, do people actually want unpredictable performance characteristics just for the sake of convenience? I never understood this, or those tricks like orphaning and such. All those depend on the fact that the GL implementation can always be smart enough to do exactly what the application is willing to do. Why not do things explicitly? Why not have map being a map, not some magic that looks like a map from the outside?

I, as an app developer, would prefer explicit things, no magic in the background, with reliable performance characteristics. "Orphaning" and all that funky stuff can be done by the application, if it wants to, and won't penalty applications that don't need it.

Alfonse Reinheart
08-12-2013, 10:51 AM
The way it would work is like this:

We already have that. It's called PBOs. They're superior to this in pretty much every respect, the most important being that the copy is asynchronous. You do the copy, and you can set a fence to check to see when the copy is finished. You can then go on and do something else. After you've finished with that other CPU task, you can map the buffer.

The way you describe shoves them all into one command; this makes the full mapping operation synchronous. And if you split it back into two commands (an async preparation stage and a synchronous mapping stage)... you've just reinvented PBOs.

And thanks to ARB_internalformat_query, we can also query the correct pixel transfer formats for a particular internal format. So it can be done with minimal overhead during the copy.

Also, making the memory management explicit is good, since textures don't tend to be small objects. That way, the application gets to keep track of its memory usage. You don't have these multi-megabyte memory objects implicitly allocated by the driver.

kRogue
08-12-2013, 01:29 PM
The problem there is that merely saying the texture is linearly ordered isn't enough. You need to offer guarantees about pixel sizes and ordering (is GL_RGBA8 stored in RGBA order?) that OpenGL doesn't currently require, as well as specify things like row alignment, strides for mipmaps, whether there is spacing between the images of array layers/cubemap faces, etc.

You basically have to enforce a particular arrangement of data, which may not be something that can be supported across a range of hardware. Note that the INTEL_map_texture extension has a linear format, but it also is implementation-specific about the other characteristics of the pixel arrangements and so forth.

The current cross-process image jazz already enforces a memory layout coming from the system (X11 Pixmap for example). Given that the vast super majority of consumer GPU support X11, that implies they support that linear layout for atleast the color format that X pixmaps support. On mobile I know that ARM Mali, PowerVR SGX all do this for fact, and I strongly suspect Qualcomm and Vivante do as well.

Once the order is linear, the queries needed about the color format storage are not that nasty. Indeed, likely all that is needed are the following:

stride between rows of pixels
stride between pixels on a row
packing order of channels


What it smells like is that the color format is what the color format of a texture buffer object allows. The main issue for an implementation to handle are the following:

rasterizing to a linear format. For traditional formats like GL_RGBA8, GL_RGB565, etc this is already there; the questionable ones are for formats that are not usually used for color buffers of a display window
sampling from a linear format for similar color formats; again "traditional" color formats already have the ability usually to make linear ordered textures (because the underlying OS display system requires them), the question is for other formats


My bet is that outside of compressed formats, the possibility is there for most color formats to be linearly ordered but possibly a space between neighbor pixels and likewise a stride between rows. Packing order may come up for some formats as well, i.e. storage may or may not be red-green-blue-alpha but some permutation. Each of these values is quite queriable though.

hlewin
08-12-2013, 05:35 PM
Each of these values is quite queriable though.
My guess would be - at least in the long run - they are defineable. It is not only yet existing Hardware which may or may not have difficulties handling different component- and byte-orders. Maybe some hardware even cannot handle some number-format at all when texture-storage is transparent - this just means they couldn't provide an extension.
But my guess would be it is old hardware that - I have no example texture-format at Hand - would fake doubles by using IEEE floats for them as the spec allows.

I'm sorry I have to take this as an example, but: When I look at the ARB_internalformat_query I cannot stop to think: Wow, this is a one-functioncall-extension... A decent programmer would have added this in a few minutes on demand.... I guess I do not understand...

Alfonse Reinheart
08-12-2013, 06:51 PM
My guess would be - at least in the long run - they are defineable.

What is this guess based on? Also, why would you want to exclude hardware? The whole point of an abstraction is to be as inclusive as possible.


I'm sorry I have to take this as an example, but: When I look at the ARB_internalformat_query I cannot stop to think: Wow, this is a one-functioncall-extension... A decent programmer would have added this in a few minutes on demand.... I guess I do not understand...

I don't understand why you would think that the time it would take to implement something has any bearing on how long it takes to standardize something. The OpenGL specification is not software; it's a specification. You don't just bash out a specification in C++ code; that's the easy part. A specification needs to define behavior. So first you have to decide what that behavior actually is.

Taking your example, you can query the pixel transfer formats for an image format. However, you can query them separately for glGetTexImage, glTex(Sub)Image, and glReadPixels. Someone had to make that decision, and it's not an obvious decision to make. People had to talk about which queries they felt that they needed and which ones they did not. These things do not happen in "a few minutes".

hlewin
08-12-2013, 07:08 PM
I'm excluding approaches that seem too limited for currently produced / planned hardware in my eyes. This thread is about future suggestions for OpenGL.