PDA

View Full Version : render to vertex buffer



babis
05-19-2008, 04:12 AM
Hello,

I'm currently using ReadPixels to copy some data from a fbo attachment texture to a vbo, bound as PIXEL_PACK_BUFFER (nullifying the previous contents each time). Is there any faster way (with no copy) to do this with the latest extensions / HW ?

I've searched around the web for any recent references, but with no luck, so I'm wondering if the above is still the way to do it.

Thanks,
babis

Zengar
05-19-2008, 04:42 AM
Nvidia has the GL_EXT_transform_feedback extension

-NiCo-
05-19-2008, 04:51 AM
AFAIK copying the FBO data to a VBO using a PBO like you said is still the best option for now.

babis
05-19-2008, 04:54 AM
Thanks Zengar, I completely forgot that!

@Nico :
Shouldn't transform feedback with GL_RASTERIZER_DISCARD_NV be way faster? Without considering compatibility issues,that is.

-NiCo-
05-19-2008, 05:16 AM
Doesn't the transform feedback record vertex attributes only? My guess was that you wanted to reinterpret the rgba color data in the FBO as xyzw vertex components (or other attributes) in a VBO.
I have limited experience with transform feedback so correct me if I'm wrong :)

babis
05-19-2008, 07:54 AM
The spec says that it can also record varyings from vertex/geometry shaders, which seems -almost- just as good, haven't tried it out yet though.

Seth Hoffert
05-19-2008, 07:56 AM
If I understand correctly, you're currently rendering computational data into a framebuffer for use as vertex data? If this is the case, then you'll definitely want to check out transform feedback - it's awesome. :D

babis
05-19-2008, 08:04 AM
You are correct, the only problem with this approach is that I cannot use a quad for computing + fetching 2d samplers, since any to-be-returned data must be specified in the vertex/geometry shader.
Rendering the quad as points & fetching stuff from a TBO could be a workaround, but it would be slow I guess for a big quad.
But I guess the good point is that there are sooo many options :)

-NiCo-
05-19-2008, 08:12 AM
Indeed, that was my point exactly. I'm also rendering my data to the FBO using a single full screen quad and calculating the output in the fragment shader...

Seth Hoffert
05-19-2008, 08:14 AM
I just ran a test: I render a VBO containing 100*100*100 vertices (each vertex being a vec2). I output one float per vertex, into another VBO (2.0 * gl_Vertex.x). It runs at 273 fps - not bad!

It would be interesting to compare this against the render-to-framebuffer method.

EDIT: I use transform feedback for quad-like data and it works out well (and avoids the framebuffer mess). I do this for a vector field application - I first send the points through and compute a user-defined function on each point, and capture this back into a VBO. I then instance render arrow geometry, and use gl_InstanceID as a lookup into the texture buffer object to which the VBO was bound. However, now I'm curious as to what the speed would be if I rendered to a texture and performed texel fetches on this...

Seth Hoffert
05-19-2008, 08:37 AM
Hmmm, are you wanting to compute data in a per-pixel correspondence kind of way (like for use in deferred shading), or are you computing a smaller-than-the-window quad for use with discrete objects?

I think the transform feedback method makes more sense for the discrete objects, but the FBO approach makes more sense for per-pixel data.

babis
05-19-2008, 04:42 PM
Actually my texels are particles, so logically the FBO is the way to go, but I'll do some further tests and see.



However, now I'm curious as to what the speed would be if I rendered to a texture and performed texel fetches on this..


If you have any results of the comparison, feel free to post! :)

babis
05-19-2008, 06:55 PM
I did some tests of my own, and it seems that fetching from TBO & sending a varying is a bit slower than fetching from a texture2D, same formats, 1K by 1K, rendering some hundred points. But my timer seems to actually suck.

Does anybody know any *precise* timer, for measuring a specific pass for example? I used now the NVidia one (using timer queries), & although I use glFinish before begin / end query, the results vary wildly.

Seth Hoffert
05-24-2008, 07:59 AM
This is good to know, thanks for performing those tests. :)

Seth Hoffert
06-11-2008, 05:13 AM
I just got around to testing transform feedback & render-to-texture... unfortunately, render-to-texture was much faster. I don't understand why this has to be the case, though. With transform feedback, the vertex shader runs n times. With render-to-texture, the fragment shader runs n times. Both use the same stream processors on my hardware, so what's the deal?

I tried a 2000x2000 grid with 3 128-bit outputs per point. With transform feedback, I get ~35 fps, but with render-to-3-textures, I see about 232 fps. Hopefully a future HW revision fixes this... I ought to test CUDA, but I assume CUDA would hit around ~200 fps too just like render-to-texture.

Seth Hoffert
06-11-2008, 05:34 AM
I take that back. The render-to-texture version actually runs at *527fps*. Ouch.

Seth Hoffert
06-11-2008, 01:39 PM
...wrong again. I checked to make sure everything was being written properly. Here are my correct results:

My input is 1000x1000 evaluations, and I'm writing 4 component output, 32 bits per component, 3 outputs.

Transform feedback, with both interleaved and separate VBO modes: 7ms (~142.9 times/second).
Render to texture (using GL_RGBA32F_ARB): 1.09ms (~917.4 times/second).

For 2000x2000 evaluations:

Transform feedback, with both interleaved and separate VBO modes: 28ms (~35.7 times/second).
Render to texture (using GL_RGBA32F_ARB): 4.4ms (~227.3 times/second).
CUDA (without mucking with assembly): 41.6ms (~24 times/second)

Conclusion: Transform feedback has a bit of room for improvement on my particular implementation (G80) :) I'm not sure what's up with CUDA. I also tried stripping out all VBO code and using cudaMalloc, but this helped none.

Timothy Farrar
06-11-2008, 05:45 PM
Render to texture uses a dedicated ROP/OM path in the hardware, where as I would guess transform feedback just writes directly to global memory in the shader itself.

Just guessing here, in the CUDA/transform feedback case, the shaders are stalled on write to global memory, while on the ROP write to texture case the outputs get hardware queued at that point, and the shaders continue with new work.

Would be an interesting test to try the same calculations on a pure bandwidth bound algorithm to see if the ROP path is still the fastest.

Awesome find, kind of makes me rethink my usage of transform feedback!!

BTW, is there an extra GPU<->GPU memory cost with render to vertex array on G80? Did you factor that kind of thing in your calculations?

-NiCo-
06-12-2008, 12:28 AM
Just guessing here, in the CUDA/transform feedback case, the shaders are stalled on write to global memory, while on the ROP write to texture case the outputs get hardware queued at that point, and the shaders continue with new work.

I seriously doubt that. I specifically remember someone from Nvidia saying that writing to global memory is fire-and-forget.

Seth Hoffert
06-12-2008, 05:06 AM
Hopefully this is one of those things that improves in a future hardware revision. :)

Timothy Farrar
06-12-2008, 08:08 AM
...I'm writing 4 component output, 32 bits per component, 3 outputs...

Yeah, what was I thinking, global write should be fire and forget.

2nd guess, you are writing out 3 vec4 outputs! Somehow I missed that. So each vert writes {XYZW XYZW XYZW} for transform feedback / CUDA. I think you are looking at some bank conflicts on a global memory write which will slow this down (something the ROP/output merger would do for "free").

Should be easy to test by comparing speed of 1 vec4 vs 3 vec4 outputs.

Seth Hoffert
06-12-2008, 08:22 PM
Writing out to one vec4 output only with transform feedback (2000x2000 points): 14 ms (~71.4 times/second)

Render-to-texture: ~1.88 ms (~531.9 times/second)

Jackis
06-17-2008, 09:36 AM
Sorry for may be little offtopic.
For G80 I've found, that common way (FBO->PBO->VBO) is about 10-20 percents slower, then simple render-to-texture without PBO copying and then fetching from vertex shader into that texture. Of course, you have to create prefilled VBO with texcoords for fetching.