PDA

View Full Version : transform feedback + glDrawElementsInstanced



peterfilm
07-11-2012, 07:10 AM
In order to avoid the query object stall when combining EXT_transform_feedback with glDrawElementsInstanced it seems to be recommended to use the ARB_draw_indirect extension - but for the life of me I can't find any information on how I get transform feedback to populate the GL_DRAW_INDIRECT_BUFFER needed for the new set of functions this extension introduces.
I've seen people talk about OpenCL, but how do I get OpenGL's transform feedback mechanism to do it?
thanks.

(I've deliberately littered this post with the keyword breadcrumbs I've been searching with for people with the same question!)

aqnuep
07-11-2012, 07:37 AM
What do you mean by query object stall with transform feedback and DrawElementsInstanced exactly? What's your use case? Do you feed back vertex array data or instance data using transform feedback?

If you feed back vertex array data then you should use DrawTransformFeedback to do a non-indexed rendering of the fed back vertex array data.

If you feed back instance data then you would need atomic counters in the vertex shader or geometry shader, though I'm not aware of any driver supporting non-fragment shader atomic counters currently.
However, on AMD hardware you can use the new GL_AMD_query_buffer_object (http://www.opengl.org/registry/specs/AMD/query_buffer_object.txt) extension to feed back the result of a primitive query to a draw indirect buffer in a non-blocking manner. Example #4 in the spec might be just what you are looking for.

peterfilm
07-11-2012, 07:46 AM
yes i'd just been reading the AMD_query_buffer_object extension just now! spooky. Frustratingly this extension is not supported on the nvidia quadro 4000 even though it's exactly what i need (example #4 could have been written with me in mind).
yes i'm trying to do frustum culling and lod selection on the gpu, just as you have done in your demos and just as I talk about in my other forum thread (where the question was performance).
now I've got everything writing to multiple streams, one stream for each lod, and the culling/lod selection is very fast indeed (still approx 50 million per tests per second, but with multiple streams i don't have to do multiple passes over the same instance data!) - but i've now identified the GL_PRIMITIVES_GENERATED query as a pretty significant bottleneck. This is why I'm looking for ways of getting the primitive generated count to the draw command without the CPU readback.

peterfilm
07-11-2012, 08:17 AM
btw, when i say a significant bottleneck i mean it takes the overall framerate down below doing the culling/lod on the CPU and using glMapBufferRange() to upload the results. So unless I can sort this out, I'll be abandoning the GPU approach.

aqnuep
07-11-2012, 08:56 AM
Well, you have at least two options:

1. Use AMD_query_buffer_object if you can limit your target audience to AMD hardware (however, I hope that NVIDIA will implement it soon too).
2. Use the visibility results of the previous frame to avoid the stall (you can even have a 2 frame delay). Obviously, this might result in popping artifacts, however, if your camera is not moving super fast and if you have decent frame rates, that one or two frame delay should not have any visible effect on your rendering.

peterfilm
07-11-2012, 09:10 AM
well that's where it gets complicated (option 2 i mean). You see the instance renderer is used in a number of cull/renders - multiple viewports, quad buffered stereo, cascaded shadow maps.... it's just not practical to have a vbo for each lod for each cull/render phase. Apart from the memory wastage, there's also the code complexity.
Ah well, life eh.

peterfilm
07-11-2012, 11:08 AM
i really love the simplicity of that AMD extension. The idea of the GL writing the query result into a buffer so we can then bind that buffer to the GL_DRAW_INDIRECT_BUFFER target is just gorgeous.

It's bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU - I mean, OpenGL is supposed to be primarily for graphics and this is one of the oldest requirements for any graphics application. I don't see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API's to do such a simple thing.

NVidia, just implement the extension already, for the love of god.

Alfonse Reinheart
07-11-2012, 12:30 PM
It's bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU

Um, why? Frustum culling is, at its core, a very different operation. GPUs are for drawing triangles. Culling is about doing arbitrary computations to determine a binary value.

Also, I'm curious as to exactly how writing the query result (which is either the number of fragments that pass or a true/false value) allows you to do LOD selection. Frustum culling I can kind of understand, sort-of. You can write a 0 value when the query is not visible. But how exactly does LOD selection work.


I don't see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API's to do such a simple thing.

Because OpenGL is for rendering and GPGPU APIs are for generic computations. Frustum culling and LOD selection are generic computations that are used to feed rendering.

I'm not saying it's a bad extension. But personally, I'd say that LOD selection is something that the CPU should be doing, considering how dirt simple it is (distance fed into a table).


NVidia, just implement the extension already, for the love of god.

Personally, if NVIDIA's going to implement any of AMD's recent extensions, I'd rather see multi_draw_indirect (http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt), sample_positions (http://www.opengl.org/registry/specs/AMD/sample_positions.txt), or depth_clamp_separate (http://www.opengl.org/registry/specs/AMD/depth_clamp_separate.txt).

aqnuep
07-11-2012, 02:02 PM
Also, I'm curious as to exactly how writing the query result (which is either the number of fragments that pass or a true/false value) allows you to do LOD selection. Frustum culling I can kind of understand, sort-of. You can write a 0 value when the query is not visible. But how exactly does LOD selection work.
You don't use an occlusion query, but a primitive query. You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.


Personally, if NVIDIA's going to implement any of AMD's recent extensions, I'd rather see multi_draw_indirect (http://www.opengl.org/registry/specs/AMD/multi_draw_indirect.txt), sample_positions (http://www.opengl.org/registry/specs/AMD/sample_positions.txt), or depth_clamp_separate (http://www.opengl.org/registry/specs/AMD/depth_clamp_separate.txt).
NVIDIA already implemented AMD_multi_draw_indirect a while ago. Btw, using the query buffer and the multi draw indirect extension can be used together to further limit the number of draw calls necessary for the idea what peterfilm wants to implement.

Alfonse Reinheart
07-11-2012, 04:10 PM
You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.

And... this is supposed to be fast? Using a geometry shader and performing per-triangle frustum culling/LOD selection, while using transform feedback? How is this faster than just rendering the models using traditional CPU-based methods of whole-object culling and LOD? You have this whole read/write/read loop going on in the shader. That requires an additional buffer just to write this intermediate data that you then render.

Also in general, when I think performance, I don't think geometry shaders.

Also, why not just use glDrawTransformFeedback (http://www.opengl.org/wiki/GLAPI/glDrawTransformFeedback) or its stream version (http://www.opengl.org/wiki/GLAPI/glDrawTransformFeedbackStream) to render it?

aqnuep
07-11-2012, 05:43 PM
And... this is supposed to be fast? Using a geometry shader and performing per-triangle frustum culling/LOD selection, while using transform feedback? How is this faster than just rendering the models using traditional CPU-based methods of whole-object culling and LOD? You have this whole read/write/read loop going on in the shader. That requires an additional buffer just to write this intermediate data that you then render.
No, nobody said that. You perform per-instance or per-object frustum culling/LOD selection using a geometry shader. That's orders of magnitude less work than the actual rendering.


Also in general, when I think performance, I don't think geometry shaders.
While using a geometry shader does has its cost, it's not the evil itself :)

Alfonse Reinheart
07-11-2012, 07:55 PM
No, nobody said that. You perform per-instance or per-object frustum culling/LOD selection using a geometry shader. That's orders of magnitude less work than the actual rendering.

How exactly? It is actually rendering. In order for the output primitive count to match the input primitive count, you have to be outputting the primitives you want to render. Which means that this pass is drawing all of the triangles for every LOD for every object that exists in the scene.

It may not be scan converting and rasterizing them. But it is passing them through the vertex and geometry shaders. Which means the GPU reads them from the buffers and has to do transformation at least. You have to do vertex processing for each visible object twice (though the second time is just pass-through). That's a lot of redundant reading of memory. You read each object, write it to another location, then read it from there to render it.

Again: how is this faster than just regular rendering via a deferred renderer?

peterfilm
07-12-2012, 03:37 AM
the thing you're missing alfonse is that the transform feedback pass is just drawing a long list of GL_POINTS (with rasterization disabled), each point contains vertex attributes, those vertex attributes are the entire objects transform and bounding volume (so in my case that's a mat4x3 for the transform and a vec4 for the sphere). The output of this transform feedback pass is a list of vertex attributes for each lod (I just output the mat4x3, the sphere has done its job) intended to be used in a glDrawElementsInstanced, as the per-instance data not the mesh data.
You might think this is a CPU job, but when you're talking about 10's of thousands of instanced objects being passed over the bus each frame (more if you take into account the shadow passes), then you can start to see the saving of doing this simple bounds/lod test on the GPU itself and then telling it to draw from the list it's just generated. To be honest I'm not that bothered about the frustum culling, I have a quad tree to cull the majority on the CPU anyway, it's the lod selection that's the real gain - that realistically has to be done per-instance, whereas frustum culling can be batched like I do in my quad tree.

peterfilm
07-12-2012, 03:53 AM
here's some numbers:-

instances:-
26781

CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps

NOTE: this is just the culling/lod selection. I've commented out the drawing code.

So as you can see, it's definitely worth doing the culling on the GPU!
Just that pesky readback that spoils the party and drags the fps down significantly (by readback I mean that in the drawing code it has to get the value of the GL_PRIMITIVES_GENERATED in order to feed that value into the primCount parameter of glDrawElementsInstanced to actually draw the mesh instances themselves).

thokra
07-12-2012, 04:43 AM
Looking at the numbers I find the discrepancy quite astonishing but I don't quite follow the data flow. Do you mind lining up your GPU approach in list of subsequent operations for dumb people like me?

Edit: If possible add the CPU path as well as to enable people to compare the approaches.

Edit 2: By no means I intend to judgmental here! It simply looks quite intriguing and I'd like to see how it works.

peterfilm
07-12-2012, 05:49 AM
I'd gladly do that, but aqnuep has already done a splendid job of writing this stuff up on his blog.
it's got diagrams and everything! ignore the hi-z business for now.
http://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/

disclosure: i'd already got this stuff working before i found his blog (looking for optimisations), so please don't think i'm a copy cat (not that there'd be anything wrong with that, I just want to retain some kudos for the idea...god knows i get little enough of them).

thokra
07-12-2012, 05:53 AM
Thank you (and aqnuep of course)! I thought I read that but it was actually the earlier instance culling post.


ignore the hi-z business for now.

No I will not! ;)

Dark Photon
07-12-2012, 06:31 AM
CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps
So 1.69ms/frame for CPU, and .501ms/frame for GPU. Net savings: 1.19ms across 26781 instances (aka 0.44ms/10,000 instances).

(FPS really is a horrible way to bench. Non-linear. Interesting thread though!)

thokra
07-12-2012, 06:56 AM
Dark Photon: What do you make of that ~1.2 ms gain? If you're tight on budget it seems reasonable. Otherwise ... I don't know.

BTW, shame on me for being blinded by those sneaky FPS.

Dark Photon
07-12-2012, 07:15 AM
What do you make of that ~1.2 ms gain?

Well, if you've got really lose framerate requirements it might not be so important. But for those that have 16.66ms to do everything or they're dead, 1.2ms is a lot of time and worth reclaiming.

It'd be good to have data on which specific GPU and CPU this test was done on to ground these benchmarks. Peter?

I like the spirit of AMD_query_buffer_object (http://www.opengl.org/registry/specs/AMD/query_buffer_object.txt). I'm all for nuking GPU pipeline bubbles and keeping the work blasting as fast as possible on the GPU. The author list on that extension is interesting too :-)

Maybe AMD and NVidia can work out a deal here: AMD implements NV_vertex_buffer_unified_memory (http://www.opengl.org/registry/specs/NV/vertex_buffer_unified_memory.txt) (batch buffers bindless only; no shader pointers) in exchange for NVidia implementing AMD_query_buffer_object (http://www.opengl.org/registry/specs/AMD/query_buffer_object.txt). Result: Everybody gets improved perf from their GPUs. :-)

peterfilm
07-12-2012, 09:36 AM
Intel Xeon Quad Core 2.66GHZ, 8GB ram, windows 7 64 bit. Quadro 4000 2GB ram driver 296.88.

Forgive me for the fps metric, was in a hurry.

1.2ms is for a relatively small number of instances versus the number I'm actually going to be required to render. Also, consider that this is just for a single pass, whereas I need to also render into the second eye of a stereo pair, and into 4 csm splits. There's also a picture-in-picture second view, albeit without shadow maps.

Alfonse Reinheart
07-12-2012, 10:32 AM
the thing you're missing alfonse is that the transform feedback pass is just drawing a long list of GL_POINTS (with rasterization disabled), each point contains vertex attributes, those vertex attributes are the entire objects transform and bounding volume (so in my case that's a mat4x3 for the transform and a vec4 for the sphere). The output of this transform feedback pass is a list of vertex attributes for each lod (I just output the mat4x3, the sphere has done its job) intended to be used in a glDrawElementsInstanced, as the per-instance data not the mesh data.

OK, but that doesn't explain how it does LOD selection. LOD selection would have to mean changing the model being rendered, yes? Which would require writing values to an indirect buffer, which would then be used with an indirect rendering command.

I don't see what you need query_buffer_object for in this case. Because the number of objects that pass (ie: the number of indirect rendering commands written) needs to come back to the CPU to be used with multi-draw-indirect. Or to loop over the indirect rendering commands.

Also, I don't see how this constitutes instanced rendering, since each instance has its own indirect drawing command.

Or, to put it simply, can you fully describe the algorithm, top to bottom? Because there seem to be some inconsistencies between the descriptions you given thus far.


here's some numbers:-

instances:-
26781

CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps

NOTE: this is just the culling/lod selection. I've commented out the drawing code.

Since you're using instancing, what's the performance of not doing frustum culling at all and simply drawing all of the instances?

peterfilm
07-12-2012, 11:03 AM
no, i still issue a glDrawElementsInstanced() call for each lod once the queries return me the primCount for each lod.
I'm not using the indirect extension, which is what the original question was in this thread - i see no way of writing to the indirect buffer from transform feedback.
I gave a link to rastergrids blog which explains the algorithm clearer than I have obviously done so far.

The problem I'm trying to solve is not specifically the frustum culling, as I said in an earlier post (keep up man!), it's the lod selection. I'm attempting to mask the simplification of the vegetation geometry by sticking to the lod distances carefully set by the artists - batching them together makes too sudden a pop. I'm trying to stop the pop without an explosion in triangle count.

Alfonse Reinheart
07-12-2012, 12:55 PM
i see no way of writing to the indirect buffer from transform feedback.

Sure you can. You just need to employ atomic increments.

Each LOD's per-instance data is being written to a separate stream. Every time you write an instance to one of the LOD streams, you atomically increment that LODs atomic counter.

Now, atomic counters are backed by buffer object storage. But you can use glBindBufferRange, as well as the `offset` field of the atomic counter's layout specifier, to put them anywhere in a buffer object's storage. Like, say, the primCount value of an indirect rendering command.

Each counter can be set to write to the `primCount` field of a different indirect rendering command, one for each LOD. Thus, when you're finished, you have three indirect rendering commands, all ready to go.

The only thing you need to do is issue a `glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT)` after building the LOD instance data, but before trying to render them. And of course, reset these values to zero each frame before specifying the LODs.

I have no idea if this will be faster than what you're doing. But there won't be any GPU->CPU->GPU antics.

peterfilm
07-12-2012, 01:04 PM
Yes that's what I was afraid of. The whole atomic counter stuff scared me, possible sync issues etc. And then aquen mentioned that you can only use atomic counters at fragment level......
But thanks for the clear explanation of how I'd use them if it came to it. I can but try I suppose, with a heavy heart.

Alfonse Reinheart
07-12-2012, 02:54 PM
The whole atomic counter stuff scared me, possible sync issues etc.

So, you're frightened by atomic counters, even though the use in this case is fairly obvious and requires exactly one sync point. But you're perfectly fine with rendering something that's not rendering anything, using multiple output streams and geometry shaders that aren't shading any geometry, all to write stuff to a buffer object that you'll use to render instances of geometry.

If you're going to yoke the GPU to do cool stuff, then yoke it. You're already forced to use GL 4.x hardware by your use of multiple streams. Best to use all of it.


aquen mentioned that you can only use atomic counters at fragment level

Then he's wrong. There is nothing in GLSL or OpenGL about where atomic counters can be used.

Dark Photon
07-12-2012, 06:53 PM
Intel Xeon Quad Core 2.66GHZ, 8GB ram, windows 7 64 bit. Quadro 4000 2GB ram driver 296.88.
Thanks for that!

aqnuep
07-12-2012, 07:59 PM
Sure you can. You just need to employ atomic increments.

Each LOD's per-instance data is being written to a separate stream. Every time you write an instance to one of the LOD streams, you atomically increment that LODs atomic counter.

Now, atomic counters are backed by buffer object storage. But you can use glBindBufferRange, as well as the `offset` field of the atomic counter's layout specifier, to put them anywhere in a buffer object's storage. Like, say, the primCount value of an indirect rendering command.

Each counter can be set to write to the `primCount` field of a different indirect rendering command, one for each LOD. Thus, when you're finished, you have three indirect rendering commands, all ready to go.
Yes, actually that should work and if you think about it, if you use a load/store image and multi draw indirect, you can even do non-instanced object culling in the same way. If I'll have time to implement something like that, I'll post about it on my blog :)


The only thing you need to do is issue a `glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT)` after building the LOD instance data, but before trying to render them. And of course, reset these values to zero each frame before specifying the LODs.
No, you're wrong. You need glMemoryBarrier(GL_COMMAND_BARRIER_BIT). Everybody seem to misunderstand how glMemoryBarrier works. It does not specify "what source" are you trying to sync but rather "what destination". In all cases glMemoryBarrier is meant to ensure that all shaders that performed image load/stores or used atomic counters finished before the commands after the barrier start. What the barrier bits specify is how you plan to use the written data. This ensures that all the appropriate input caches get flushed before commencing the next draw command.

Quote from spec:

COMMAND_BARRIER_BIT: Command data sourced from buffer objects by Draw*Indirect commands after the barrier will reflect data written by shaders prior to the barrier. The buffer objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER binding.


Then he's wrong. There is nothing in GLSL or OpenGL about where atomic counters can be used.
There is nothing, that's true. But if you check the extension specs (or the core spec) you can see that the extensions require a minimum of 8 load/store images and atomic counters only for fragment shaders (MAX_FRAGMENT_IMAGE_UNIFORMS and MAX_FRAGMENT_ATOMIC_COUNTERS), but the required number is 0 for all other stages. It's not a coincidence that there are some GL 4.2 capable GPUs not supporting them in all shader stages (at least currently).

Dark Photon
07-12-2012, 08:46 PM
Yes that's what I was afraid of. The whole atomic counter stuff scared me, possible sync issues etc.

So, you're frightened by atomic counters, even though the use in this case is fairly obvious

No, you're wrong. You need glMemoryBarrier(GL_COMMAND_BARRIER_BIT). Everybody seem to misunderstand how glMemoryBarrier works.
Not to derail the thread, but this is perfect example of why many folks (not just peterfilm), including me, are hesitant to wade into the GLSL "side-effect" waters. For folks that have cooked OpenCL or CUDA kernels, this opens up the same issues you have to deal with there ... definitely not an pool to dive into lightly (watch out for the sharks!).

I need to see more complete GLSL side-effect example code before I go hacking down that road.

(Maybe some year there'll be a Expert OpenGL Techniques class at SIGGRAPH that'll cover this in detail... (hint hint). Anyway, we now resume your current program already in progress...)

thokra
07-13-2012, 03:09 AM
I need to see more complete GLSL side-effect example code before I go hacking down that road.

It was once suggested to me that down-sampling a texture is best done with image load/store instead of using the convenient glGenerateMipmap() - I didn't try it yet but it was suggested by an AMD driver developer (not aqnuep however :) ). Also, you can apply filters without doing ping-pong rendering as in the case of applying multiple iterations of a blur filter since incorporating already altered pixels when determining the value of the next one is acceptable. To cope with instruction limits one could tile the the full-screen quad and have GPU perform filtering on the tiled regions - not sure exactly if that's permissible mathematically thinking of applying kernels in a undeterministic way with multiple tiles.

peterfilm
07-13-2012, 04:35 AM
well i asked for the limits on the quadro 4000, and got:-
GL_MAX_VERTEX_ATOMIC_COUNTERS: 16384
GL_MAX_GEOMETRY_ATOMIC_COUNTERS: 16384
GL_MAX_FRAGMENT_ATOMIC_COUNTERS: 16384

so i tried it, using atomic counters i mean, backed by a buffer.

results (sorry, fps again):-

instances: 25798
triangles: 186696
GPU: 705fps
CPU: 410fps



pretty damn good!
I know this isn't a real stress test, but i'm having trouble with the tool that generates the instances...can't get enough of em to produce a realistic load.

peterfilm
07-13-2012, 04:44 AM
#version 420 core


#ifdef GL_VERTEX_SHADER


in vec4 attrib_row1; // xyz=axisX, w=translationX
in vec4 attrib_row2; // xyz=axisY, w=translationY
in vec4 attrib_row3; // xyz=axisZ, w=translationZ
in vec4 attrib_bsphere; // bounding sphere xyz=centre, w=radius


out vec4 vsRow1;
out vec4 vsRow2;
out vec4 vsRow3;
flat out int vsVisible;


uniform vec4 uni_frustum[6]; // the 6 world space frustum planes


void main() {
vsRow1 = attrib_row1;
vsRow2 = attrib_row2;
vsRow3 = attrib_row3;


vsVisible = 1;

// is instance in frustum?
for (int i=0; i<6; ++i) {
float d = dot(uni_frustum[i], vec4(attrib_bsphere.xyz, 1.0));
if (d <= -attrib_bsphere.w) {
vsVisible = 0;
break;
}
}
}


#endif


#ifdef GL_GEOMETRY_SHADER


layout(points) in;
layout(points, max_vertices = 1) out;


uniform vec3 uni_camPos; // xyz=world space camera position
uniform vec4 uni_lodDist; // lod distances for x=lod0, y=lod1, z=lod2, w=lod3


in vec4 vsRow1[1];
in vec4 vsRow2[1];
in vec4 vsRow3[1];
flat in int vsVisible[1];


layout(stream=0) out vec4 gsOut0Row1;
layout(stream=0) out vec4 gsOut0Row2;
layout(stream=0) out vec4 gsOut0Row3;
layout(stream=1) out vec4 gsOut1Row1;
layout(stream=1) out vec4 gsOut1Row2;
layout(stream=1) out vec4 gsOut1Row3;
layout(stream=2) out vec4 gsOut2Row1;
layout(stream=2) out vec4 gsOut2Row2;
layout(stream=2) out vec4 gsOut2Row3;
layout(stream=3) out vec4 gsOut3Row1;
layout(stream=3) out vec4 gsOut3Row2;
layout(stream=3) out vec4 gsOut3Row3;


layout(binding = 0, offset = 4) uniform atomic_uint LodCount0;
layout(binding = 0, offset = 24) uniform atomic_uint LodCount1;
layout(binding = 0, offset = 44) uniform atomic_uint LodCount2;
layout(binding = 0, offset = 64) uniform atomic_uint LodCount3;


void main() {
if (vsVisible[0]==1) {
float dist = distance(vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w), uni_camPos);
if (dist < uni_lodDist.x) {
gsOut0Row1 = vsRow1[0];
gsOut0Row2 = vsRow2[0];
gsOut0Row3 = vsRow3[0];
atomicCounterIncrement(LodCount0);
EmitStreamVertex(0);
}
else if (dist < uni_lodDist.y) {
gsOut1Row1 = vsRow1[0];
gsOut1Row2 = vsRow2[0];
gsOut1Row3 = vsRow3[0];
atomicCounterIncrement(LodCount1);
EmitStreamVertex(1);
}
else if (dist < uni_lodDist.z) {
gsOut2Row1 = vsRow1[0];
gsOut2Row2 = vsRow2[0];
gsOut2Row3 = vsRow3[0];
atomicCounterIncrement(LodCount2);
EmitStreamVertex(2);
}
else if (dist < uni_lodDist.w)
{
gsOut3Row1 = vsRow1[0];
gsOut3Row2 = vsRow2[0];
gsOut3Row3 = vsRow3[0];
atomicCounterIncrement(LodCount3);
EmitStreamVertex(3);
}
}
}


#endif

thokra
07-13-2012, 05:45 AM
Just a minor observation:


float dist = distance(vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w), uni_camPos);


I can't tell if it will have a significant impact in your case but if the range of values permits you could use square distance to get rid of the sqrt here:


vec3 distVec = vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w) - uni_camPos;
float sqrDist = dot(distVec, distVect);

If course you'll have to account for that during LOD selection as well, i.e. store squared distances in uni_lodDist.

peterfilm
07-13-2012, 06:51 AM
yup, i know, this is a simple test - i found early on that it made no real difference to performance on the GPU but did on the CPU so I decided to leave it with true length on both implementations to make it fair. ;)