nvidia drivers to do frustum culling?

i’m running a geforce4 4600, 44.03 drivers
i’ve noticed i don’t get a significant performance increase when i do frustum culling manually
could it be the drivers are doing it automatically?

Yes, the /hardware will cull away what is outside the view frustum. Whats outside of it wont be visible anyway, so they cull it away to save time processing it later when it isnt needed anyway.

And its not only nVidia that do it. It would be something that every hardware manufacturer would have done.

It’s called clipping and the issue exists since graphics came to computers (that includes 2D).

The real trick to get a speed improvement with manual frustum culling is to cull large amounts of polygons in a single test. If you can eliminate hundreds or thousands of polygons in a quick single test on the cpu, there is far less for the gpu to process and eliminate 1 by 1. This is where the speed increase from manual frustum culling will come from.

Is this enabled even if I don’t turn on scissor tests? I noticed some little performance gain by turning it on.
I actually think they work togheter (driver culling polygons and scissor culling fragments).

I’m wondering if that culling is not done before T&L.

I noticed on a GF3 that if you display millions of triangles outside your view, your framerate will not drop. It’s strange because i always thought culling was done after transformation, in screen space, hence would be limited by the transform rate of the graphics card, but it doesn’t seem to work like that.

Y.

I doubt it would take place before T&L is performed, especially when using vertex programs (which even the “fixed function” pipeline is emulated via on the GF3+), as the transform of the vertices could very well place them inside of the view frustum, when under the fixed-function pipeline they would fall outside of the view frustum.

frustrum culling can’t be done before T&L, because the vertices are not transformed yet. or is there any trick on that point?

regards,

Originally posted by nystep:
frustrum culling can’t be done before T&L, because the vertices are not transformed yet. or is there any trick on that point?

The card may either :

  • transform your vertices and cull them in the camera space
  • back-transform your frutum and cull vertices in the object space

But in order to use homogenous coordinates, the card still has to use the projection matrix, so I’d guess 1st method is used. So yes the vertices are proably culled after TnL

If you don’t get a performance increase with frustum culling, you probably weren’t geometry limited in the first place.

frustrum culling can’t be done before T&L, because the vertices are not transformed yet.

This point is moot IMO. When you’re doing frustum culling on your CPU, you’re doing it before T&L, aren’t you?

I was under the impression that it was something done on the CPU by the driver, at a per-primitive level.

How hard would it be for a driver to do frustum culling? The driver knows the projection/modelview matrix, it can build the 6 frustum planes easily.

The driver can build a bounding box/sphere at run-time each time you issue a glDrawElements or similar call.

Technically, nothing would prevent a driver from doing its own culling before T&L. Yeah, it wouldn’t work with vertex shaders, but it might be a specifically optimized path for the fixed function pipeline.

I’d love to have that confirmed, as it was long ago, and i no longer have the program to test it.

Y.

I’ve never seen any evidence of this, and quite frankly I find the whole idea rather ludicrous to begin with.

To do frustum culling, the driver needs a bounding box for your geometry (unless you’re suggesting that it does it per-triangle, which is even more absurd). Since your geometry may be dynamic, that would imply that the driver has to determine the bounding box of your geometry each and every time you do a glDrawElements() call.

Not only that, but any sensible app will already be doing its own frustum culling, and will be doing it hierarchically which is much more efficient than doing it on a per-glDrawElements() basis. Frustum culling belongs in the app, not in the driver.

If you see any kind of speedup after moving all your geometry outside of the frustum, the only reasonable conclusion (IMHO) is that you were fillrate-limited.

– Tom

that would imply that the driver has to determine the bounding box of your geometry each and every time you do a glDrawElements() call.

That’s also my conclusion.

Frustum culling belongs in the app, not in the driver.

Again, agreed.

If you see any kind of speedup after moving all your geometry outside of the frustum, the only reasonable conclusion (IMHO) is that you were fillrate-limited.

But in that case i’d be T&L-limited. Rendering a one-million triangle scene (outside the frustum) would not take 1 millisecond.

I only saw that behavior on a GF3 many months ago. ATIs performed “as expected”.

Y.

I’m under the impression nVidia drivers perform visibility culling for buildlists (probably deactivated when VPs are not used).

I have a small demo in which I render a bunch of high-resolution spheres, each sphere being in a build list. Performing visibility culling at the sphere level results in minimal framerate increase on nVidia hardware, while on ATI hardware, a fair increases in framerate was reported.

Just re-tried the demo and I confirm: no visibility culling in the app, I can add plenty of high-res spheres behind the camera (ie. not visible) and the framerate does not change on my GF3/44.03.
If they are added in front, the framerate drops (all triangles fillup one-two pixels at worst, the viewport is very small, overdraw close to zero).

Ah, well display lists are a special case because they’re guaranteed to be static (again, provided no vertex programs are used). In this case, it’d be pretty easy for the drivers to cull them.

Have you measured triangle rates? If the drivers really are doing culling on static geometry, then moving the geometry outside the frustum should make your triangle rate readout exceed the theoretical maximum of the card.

– Tom

thanks for your replies,
i guess EG caught the point, anyway i’m quite sure nv drivers even do some sort of clipping on a per vertex base, just try and draw a very well tessellated quad (say 128*128 quads) in immediate mode and try displaying only part of it
it will run much faster on a geforce4 than on radeon9700

NVIDIA hardware does infinite guardband clipping which is (more or less) infinitely fast. Instead of clipping the geometry to the edges of the screen, the fragments just aren’t produced.

ATI does ‘true’ clipping and, quite frankly, it performs horribly. They manage to clip one triangle every 40 cycles. That hasn’t changed since the R100 (limited guardband caps are exposed in DX since R300, but apparently this doesn’t mean anything for OpenGL performance; dunno about R350).

[b]ArchMark 0.10.09alpha[/b]
Driver              Radeon 9500 Pro x86/MMX/3DNow!/SSE v1.3.3803 Win9x Release
Resolution          1024x768 @ unknown refresh rate
Method              Flush
Timer               1.499 GHz

[b]Geometry[/b]
Mode                RGBA5650 Z16 S0
--[b][i]Plain vertices[/i][/b]--------------------------------
  Fan               85.909 MTris/s
  List              39.159 MTris/s
  Clip              6.884 MTris/s

--[b][i]Vertex shading speed[/i][/b]--------------------------
  LightD1           78.884 MVerts/s
  LightP1           33.791 MVerts/s
  LightP8           12.574 MVerts/s

[This message has been edited by zeckensack (edited 06-30-2003).]

Yep, I play-tested some more after posting, I began at 40 MTris/sec and pumped it up to 140 MTris/sec, at which point I ran into long startup times and file swapping (each sphere had its own buildlist).
By reusing the buildlist, I easily reached 500 MTris/sec on this lowly GF3 Ti200

I’m not sure what culling approach is used by the drivers, with the buildlist repeated 2100 times, most invisible, there is already a significant hit (much higher than brute-force, non-hierarchical culling).

For instance, in the 500 MTris/sec case, culling by the driver gives me 12 FPS, brute-force culling 151 FPS, and rendering only what is visible (altered scene, no culling) 153 FPS.
There are 2100 spheres in the scene, each containing 20k tris, and of which two are visible. Texcoords are defined but not used. At 150 FPS, I’m at 6 MTri/sec, with lighting off, framerates slightly more than doubles to 325 FPS (13 MTri/sec), which hints that T&L is really the limiting factor here (with lighting off and “driver frustum culling”, I get 14 FPS).

Maybe someone from NV could shed some light as to what kind of culling is used, or if this “culling” is just the byproduct of some other optimization?

I got the confirmation from NVidia that they are doing culling optimizations for static geometry. Just so that you know (no, i was not dreaming!).

Y.