glScissor & performance

Has anyone done any performance testing on glScissor? I am considering using it for GUI drawing (each window/control having its own scissor rect, so that they can’t draw outside the screen). Given, say, 200 windows (absolute worst case), is that too many glScissor state changes?

I’m using scissor for my GUI, but I’ve never used more than 30 windows at once, so I have never thught of it’s performance.
Here is my (unverified) opinion - scissor should not introduce any significant performance cost. It should be like changing a blend mode actually.
Note that most lighting and shadow implementations use scissor and depth bounds test. I would find it wierd if vendors would make scissor test work in such way that it would introduce some stall or any other performance cost.
I don’t think you have to worry about that.

I know one thing that costs a lot on nvidia chipsets, glGetIntegerv(GL_VIEWPORT, vp);
Found this out recently when trying to write a completely separate GUI library, with no dependency on my renderer library. Avoid using it like the plague, it causes a flush.

I have used it for GUIs as well and never had problems with it (though I never stress-tested it above 30-40x per frame).

I agree with k_szczech, I’d be surprised if it has any impact at all.

@knackered: I’ve noticed that too, so do a few other viewport related glGets.

Scissor should be quite cheap.

cass, why would getting the viewport cause a flush?
Is it because the nvidia driver now runs in a separate thread?

Yes, having the driver in a separate thread makes most synchronization painful. It may be better for you to shadow it yourself, though I realize that’s not particularly attractive either.

yeah, it’s no real problem for me to shadow it (or have the lib user pass in the current dimensions), but couldn’t the nv driver cache certain things before it puts them in the driver thread fifo? like states that can’t be changed by stream ops?
I realise that there’s a certain validation overhead you’re trying to avoid in the application thread, but I can’t imagine there’s much validation required for setting the current viewport, and other things like enable bits.
Thinking about it, I guess glGetError will now be extraordinarily expensive? (haven’t benchmarked it, and it is only used in my debug build).

cass,

Perhaps it’s none of my business, but…

Speaking as a Win32 user; as you already have a TLS slot (who are we kidding, you have a whole FS:offs to play with :slight_smile: ), can’t you store (some of) such data per-context there?

OK, I realize you’d have to secure it by a critical section (assuming a process-wide page or more) in case multiple threads have access to the same context, but as you already have the FS:relative playground you can fix that per-thread also. Easy, even.

If that for some reason wouldn’t be an option, even plain TLS with critical sections could be an option. I mean, entering+leaving a critsect without contention is a few hundred clockcycles. Forcing a thread synchronization, perhaps even querying the hardware, is as you know a very, very expensive operation (in context - of this subject, not OpenGL context).

It could even be optimized for cases where the viewport is integers, and together less than 64 bits, where you could actually use the “lock cump exchug 8” (or however it’s to be pronounced :slight_smile: ) (for 8 bytes).

Heck, you could even add a single 32-byte aligned (for efficiency) DWORD locking, and if that displays contention you could use a full critsect.
(32 byte * no_of_threads in user-mode seems like a small tradeoff, especially when you allocate 50+MB kmode memory just to get a system to boot :-< )

Just some ideas. If you patent any of them, I want a cut. :wink:

Knackered, yes, when we run the driver in a separate thread and use the application thread just to pass command tokens across, even glGetError() becomes expensive.

Tamlin, it could be that I don’t know what I’m talking about, but I think the big issue with the multithreaded (think pipelined, not side-by-side) makes the “current” state that the driver knows about different than the current state at the very front of the API. The same kind of problems exist if you have a “smart” GPU where almost all the driver logic lives on the GPU side, which could even be across a network connection in the GLX indirect rendering world).

makes me wonder about things like vbo/pbo’s. Do you copy the application data (from maybe a subdata call) into the fifo, or does it cause a flush so it can be passed directly to the driver? And what about mapping bo’s? Makes my brain hurt.

Good question. I’ll ask.

Our driver folks tell me that almost all stuff that can be pipelined between the threads is. Much like it would be for indirect rendering.

so it must copy BO populating data into the fifo?

makes the “current” state that the driver knows about different than the current state at the very front of the API.
Shouldn’t it be easy and cheap to return “API-front-state” to the user? I mean, if I call glSetViewport() and immediately after it call glGet() on it, shouldn’t it be easy for OpenGL to return the values I just set, regardless of the fact the the actual “set viewport” command might not have been completed on the GPU?

you’ll be paying the price for a lookup in the app thread. The whole point of moving the driver into another thread is to offload from the app thread the validation work etc. a driver has to do before dispatching to the card. OpenGL is now so complex that this stuff is now a significant chunk of CPU time.
It’s fair enough to make a glget expensive, it’s always been good working practice to never call the glget functions unless absolutely necessary. Like cass said, indirect rendering mode on X would make a glget ridiculously expensive.

For stuff that it pipelines, yes, it makes a copy.

any idea what heuristics it uses when the “thread optimization” flag is set to “auto” in display properties?

I’m not really sure. I could check, but it also might be something that varies.

Any command which starts with “Get” should be avoided. This is not specific to NVIDIA. The API is designed for a unidirectional flow from your application to the GPU. Get’s disrupt the pipeline and should be used primarily for debugging.