OT: Threads - why not?

General consensus about using separate threads for rendering/physics/collision etc. seems to be that it’s a bad thing because of the expense of ‘context switching’.
From what I understand of ‘context switching’ it is the process of saving/loading a few register values and changing the program counter (how does this affect the cpu cache?).
Fair enough, I suppose they’re operations that technically could be avoided by simply doing our own ‘context switching’ and doing conditional branching in a single thread based on timer values.
But then I take a look at windows task manager. I see that there are many, many processes, some of which use many, many threads.
Now, these are all getting serviced by the CPU, same as our own app, so I suppose my question is - what difference is adding another couple of threads to one of the many hungry processes going to make to my applications performance in real terms?

Threads are light-weight processes. A process creates a thread that runs in a shared memory space, so switching threads doesn’t affect the cpu cache, unlike heavy-weight processes (what you see in the task manager), which requires the cpu cache to be erased every time a process changes.

I don’t know what the aversion against threads are, so I can’t help you on that. They do have issues, though: you have to worry about race conditionos and mutual exclusion and atomic access and stuff. None of that isn’t unsummountable.

Good use of threads, though, have a benefit on modern processors because they can hide memory latency. I belive the Pentium 4 does this (h/w support for hyperthreading), but I learnt about the trick with another processor, whose name I forget. The trick is that the cpu knows that a process has threads: if a particular thread would stall the cpu from a memory fetch, the cpu schedules another of the processes threads to execute instead. If the process has enough threads so at least one of them can execute then you effectively hide memory latency because the cpu ~never stalls~. Very, very nice indeed.

Multithreading is usually a bad idea because it increases program complexity (and hence development time) by more than the 2x for two threads you might naïvely expect.

You also don’t tend to get a good performance improvement unless you’ve either got multiple processors available (which is still not common on desktop PCs, though it is on Macs), or unless you’re working around blocking functions.

OpenGL does have a good few blocking functions that can benefit from a second thread continuing processing whilst they’re waited on – swapBuffers, ReadPixels, &c

Use threads wherever it makes sense. Abusing them is not a good idea, but carefull usage of them in places that can be heavilly paralellized (ie. any algorithm where some results are not dependant on others) can give a good performance boost. It’s not that hard/complex to implement them as long as you don’t become mad and use them everywhere. For example, i implemented frustum culling in parallel in one hour, and got a 5-10% speed improvement on an hyperthreaded Pentium 4, and 30-40% improvement on a dual AMD.

Y.

Using more than one thread is not a bad idea at all.
The only bad thing is to render from more than one thread, because the context switching at the gpus is very-very expensive.

I always add threads to ERASE complexity. The app that iam currently working on has 15 threads and i plan another 5 to be added soon. Each thread does his own job separately which is very simple.
Another benefit is that GUI is more responsible to the user because the messages dont have to wait in the queue until the scene drawing is completed.

Good use of threads, though, have a benefit on modern processors because they can hide memory latency. I belive the Pentium 4 does this (h/w support for hyperthreading), but I learnt about the trick with another processor, whose name I forget. The trick is that the cpu knows that a process has threads: if a particular thread would stall the cpu from a memory fetch, the cpu schedules another of the processes threads to execute instead. If the process has enough threads so at least one of them can execute then you effectively hide memory latency because the cpu ~never stalls~. Very, very nice indeed.
Huh? I thought it only took a few CPU cycles to read from and write to RAM. I would think that with the overhead of switching between threads, you wouldn’t save any cycles.

Abusing them is not a good idea, but carefull usage of them in places that can be heavilly paralellized (ie. any algorithm where some results are not dependant on others) can give a good performance boost.
Remember that parallel thread execution is an illusion on a single-CPU system without hyperthreading.

Originally posted by Aaron:
Huh? I thought it only took a few CPU cycles to read from and write to RAM. I would think that with the overhead of switching between threads, you wouldn’t save any cycles.

Yes, in the age of 80386 and 80486…
If you have cache miss on a newer processor you have to wait “a few” CPU cycles…

So the rationale for threads is memory access stalls?
What if the second thread stalls, too, or if the fact that a second thread is running causes the first thread to stall even more often? I mean, two threads obviously consume more cache (d&i) than one.

How does this work out with nontrivial data sets?

I think you are a bit confused about the memory latency and thread switching issues in an SMT system (a.k. “HyperThreading” in Intel terms).

The whole idea with SMT is that you do actually execute several threads in parallel (i.e. you do NOT switch threads!). To accomplish that you have N times the number of register banks and program counters (for instance) in the CPU core, but (roughly) the same number of execution units as a non-SMT processor.

At first it may seem idiotic, but the point is that a single thread usually has too much inter-instruction dependencies and too monotonic behaviour to utilize 100% of the execution units of the CPU all the time. By executing two (or more) threads in the same core, you statisitcally get better utilization of all the available CPU power.

Conclusion: not only memory stalls are “hidden” by running several threads in the CPU, but almost every single execution unit in the CPU gets higher utilization.

However, you can actually see performance drops when running more then one thread on SMT systems, since two (or more) threads fight for the same cache space! In other words, if you plan to make a multi threaded program that is to run fast on an SMT system, make sure that the threads do not compete for the cache too much.

Another comment…

Thread switches are very inexpensive. In practice, you can ignore them. There ARE a few things that can cost a few microseconds (thread synchronization calls that go through the kernel), but these are usually quite rare (a few times per frame, in the case of a game).

If you start thinking in terms of multi threading, you can often find pretty simple ways to split work into several threads. It can even give you a few % by splitting “physics etc.” and “rendering etc.” into two threads, and have them share the same data using a mutex. In practice this usually gives you two threads that runs pretty much serialized rather than in parallel, at least on an single CPU system (since the “physics” thread has to compute the data before the “render” thread can draw it, and vice versa), but it almost never degrades performance, and on SMP systems you usually gain a few % in speed. The point is that is very simple to do this “split”.

I think that you should avoid complex solutions (that could be “clever”), because of maintainance issues, and you may actually degrade performance. A rule of thumb that I like to use is that the multi threaded solution should trivially degenerate to a single threaded solution on a single processor system.

The cost in threads is mostly in the added complexity of locking. You need critical sections, semaphores, events, and other kinds of synchronization primitives to make a threaded program “safe” which is another word for “correct”.

All synchronization primitives except the spinlock enter the kernel; entering the kernel is more than just pushing a few registers.

All synchronization privitives are built on atomic bus transactions. Atomic bus transactions go across ALL buses, including the PCI bus, which means they’re a microsecond at a minimum. This is regardless of CPU or FSP speed – PCI latency of 32 clocks, on a 33 MHz bus, gives you a microsecond.

Last, switching threads has different cost on different platforms. Many platforms have a bunch of thread-local data, which either get switched on context switch, or has to be looked-up based on current-thread-id when asked for.

The cheapest version of threads is cooperative threads, usually known as “fibers” these days – you can build those yourself on top op longjmp(). But then you don’t get pre-emption.

Anyway, if your rendering API is entered from more than one thread, be prepared to pay a heavy, heavy synchronization penalty. So don’t do it.

Originally posted by jwatte:
[b]The cost in threads is mostly in the added complexity of locking. You need critical sections, semaphores, events, and other kinds of synchronization primitives to make a threaded program “safe” which is another word for “correct”.

All synchronization primitives except the spinlock enter the kernel; entering the kernel is more than just pushing a few registers.
[/b]

I have seen very rarely when people use more than critical sections (spinlock, cheap mutex for threads etc.), and as you’ve writen it is cheap…

[b]
All synchronization privitives are built on atomic bus transactions. Atomic bus transactions go across ALL buses, including the PCI bus, which means they’re a microsecond at a minimum. This is regardless of CPU or FSP speed – PCI latency of 32 clocks, on a 33 MHz bus, gives you a microsecond.

Last, switching threads has different cost on different platforms. Many platforms have a bunch of thread-local data, which either get switched on context switch, or has to be looked-up based on current-thread-id when asked for.

The cheapest version of threads is cooperative threads, usually known as “fibers” these days – you can build those yourself on top op longjmp(). But then you don’t get pre-emption.

Anyway, if your rendering API is entered from more than one thread, be prepared to pay a heavy, heavy synchronization penalty. So don’t do it.[/b]

On all platform thread switching is going to be more and more cheap, because it’s quite important factor for the performance.
And fibers is a hack don’t call it threads…

I’d second that
“Fibers”, if they are what I think they are, are only an illusion that falls apart when you notice you don’t get any parallelism out of SMP (or HTT for that matter).

Otherwise I’m still clueless. Please go on

Originally posted by jwatte:
All synchronization privitives are built on atomic bus transactions. Atomic bus transactions go across ALL buses, including the PCI bus, which means they’re a microsecond at a minimum. This is regardless of CPU or FSP speed – PCI latency of 32 clocks, on a 33 MHz bus, gives you a microsecond.

Does this apply to single-CPU systems?
On my 1.4 GHz Athlon, cost of windows’ Interlocked****() function is ~10 cycles (140M/s)

If you divide your application into

  1. app time
  2. cull time
  3. draw time

you can share/save some time in both single processor systems and in multi processor systems.

e.g. let your software do animations and physics in app while draw is waiting for screen vsync or swap.

e.g. let cull work for next frame while draw is waiting for vsync or swap or VBO data updates

I think it is quite bad for an application that takes 100% of cpu to do the rendering (which most apps and demos does) without letting the app do calcs and culling.

BTW, wanted to mention how Delphi copes with thread synchronisation. It forces the method to be called by a main thread - passing class pointer and method address via a message parametzers. I find it somehow cool .