Parallel OpenGL, MPI, Glut,, multithreading, offscreen rendering

Hello everyone,

I’m starting a new project using OpenGL 1.5 + GLSLANG.

It the project is a parallel volume rendering application running on a beowulf cluster .

The application is uses MPI for communication.

The idea is simple: the volume data is going to be partitionned in an octree, then distributed among the nodes to be rendered separately, and then each subimage is going to be sent to the server node for back to front compositing and user display.

As I see it for the moment, the parallel application is going to look like this:

  • the cluster is running windows 2000 on each node

  • an MPI master task is going to be running on the server node;

  • an MPI worker task is going to be instanciated on each cluster node;

-each worker task will be constituted of two threads :

      -a "network" thread that will listen to the network to receive commands from the master node and send locally computed subimages to the master node using MPI calls;
      - an "OpenGL rendering" thread that will actually perform the volume rendering.

So the questions are:

-Until now I’ve always used Glut for prototyping with OpenGL. Is it possible to use Glut in a thread of a multithreaded application?

-is it possible to to offscreen rendering using GLUT, without opening any window on the screen?

-if the answers to the two preceding questions is YES:

       - is is possible to share the opengl context between the "glut" thread and the "network" thread, so that the latter can receive the modelview matrix, zooming factor, transfer function from the master node and issue OpenGL calls to update the context while the opengl thread is computing the image? (I figured out I could use several checkpoints in the rendering loop to preempt and restart the current rendering pass with updated context whenever the context had changed)

-If GLUT can be used for all that, can the user input be handeled on the master node solely with glut mouse and keyboard call backs, and propagated to the slave nodes?

I know I’m asking a lot of newbie questions, but I’m new to cluster computing, and using glut for all this years kept me away from system-level issues like this.

Thank you very very much for your help.

Using MPI, each program running on a node will be independent. As far as I know there is no way you can share a opengl context/glut window accross several nodes.

Under Unix, several nodes could send OGL commands to a uniq rendering node, but that would be pointless : the rendering node would do nearly all the work, the other nodes would just do pinuts.

3 ways you could do it :

  1. Each node creates its own GL (glut) window. It renders 1 frame out of N (ie : frames i, N+i, 2N+i, 3N+i…) and it sends it to the main node. The main node just display frame t from node t/N.

  2. Each node creates its own GL (glut) window. It renders 1/N of the frame (that is, you split the frame in N subframes). Then each node sends its rendered result to the main node that rebuild the whole image.

  3. Each node creates its own GL (glut) window. It renders a fullscreen 1/N subpart of the octree. Then it sends both the frame buffer and the zbuffer to the main node. The main node rebuild the whole image by displaying in N pass a full quad with two textures : the ith frame buffer and the ith zbuffer to do some kind of displacement mapping.

To sum up: in the end you’ll have to sent images through the network and you’ll have to rebuild the result on your own. OGL won’t do it for you.

Actually each slave node will render 1/8th of the total volume data, and send the result to the master node for back to front alpha composition and display.

My multithreading question was about the slave part of the application (running on each of the 8 slave nodes) :

Each of the 8 slave nodes will have its own GL rendering loop, each with its own GL context.

The slave contexts need to be syncrhonized according to user interaction (handled by the master node).

So I would like to split each slave rendering process into a GL rendering thread + network thread, so that the slave can listen to the network while the rendering thread is at work.

So I wanted to know wether an openGL rendering context, could be shared among several threads, even if one of those threads uses GLUT.

Originally posted by relmas:
[b]My multithreading question was about the slave part of the application (running on each of the 8 slave nodes) :

Each of the 8 slave nodes will have its own GL rendering loop, each with its own GL context.

The slave contexts need to be syncrhonized according to user interaction (handled by the master node).

So I would like to split each slave rendering process into a GL rendering thread + network thread, so that the slave can listen to the network while the rendering thread is at work.

So I wanted to know wether an openGL rendering context, could be shared among several threads, even if one of those threads uses GLUT.[/b]
An OpenGL context can only be active in one thread at any given time, so if you want to do what I think you want to do (change the modelview in the network thread), it won’t work.

The results would be undefined anyway. Changes to viewing parameters have to be synchronized to drawing, otherwise you’ll get two half-frames instead of one complete one.

You should just update your viewing parameters after you’re done rendering the last frame. In this case the network thread doesn’t need OpenGL access, which will simplify things quite a bit.This will also allow you to use GLUT easily, which otherwise can be a bit of a pain, as GLUT sets the conetxt to its window a lot.

We’re running a similar model for the OpenSG clustering, and it’s working quite nicely.

Hope it helps

Dirk

Before starting such a massive implementation I would suggest doing some initial tests:

  • rendering whole screen on one machine
  • rendering 1/8 scene on one machine
  • read out frame- and depthbuffer (surprise surprise)
  • combine images and get them rendered
  • send partial images over network (watch out for latency)

I am sure some things will pop up that you did not expect. After that you can try to fiddle with MPI, multithreading and stuff.

Matthias

  • read out frame- and depthbuffer (surprise surprise)
  • send partial images over network (watch out for latency)
    That’s what I thought also. I supposed the goal was to render massive scenes (like hundred of millions of polygons) as fast as possible, even if it is not real time.

Originally posted by tfpsly:
That’s what I thought also. I supposed the goal was to render massive scenes (like hundred of millions of polygons) as fast as possible, even if it is not real time.
He mentioned volume rendering, and that is always fill-limited and therefore scales very nicely over multiple machines.

But the warnings are not unfounded. Readback performance is not very good on the current crop of cards (unless you’re running PCIe, that seems to work quite nicely), especially for the Z-Buffer. But for volume rendering doing sort-first is probably ok, if you do some kind of load-balancing.

I’ll be doing something very similar (without MPI, but using OpenSG instead) later this year, so I’d be interested in your results once they become available.

Dirk

i have implemented almost exactly what you are talking about with OpenGL and MPI and a volume distributed over multiple nodes. most of the issues i ran into are buffer read/write bandwidth and latency, and overall synchronization. with the implementation of MPI i was using, i also ran into problems with its internal buffers overflowing, leading to strange crashes and lock-ups. this was no fun to debug (and is still not fully debugged) so take your time here.

i agree with everything everyone else has said, so won’t bother repeating it. i just want to answer one question you raised which hasn’t been addressed yet:

AFAIK, you can’t open a “windowless” GL context on windows. there is nothing in wgl to allow that. so, sadly, you’d have to have a GLUT or other window and make it as small as possible, and use a pbuffer to do your actual rendering. however, you CAN do this on linux. my parallel renderer was designed for linux and uses a pbuffer class i wrote that can create a GL context not attached to a visible window. this works well, but can get screwed up if someone logs out of the machine (regardless of whether they were logged in when the process started - it has to do with the way X frees resources and is unavoidable).

FYI

Happy to get some hints from you guys.

The project is in fact a pre-project, a sort of feasibility study. I started coding a volume rendering app running on a Radeon9600 back in march,I implemented the full state of the art array of texture-based, back-to-front techniques you can find in the bibliography. Then I started to look at the front-to-back approaches with early termination and empty space skipping (brilliant rendering algorithm by the way).
It seems that a combination of space partitionning, level of details and early termination would allow the slave nodes to work asynchronously, and that some of the idle time saved by early termination on the slave nodes could be used to start an early compositing and break the final compositing bottleneck.

As most of the time people are looking at isosurfaces, the majority of the rendered volume ends up transparent.

If the application looks promising on a bunch of 2.4Ghz P4 running radeons9600, the boss might be interested in upgrading the cluster with 256Mo GF6800 boards, or even some 3Dlabs cards with 512 Mo of ram.

But first I have to make the damn thing run!!! and parallel coding is really new for me, more of a conceptual way of thinking than an actual stuff.

I want the app to be ready for the future, so that’s why i want to use GLslang.

I plan to set up a web page for the project once there is some results to show and interesting enough technical experience to share.

Thanks for the valuable hints, and if you’ve got some more, keep the thread alive!

If some of you have links on direct hexahedral cells volume rendering, I’m interested.

Special question for Codemonkey76: what MPI implementation did you use? I’m using MPICH.
From my googling on MPI, I have the feeling that people use MPI for rapid prototyping, and eventually move down to sockets for performance tuning.