Bad multi GPU performance scaling

I’m having some troubles running my OpenGL renderer in multi-gpu configuration. There are two Quadro graphics cards in my computer with one monitor connected to each card. My renderer creates two windows, one on each monitor with correct gpu affinity. After that, two rendering threads are created, each with it’s own rendering context and with a local copy of data to render. There’s no data sharing or synchronization between threads and also there’s no data sharing between render contexts.

Trouble is that there’s almost no performance scaling and GPU utilization is below 50%. Framerate is exactly the same as in case with rendering only on one GPU.

I can run this renderer in one thread/one window configuration. In this case, selected GPU utilization is almost 100% and frame time is exactly halved compared to situation above. Surprisingly, running two instances of this renderer I can achieve perfect utilization of both GPUs.

I have observed similar behavior on Windows 8.1 64bit and also on Linux, both running latest nvidia drivers.

  • What do I have to do to achieve good scaling from within one process/multiple render threads configuration? Do I need special driver profile for my app? Is there some other conditions to meet?

  • Under Windows 8.1, it seems that GPU affinity is set correctly by default from initial window position. I can verify that each thread is sending commands to different GPU via NSight Performance Profiler.

  • Under X11, there’s two X server displays, Xinerama is disabled.

Please, I would be grateful for any tips or suggestions.

One thing that could help would be creating the contexts in the thread that use it.

Unfortunately, contexts are already created from within the thread.

There’s a simple code I’m using to setup and test my rendering. Maybe there are some errors I’m unable to see…


class window
{
public:

	window(int x, int y, int width, int height, const char * title, int affinity = -1)
	{
		HINSTANCE hInstance = GetModuleHandle(NULL);

// registration here

		wnd = CreateWindow(title, title,
			WS_CAPTION | WS_BORDER | WS_SIZEBOX | WS_SYSMENU | WS_MAXIMIZEBOX | WS_MINIMIZEBOX,
			x, y, width, height,
			NULL, NULL, hInstance, NULL);
		if (!wnd)
			throw std::runtime_error("CreateWindow failed!");

		SetWindowLongPtr(wnd, GWLP_USERDATA, (LONG_PTR)this);

		dc = GetDC(wnd);

		PIXELFORMATDESCRIPTOR pfd =    
		{
			sizeof(PIXELFORMATDESCRIPTOR),         // Size Of This Pixel Format Descriptor
			1,                                      // Version Number
			PFD_DRAW_TO_WINDOW |                    // Format Must Support Window
			PFD_SUPPORT_OPENGL |                    // Format Must Support OpenGL
			PFD_DOUBLEBUFFER,                       // Must Support Double Buffering
			PFD_TYPE_RGBA,                          // Request An RGBA Format
			24,                                     // Select Our Color Depth
			0, 0, 0, 0, 0, 0,                       // Color Bits Ignored
			1,                                      // Alpha Buffer
			0,                                      // Shift Bit Ignored
			0,                                      // No Accumulation Buffer
			0, 0, 0, 0,                             // Accumulation Bits Ignored
			24,                                     // 24 Bit Z-Buffer (Depth Buffer)  
			8,                                      // 8 Bit Stencil Buffer
			0,                                      // No Auxiliary Buffer
			PFD_MAIN_PLANE,                         // Main Drawing Layer
			0,                                      // Reserved
			0, 0, 0                                 // Layer Masks Ignored
		};

		int _pixelFormat = ChoosePixelFormat(dc, &pfd);
		if (_pixelFormat == 0)
			throw std::runtime_error("ChoosePixelFormat failed!");

		if (SetPixelFormat(dc, _pixelFormat, &pfd) == FALSE)
			throw std::runtime_error("SetPixelFormat failed!");

		rc = wglCreateContext(dc);

		wglMakeCurrent(dc, rc);

 		glewInit();

		if ((WGLEW_NV_gpu_affinity) && (affinity != -1))
		{
			HGPUNV  gpu;
			wglEnumGpusNV(affinity, &gpu);

			HGPUNV gpu_list [] = { gpu, nullptr };
			affinity_dc = wglCreateAffinityDCNV(&gpu_list[0]);
			if (!affinity_dc)
				throw std::runtime_error("wglCreateAffinityDCNV failed!");

			int _pixelFormat = ChoosePixelFormat(affinity_dc, &pfd);
			if (_pixelFormat == 0)
				throw std::runtime_error("ChoosePixelFormat failed!");

			if (SetPixelFormat(affinity_dc, _pixelFormat, &pfd) == FALSE)
				throw std::runtime_error("SetPixelFormat failed!");

			affinity_rc = wglCreateContext(affinity_dc);
			if (!affinity_rc)
				throw std::runtime_error("wglCreateContext failed!");

			if (!wglMakeCurrent(dc, affinity_rc))
				throw std::runtime_error("wglMakeCurrent failed!");
		}

		ShowWindow(wnd, SW_SHOW);
		UpdateWindow(wnd);
	}


	template <typename F>
	void run(F fn)
	{
		MSG msg;

		bool done = false;
		while (!done)
		{
			while (PeekMessage(&msg, wnd, 0, 0, PM_REMOVE))
			{
				if (msg.message == WM_QUIT)
					done = true;
				
				TranslateMessage(&msg);
				DispatchMessage(&msg);
			}
			
			fn();

			SwapBuffers(dc);
		}
	}

	static LONG WINAPI MainWndProc(HWND hWnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
	{
		window * w = (window*) GetWindowLongPtr(hWnd, GWLP_USERDATA);
		if ((!w) || (w->wnd != hWnd))
			return (LONG) DefWindowProc(hWnd, uMsg, wParam, lParam);

		switch (uMsg)
		{
		case WM_CREATE:
			break;
		case WM_PAINT:
			break;
		case WM_SIZE:
			break;
		case WM_CLOSE:
			PostQuitMessage(0);
			break;
		case WM_DESTROY:
			PostQuitMessage(0);
			break;
		}
		return (LONG) DefWindowProc(hWnd, uMsg, wParam, lParam);
	}
...
};


void run(int x, int y, int w, int h, const char * title, int affinity = -1)
{
	try
	{
		window wnd(x, y, w, h, title, affinity);
// create gbuffer fbo
// create shaders/programs
// load textures and meshes
		wnd.run([&]()
		{
// render to gbuffer fbo
// display result from gbuffer
		});
	}
	catch (std::exception & e) { std::cout << e.what() << std::endl; }
	catch(...) { std::cout << "Unknown exception!" << std::endl; }
	return 0;
}

int main(int argc, char * argv [])
{
	try
	{
		std::thread t1(run, 50, 50, 1024, 768, "win1", 0);
		run(1920 + 50, 50, 1024, 768, "win2", 1);
		t1.join();
	}
	catch (std::exception & e) { std::cout << e.what() << std::endl; }
	catch (...) {  std::cout << "Unknown exception!" << std::endl; }
}

For the purpose of testing there’s no data upload during render loop. There’s just binding of textures, binding of vertex buffers and glDrawArraysInstancedBaseInstance calls. Data for each draw call is sourced from shader storage buffer using gl_BaseInstanceID.

[QUOTE=TomSka;1260248]Trouble is that there’s almost no performance scaling and GPU utilization is below 50%. Framerate is exactly the same as in case with rendering only on one GPU.

I can run this renderer in one thread/one window configuration. In this case, selected GPU utilization is almost 100% and frame time is exactly halved compared to situation above. Surprisingly, running two instances of this renderer I can achieve perfect utilization of both GPUs.
[/QUOTE]

This is the same behavior I observed six years ago while trying to use three NV Quadro GPUs in a single system. I reported it to NVIDIA, who mentioned something about their OpenGL driver serializing all work within a process (but, as you notice, not across different processes). They tracked the bug for a couple years, didn’t fix it, and it sounds like this must still be a problem today. For what it’s worth, their Direct3D driver doesn’t have this limitation.

At the time, their GPU affinity extension was being pushed alongside QuadroPlex systems and a paper on how amazing it was to scale across multiple GPUs, so it was pretty surprising to find out how that was a lie in practice and couldn’t actually be obtained. $10k in GPUs just to try out a feature based on that advertising was a pricey mistake…

I have switched back to GLFW and after creating two fullscreen windows it’s finally working! I’m getting around 85% performance scaling (4.45ms for a one window, 5.15ms for two windows/threads on two GPUs). I was expecting to see a little bit better results but it’s better than nothing. Later I have tested two older AMD 5870 and there’s almost 100% scaling.

Funny thing, at begining I was using GLFW and fullscreen without much luck. But there might have been some bug and both contexts and windows were created from within main thread. Because of that I have switched to custom code to create a window but never tried to create a fullscreen one again. :frowning:

Thank you for your inputs! I was staring to think that it’s not possible to get it working… :slight_smile: