Avoiding the default framebuffer blit overhead

l_belev · November 17, 2012, 9:53am

Hi,

First I will describe the problem.
As we know the default framebuffer (0) is remnant from the past which for some mysterious reason opengl is still dragging along like a bag with stones.
It is very un-flexible and totally alien to many modern-day ways of doing things, e.g. deferred rendering.
One would often need to be able to combine freely various color/depth/stencil buffers, which is easy with the FBO infrastructure.
But when we need to display something there is a problem. The final image to be displayed is often not generated in the default framebuffer,
because we need the flexibility of FBOs. For example we may need the depth buffer used to render the scene available as a texture or something.

Then we need to blit to the default framebuffer. This adds overhead, which may be something like 1-2 milliseconds per frame.

In direct3d the colorbuffer that can be displayed (swapchain) is a pure colorbuffer-only object from the POV of the renderer and can be combined with other buffers just like the non-displayable ones.
This is unlike the opengl default framebuffer, which drag it’s own depth buffer (or has none) and can not be changed.

I experimented a bit with the nvidia WGL_NV_DX_interop2 extension.
I created some d3d11 device with it’s swapchain, then using the extension, setup a opengl renderbuffer that corresponds to the swapchain backbuffer.
Then i did some rendering on the opengl while using the d3d’s way of presenting image to a window.
After some tweaking i managed that to run faster than opengl’s own way using blit.

All the rendering was just a glClear(GL_COLOR_BUFFER_BIT) and then present the result.

I tested 3 cases:
a) opengl clear + opengl present (using blit to the default fb)
b) opengl clear + d3d present
c) d3d clear + d3d present.

b) and c) are equally fast and a) is noticeably slower than them.

The mentioned tweaking included removing of the synchronization calls (wglDXLockObjectsNV and wglDXUnlockObjectsNV)
I only call wglDXLockObjectsNV once and the objects stays locked all the time (otherwise opengl generates GL_INVALID_FRAMEBUFFER_OPERATION)

the render loop is basically
glClearColor(0, rand()%256*(1.0f/256), 0, 1);
glClear(GL_COLOR_BUFFER_BIT);
glFlush();
sc->Present(0, 0);
the backbuffer of the swapchain is bound to the opengl draw framebuffer.

Also when the swapchain is created, the BufferUsage must include the DXGI_USAGE_RENDER_TARGET_OUTPUT flags, otherwise the performance is crippled.

It is a shame that this ugly hack actually outperforms the opengl’s native way to output it’s graphics.
I think it is about time they get rid of the default framebuffer.
They can look at the ipad for an idea how to do it.

l_belev · November 17, 2012, 10:33am

here is the test source if someone is interested to try it
change “mode” to select among the 3 cases i mentioned - see the comment
ah, also “start” is the program entry point (i set that in the linker options). you can rename it to WinMain or whatever

#include <stdio.h>
#define INITGUID
#include <windows.h>
#include <GL/gl.h>
#include <d3d11.h>

static LRESULT CALLBACK wnd_proc(HWND wnd, UINT msg, WPARAM wp, LPARAM lp)
{
    switch (msg) {
        case WM_PAINT: ValidateRect(wnd, NULL); return 0;
        case WM_CLOSE: ExitProcess(0); return 0;
        default: return DefWindowProcA(wnd, msg, wp, lp);
    }
}

#define WIDTH 1024
#define HEIGHT 768

// 0 = gl_clear/gl_present, 1 = gl_clear/d3d_present, 2 = d3d_clear/d3d_present
int mode = 1;

void start()
{
    // window
    WNDCLASSA wc;
    RECT rc;
    HWND wnd;
    // d3d
    ID3D11Device *d3ddev;
    ID3D11DeviceContext *d3dctx;
    IDXGISwapChain *sc;
    DXGI_SWAP_CHAIN_DESC scd;
    ID3D11Texture2D *d3dbb;
    ID3D11RenderTargetView *view;
     // opengl
    HDC dc;
    PIXELFORMATDESCRIPTOR pfd;
    int pf;
    HGLRC ctx;
    #define WGL_ACCESS_READ_WRITE_NV          0x0001
    HANDLE (WINAPI *wglDXOpenDeviceNV)(void *dxDevice);
    HANDLE (WINAPI *wglDXRegisterObjectNV)(HANDLE hDevice, void *dxObject, GLuint name, GLenum type, GLenum access);
    BOOL (WINAPI *wglDXLockObjectsNV)(HANDLE hDevice, GLint count, HANDLE *hObjects);
    BOOL (WINAPI *wglDXUnlockObjectsNV)(HANDLE hDevice, GLint count, HANDLE *hObjects);
    HANDLE idev;
    #define GL_READ_FRAMEBUFFER               0x8CA8
    #define GL_DRAW_FRAMEBUFFER               0x8CA9
    #define GL_RENDERBUFFER                   0x8D41
    #define GL_COLOR_ATTACHMENT0              0x8CE0
    void (APIENTRY *glGenFramebuffers) (GLsizei n, GLuint *framebuffers);
    void (APIENTRY *glBindFramebuffer) (GLenum target, GLuint framebuffer);
    void (APIENTRY *glFramebufferRenderbuffer) (GLenum target, GLenum attachment, GLenum renderbuffertarget, GLuint renderbuffer);
    void (APIENTRY *glGenRenderbuffers) (GLsizei n, GLuint *renderbuffers);
    void (APIENTRY *glBindRenderbuffer) (GLenum target, GLuint renderbuffer);
    void (APIENTRY *glRenderbufferStorage) (GLenum target, GLenum internalformat, GLsizei width, GLsizei height);
    GLenum (APIENTRY *glCheckFramebufferStatus) (GLenum target);
    void (APIENTRY *glBlitFramebuffer) (GLint srcX0, GLint srcY0, GLint srcX1, GLint srcY1, GLint dstX0, GLint dstY0, GLint dstX1, GLint dstY1, GLbitfield mask, GLenum filter);
    GLuint bb, fb;
    HANDLE ibb;

    // create windwo    
    memset(&wc, 0, sizeof(wc));
    wc.lpfnWndProc = wnd_proc;
    wc.lpszClassName = "test_wc";
    wc.hCursor = LoadCursor(NULL, MAKEINTRESOURCE(IDC_ARROW));
    RegisterClassA(&wc);
    rc.left = rc.top = 0;
    rc.right = WIDTH;
    rc.bottom = HEIGHT;
    AdjustWindowRect(&rc, WS_CAPTION|WS_SYSMENU, FALSE);
    wnd = CreateWindowExA(0, wc.lpszClassName, "window", WS_CAPTION|WS_SYSMENU, 0, 0, rc.right - rc.left, rc.bottom - rc.top, NULL, NULL, NULL, NULL);
    ShowWindow(wnd, SW_SHOW);

    if (mode) {
        IDXGIFactory *factory;
        IDXGIAdapter *adapter;
        IDXGIOutput *output;
        DXGI_OUTPUT_DESC od;
        CreateDXGIFactory(&IID_IDXGIFactory, &factory);
        factory->lpVtbl->EnumAdapters(factory, 0, &adapter);
        adapter->lpVtbl->EnumOutputs(adapter, 0, &output);
        output->lpVtbl->GetDesc(output, &od);
        output->lpVtbl->Release(output);
    
        // create d3d device
        memset(&scd, 0, sizeof(scd));
        scd.BufferDesc.Width = WIDTH;
        scd.BufferDesc.Height = HEIGHT;
        scd.BufferDesc.RefreshRate.Numerator = 60;
        scd.BufferDesc.RefreshRate.Denominator = 1;
        scd.BufferDesc.Format = DXGI_FORMAT_B8G8R8A8_UNORM;
        scd.SampleDesc.Count = 1;
        scd.BufferUsage = DXGI_USAGE_BACK_BUFFER|DXGI_USAGE_RENDER_TARGET_OUTPUT;
        scd.BufferCount = 1;
        scd.OutputWindow = wnd;
        scd.Windowed = TRUE;
        D3D11CreateDeviceAndSwapChain(adapter, D3D_DRIVER_TYPE_UNKNOWN, NULL, D3D11_CREATE_DEVICE_SINGLETHREADED,
            NULL, 0, D3D11_SDK_VERSION, &scd, &sc, &d3ddev, NULL, &d3dctx);
        sc->lpVtbl->GetBuffer(sc, 0, &IID_ID3D11Texture2D, (void **)&d3dbb);
        
        if (mode > 1) {
            D3D11_RENDER_TARGET_VIEW_DESC vd;
            D3D11_VIEWPORT vp;
            vd.Format = DXGI_FORMAT_UNKNOWN;
            vd.ViewDimension = D3D11_RTV_DIMENSION_TEXTURE2D;
            vd.Texture2D.MipSlice = 0;
            d3ddev->lpVtbl->CreateRenderTargetView(d3ddev, d3dbb, &vd, &view);
            d3dctx->lpVtbl->OMSetRenderTargets(d3dctx, 1, &view, NULL);
            vp.TopLeftX = vp.TopLeftY = 0;
            vp.Width = WIDTH;
            vp.Height = HEIGHT;
            vp.MinDepth = 0;
            vp.MaxDepth = 1;
            d3dctx->lpVtbl->RSSetViewports(d3dctx, 1, &vp);
        }
    }

    if (mode < 2) {    
        dc = GetDC(wnd);
        memset(&pfd, 0, sizeof(pfd));
        pfd.nSize = sizeof(pfd);
        pfd.nVersion = 1;
        pfd.dwFlags = PFD_DRAW_TO_WINDOW|PFD_SUPPORT_OPENGL|PFD_DEPTH_DONTCARE;
        pf = ChoosePixelFormat(dc, &pfd);
        SetPixelFormat(dc, pf, NULL);
        ctx = wglCreateContext(dc);
        wglMakeCurrent(dc, ctx);
        glGetString(GL_RENDERER);
        *(PROC *)&glGenRenderbuffers = wglGetProcAddress("glGenRenderbuffers");
        *(PROC *)&glGenFramebuffers = wglGetProcAddress("glGenFramebuffers");
        *(PROC *)&glBindFramebuffer = wglGetProcAddress("glBindFramebuffer");
        *(PROC *)&glFramebufferRenderbuffer = wglGetProcAddress("glFramebufferRenderbuffer");
        *(PROC *)&glCheckFramebufferStatus = wglGetProcAddress("glCheckFramebufferStatus");
        *(PROC *)&glBindRenderbuffer = wglGetProcAddress("glBindRenderbuffer");
        *(PROC *)&glRenderbufferStorage = wglGetProcAddress("glRenderbufferStorage");
        *(PROC *)&glBlitFramebuffer = wglGetProcAddress("glBlitFramebuffer");
        glGenFramebuffers(1, &fb);
        glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fb);
        glBindFramebuffer(GL_READ_FRAMEBUFFER, fb);
        glGenRenderbuffers(1, &bb);

        if (mode) {
            *(PROC *)&wglDXOpenDeviceNV = wglGetProcAddress("wglDXOpenDeviceNV");
            *(PROC *)&wglDXRegisterObjectNV = wglGetProcAddress("wglDXRegisterObjectNV");
            *(PROC *)&wglDXLockObjectsNV = wglGetProcAddress("wglDXLockObjectsNV");
            *(PROC *)&wglDXUnlockObjectsNV = wglGetProcAddress("wglDXUnlockObjectsNV");
            idev = wglDXOpenDeviceNV(d3ddev);
            ibb = wglDXRegisterObjectNV(idev, d3dbb, bb, GL_RENDERBUFFER, WGL_ACCESS_READ_WRITE_NV);
            GetLastError();
            wglDXLockObjectsNV(idev, 1, &ibb);
            glFramebufferRenderbuffer(GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, bb);
            glCheckFramebufferStatus(GL_DRAW_FRAMEBUFFER);
            //wglDXUnlockObjectsNV(idev, 1, &ibb);
        } else {
            glBindRenderbuffer(GL_RENDERBUFFER, bb);
            glRenderbufferStorage(GL_RENDERBUFFER, GL_RGBA8, WIDTH, HEIGHT);
            glFramebufferRenderbuffer(GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, bb);
            glCheckFramebufferStatus(GL_DRAW_FRAMEBUFFER);
        }
    }

    while (1) {
        MSG msg;
        while (PeekMessageA(&msg, NULL, 0, 0, PM_REMOVE))
            DispatchMessageA(&msg);

        if (mode > 1) {
            float col[4] = {rand()%256*(1.0f/256),0,0,1};
            d3dctx->lpVtbl->ClearRenderTargetView(d3dctx, view, col);
            sc->lpVtbl->Present(sc, 0, 0);
        } else {
            //if (mode) wglDXLockObjectsNV(idev, 1, &ibb);
            glClearColor(0,rand()%256*(1.0f/256),0,1);
            glClear(GL_COLOR_BUFFER_BIT);

            if (mode) {        
                glFlush();
                //wglDXUnlockObjectsNV(idev, 1, &ibb);
                sc->lpVtbl->Present(sc, 0, 0);
            } else {
                glBindFramebuffer(GL_DRAW_FRAMEBUFFER, 0);
                glBlitFramebuffer(0, 0, WIDTH, HEIGHT, 0, HEIGHT, WIDTH, 0, GL_COLOR_BUFFER_BIT, GL_LINEAR);
                glGetError();
                glFlush();
                glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fb);
            }
        }

        {
            // show fps in the window title bar. dont update it on every frame to avoid crippling the performance
            static DWORD fc, last;
            DWORD now = GetTickCount();
            fc += 1;
            if (!last) last = now;
            else if (now - last > 300) {
                char txt[64];
                sprintf(txt, "fps: %.4f", 1000.0f * fc / (float)(now - last));
                SetWindowTextA(wnd, txt);
                fc = 0;
                last = now;
            }
        }
    }
}

Alfonse_Reinheart · November 17, 2012, 2:11pm

DWORD now = GetTickCount();

I don’t think this makes for a good test, considering that the resolution on this function is poor. Try using QueryPerformanceCounter, which has much higher resolution and is the common means for doing serious timings in Windows.

Also, you never said what your actual results are, only that one was “noticeably slower”. Oh, and I would be curious to see what you would get via query objects. That is, detecting the GPU time rather than the CPU time.

l_belev · November 17, 2012, 3:55pm

you have the source, feel free to test with QueryPerformanceCounter, queries and whatever you like. the results i got were telling enough for me

both d3d-present cases did about 1000 fps on my machine and the opengl-present case did something between 500 and 600 fps
the gl-clear/d3d-present case did abit lower than pure d3d, but the difference was marginal.

To me it is clear that the gl-present case has one additional buffer copy than the d3d-present cases

Alfonse_Reinheart · November 17, 2012, 5:09pm

The biggest question I have is this… what if you’re not doing it the way you describe?

Consider the case of actually rendering something for real. You’re doing deferred rendering; OK, fine. You have your g-buffers, where you have your actual data. Then you convert this into light reflectance as seen by the camera. But if you’re doing HDR (which, let’s be honest, is far more of a no-brainer than deferred rendering by this point), you’re doing all of this accumulation into a floating-point buffer. You can’t “present” that; you need to tone-map it first. Not only that, you probably have some transparent objects to render, so you need to do some blending. This should presumably be done in HDR space.

Now it’s time to tone-map down to SRGB8_ALPHA8. But where should the output go? Why not… the default framebuffer?

In short, I’m not seeing the problem here. Your problem seems to be that you don’t want to use the default framebuffer (as stated by your passive/aggressive introduction). That’s fine, but… it still there.

No matter how many threads on this forum you make, no matter how many alternative rendering systems you write, no matter how much you want it to be so, it’s still there. It was there in OpenGL 4.1. It was there in OpenGL 4.2. It was there in OpenGL 4.3. Next year, it will still be there in OpenGL 4.4/5.0/etc. Whether you want to use it or not, it is there and available for use. So if you can, use it. And in most real cases, you can. So use it, and you won’t have to worry about that copy being slow, since you won’t be doing a copy.

If you spent more time using the API you have, rather than the API you want, you’ll be a lot happier.

To me it is clear that the gl-present case has one additional buffer copy than the d3d-present cases

You’re not honestly showing this code off because you had the revelation that copying is slower than not copying, are you?

Brandon_J_Van_Every · November 19, 2012, 6:00am

It’s simply an aggressive introduction, which you don’t happen to agree with. Could you please skip the psychoanalysis in the future?

dukey · November 25, 2012, 11:15am

In windows vista+ since the desktop is rendered through d3d, you do get an aditional copy. But with direct3d9ex there is a way to render more efficiently. Basically it passes a pointer to the surface for DWM to render directly instead of requiring an extra copy. I guess this won’t be possible under opengl. But really the cost of 1 blit is nothing to worry about anyway.

l_belev · November 30, 2012, 3:17am

I myself thought that 1 blit more would be nothing, but it turned out it does have noticeable impact on lower-end hardware. It’s nothing spectacular and certainly not a show stopper, but still it could be bigger (depending on the GPU memory bandwidth) than many other things people make efforts to optimize.

system · October 19, 2021, 6:15pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.