WGL_ARB_buffer_region issue

I need to save my depth and color buffer at a point in my application, and restore it later.
I have tried three techniques:

1. FBO. Render calls to an FBO seem to be about 25-40% slower than ordinary OpenGL calls. Thus I abandoned this method.

2. Use glCopyTexSubImage to copy color buffer to texture rectangle and depth buffer to a depth texture (GL_DEPTH_COMPONENT) rectangle. Then to restore I render a screen-aligned quad using a fragment shader outputting both DEPTH and COLOR. My depth output from this pass only contains three different values! Thus I also abandoned this method.

3. Use WGL_ARB_buffer_region. I still have problems with it being slow. Saving and restoring immediately after each other seems to affect performance of ALL subsequent OpenGL render calls?!

I do the following before a loop:

bufferRegion = Util::wglCreateBufferRegionARB(hdc, 0, WGL_DEPTH_BUFFER_BIT_ARB | WGL_BACK_COLOR_BUFFER_BIT_ARB);

inside a loop (for each light source) I do:

Util::wglSaveBufferRegionARB(bufferRegion, 0, 0, Util::setup->resX, Util::setup->resY);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT);
Util::wglRestoreBufferRegionARB(bufferRegion, 0, 0, Util::setup->resX, Util::setup->resY, 0, 0);

//RENDER CODE GOES HERE, BUT IS SLOWER WITH WGL CALLS ABOVE IT??

finally after the loop I do:

Util::wglDeleteBufferRegionARB(bufferRegion);

Furthermore, I have found that doing all of the above, except the call to wglRestoreBufferRegionARB is fine - the render speed is as fast as without the wgl calls. I am using an nVidia GeForce 6800 with ForceWare driver version 93.71.

I would be gratefull for any help - also if it turns out that it is not possible to avoid a speed-down. I would still like to know the reason why it is slow.