How to increase the speed of glReaxPixels? Again... (OpenGL ES 2.0)

I’m sorry to bother you guys with questions that have already asked but I still can’t solve this problem.

I have this simple function that takes ages to complete:

void procFrame(JNIenv env, jint texIn, jint texOut, jint w, jint h) {

    static cv::Mat m;
    m.create(h, w, CV_8UC4);

    glReadPixels(0, 0, w, h, GL_RGBA, GL_UNSIGNED_BYTE, m.data);

    cv::Filter(m, m, CV_8U);
    m *= 10;

    glActiveTexture(GL_TEXTURE0);
    glBindTexture(GL_TEXTURE_2D, texOut);

    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w, h, GL_RGBA, GL_UNSIGNED_BYTE, m.data);

}

Someone suggested me to use pbos but I can’t figure out how to make them works with my project.

This is what I came up with:

static cv::Mat m;
bool initialized = false;
static int index = 0;
static int nextIndex = 0;
static int DATA_SIZE = 0;


void procFrame(JNIenv env, jint texIn, jint texOut, jint w, jint h) {
if (!initialized) {
            m.create(w, h, CV_8UC4);
            DATA_SIZE = w * h * 4;
            glGenBuffers(2, pbo);
            glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[0]);
            glBufferData(GL_PIXEL_PACK_BUFFER, DATA_SIZE, 0, GL_STATIC_READ);

            glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[1]);
            glBufferData(GL_PIXEL_PACK_BUFFER, DATA_SIZE, 0, GL_STREAM_DRAW);

            glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
            initialized = true;
        }
        index = (index + 1) % 2;
        nextIndex = (index + 1) % 2;

        int64_t t = getTimeMs();
        glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[index]);
        glReadPixels(0, 0, w, h, GL_BGRA_EXT, GL_UNSIGNED_BYTE, 0);

        glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo[nextIndex]);
        GLubyte* src = (GLubyte*)glMapBufferOES(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
        if(src)
        {
            // TODO process
            cv::Laplacian(m, m, CV_8U);
            m *=10;
            glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
        }

        glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);


        glBindTexture(GL_TEXTURE_2D, texOut);
        glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w, h, GL_BGRA_EXT, GL_UNSIGNED_BYTE, m.data);
}

Can anyone explain to me how to proceed and why?
Also I would love to have OpenGL ES books suggestions.

Thanks for your patience, guys :slight_smile:

[QUOTE=Mikelle02;1292689]I still can’t solve this problem.
I have this simple function that takes ages to complete:


    ...
    [ Download WxH RGBA8 image from GPU/driver to app memory, with glReadPixels() ]
    ...
    [ Run OpenCV CPU-side filter on it ]
    ...
    [ Upload WxH RGBA8 images from app memory to GPU/driver, with glTexSubImage2D() ]
    ...
}

Someone suggested me to use pbos but I can’t figure out how to make them works with my project.

This is what I came up with:


    ...
    [ Same thing as above, but with:
      1) 2 ping-pong PBOs and 1-frame-late app PBO mapping for read,
      2) Downloading/uploading BGRA instead of RGBA texel data.]
    ...

Can anyone explain to me how to proceed and why?[/QUOTE]

Which GPU and driver are you targeting? How long is “ages” (in msec)? Also, what results did you get from your PBO implementation?

Reading back the rendered result to app memory has a high latency, especially on mobile GPUs. What you have to realize is that when you submit commands to a mobile GPU, you may not see the pixels resulting from that rendered on-screen for 33-50ms. This is due to the way the GPU processing works, so that it can use slow CPU DRAM for its memory rather than the fast GPU VRAM available on desktop GPUs. The pipeline is very deep, so it takes a while to get results out the other end. glReadPixels() can speed things up a bit (to the detriment of your on-screen rendering performance) but there’s still a limit.

Using a ring buffer of 2-3 PBOs as you’ve tried to do can help your app be more tolerant of this high latency. So if your processing absolutely had to be structured the way you’re doing it (download from GPU + CPU process + upload to GPU), then you’re probably going in the right direction with the ring-buffer of PBOs approach (but check your GPU vendor’s OpenGL ES developer guides for details on how to pipeline readbacks efficiently with their driver). However…

Possibly the better way to do this is not to download the data from the GPU in the first place but do your image processing on the GPU. This is typically the better way to go on desktop GPUs, where there are many options for doing on-GPU image processing (GL compute shaders, standard GL fragment shaders, CUDA, OpenCL, etc.) which you can task to perform your image processing operations. There are also “pre-built” on-GPU image processing filters ready to pick up and use (e.g. OpenCV’s CUDA module, CUDA NPP, etc.). Check the documentation for your GPU platform to see what libraries are available. Alternatively, if you know what you need to do, you can probably just write these yourself. Producing highly optimized processing kernels for the GPU takes some skill though, so definitely see if you can find an off-the-shelf filter that’ll do what you want first.