The speed of uploading textures seems slow.

Hi,

I uploads textures using the PBO skill, got a data transfer speed of ~3.5GB/s on GTX570, and ~2.5 GB/s on GTX670. The timings are performed by ARB_time_query.

These speeds seem slow, since I read in some posts that the right speed is ~5GB/s. I coded following this doc(OpenGL Pixel Buffer Object (PBO)). Have I missed something?
It is said that pinned memory benefits data transfer, but how to use pinned memory in OpenGL?

Thanks in advance.

You are unlikely to get peak data transfer speed with any method due to CPU + synchronization overhead. I think 3.5GB/s is a pretty reasonable (and probably way better than most people achieve). Why do you think that it should be 5GB/s?

Also, pinned memory is available in OpenGL through the GL_AMD_pinned_memory extension, but works only on Radeons as NVIDIA doesn’t support it yet.

Thanks for the reply.

The timings above exclude the CPU time, actually, just the GPU time consumed for uploading. So i think it should be faster.

In this thread, Nvidia Dual Copy Engines - OpenGL - Khronos Forums posted his results of some tests, “The OpenGL data upload (texture and buffers) works in full speed on GeForce family which means ~5GB/s on PCe 2.0 and 2.5GB/s on PCIe 1.1.”. I am impressed by this number (5GB/s) because I once read this number in another site, but i cannot find the original post.

Besides, why is GTX670 slower than GTX 570? strange.

I know that the timing excludes CPU time there, but that doesn’t mean that CPU overhead or synchronization time is not affecting the performance, especially if you have more than a single upload performed. Don’t forget that OpenGL has implicit synchronization of resources, thus it might happen that you run into some unintended race condition. Or the simplest: the CPU is unable to send the commands to the GPU just in time, leaving small gaps between the uploads. Your application doesn’t even have to be CPU-bound in order to such things to happen.

I’m just saying, you have to make the perfect synthetic test to achieve peak performance.

Thank you, aqnuep. looks like i have a lot to learn.

Could you direct me some resources/links about how to make this perfect synthetic test, or in other words, how to get the peak performance? The speed of uploading is the bottleneck of my application. A few keywords for this problem is good, too, i even don’t know what keywords should i google.

To be honest, I’m not the right person to ask. Maybe you should try to contact the users who had the discussion on the dual-copy engine topic (hopefully they’ll come by and visit this topic too).

[LEFT]thanks, aqnuep, appreciate your help. I’ll continue to try.[/LEFT]

In order to tell you if you missed something, you have to tell us what you know about “PBO skill”. :slight_smile:

First, why do you think using PBO is faster than “normal” glTexSubImage2D()? I can say it isn’t, and it is right in many use-cases.
We need a pseudo-code of your texture upload/download to tell you whether you can get any benefit of PBO usage, or an answer to a previous question.

Second, maximal throughput of 16-lane PCI-E v2.x bus is 8GB/s. You cannot gain even the half of that throughput if you don’t overlap multiple texture transfers. The reason is obvious. Data have to be transferred to PBO (the part of the OpenGL controlled main memory) first, and then asynchronously transferred to a GPU memory.

Third, the time you have measured is the execution time of some code-block on the GPU, but that is not the actual transfer time. What have you measured and how?

Fourth, since data have to be uploaded first to a main memory and then downloaded to a GPU, the speed of the main memory and FSB is very important. You have mentioned GTX570 and GTX670. Are those cards on the identical machines? If not, the results might differ a lot. Maybe there is also some problem in a driver for GTX670.

Fifth, “dual copy engine” does not work on GeForce cards.

Thank you for the detailed reply, Aleksandar. I should have post a more complete question .

For the first question, I have tested the “normal” glTexImage2D() ( not glTexSubImage2D(), Do the two funcs differ much? ). Without using PBO, glTexImage2D() is performed on the CPU, and I got a speed of ~600M/s.

For the second item, could you tell me how to “overlap multiple texture transfers” for a better performance?

For the third, here is the code snippets:




void func1()
{
    //--------------codes for timing---------------------------------------
    GLuint query;
    glGenQueries(1, &query);
    glBeginQuery(GL_TIME_ELAPSED,query);
    //---------------------------------------------------------------------


    //uploading ---- the code-block tested
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);   //
    glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);

    glBindTexture( GL_TEXTURE_2D, tex1 );
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
    glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0,tex1_width,tex1_height,GL_RED_INTEGER,GL_UNSIGNED_INT,0); //where the data transfer occurs

    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

    //-----------codes for timing----------------------------------------------
    //yes,measuring like this stalls the whole progress. But I am just testing the time consumed for
    //uploading from host to video memory, rather than the performance of the whole app.
    //Does measuring like this broke the GPU execution sequence, thus hurts the uploading speed?

    glEndQuery(GL_TIME_ELAPSED);
    GLuint done = 0;
    while (done == 0)
    {
        glGetQueryObjectuiv(query, GL_QUERY_RESULT_AVAILABLE, &done);
    }
    GLuint elapsed_time;
    glGetQueryObjectuiv(query, GL_QUERY_RESULT, &elapsed_time);
    glDeleteQueries(1, &query);
    float time_ms = elapsed_time/1000000.0f;
    LogTime( time_ms ); //write time_ms to a log file.
    //-----------------------------------------------------------------------

    Render();

    //remap the buffer
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
    glBufferData(GL_PIXEL_UNPACK_BUFFER, buffersize, 0, GL_STREAM_DRAW);
    pBufferData = (byte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER,GL_WRITE_ONLY);
    //pBufferData is a global variable, it will be refilled in another thread which focus on I/O.


    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
}

Fouth, yes, the GTX570 and the GTX670 are on the same machine. And i tend to believe it is a diver issue.

Yes, they differ a lot considering performance. glTexSubImage2D() just updates the portion of existing texture. glTexImage2D() should be used for texture initialization only.
Everything depends on drivers and the way they optimize work. I remember the excitement of a colleague of mine when replaced glTexImage2D with glTexSubImage2D. The frame rate increased significantly on NVIDIA, but have almost no boost on AMD. So, everything is on the drivers. Anyway, don’t use glTexImage2D for the texture update.

I have to disappoint you. Transferring data is done in two phases: copying data to “driver’s memory” (system main memory) and downloading data from “driver’s memory” to “graphics memory”.
Using standard approach (without PBO), both phases are done synchronous. Your CPU (or to be precise a core) is busy until everything is finished. PBO enables to do second phase asynchronously.
But, during the second phase GPU is busy. You cannot do anything else on the GPU while texture is downloaded. That is what NV dual copy engine solves, but it doesn’t work on your cards.
In short, there is no magic in using PBO. It just releases CPU to do some other work while texture is downloaded from “driver’s” to “graphics” memory. If you can do something useful on the CPU in the meantime, PBO saves some CPU time. The first phase stays anyway on the CPU. If you issue texture downloading and then wait to draw something and again wait for another to complete etc., you’ll probably get no benefit of PBO. That’s why texture download is usually started with multiple PBOs. While you are filling the second the first is (probably) downloading to a GPU. That’s what I meant with overlapping.

Considering timer_query, using TIME_ELAPSED is not preferable query since doesn’t allows overlapping. glGetQueryObjectuiv() is a blocking function that significantly reduces performance of your code. I have no time now to elaborate on the topic. Please find some tutorial on the net.

P.S. Remove while loop since it does nothing, and call glEndQuery() after some trivial drawing that uses uploaded texture. The measured time using timer_query does not represent exact upload-time anyway, since you are measuring something that is not GPU execution code only. Adding drawing code is necessary to force driver to actually upload texture.

Hi,Aleksandar, I have remove the loop and add a trivial rendering, and move the codes for measuring to make them contain exactly the glTexSubImage() func,like this:





    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
    glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);


    glBindTexture( GL_TEXTURE_2D, tex1 );
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);


    //--------------codes for timing---------------------------------------start
    GLuint query;
    glGenQueries(1, &query);
    glBeginQuery(GL_TIME_ELAPSED,query);
    //---------------------------------------------------------------------end
    glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0,tex1_width,tex1_height,GL_RED_INTEGER,GL_UNSIGNED_INT,0); //where the data transfer occurs
    DrawSomePoints();
    //-----------codes for timing--------------------------------------------start
    glEndQuery(GL_TIME_ELAPSED);
    GLuint elapsed_time;
    glGetQueryObjectuiv(query, GL_QUERY_RESULT, &elapsed_time);
    glDeleteQueries(1, &query);
    float time_ms = elapsed_time/1000000.0f;
    LogTime( time_ms ); //write time_ms to a log file.
    //-----------------------------------------------------------------------end




    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);


    Render();


    //remap the buffer
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo1);
    glBufferData(GL_PIXEL_UNPACK_BUFFER, buffersize, 0, GL_STREAM_DRAW);
    pBufferData = (byte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER,GL_WRITE_ONLY);
    //pBufferData is a global variable, it will be refilled in another thread which focus on I/O.


    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);


But, the result turns out to be the same as the original, only differs 3-4 ms of transferring 200 M textures. Maybe i should find a new computer…I will continue to try.Thanks for your kind help.

What I tried to achieve is to make you realize what’s really happening, not to boost the speed.
I’ve overseen that you are uploading data in a separate thread. Do you have an adequate synchronization?

In order to get better performance you need two PBOs. A separate thread should fill one, while another is used for transferring to graphics memory and for drawing. But, you should fill data between glTexImage2D() and drawing to have an effect. In the best case, a buffer refill should be done in the same procedure (not in the separate thread) and should last exactly the same time as data transfer. In your case Render() waits for texture to be downloaded before executes, so you have no benefits of using PBO (just disable PBO, and use glTexSubImage() with a real pointer to check this; I think you’ll get marginally worse results).

Also, transferring a 200[MB] texture in a single frame is not feasible. 200[MB] * 60[1/s] = 12[GB/s]
Your problem is not in a transfer speed, but in the wrong way of texture update. The texture should be broken into smaller pieces, and update one at the time. 20-30 [MB/frame] is something you should stick to.