Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 7 of 7

Thread: GLSL function execution time estimation

Hybrid View

  1. #1
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144

    GLSL function execution time estimation

    On the first glance the following question looks silly, but after trying to answer on it I realized it is quite difficult (or even impossible).

    How can we estimate execution time of some function inside a shader?

    I have made a sample vertex shader like this:
    Code :
    #version 330
    out  float out_val;
    void main(void)
    {
        out_val = someFun(gl_VertexID * 1e-6);
    }
    allocate 80MB buffer for transform feedback, embrace glDrawArrays() with glQueryCounter()
    Code :
    glQueryCounter(m_nStartTimeID, GL_TIMESTAMP);
    glDrawArrays(GL_POINTS, first, count);
    glQueryCounter(m_nEndTimeID, GL_TIMESTAMP);
    and call it for count=1e7.

    Can you guess what happens? Elapsed time does not depend on the complexity of the function. There is a fixed portion for setup (about 14.7us on my laptop) and a portion that directly depends on the number of vertices (about 22.5ms for 1e7 vertices).

    Does anybody have any suggestion on measuring GLSL function execution time?

    In fact, I need to compare efficiency of some implementations. So it is not important to have absolute values. On the other hand, I don't want to measure execution time of the application when they are applied, since it is quite specific and subjected to optimization related to certain implementation.

    Thank you in advance!

  2. #2
    Senior Member OpenGL Guru
    Join Date
    May 2009
    Posts
    4,948
    Measuring the performance of an independent function in a vacuum is pointless. GLSL is not C, where you could expect the performance of a particular function to be invariant with other changes. In shader compilation, functions will be inlined, instructions will be statically reordered to hide latencies for various operations, and so forth.

    You can never assume that a function X which is faster than function Y in your vacuum test will always be faster in your application. Once you put it in your real shader(s), it may be faster or it may be slower.

    For example, let's say you have some function that does purely math stuff. So it has some particular performance X. And let's say you have another function that does a texture fetch, then a small number of math computations. It has some particular performance Y.

    It is entirely possible that, when you call one after the other in your real shader, the overall performance is not X+Y. It could just be as small as max(X, Y).

    So the exercise you propose is simply not useful. If you want to optimize a shader, you're going to have to do so in the actual context of the overall code you're trying to make faster. The only thing you can test is how long it takes to execute it.

    Also, if you're measuring shader performance, why are you using transform feedback?

  3. #3
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144
    Thank you, guys!

    My question was an a consequence of the late-night desperate thinking.
    In fact, the case is quite clear.

    If there are no dependences and pipeline stalls, only parameter I could measure is a single-step interval.
    Let's assume we have M processing units and want to execute N function calls.
    The whole processing time is equal to:

    setup_time + (ceil(N/M) - 1) * single_step_time + full_pipeline_execution_time

    setup_time - constant time that does not depend on the problem size
    single_step_time - single clock interval
    full_pipeline_execution_time - function execution time

    Considering above, I could calculate function execution time, but it is very short interval (far below 1us) that cannot be measured precisely.

    Quote Originally Posted by Alfonse Reinheart View Post
    Also, if you're measuring shader performance, why are you using transform feedback?
    Just to be sure that all 1e7 executions are done correctly.

    Quote Originally Posted by tonyo_au
    Can you use Nsight from nVidia?
    Yes, but the result will be the same.

    Thanks again!
    Conclusion: Only thing we can measure is a pipeline stalls and dependences, not the execution time!

  4. #4
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    If the difference is too small to measure you could loop inside the shader ( but I would make sure something changed or the compler might optimise the loop away).

  5. #5
    Senior Member OpenGL Pro Ilian Dinev's Avatar
    Join Date
    Jan 2008
    Location
    Watford, UK
    Posts
    1,290
    If you don't use nSight, I'd suggest you use a fragment shader on a fullscreen triangle. (to avoid high primitive-setup costs, and <= 4 primitives per cycle setup, and transform-feedback setup/memwrite). With manually-unrolled looping and care to not let the compiler optimize stuff out.
    Things that can skew results are texture fetches (longest stall), access to limited ALU units (trigonometry), and register bank clashes (fmad r0, r4, r8, r12). I guess they will appear to have a +1 cycle execution time, in perfect circumstances. (if the other warps happen to not use those resources).
    Still, the vast majority of instructions will be fmad-like, executing in a single cycle (effectively, even if they have a latency of 10-20 cycles). So, you can infer how many simple instructions a GLSL function consists of.

    You can often accurately measure high latency limited-resources' minimum effective execution time by padding them with simple ALU ops, in a ratio. E.g loop(10){1 fsin, 8 fmad } if the trigonometry units are 8x fewer than ALU. Same for texture fetches, except that you have to make the texture small enough to fit in cache, with nice access-patterns.
    Last edited by Ilian Dinev; 04-30-2013 at 01:26 AM.

  6. #6
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,144
    Quote Originally Posted by tonyo_au View Post
    If the difference is too small to measure you could loop inside the shader ( but I would make sure something changed or the compler might optimise the loop away).
    Interesting idea, but I'm going on a trip so I will try it next week.

    Quote Originally Posted by Ilian Dinev View Post
    If you don't use nSight, I'd suggest you use a fragment shader on a fullscreen triangle. (to avoid high primitive-setup costs, and <= 4 primitives per cycle setup, and transform-feedback setup/memwrite). With manually-unrolled looping and care to not let the compiler optimize stuff out.
    In testing shader I don't have any drawing. Even explicitly call glEnable(GL_RASTERIZER_DISCARD). Whole transformation is done in vertex shader.

    I'm not sure I have understood the rest of your suggestion. First, I want to use GLSL not the assembly language, so I have no control over instructions that are executed. Furthermore, GLSL compilers are very aggressive in optimizations.

    What you call ALU is actually SFU (Special function unit) used for transcendental functions. Addition, multiplication and logical operations are done at SPU/DPU (some GPUs use same logic for both single and double precision, like Fermi, while others have separate DP units, like Kepler). SFU count is pretty high on modern GPUs. I don't think they can make any trouble.

  7. #7
    Senior Member OpenGL Pro
    Join Date
    Jan 2012
    Location
    Australia
    Posts
    1,117
    Can you use Nsight from nVidia? It gives you the cpu and gpu time for every OpenGL call
    It looks like this (but better formatted)

    event Description CPU Duration (ns) GPU Duration (ns)
    1523 glDrawArrays(GL_TRIANGLES, GLint first = 0, GLsizei count = 54) 17685 4192
    1524 glBindVertexArray(GLuint array= 0) 1153 0
    Also Alfonse it right; I cannot believe how much overlapping gpu instructions can do - I grew up when cpu's only execute 1 instruction at a time that is not true any more.
    Last edited by tonyo_au; 04-28-2013 at 06:08 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •