Part of the Khronos Group

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Results 1 to 2 of 2

Thread: Preparing data for instanced rendering

  1. #1
    Newbie Newbie
    Join Date
    Sep 2014

    Preparing data for instanced rendering

    I'm making a game for PC that involves a large terrain with forests, so there are many trees, rocks etc. to be rendered. Instanced rendering fits well for low-poly meshes that
    appear in large quantities, but maybe not so well for high poly objects.

    I'm planning on making a renderer that supports both instanced rendering and traditional one, but I'm not sure how to store/prepare the object data.

    I've implemented instancing before by using glUniform* to load an array of transformation matrices. The problem I'm trying to figure out is if I should
    (1) first pack all object transformations into a single contiguous array and then call glUniform* once per instance group, or
    (2) call glUniform* for each object transformation matrix separately before instanced draw call?

    If I always had a contigous array of transformations, the obvious choice would be (1), but in practice they are not, because not all objects are visible
    at the same time. So making a contiguous array for use in (1) would cost some CPU time each frame (and also some additional memory). Copying a 1000 matrices
    into a single contiguous array every frame seems like a lot of work for the CPU, but is it still better than option (2)?

    I could also arrange my objects into batches according to, e.g., quadtree leaves and then in those batches maintain a contiguous array of transformations.
    The problem is that the granularity of those batches may not be optimal for all objects. This is also a big design desicion.

  2. #2
    Newbie Newbie
    Join Date
    Sep 2014
    I made a small test program to estimate the time of copying 4-by-4 matrices to a buffer:

    #include <time.h>
    #include <iostream>
    #include <math.h>
    #include <memory.h>
    #include <time.h>
    #include <vector>
    #include <algorithm>

    using namespace std;

    long int getTimeMilliSec()
    timespec spec;

    clock_gettime(CLOCK_MONOTONIC, &spec);

    return ((long int)spec.tv_sec)*1000 + ((long int)spec.tv_nsec)/1000000;

    // 4x4 matrix
    struct mat4
    float data[16];

    // Renderable object with dummy data and transformation matrix.
    struct CObject
    int data[10];
    mat4 trans;
    float data2[20];

    int main(int argc, char **argv)
    // Number of objects in the world.
    const int nobjects = atoi(argv[1]);

    // Number of visible object in a frame.
    const int ninstances = atoi(argv[2]);

    // A buffer to hold transformations.
    mat4 buffer[ninstances];

    // Create objects.
    CObject *objects = new CObject[nobjects];

    // Indices to objects.
    std::vector<int> indices;

    // Make random indices. In reality these would be given by
    // a visibility culling algorithm.
    srand (time(NULL));
    for(int i=0; i<ninstances; i++)

    long int time = getTimeMilliSec();

    // Copy transformations into a buffer.
    for(int i=0; i<ninstances; i++)
    memcpy(buffer[i].data, objects[indices[i]].data, sizeof(mat4));

    cout << getTimeMilliSec()-time << endl;

    delete [] objects;

    Compile with

    g++ -lrt test.cpp

    For 50k objects and 4k instances

    ./a.out 50000 4000

    result is 2 ms. For 500k objects and 10k instances

    ./a.out 500000 10000

    result is 6 ms. The timing is probably very inaccurate as there is some
    fluctuation between different runs (also the randomness may affect cache hits).

    I replaced the last bit of code with

    // Copy transformations into a buffer.
    const int nrepeats = 10000;

    for(int j=0; j<nrepeats; j++){
    for(int i=0; i<ninstances; i++)
    memcpy(buffer[i].data, objects[indices[i]].data, sizeof(mat4));

    cout << float(getTimeMilliSec()-time)/nrepeats << endl;

    To average the timing over 10k tests. Now 500k objects and 10k instances
    times on average 0.2 ms. I know that timing code execution is not trivial and both of my
    timing approaches might be invalid. 6 ms would be unacceptable whereas 0.2 ms would
    be totally fine.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts