VAO/glUniformMatrix4fv slow with PyOpenGL

I’m writing with Python and PyOpenGL, but I’m running into a serious bottleneck while rendering a large number of objects.

The code seems to become very slow when running my “LoadMatricesToShader” function which uses glUniformMatrix4fv to send my MVP matrix to the loaded shader, in order to speed the program up I elected not to run “pid = glGetIntegerv(GL_CURRENT_PROGRAM)” and “glGetUniformLocation(pid, “MVP”)” each frame as the glGet * functions seem to have a serious impact on performance, I now store these as the shader is compiled.

My program will now draw up to around 2500 cubes without breaking a sweat, but beyond 6-8000 I’ve dropped to around 6-7 fps. I’m not compiling the shader every frame, I’m not creating vbo/vaos every frame, I’m not loading the model every frame - as far as I can tell the slow line of code is glUniformMatrix4fv.

I’ve tried loading one object with 1m tris and it runs perfectly, so the issue is centred around the way I’m instancing the objects (big surprise).

The other possibility is the only other line in my “LoadMatricesToShader” function:

# Construct my ModelViewProjection matrix
        self.MVP = self.P * self.V * self.m_mainCamera.m_transform.getMatrix() * self.M

and the line

self.M = m_trans.getMatrix()

Both are called within my paintGL, each frame - Transform is a class I’ve written which holds a transformation matrix with a few handy functions for “translate/scale/rotate” etc, and also has a GetMatrix() function for when I want to use it. I am using the same transform for all the objects and just translating them between draw calls - this might be slowing me down?

This is my paintGL function as it stands:


    def paintGL(self):                            
        gl.glClear(
            gl.GL_COLOR_BUFFER_BIT | gl.GL_DEPTH_BUFFER_BIT)
        
        #Temporary single transform, used to position each instance
        m_trans = Transform()

        gl.glBindVertexArray (self.m_ObjectList[len(self.m_ObjectList) - 1].m_vao)
        gl.glEnableVertexAttribArray(0)  
       
        for i in range(0, 10000):
            m_trans.setPos(numpy.array([i - 5000, 0, 0]))
            self.M = m_trans.getMatrix()
            self.loadMatricesToShader()   
            gl.glDrawArrays (gl.GL_TRIANGLES, 0, self.m_ObjectList[len(self.m_ObjectList) - 1].m_vertexCount)
        gl.glBindVertexArray(0)
        
        self.update()

I create my buffers and vao as the object is loaded, apologies for the ghastly coding practices I’m certain I’m adhering to! :sick:

    # Function to load the object from the supplied filepath, open file - read lines - string format and extract shared vertex/face construction information + populate m_vertices
    def loadObject(self):
        # Read file
        self.lines = [line.rstrip('
') for line in open(self.m_filePath)]
        
        # Local Lists
        sharedVerts = []
        faceConstructs = []    
                
        # Check for 'v' shared vertices, or 'f' faces (3 integers which reference entries in the shared vertex list
        for lineNumber in range(0, len(self.lines)):  
            # Make sure not a null line - safety check as ' ' += None/Null
            if len(self.lines[lineNumber]) > 0 :
                if self.lines[lineNumber][0] == 'v' and self.lines[lineNumber][1] == ' ':
                    # Discard 'v ' keep all proceeding [[x11, y1, z1],[x2, y2, z2]...] stored as a tuple
                    sharedVerts.append(self.lines[lineNumber][2:].split(' '))
                elif self.lines[lineNumber][0] == 'f' and self.lines[lineNumber][1] == ' ':
                    # For each vertex listed in shared vertex list, which makes up this face
                    for vertCount in range(0, 3):
                        # Add each index to the face construct in order [0, 1, 4, 6, 4 ,3] would be 2 triangles, tri1 verts 0,1,4  tri2 verts 6, 4, 3
                        faceConstructs.append(int(self.lines[lineNumber][2:].split(' ')[vertCount].split('/')[0])-1)
                    
                    
        
        # Now for every entry in this vert reference list (how to "construct faces", get the appropriate vertex components and make a component wise vbo 
        # for 2 tri example above, would produce: [0x, 0y,0z, 1x, 1y, 1z, 4x, 4y, 4z, 6x, 6y, 6z, 3x, 3y, 3z] where integers are "index" of shared vertex array (vert1, vert6 or vert3 etc...)
        for vertIndex in faceConstructs:
            # x, y, z
            for vertComponent in range(0, 3):
                self.m_vertices.append(float(sharedVerts[vertIndex][vertComponent]))
        
        self.m_vertexCount = len(faceConstructs)
        
        self.CreateVAO()



    def CreateVAO (self):
        vaoID = gl.glGenVertexArrays(1)
        gl.glBindVertexArray(vaoID)
                
        self.GenBuffers()
        
        self.unbindVAO()
        
        gl.glBindBuffer (gl.GL_ARRAY_BUFFER, 0)
        
        self.m_vao = vaoID  
        
        
    # Function to generate a Vertex Buffer Object from the vertex component-wise information is stored in my structure
    def GenBuffers(self):
        self.m_vbo = gl.glGenBuffers(1)
        
        # Bind the buffer to populate
        gl.glBindBuffer (gl.GL_ARRAY_BUFFER, self.m_vbo)
        
        # Describe the position data layout in the buffer
        gl.glVertexAttribPointer(0, 3, gl.GL_FLOAT, False, 0, c_void_p(0))
    
        # Populate the buffer with data
        gl.glBufferData (gl.GL_ARRAY_BUFFER, len(self.m_vertices)*4, (c_float*len(self.m_vertices))(*self.m_vertices), gl.GL_STATIC_DRAW)
        
        print("generating vbo " + self.m_objectName)

Am I using old functions or have I missed something obvious with me vaos? Any help would be greatly appreciated!

thanks

  • bc

Why do you believe that the bottleneck is the loadMatricesToShader() function rather than the glDrawArrays() call? The latter is the bulk of the GPUs workload.

If you’re comparing multiple draw calls to a single draw call, the former will be less efficient; there’s a significant overhead for each draw call, so “batching” geometry (rendering many primitives with few draw calls) is fundamental to performance.

If you’re rendering one cube per draw call, that’s going to be very inefficient. In that case, consider using instancing (e.g. glDrawArraysInstanced(), requires OpenGL 3.1) or fake instancing (storing the matrices in a uniform array or texture, indexed using gl_PrimitiveID or an integer vertex attribute).

[QUOTE=GClements;1289917]Why do you believe that the bottleneck is the loadMatricesToShader() function rather than the glDrawArrays() call? The latter is the bulk of the GPUs workload.

If you’re comparing multiple draw calls to a single draw call, the former will be less efficient; there’s a significant overhead for each draw call, so “batching” geometry (rendering many primitives with few draw calls) is fundamental to performance.

If you’re rendering one cube per draw call, that’s going to be very inefficient. In that case, consider using instancing (e.g. glDrawArraysInstanced(), requires OpenGL 3.1) or fake instancing (storing the matrices in a uniform array or texture, indexed using gl_PrimitiveID or an integer vertex attribute).[/QUOTE]

I believed it to be the load matrices function as running all the draw calls without loading the matrices for each of the cubes ran fine ie:


        self.loadMatricesToShader()   

        for i in range(0, 20000):
            gl.glDrawArrays (gl.GL_TRIANGLES, 0, self.m_ObjectList[len(self.m_ObjectList) - 1].m_vertexCount)

Or does drawing them all to the same place not actually get processed like that, are the draw calls discarded? I can push it to 20000 before it starts slowing down like this - I suppose I need to restructure this all like you say and batch these together/instance them but loading the matrices to the shader seems to be really heavy too?

Presumably the same response applies, batch them together and I don’t need to load many matrices to the shader either.

Edit: Interesting you should say the draw calls are the bulk of the workload, the above runs fine where as this runs just as poorly (putting just the matrix loader into the loop)


        for i in range(0, 10000):
            self.loadMatricesToShader()   
            
        gl.glDrawArrays (gl.GL_TRIANGLES, 0, self.m_ObjectList[len(self.m_ObjectList) - 1].m_vertexCount)

The draw calls won’t be discarded, but if you render exactly the same geometry repeatedly, all instances after the first will have every fragment fail the depth test (assuming depth testing is enabled and uses GL_LESS), which will drastically reduce the load.

[QUOTE=BuggyCode;1289918]
Edit: Interesting you should say the draw calls are the bulk of the workload, the above runs fine where as this runs just as poorly (putting just the matrix loader into the loop)[/QUOTE]
That suggests that loadMatricesToShader() is expensive; what exactly is in there?

[QUOTE=GClements;1289927]The draw calls won’t be discarded, but if you render exactly the same geometry repeatedly, all instances after the first will have every fragment fail the depth test (assuming depth testing is enabled and uses GL_LESS), which will drastically reduce the load.

That suggests that loadMatricesToShader() is expensive; what exactly is in there?[/QUOTE]

+1 Internet for the knowledge about depth testing, interesting to know! The reason I’m so concerned about loadMatricesToShader() is it only seems to contain these 2 lines, I’ve also tried setting an MVP at the start so the camera remains stationary and the MVPid is the program id saved only once when the shader is loaded/compiled. This is what lead me eventually to the forums because it feels like I’ve black boxed my 6fps down to quite simply the glUniformMatrix4fv.

    def loadMatricesToShader(self):     
        # Construct my ModelViewProjection matrix
        self.MVP = self.P * self.V * self.m_mainCamera.m_transform.getMatrix() * self.M

        # Set uniforms
        gl.glUniformMatrix4fv(self.MVPid, 1, gl.GL_TRUE, self.MVP)
        

Is there anything substantial in the getMatrix() method? Are all of the other members plain variables, not properties? The matrices are all numpy.matrix instances?

Try changing the glUniformMatrix4fv() call to a dummy function to see if the calculation of its arguments is an issue.