What is the protocol here for questions about speed?

I have a particle drawing program, and I finally got it running! Unfortunately it runs poorly. And when I tried to make it run more efficient (using best practices) it just ran worse.

This is Opengl ES 2.0, with a target of opengles 2.0

The drawing on this is fairly simple

There are 4 shaders

  1. Draws the “Lines” Really the are two triangles
  2. Draws the dots
  3. Draws a full-screen textures
  4. Draws black on the screen with specific alpha

There are 2 frame buffers

  1. The general front buffer automatically made by opengl
  2. The back-buffer with a texture made by me

There are 5 buffer objects

  1. Holds all of the point locations and data (4 GLfloats per dot, 2 dots per particle)
  2. Holds all of the rectangle locations and data (4 GLfloats per verticie, 4 vertices per particle)
  3. Has indices written for the rectangle data so that things arent repeated and it can be drawn with GL_TRIANGLES
  4. Holds texture coordinate data
  5. Holds full screen vertices

There are also 4 opengl drawing calls in their own functions

  1. Draws a huge amount of lines on the screen. Uses the indices (VBO 3) and the data from VBO 2. Uses shader 1
  2. Draws a huge ammount of dots on the screen, uses the data from VBO 1, uses shader 2
  3. Draws a texture on screen, uses VBO 4 and 5.
  4. Draws black on full screen, it is called with “glDrawArraysInstanced” so that it can be run a variable amount of times based on fps. Uses VBO 4 and 5

Each frame the following happens

  1. Depth and color buffers cleaned
  2. Back-buffer is called up
  3. Previous contents are faded (draw function 4)
  4. Circles are drawn (Draw function 2)
  5. Lines are drawn (function 1)
  6. Front buffer it put back into place
  7. Back buffer is drawn to the front buffer via texture (I plan to add effects later)

Extra notes

  • Each particle is made of 2 GL_POINTS, and 2 GL_TRIANGLES
  • All the shaders are very light, only preforming basic subtraction. However one does preform one dot product, and one step function to get carve out the circle from the square dot.
  • All buffer objects, and frame buffers are generated at the beginning and are updated appropriately.
  • For some reason when analyzing the gpu frame, only 1 error appears. For some reason the GPU think that my call to draw the dots goes out of array buffer bounds, however this does not crash it, the vbo is way bigger then the draw call, and even dividing the end range by two draws half the dots, however it still gets the error.
  • All the array data for the simulation is set in an update loop, supposedly called on logic updates. However the buffer data is not uploaded till drawing.

Originally the bottleneck was way in the CPU, however with my changes now the cpu is low and gpu is a bit higher (30.5ms for 300 particles) and I have no clue why! The shaders are complex, in fact I made some of them into simple white pixel shaders however that has very little performance difference.

I would love to post all of this code here, however I dont know if it is appropriate to post 432 lines of code (I added tons of comments which probably bloat that number but make it easier to read). Is it ok to post this code to have it be examined?

EDIT:
I did a study, and it gave no conclusive results… just more confusion! It appears to be a combination issue.
The numbers correspond to a function I described in an earlier section, and black means it was commented out.