Move calculation from cpu to gpu

Hi,

I’m trying to move some calculation from cpu to gpu. Here is some peudocode:

CalcForce(index) {
  force = 0;
  forall i, i != index do {
    force += f(index, i);
  }
  return force;
}

UpdateForce() {
  forall i do {
    force[i] = CalcForce(i);
  }
}

and f(i,j) is some arithmetical function.

Well, the idea is to put the current data into a floating point texture, set a shader to compute f(i,j) and render quads. The texture’s values are set using glTexSubImage1D. To get the values back glGetTexImage is called.

Right now I tried it using a 1D floating point texture but without success.

What’s the way to solve such kind of algorithm on the gpu? Using a 1D texture, should the viewport be set to (0,0,texwidth,1) and render lines instead of quads?

Any advice is appreciated.

Thanks
kon

What’s the way to solve such kind of algorithm on the gpu? Using a 1D texture, should the viewport be set to (0,0,texwidth,1) and render lines instead of quads?
Yes. Set projection and modelview to identity, send the right vertex coords. {-1.0, 0.0, 0.0} {1.0, 0.0, 0.0} should do it.

Write the same algorithm in a fs. Not sure if you have a dynamic loop or static. Depends on “for all i” but I guess it is static.

Thanks for conferming this. Yes, the loop is static.

Now, I have the following code:
pRTTtex and pRTTForce are floating point textures.

void UpdateForce() {
  TransferToTexture();
  RenderForce();
  TransferFromTexture();
}

void TransferToTexture() {
  pRTTtex->BindBuffer(WGL_FRONT_LEFT_ARB);
  glTexSubImage2D(pRTTtex->GetTextureTarget(), 0, 0, 0, nPARTICLES, 1, GL_RGB, GL_FLOAT, &(r[0].x));
}

void TransferFromTexture() {
  pRTTForce->BindBuffer(WGL_FRONT_LEFT_ARB);
  glGetTexImage(pRTTForce->GetTextureTarget(), 0, GL_RGB, GL_FLOAT, &(force[0].x));
}

void RenderForce() {
  int vp[4];
  glGetIntegerv(GL_VIEWPORT, vp);
  glViewport(0, 0, nPARTICLES, 1);

  glDisable(GL_DEPTH_TEST);
  glActiveTextureARB(GL_TEXTURE1);
  glEnable(pRTTtex->GetTextureTarget());
  pRTTtex->BindBuffer(WGL_FRONT_LEFT_ARB);

  glActiveTextureARB(GL_TEXTURE0);
  glEnable(pRTTtex->GetTextureTarget());
  pRTTForce->BindBuffer(WGL_FRONT_LEFT_ARB);
  
  glUseProgramObjectARB(pShader->GetProgram());
  pExt->glUniform1iARB(pShader->GetUniLoc(pShader->GetProgram(), "tex"), 1);
  pExt->glUniform1iARB(pShader->GetUniLoc(pShader->GetProgram(), "tforce"), 0);

  pRTTForce->BeginCapture();
  for(int i = 0; i < nPARTICLES; ++i) {
    glUniform1fARB(pShader->GetUniLoc(pShader->GetProgram(), "curTC"), ((float)i) / ((float)nPARTICLES-1.));
    glBegin(GL_LINES);
      glTexCoord2f(0.0, 0.5);
      glMultiTexCoord2fARB(GL_TEXTURE1_ARB, 0, 0.5);
      glVertex3f(-1, .0, .0f);
      
      glTexCoord2f(pRTTForce->GetMaxS(), 0.5);
      glMultiTexCoord2fARB(GL_TEXTURE1_ARB, pRTTtex->GetMaxS(), 0.5);
      glVertex3f( 1, .0, .0f);
    glEnd();
  }
  pRTTForce->EndCapture();

  glUseProgramObjectARB(0);

  glViewport(vp[0], vp[1], vp[2], vp[3]);
  glActiveTextureARB(GL_TEXTURE1);
  glDisable(pRTTtex->GetTextureTarget());
  glActiveTextureARB(GL_TEXTURE0);
  glDisable(pRTTtex->GetTextureTarget());
  glEnable(GL_DEPTH_TEST);
}

// Vertex shader

void main(void) { 
  gl_TexCoord[0] = gl_MultiTexCoord0;
  gl_TexCoord[1] = gl_MultiTexCoord1;
  gl_Position = ftransform();
}

// Fragment shader

uniform sampler2DRect tex;

uniform float curTC;

void main(void) {
  vec4 source = texture2DRect(tex, gl_TexCoord[1].xy);
  vec4 remote = texture2DRect(tex, vec2(curTC, 0.5f));
  vec4 d = source - remote;
  float dist2 = d.x*d.x + d.y*d.y + d.z*d.z;

  float dist = max(dist2, 0.000001f);

  gl_FragColor = 83.0f * (0.8f - dist) * (0.8f - dist) * d / dist;
}

But still not the desired output in the force array (all 0.0)! What is going wrong here? How does this kind of algorithm has to be solved?

Thanks
kon

You can try gl_FragColor = vec4(83.0, 83.0, 83.0, 83.0);

or whatever value you want to see if it writes good values. Also check for GL errors.

For GLSL coding, you don’t need to write 83.0f.
83.0 is enough and is portable.

float dist2 = d.xd.x + d.yd.y + d.z*d.z;

should be written as

float dist2 = dot(vec3(d), vec3(d));

How does this kind of algorithm has to be solved?
The idea of your first post sounds fine.

Thanks for debugging suggestion to write some constant values! There are no errors reported!
Whatever values I write to the fragment the values read back are all very small (e.g. -8.35964e-034)!
Any idea what could be wrong here?

For texRect the texture coordinates are supposed to go from 0 to texture’s width. But to get a texel do the texture coordinates go from 0.5 to texwidth-0.5? That would give exactly texwidth texels, instead of texwidth+1 when going from 0 to texwidth.

Originally posted by kon:
Whatever values I write to the fragment the values read back are all very small (e.g. -8.35964e-034)!
Any idea what could be wrong here?

However obvious, have you tried casting the data to (unsigned) 32-bit int’s and printing the hexadecimal representation? Perhaps that could display a pattern?

No, there isn’t any pattern. Well, the weird thing is that if the number of particles is 256 (number of particles equals the texsize) all values returned are (1.0, 0.0, 0.0) (still incorrect!) and with other sizes I get these very small or very big numbers.
I’m confused!

So far, I managed it to read back the constant float values defined in the fragment shader!!! Enabling textures has to be done after the RTT is current. The texCoords changed to range from 0.0 to texWidth. I’m still not sure about the shift by 0.5 (see one of my previous posts).
Well, at least that’s working now!

But, the algorithm defined in my first post still doesn’t work. Actually this is (should be) a force calculation for a particle system and switching between doing the force calculation on the cpu or on the gpu, the output is different. (wrong on the gpu)

So, now that the whole floating point math on gpu is working, I’m looking for some advice on what’s going wrong with my shader.

thanks,
kon

I’m far from an authority on shaders, but I’m wondering if this is really what you want to do:

float dist2 = d.x*d.x + d.y*d.y + d.z*d.z;
float dist = max(dist2, 0.000001f);
gl_FragColor = 83.0f * (0.8f - dist) * (0.8f - dist) * d / dist;
  

What you are doing here is using the square of the distance. I doubt that that is what you want (I may be wrong).
Springs are linear, gravity forces get smaller with increasing distance.

Your formula is biased around a fixed distance between “source and remote” of almost 0.9 (the square root of 0.8, which is 0.894), the result is almost 0.

V-man suggested that you replace it with a dot-product (I assume that’s what dot(…) does.
But that wouldn’t be very interesting. After all, the angle between a vector and itself is always 0.

I’m wondering about something else, though: basically you’re rendering to a texture.
Do float textures get clamped when you do this?
Since you’re getting 0s and 1s I’d say that may be that is what’s happening.

The formula is correct! This is my f() in my first post.
The dot(v,v) dot product gives v’s length squared. (dot(a,b) = ¦a¦¦b¦cos(alpha) and cos(0) = 1)
And the floating point values are not clamped! I get -1234.56 or 1234.56.
The floating point texture writing / reading is working (see my last post).

I guess you’re right about that dotproduct, since that is the only explanation for why you need to normalize your normals after a scale.

Just goes to show how much use I’ve ever had for it: the only thing I remember using it for is for determining the angle between two vectors. Sorry about that.

W.r.t. casting to ints - I’m surprised that that works in the way suggested. Normally (in C/C++) a cast from float to int results in just the whole number.

My remark about using the square of the distance was precisely because the square moves so quickly, and you hadn’t posted the original function.
I think it’s an incorrect function, since the magnitude of the force becomes greater at higher distance and at near zero distance - without changing direction. It’s inherently unstable, and would result in lots of large numbers and lots of near zero numbers at distances near 0.89.

Sorry I can’t be of more use.
BTW: your post might be better off in the shading forum.

Well, after looking at the algorithm again last night, I think I found out what’s missing: In the for loop I’m overwriting the result of loop i when rendering i+1.
So, I need two textures, one two hold the current sum and the other to calculate the current forces.
Creating a double buffered pbuffer and swapping the front and back buffer in the for loop, defining the one as current texture and the other one as rendering target could be the solution and will also save costly context switching.
Let’s see.

Would it be a clever idea to move on the GPU all the sqrt and pow calculations ?

Well, having some time to work at this project here a quick update and some questions:
The force computation can now be done on the cpu or on the gpu and the simulation looks the same. Changing from 1D textures to 2D and instead of glGetTexImage using glReadPixels to get back the data give a significant boost!

The fragment shader is:

// Fragment shader

uniform sampler2DRect texforce;
uniform sampler2DRect tex;

//uniform float curTC;

void main(void) {
  vec4 source = texture2DRect(tex, gl_TexCoord[1].st);
  vec4 remote = texture2DRect(tex, gl_TexCoord[1].pq);
  vec4 current = texture2DRect(texforce, gl_TexCoord[0].st);
  vec4 d = source - remote;
  float dist2 = dot(vec3(d), vec3(d));

  if(dist2 < 0.0001) dist2 = 10000.0;

  dist2 = sqrt(dist2);

  float dist = max(dist2, 0.0001);
  float rmindist = 0.8   - dist;

  vec4 result = 83.0 * rmindist * rmindist * d / dist;
  result *= step(dist2, 0.64);

  gl_FragColor = current + result;
}

Now some performance measurements:
System 1
Pentium M 1.7GHz, 1GB RAM
NVIDIA GeForce FX Go 5200

Computation on CPU
68 fps 100% cpu time, ~0% kernel time

Computation on GPU
34 fps 100% cpu time, ~25% kernel time

System 2
Pentium M 2.17GHz, 2GB RAM
NVIDIA Quadro FX Go 1400

Computation on CPU
215 fps 100% cpu time, ~0% kernel time

Computation on GPU
175 fps 100% cpu time, ~25% kernel time

What’s the reason of the increase in kernel time? The glReadPixels calls?
Does the lower performance in ‘gpu mode’ means that the algorithm can’t be well parallelized?