john_connor

07-08-2016, 11:15 AM

hi, i'm trying to implement a simple particle system

last time i used transform feedback objects and double buffering, it delivers (im my judgement) very good results:

without collision detection, about 3 millions particles can be simulated witht 60 frames per second (only gravity and collision with y = 0 level enabled)

with simple line-triangle-intersection method to detect collisions between particles and some (few) triangles in the scene, i can render about 800.000 particles with 60 frames per second

(my graphics card: NVIDIA GT 640, about 3 years old)

this time i want to push the limits further by using compute shaders, i managed to build this application:

web.engr.oregonstate.edu/~mjb/cs557/Handouts/compute.shader.1pp.pdf

i changed that to only 1 particle buffer for position / velocity / color / etc, but double buffered

the rendering method looks like this:

void ParticleSystem::Render(const glm::mat4 & view, const glm::mat4 & projection, float timestep)

{

// double buffered, switch vertex array every frame

static unsigned int flipflop = 1;

flipflop = !flipflop;

// bind both particle buffers

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, m_particle_buffer[1 - flipflop].ID()); // source

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, m_particle_buffer[flipflop].ID()); // results

// compute shader

unsigned int program = m_program_update.ID();

// simulate 1 frame

glUseProgram(program);

glDispatchCompute(m_particle_count / PARTICLES_WORK_GROUP_SIZE, 1, 1); // work group size = 128

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

// render 1 frame

program = m_program_render.ID();

glUseProgram(program);

glUniformMatrix4fv(glGetUniformLocation(program, "Model"), 1, false, glm::value_ptr(glm::mat4(1)));

glUniformMatrix4fv(glGetUniformLocation(program, "View"), 1, false, glm::value_ptr(view));

glUniformMatrix4fv(glGetUniformLocation(program, "Projection"), 1, false, glm::value_ptr(projection));

glBindVertexArray(m_vertexarray[flipflop].ID());

glDrawArrays(GL_POINTS, 0, m_particle_count);

glBindVertexArray(0);

glUseProgram(0);

}

question 1:

i've read that glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); is used to syncronize and is relatively expensive (https://www.opengl.org/wiki/Memory_Model#External_visibility), so that if i want to read back data from that buffer, i can be sure that the compute shader already finished processing the data

BUT: i use 2 buffers, the comput shader calculates data for te next frame, the current one renders the "old" frame from which the compute shader ONLY reads data

do i acually need to syncronize ?

or can i delete glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); without problems ?

compute shader source:

#version 450

layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;

layout (std140, binding = 0) buffer Source { vec4 DataSource[]; }; // particle buffer to read from

layout (std140, binding = 1) buffer Destination { vec4 DataDestination[]; }; // particle buffer to write into

const vec3 gravity = vec3( 0, -9.81, 0);

const float timestep = 0.016;

void main()

{

// read old data

// this is a 1-dimensional calculation because the data is a 1D array (of particles)

uint index = gl_GlobalInvocationID.x; // .y and .z == 1

vec4 data0 = DataSource[3 * index + 0];

vec4 data1 = DataSource[3 * index + 1];

vec4 data2 = DataSource[3 * index + 2];

vec3 position = data0.xyz;

float lifetime = data0.w;

vec3 velocity = data1.xyz;

float unused = data1.w;

vec4 color = data2;

// calculate new data

//vec3 accelleration = gravity;

vec3 accelleration = vec3(0, 0, 0);

vec3 position_new = position + velocity * timestep;

float lifetime_new = lifetime - timestep;

vec3 velocity_new = velocity + accelleration * timestep;

vec4 color_new = color;

if (position_new.x < -1) { position_new.x = -1; velocity_new.x *= -0.9; }

if (position_new.y < -1) { position_new.y = -1; velocity_new.y *= -0.9; }

if (position_new.z < -1) { position_new.z = -1; velocity_new.z *= -0.9; }

if (position_new.x > +1) { position_new.x = +1; velocity_new.x *= -0.9; }

if (position_new.y > +1) { position_new.y = +1; velocity_new.y *= -0.9; }

if (position_new.z > +1) { position_new.z = +1; velocity_new.z *= -0.9; }

// write new data

DataDestination[3 * index + 0] = vec4(position_new, lifetime_new);

DataDestination[3 * index + 1] = vec4(velocity_new, 0);

DataDestination[3 * index + 2] = color_new;

}

question 2:

what about the ModelxViewxProjection matrix calculation in the vertex shader (for rendering the particles) ?

should i move this calculation also to the compute shader and store the results in a third buffer ? what about syncronising ?

question 3:

what about a struct Particle { ... }; in the compute shader as data source / destination array, can i assume that the data is packed tightly together or do i have to bother about any offsets between struct members ??

(i would like to avoid this uglyness)

vec4 data0 = DataSource[3 * index + 0];

vec4 data1 = DataSource[3 * index + 1];

vec4 data2 = DataSource[3 * index + 2];

vec3 position = data0.xyz;

float lifetime = data0.w;

vec3 velocity = data1.xyz;

float unused = data1.w;

vec4 color = data2;

last time i used transform feedback objects and double buffering, it delivers (im my judgement) very good results:

without collision detection, about 3 millions particles can be simulated witht 60 frames per second (only gravity and collision with y = 0 level enabled)

with simple line-triangle-intersection method to detect collisions between particles and some (few) triangles in the scene, i can render about 800.000 particles with 60 frames per second

(my graphics card: NVIDIA GT 640, about 3 years old)

this time i want to push the limits further by using compute shaders, i managed to build this application:

web.engr.oregonstate.edu/~mjb/cs557/Handouts/compute.shader.1pp.pdf

i changed that to only 1 particle buffer for position / velocity / color / etc, but double buffered

the rendering method looks like this:

void ParticleSystem::Render(const glm::mat4 & view, const glm::mat4 & projection, float timestep)

{

// double buffered, switch vertex array every frame

static unsigned int flipflop = 1;

flipflop = !flipflop;

// bind both particle buffers

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, m_particle_buffer[1 - flipflop].ID()); // source

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, m_particle_buffer[flipflop].ID()); // results

// compute shader

unsigned int program = m_program_update.ID();

// simulate 1 frame

glUseProgram(program);

glDispatchCompute(m_particle_count / PARTICLES_WORK_GROUP_SIZE, 1, 1); // work group size = 128

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

// render 1 frame

program = m_program_render.ID();

glUseProgram(program);

glUniformMatrix4fv(glGetUniformLocation(program, "Model"), 1, false, glm::value_ptr(glm::mat4(1)));

glUniformMatrix4fv(glGetUniformLocation(program, "View"), 1, false, glm::value_ptr(view));

glUniformMatrix4fv(glGetUniformLocation(program, "Projection"), 1, false, glm::value_ptr(projection));

glBindVertexArray(m_vertexarray[flipflop].ID());

glDrawArrays(GL_POINTS, 0, m_particle_count);

glBindVertexArray(0);

glUseProgram(0);

}

question 1:

i've read that glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); is used to syncronize and is relatively expensive (https://www.opengl.org/wiki/Memory_Model#External_visibility), so that if i want to read back data from that buffer, i can be sure that the compute shader already finished processing the data

BUT: i use 2 buffers, the comput shader calculates data for te next frame, the current one renders the "old" frame from which the compute shader ONLY reads data

do i acually need to syncronize ?

or can i delete glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT); without problems ?

compute shader source:

#version 450

layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;

layout (std140, binding = 0) buffer Source { vec4 DataSource[]; }; // particle buffer to read from

layout (std140, binding = 1) buffer Destination { vec4 DataDestination[]; }; // particle buffer to write into

const vec3 gravity = vec3( 0, -9.81, 0);

const float timestep = 0.016;

void main()

{

// read old data

// this is a 1-dimensional calculation because the data is a 1D array (of particles)

uint index = gl_GlobalInvocationID.x; // .y and .z == 1

vec4 data0 = DataSource[3 * index + 0];

vec4 data1 = DataSource[3 * index + 1];

vec4 data2 = DataSource[3 * index + 2];

vec3 position = data0.xyz;

float lifetime = data0.w;

vec3 velocity = data1.xyz;

float unused = data1.w;

vec4 color = data2;

// calculate new data

//vec3 accelleration = gravity;

vec3 accelleration = vec3(0, 0, 0);

vec3 position_new = position + velocity * timestep;

float lifetime_new = lifetime - timestep;

vec3 velocity_new = velocity + accelleration * timestep;

vec4 color_new = color;

if (position_new.x < -1) { position_new.x = -1; velocity_new.x *= -0.9; }

if (position_new.y < -1) { position_new.y = -1; velocity_new.y *= -0.9; }

if (position_new.z < -1) { position_new.z = -1; velocity_new.z *= -0.9; }

if (position_new.x > +1) { position_new.x = +1; velocity_new.x *= -0.9; }

if (position_new.y > +1) { position_new.y = +1; velocity_new.y *= -0.9; }

if (position_new.z > +1) { position_new.z = +1; velocity_new.z *= -0.9; }

// write new data

DataDestination[3 * index + 0] = vec4(position_new, lifetime_new);

DataDestination[3 * index + 1] = vec4(velocity_new, 0);

DataDestination[3 * index + 2] = color_new;

}

question 2:

what about the ModelxViewxProjection matrix calculation in the vertex shader (for rendering the particles) ?

should i move this calculation also to the compute shader and store the results in a third buffer ? what about syncronising ?

question 3:

what about a struct Particle { ... }; in the compute shader as data source / destination array, can i assume that the data is packed tightly together or do i have to bother about any offsets between struct members ??

(i would like to avoid this uglyness)

vec4 data0 = DataSource[3 * index + 0];

vec4 data1 = DataSource[3 * index + 1];

vec4 data2 = DataSource[3 * index + 2];

vec3 position = data0.xyz;

float lifetime = data0.w;

vec3 velocity = data1.xyz;

float unused = data1.w;

vec4 color = data2;