Difference between revisions of "Compute Shader"

From OpenGL.org
Jump to: navigation, search
m (categorization)
(Compute shader overview.)
Line 1: Line 1:
A '''Compute Shader''' is a shader stage that is used entirely for computation.
+
A '''Compute Shader''' is a [[Shader Stage]] that is used entirely for computation.
  
These are special in that all other shaders have specific input and output information flow. Vertex shaders get their inputs from vertex attributes; geometry shaders get their input from vertex shaders and provide output to the rasterizer and/or transform feedback. And so on. These shaders can access [[Texture]]s, [[Buffer Object]]s, and so forth via various means, but they also have these special purpose inputs and outputs.
+
== Overview ==
 +
 
 +
Compute shaders operate differently from other shader stages. All of the other shader stages have a well-defined set of input values, some built-in and some user-defined. They have a well-defined set of output values, some built-in and some user-defined. The frequency at which a shader stage executes is specified by the nature of that stage; vertex shaders execute once per input vertex, for example (though some executions can be skipped via caching). Fragment shader execution is defined by the fragments generated from the rasterization process.
 +
 
 +
Compute shaders work very differently. The "space" that a compute shader operates on is largely abstract; it is up to each compute shader to decide what the space means. The number of compute shader executions defined by the function used to execute the compute operation. Most important of all, compute shaders have no user-defined inputs and no outputs at all. The built-in inputs only define where in the "space" of execution a particular compute shader invocation is.
 +
 
 +
Therefore, if a compute shader wants to take some values as input, it is up to the shader itself to fetch that data, via [[GLSL Sampler|texture access]], [[Image Load Store|arbitrary image load]], [[Shader Storage Buffer Object|shader storage blocks]], or other forms of interface. Similarly, if a compute shader is to actually compute anything, it must explicitly write to an image or shader storage block.
 +
 
 +
=== Compute space ===
 +
 
 +
The space that compute shaders operate within is abstract. There is the concept of a ''work group''; this is the smallest amount of compute operations that the user can execute. Or, to put it another way, the user can execute some number of work groups.
 +
 
 +
The number of work groups that a compute operation is executed with is defined by the user when they invoke the compute operation. The space of these groups is three dimensional, so it has a number of "X", "Y", and "Z" groups. Any of these can be 1, so you can perform a two-dimensional or one-dimensional compute operation instead of a 3D one. This is useful for processing image data or linear arrays of a particle system or whatever.
 +
 
 +
When the system executes work groups, it can do so in any order. So if it is given a work group set of (3, 1, 2), it could execute group (0, 0, 0) first, then skip to group (1, 0, 1), then jump to (2, 0, 0), etc. So your compute shader should not rely on the order in which groups are processed.
 +
 
 +
The work group that a particular compute shader invocation is executing within is passed as an input value.
 +
 
 +
Do not think that a work group is the same thing as a compute shader invocation; there's a reason why it is called a "group". Within a single work group, there may be many compute shader invocations. How many is defined by the ''compute shader itself'', not by the call that executes it. This is known as the ''local size'' of the work group.
 +
 
 +
Every compute shader has a three-dimensional local size (again, sizes can be 1 to allow 2D or 1D local processing). This defines the number of invocations of the shader that will take place within each work group.
 +
 
 +
Therefore, if the local size of a compute shader is (128, 1, 1), and you execute a group size of (16, 8, 64), then you will get 1,048,576 separate shader invocations. Each invocation will have a set of inputs that ''uniquely'' identify that specific invocation. This is useful for doing various forms of image compression or decompression; the local size would be the size of a block of image data (8x8, for example), while the group size will be the image size divided by the block size. Each block is processed as a single work group.
 +
 
 +
The local size invocations will be executed "in parallel". The main purpose of the distinction between work size and local size is that the different compute shader invocations ''within'' a work group can inter-communicate through a set of {{code|shared}} variables. Invocations between work groups can theoretically inter-communicate, but only through [[Atomic Counters|atomics]], [[Image Load Store|images]], and other global memory. Attempting to do so is dangerous, since groups are executed in an arbitrary order.
 +
 
 +
== Dispatch ==
 +
 
 +
Compute shaders are not part of the regular [[Rendering Pipeline Overview|rendering pipeline]]. So the usual [[Vertex Rendering]] functions do not work on them.
 +
 
 +
A [[GLSL Object|program object]] can have a compute shader in it. When not using a separate program, the compute shader linked with other [[Shader Stages]] is effectively inert. [[Vertex Rendering]] functions can be issued without affecting the compute shader.
 +
 
 +
There are two functions to begin compute operations. They will use whichever compute shader is currently active (via {{apifunc|glBindProgramPipeline}} or {{apifunc|glUseProgram}}, following the usual rules for active programs).
 +
 
 +
  void {{apifunc|glDispatchCompute}}(GLuint {{param|num_groups_x}}, GLuint {{param|num_groups_y}}, GLuint {{param|num_groups_z}});
 +
 
 +
The {{param|num_groups_*}} parameters define the work group size, in three dimensions. These numbers cannot be zero. There are [[#Limitations|limitations]] on the number of work groups that can be dispatched.
 +
 
 +
It is possible to execute dispatch operations where the work group size comes from information stored in a [[Buffer Object]]. This is similar to [[Vertex_Rendering#Indirect_rendering|indirect rendering for vertex data]]:
 +
 
 +
  void {{apifunc|glDispatchComputeIndirect}}(GLintptr {{param|indirect}});
 +
 
 +
The {{param|indirect}} parameter is the byte-offset to the buffer currently bound to the {{enum|GL_DISPATCH_INDIRECT_BUFFER​}} target. Note that the same limitations on work group sizes still apply; however, indirect dispatch bypasses OpenGL's usual error checking. As such, attempting to dispatch with out-of-bounds work group sizes can cause a crash or even a GPU hard-lock.
 +
 
 +
== Inputs ==
 +
 
 +
 
 +
=== Local size ===
 +
 
 +
 
 +
 
 +
 
 +
== Shared variables ==
 +
 
 +
 
 +
 
 +
== Limitations ==
  
Compute shaders do not. They have a very limited set of built-in inputs, which only define "where" in the computation this particular invocation of the shader is executing. They have no defined outputs. Thus, if they are to do something, they must do so through mechanisms like writing to [[Image Load Store|images]], employing [[Shader Storage Buffer Object]]s, and the like.
 
  
 
{{stub}}
 
{{stub}}

Revision as of 15:58, 20 October 2012

A Compute Shader is a Shader Stage that is used entirely for computation.

Overview

Compute shaders operate differently from other shader stages. All of the other shader stages have a well-defined set of input values, some built-in and some user-defined. They have a well-defined set of output values, some built-in and some user-defined. The frequency at which a shader stage executes is specified by the nature of that stage; vertex shaders execute once per input vertex, for example (though some executions can be skipped via caching). Fragment shader execution is defined by the fragments generated from the rasterization process.

Compute shaders work very differently. The "space" that a compute shader operates on is largely abstract; it is up to each compute shader to decide what the space means. The number of compute shader executions defined by the function used to execute the compute operation. Most important of all, compute shaders have no user-defined inputs and no outputs at all. The built-in inputs only define where in the "space" of execution a particular compute shader invocation is.

Therefore, if a compute shader wants to take some values as input, it is up to the shader itself to fetch that data, via texture access, arbitrary image load, shader storage blocks, or other forms of interface. Similarly, if a compute shader is to actually compute anything, it must explicitly write to an image or shader storage block.

Compute space

The space that compute shaders operate within is abstract. There is the concept of a work group; this is the smallest amount of compute operations that the user can execute. Or, to put it another way, the user can execute some number of work groups.

The number of work groups that a compute operation is executed with is defined by the user when they invoke the compute operation. The space of these groups is three dimensional, so it has a number of "X", "Y", and "Z" groups. Any of these can be 1, so you can perform a two-dimensional or one-dimensional compute operation instead of a 3D one. This is useful for processing image data or linear arrays of a particle system or whatever.

When the system executes work groups, it can do so in any order. So if it is given a work group set of (3, 1, 2), it could execute group (0, 0, 0) first, then skip to group (1, 0, 1), then jump to (2, 0, 0), etc. So your compute shader should not rely on the order in which groups are processed.

The work group that a particular compute shader invocation is executing within is passed as an input value.

Do not think that a work group is the same thing as a compute shader invocation; there's a reason why it is called a "group". Within a single work group, there may be many compute shader invocations. How many is defined by the compute shader itself, not by the call that executes it. This is known as the local size of the work group.

Every compute shader has a three-dimensional local size (again, sizes can be 1 to allow 2D or 1D local processing). This defines the number of invocations of the shader that will take place within each work group.

Therefore, if the local size of a compute shader is (128, 1, 1), and you execute a group size of (16, 8, 64), then you will get 1,048,576 separate shader invocations. Each invocation will have a set of inputs that uniquely identify that specific invocation. This is useful for doing various forms of image compression or decompression; the local size would be the size of a block of image data (8x8, for example), while the group size will be the image size divided by the block size. Each block is processed as a single work group.

The local size invocations will be executed "in parallel". The main purpose of the distinction between work size and local size is that the different compute shader invocations within a work group can inter-communicate through a set of shared​ variables. Invocations between work groups can theoretically inter-communicate, but only through atomics, images, and other global memory. Attempting to do so is dangerous, since groups are executed in an arbitrary order.

Dispatch

Compute shaders are not part of the regular rendering pipeline. So the usual Vertex Rendering functions do not work on them.

A program object can have a compute shader in it. When not using a separate program, the compute shader linked with other Shader Stages is effectively inert. Vertex Rendering functions can be issued without affecting the compute shader.

There are two functions to begin compute operations. They will use whichever compute shader is currently active (via glBindProgramPipeline or glUseProgram, following the usual rules for active programs).

 void glDispatchCompute(GLuint num_groups_x​, GLuint num_groups_y​, GLuint num_groups_z​);

The num_groups_*​ parameters define the work group size, in three dimensions. These numbers cannot be zero. There are limitations on the number of work groups that can be dispatched.

It is possible to execute dispatch operations where the work group size comes from information stored in a Buffer Object. This is similar to indirect rendering for vertex data:

 void glDispatchComputeIndirect(GLintptr indirect​);

The indirect​ parameter is the byte-offset to the buffer currently bound to the GL_DISPATCH_INDIRECT_BUFFER​ target. Note that the same limitations on work group sizes still apply; however, indirect dispatch bypasses OpenGL's usual error checking. As such, attempting to dispatch with out-of-bounds work group sizes can cause a crash or even a GPU hard-lock.

Inputs

Local size

Shared variables

Limitations