Table of

8.1.2010The year I started blogging (blogware)
9.1.2010Linux initramfs with iSCSI and bonding support for PXE booting
9.1.2010Using manually tweaked PTX assembly in your CUDA 2 program
9.1.2010OpenCL autoconf m4 macro
9.1.2010Mandelbrot with MPI
10.1.2010Using dynamic libraries for modular client threads
11.1.2010Creating an OpenGL 3 context with GLX
11.1.2010Creating a double buffered X window with the DBE X extension
12.1.2010A simple random file read benchmark
14.12.2011Change local passwords via RoundCube safer
5.1.2012Multi-GPU CUDA stress test
6.1.2012CUDA (Driver API) + nvcc autoconf macro
29.5.2012CUDA (or OpenGL) video capture in Linux
31.7.2012GPGPU abstraction framework (CUDA/OpenCL + OpenGL)
7.8.2012OpenGL (4.3) compute shader example
10.12.2012GPGPU face-off: K20 vs 7970 vs GTX680 vs M2050 vs GTX580
4.8.2013DAViCal with Windows Phone 8 GDR2
5.5.2015Sample pattern generator


OpenGL (4.3) compute shader example


OpenGL 4.3 was released yesterday, and among the larger updates were compute shaders. Today, since I couldn't find a tutorial/example on google, I'm going to show you how to use them.

Compute shaders in the pipeline

The important thing to note is that while the other shaders have a fixed execution order, compute shaders can essentially alter any data anywhere. Shader objects within a program object are implicitly pipelined after another, and a program object is "ready to go" as it is. Compute shaders cannot be baked into a program object alongside other shaders as their execution order is not fixed. Instead, compute shaders have to be placed into program objects by themselves and the application has to instruct OpenGL about the execution order explicitly by switching on and off the compute shader program object and calling DispatchCompute*() to run the compute shaders.

OpenGL compute shaders are GLSL and similar to other shaders: you can read textures, images, and buffers and write images and buffers. Just like with other GPGPU implementations, threads are grouped into work groups and one compute shader invocation processes a bunch of work groups. The work group size is specified along with the kernel source code, and the number of work groups launched is given by the application as arguments to DispatchCompute*().


You should know when to choose a compute shader over the other shaders for your algorithm (this is not one such example). The reasons to use GPGPU are universal and have nothing to do with OpenGL compute shaders specifically.

You can grab the full example program here, but the important files are main.cpp and opengl_cs.cpp. In main.cpp we create an OpenGL 4.3 context (I'm being strict and using a forward-compatible core profile, but you don't have to), a texture for the compute shader to write and the fragment shader to read, and two program objects. One object is for the compute shader and the other is for rendering (vertex + fragment shaders). After that we go into a loop where we update a counter in the compute shader, fill in the texture (as image2D), and blit the texture onto the screen.

#include "opengl.h"

GLuint renderHandle, computeHandle;

void updateTex(int);
void draw();

int main() {

    GLuint texHandle = genTexture();
    renderHandle = genRenderProg(texHandle);
    computeHandle = genComputeProg(texHandle);

    for (int i = 0; i < 1024; ++i) {

    return 0;

void updateTex(int frame) {
    glUniform1f(glGetUniformLocation(computeHandle, "roll"), (float)frame*0.01f);
    glDispatchCompute(512/16512/161); // 512^2 threads in blocks of 16^2
    checkErrors("Dispatch compute shader");

void draw() {
    glDrawArrays(GL_TRIANGLE_STRIP, 04);
    checkErrors("Draw screen");

The compute shader set-up should look familiar as it's just another shader. (There are some specifics which are documented in the GLSLang specification.)

#include "opengl.h"
#include <stdio.h>
#include <stdlib.h>

GLuint genComputeProg(GLuint texHandle) {
    // Creating the compute shader, and the program object containing the shader
    GLuint progHandle = glCreateProgram();
    GLuint cs = glCreateShader(GL_COMPUTE_SHADER);

    // In order to write to a texture, we have to introduce it as image2D.
    // local_size_x/y/z layout variables define the work group size.
    // gl_GlobalInvocationID is a uvec3 variable giving the global ID of the thread,
    // gl_LocalInvocationID is the local index within the work group, and
    // gl_WorkGroupID is the work group's index
    const char *csSrc[] = {
        "#version 430\n",
        "uniform float roll;\
         uniform image2D destTex;\
         layout (local_size_x = 16, local_size_y = 16) in;\
         void main() {\
             ivec2 storePos = ivec2(gl_GlobalInvocationID.xy);\
             float localCoef = length(vec2(ivec2(gl_LocalInvocationID.xy)-8)/8.0);\
             float globalCoef = sin(float(gl_WorkGroupID.x+gl_WorkGroupID.y)*0.1 + roll)*0.5;\
             imageStore(destTex, storePos, vec4(1.0-globalCoef*localCoef, 0.0, 0.0, 0.0));\

    glShaderSource(cs, 2, csSrc, NULL);
    int rvalue;
    glGetShaderiv(cs, GL_COMPILE_STATUS, &rvalue);
    if (!rvalue) {
        fprintf(stderr"Error in compiling the compute shader\n");
        GLchar log[10240];
        GLsizei length;
        glGetShaderInfoLog(cs, 10239, &length, log);
        fprintf(stderr"Compiler log:\n%s\n", log);
    glAttachShader(progHandle, cs);

    glGetProgramiv(progHandle, GL_LINK_STATUS, &rvalue);
    if (!rvalue) {
        fprintf(stderr"Error in linking compute shader program\n");
        GLchar log[10240];
        GLsizei length;
        glGetProgramInfoLog(progHandle, 10239, &length, log);
        fprintf(stderr"Linker log:\n%s\n", log);
    glUniform1i(glGetUniformLocation(progHandle, "destTex"), 0);

    checkErrors("Compute shader");
    return progHandle;

compute shader demo


But why did Khronos introduce compute shaders in OpenGL when they already had OpenCL and its OpenGL interoperability API? Well, OpenCL (and CUDA) are aimed for heavyweight GPGPU projects and offer more features. Also, OpenCL can run on many different types of hardware (apart from GPUs), which makes the API thick and complicated compared to light compute shaders. Finally, the explicit synchronization between OpenGL and OpenCL/CUDA is troublesome to do without crudely blocking (some of the required extensions are not even supported yet). With compute shaders, however, OpenGL is aware of all the dependencies and can schedule things smarter. This aspect of overhead might, in the end, be the most significant benefit for graphics algorithms which often execute for less than a millisecond.



Great article, thanks!!!
- Rich


Thank you very much!
- Aavci


Nice! Thank you!
- linsnos


Why do you set texHandle as arg of genRenderProg() and genRenderProg()? You havent even use it internally. I don't know how it supposed to work it that way...
- Wonderer


Oh yeah you're right; I'm not using the parameter, so it's ignored.  There's no need to use it since it's bound to GL_TEXTURE0 during creation and kept bound throughout the program.
- wili


Thank you


Very helpful !!
- AB


to anyone having problems compiling/running this with a nvidia card, try -L/usr/lib/nvidia-xxx with g++ (xxx being your driver version) and change "uniform image2D destTex" in the shader code to "writeonly uniform image2D destTex"
- meepo


thank you @meepo
- nozam


- jimmi


Thank you!  This was easy to duplicate.  Well done.
- freeflyclone

Nick     E-mail   (optional)

Is this spam? (answer "no")