Run Pong on the GPU with Compute Shaders in Unity

Pong on the GPU

This article is about adding interactivity and game logic to Compute Shaders. For my introduction to Compute Shaders, check out this post. We’ll do this by writing a version of Pong that runs on the GPU through Compute Shaders. Pong is a simple game that doesn’t fully take advantage of the power of the GPU. Over time I’ll show how you can run more things on the GPU so that you’ll be able to apply Compute Shaders to solve your specific problems.

So let’s start by drawing the paddle.

Drawing the Paddle

I’ll begin with a basic compute shader runner and the compute shader template. The runner is based off of the introduction to compute shaders post. It looks like this:

 public class PongComputeShaderRunner : MonoBehaviour
 {
    [SerializeField] ComputeShader _computeShader;
    [SerializeField] int _size;

    RenderTexture _renderTexture;

    void Start()
    {
        _renderTexture = new RenderTexture(_size, _size, 24);
        _renderTexture.filterMode = FilterMode.Point;
        _renderTexture.enableRandomWrite = true;
        _renderTexture.Create();

        var main = _computeShader.FindKernel("CSMain");
        _computeShader.SetTexture(main, "_Result", _renderTexture);
        _computeShader.GetKernelThreadGroupSizes(main, 
            out uint xGroupSize, 
            out uint yGroupSize, 
            out uint zGroupSize);
        _computeShader.Dispatch(main, 
            _renderTexture.width / (int) xGroupSize, 
            _renderTexture.height / (int) yGroupSize,
            1);
    }

    void OnRenderImage(RenderTexture src, RenderTexture dest)
    {
        Graphics.Blit(_renderTexture, dest);
    }
 }

And the compute shader:

 #pragma kernel CSMain
 RWTexture2D<float4> _Result;

 [numthreads(8,8,1)]
 void CSMain (uint3 id : SV_DispatchThreadID)
 {
    _Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);
 }

It’s nothing we haven’t covered in the previous tutorial. To draw the paddle, we need to know if a given pixel is within a rectangle’s boundaries. So add an IsInsideRect() function to the compute shader:

bool IsInsideRect(float2 min, float2 max, float2 p)
{
    return p.x > min.x && p.x < max.x && p.y > min.y && p.y < max.y;
}

This checks if point p is within the min and max extents of a rectangle. Next, we’ll define the paddle’s position in the compute shader. I use float4 _Paddle to cleverly encode the paddle’s position in the x and y part and the paddle size in the z and w (actually, I’m just lazy). We can modify the CSMain function to write the paddle’s colour if we’re inside its rect, and otherwise write our to-be-defined background colour like so:

float4 _Paddle;
float4 _PaddleColor;
float4 _BackgroundColor;

void CSMain(uint3 id : SV_DispatchThreadID)
{
    if (IsInsideRect(_Paddle.xy - _Paddle.zw, _Paddle.xy + _Paddle.zw, id.xy))
    {
        _Result[id.xy] = _PaddleColor;
    }
    else
    {
        _Result[id.xy] = _BackgroundColor;
    }
}

In this case, the paddle rect’s min and max extents are the paddle’s position minus its size and the paddle’s position plus its size, respectively.

And now the compute shader is ready to draw. By the way, if you’re familiar with writing shaders, you may cringe when you see the if statement. Branching on the GPU is different than what we’re used to on the CPU. The GPU will execute all branches and ignore the unused results. So unlike the CPU, the cost of branching is, in fact, the sum of the cost of all branches. In our case, this example is easier to read, and the cost is trivial, so it’s ok.

From the C# side, it’s primarily boilerplate code. Declare some inspector variables, pass them into the compute shader before you call Dispatch(), nothing surprising. I don’t want to waste much time going over the boilerplate, but I’ll share the finished project later.

[Header("Paddle")]
[SerializeField] Color _paddleColor;
[SerializeField] Vector2 _paddlePosition = new Vector2(256, 10);
[SerializeField] Vector2 _paddleSize = new Vector2(32, 8);
.
. // Then inside Start()...
.
//Send our inspector vars to the compute shader:
_computeShader.SetVector("_Paddle", new Vector4(
	_paddlePosition.x,  _paddlePosition.y, 
	_paddleSize.x,  _paddleSize.y));
_computeShader.SetVector("_BackgroundColor", _backgroundColor);
_computeShader.SetVector("_PaddleColor", _paddleColor);
.
.
.

You should see something like this:

Ok great! Now let’s move on to the real purpose of this article: adding interactivity.

Moving the Paddle

We could move the paddle in the C# update loop and pass the new position into the compute shader, but where’s the fun in that? Instead, we give the input and delta time into the compute shader and ask the GPU to do the work. However, doing so requires some restructuring. First, we shouldn’t move the paddle inside our current CSMain because that function is called once for every pixel on the screen. Doing so would effectively move our paddle thousands of times every frame. Instead, let’s add a new Update kernel to the compute shader and use that to update our elements before drawing them. Here’s what Update looks like in the compute shader:

float _Input;
float _DeltaTime;

[numthreads(1,1,1)]
void Update(uint3 id : SV_DispatchThreadID)
{
    _Paddle.x += _Input * _DeltaTime;
}

I set numthreads to (1, 1, 1) because we’ll only be running Update once per frame in this simple example. In other words, there’s no reason to ask for a large thread group. At this point, you might be thinking this a silly use of the GPU’s parallel architecture and might be slower than updating on the CPU. Well, you’re right, but that’s not the point of this sample. If you had multiple paddles, you could consider breaking the work into one thread per paddle.

Anyway, in the above example _Input is passed to the shader from Input.GetAxis(“Horizontal”) multiplied by a speed value, and _DeltaTime comes from Time.deltaTime. However, if you were to set up the C# script and run this, you wouldn’t see the paddle move. The reason is that float4 _Paddle resets to its initial value every time the shader runs. To get around this limitation, we need to store it in an RWStructuredBuffer.

ComputeBuffers / RWStructuredBuffers

By changing to a StructuredBuffer, we allow our state to persist between frames. To do this, we need to declare a ComputeBuffer in our C# script and assign it to this RWStructuredBuffer. This is similar to what we’re doing with the _Result texture already. After this change, our compute shader Update will look like this:

RWStructuredBuffer<float4> _Paddle;
[numthreads(1,1,1)]
void Update(uint3 id : SV_DispatchThreadID)
{
    _Paddle[0].x += _Input * _DeltaTime;
}

The difference is now we have a collection of Paddles with a single element because StructuredBuffers point to arrays. By the way, the difference between RWStructuredBuffer and StructuredBuffer is that the latter is read-only. Over on the C# side, we create our ComputeBuffer and map it. So we add this block to our initialization:

// Inside Start():
_paddleBuffer = new ComputeBuffer(1, 4 * sizeof(float));
_paddleBuffer.SetData(new[] {
    new Vector4(_paddlePosition.x, 
        _paddlePosition.y, 
        _paddleSize.x, 
        _paddleSize.y)
});

_computeShader.SetBuffer(_updateKernel, "_Paddle", _paddleBuffer);
_computeShader.SetBuffer(_drawKernel, "_Paddle", _paddleBuffer);

So, let’s break this down. First, we create a new ComputeBuffer. The first argument is the number of elements in our compute buffer, one paddle in our case. The second is the size of an element in bytes. A float takes 4 bytes, so a float4 takes 4 * 4 bytes. So, we use sizeof(float) to get the size and multiply that by 4. After, we set the data in the buffer to an array with a single element. Like before, this element holds our paddle position and size. Finally, we set the buffer in both of our kernels, the new Update kernel and our CSMain kernel (which we’ll rename to Draw). Finally, you must dispose of your buffer to avoid leaking memory:

void OnDestroy()
{
    _paddleBuffer.Dispose();
}

So now we know how to create ComputeBuffers and link them to our compute shader.

Putting it together

Here are both scripts in their entirety up until now. If I’ve done my job right, nothing here should be surprising. I’ve taken all the concepts we covered and put them together.

//PongComputeShader.compute
#pragma kernel Update
#pragma kernel Draw

RWTexture2D<float4> _Result;

float _Input;
float _DeltaTime;

RWStructuredBuffer<float4> _Paddle;

float4 _PaddleColor;
float4 _BackgroundColor;

float _Resolution;

bool IsInsideRect(float2 min, float2 max, float2 p)
{
    return p.x > min.x && p.x < max.x && p.y > min.y && p.y < max.y;
}

[numthreads(1,1,1)]
void Update(uint3 id : SV_DispatchThreadID)
{
    _Paddle[0].x += _Input * _DeltaTime;
}

[numthreads(8,8,1)]
void Draw(uint3 id : SV_DispatchThreadID)
{
    if (IsInsideRect(_Paddle[0].xy - _Paddle[0].zw, _Paddle[0].xy + _Paddle[0].zw, id.xy))
    {
        _Result[id.xy] = _PaddleColor;
    }
    else
    {
        _Result[id.xy] = _BackgroundColor;
    }
}
//PongComputeShaderRunner.cs
using UnityEngine;

public class PongComputeShaderRunner : MonoBehaviour
{
    [Header("Setup")]
    [SerializeField] ComputeShader _computeShader;
    [SerializeField] int _size = 512;
    [SerializeField] Color _backgroundColor;
    
    [Header("Paddle")]
    [SerializeField] Color _paddleColor;
    [SerializeField] Vector2 _paddlePosition = new Vector2(256, 10);
    [SerializeField] Vector2 _paddleSize = new Vector2(32, 8);
    [SerializeField] float _paddleSpeed = 400f;

    RenderTexture _renderTexture;

    int _updateKernel;
    int _drawKernel;

    ComputeBuffer _paddleBuffer;

    void OnEnable()
    {
        _renderTexture = new RenderTexture(_size, _size, 24)
        {
            filterMode = FilterMode.Point, 
            enableRandomWrite = true
        };
        _renderTexture.Create();

        _drawKernel = _computeShader.FindKernel("Draw");
        _updateKernel = _computeShader.FindKernel("Update");

        _computeShader.SetFloat("_Resolution", _size);
        _computeShader.SetTexture(_drawKernel, "_Result", _renderTexture);
        _computeShader.SetVector("_BackgroundColor", _backgroundColor);

        _computeShader.SetVector("_PaddleColor", _paddleColor);

        _paddleBuffer = new ComputeBuffer(1, 4 * 4);
        _paddleBuffer.SetData(new[] {new Vector4(_paddlePosition.x, _paddlePosition.y, _paddleSize.x, _paddleSize.y)});

        _computeShader.SetBuffer(_updateKernel, "_Paddle", _paddleBuffer);
        _computeShader.SetBuffer(_drawKernel, "_Paddle", _paddleBuffer);

        _computeShader.SetFloat("_Input", 0f);

        _computeShader.GetKernelThreadGroupSizes(_drawKernel, out uint xGroupSize, out uint yGroupSize, out _);
        _computeShader.Dispatch(_drawKernel, _renderTexture.width / (int) xGroupSize,
            _renderTexture.height / (int) yGroupSize, 1);
    }

    void Update()
    {
        var input = Input.GetAxisRaw("Horizontal") * _paddleSpeed;

        _computeShader.SetFloat("_Input", input);
        _computeShader.SetFloat("_DeltaTime", Time.deltaTime);

        _computeShader.Dispatch(_updateKernel, 1, 1, 1);

        _computeShader.GetKernelThreadGroupSizes(_drawKernel, out uint xGroupSize, out uint yGroupSize, out _);
        _computeShader.Dispatch(_drawKernel, _renderTexture.width / (int) xGroupSize,
            _renderTexture.height / (int) yGroupSize, 1);
    }

    void OnRenderImage(RenderTexture src, RenderTexture dest)
    {
        Graphics.Blit(_renderTexture, dest);
    }

    void OnDisable()
    {
        _paddleBuffer.Dispose();
    }
}

The next thing we’ll do is add the ball. I’ll use this opportunity to demonstrate how you can have a StructuredBuffer of an arbitrary struct.

Structured Buffers with custom Structs

In the compute shader, we can declare a struct and a Structured Buffer of that struct like so:

//In the compute shader
struct Ball
{
    float4 position;
    float2 velocity;
};

RWStructuredBuffer<Ball> _Ball;

There’s nothing special about it. Over on the C# side, we must declare an accompanying struct. While we’re at it, let’s create a new ComputeBuffer to hold the data.

//In C#
ComputeBuffer _ballBuffer;

struct Ball
{
    public Vector4 Position;
    public Vector2 Velocity;
}

When we create the compute buffer, we pass a 6 * sizeof(float) stride in the ball’s case. That’s because a Vector4 holds 4 floats and a Vector2 holds 2 floats, add them up, and you get to 6.

//Inside the C# initialization...
.
.
.
_ballBuffer = new ComputeBuffer(1, 6 * sizeof(float));
Ball ball = new Ball
{
    Position = new Vector4(_ballPosition.x, _ballPosition.y, _ballSize, 0f),
    Velocity = new Vector2(0.5f, 0.5f).normalized * _ballInitialSpeed
};
_ballBuffer.SetData(new[] {ball});
.
.
.

With that in place, you can pass the buffer to the compute shader. Now at this point, I have a confession to make. I promised a version of Pong running in a compute shader, but the rest of the work has little to do with compute shaders and a lot to do with Pong. My point is that getting into those details would detract from the goal of this article, which is to show how to add interactivity to a compute shader. If you want to keep going, there’s a more complete version of the project on Github linked at the end.

So that’s all I’ll cover today. Now you can write interactive programs that run entirely on the GPU!

If you appreciate my work, join my mailing list to be notified whenever a new post comes out. The complete(ish) project is available here on Github.

Leave A Comment