top of page

Michael Ennis

Linkedin Logo.png

Games Programmer

Boids That Sprint

Boids That Sprint is my implementation of a runtime optimised variant of the Boids flocking algorithm, showcasing the performance benefits of utilizing GPU resources to execute AI behaviours.  I developed this over the course of my final year in Abertay for my honours project, delving into the DirectX12 API and discovering numerous optimisation methods that I could use to accelerate the basic algorithm.

The Boids model is a simple flocking algorithm that simulates a school of fish or a herd of animals moving in tandem with one another. The algorithm achieves this fluid motion through three separate rules; cohesion, alignment and separation. These rules result in emergent behaviour, as the individual Boids themselves react to each other Boid within the model, rather than directly being informed to move together as a group by a controller. 

While an effective flocking algorithm, it has a critical performance drawback; the worst-case time complexity. This is because each frame, all Boids determine their positions and velocities based on all other Boids in the model, resulting in a big O notation of O(n^2). To mitigate this, my honours project utilizes compute shaders through general purpose computation on the graphic processing unit (GPGPU) to perform the Boids calculations, resulting in a 96.59% increase in performance with 500 boids. Further optimisation of the GPGPU version using asynchronous compute and render pipelines resulted in an additional 9.85% performance increase.

The Boids application was built on top of a detailed DirectX 12 tutorial series Learning-DirectX12, which provided a functional renderer, wrappers for command lists and queues, a heap allocator and other helper functionality. Using this foundation, I focused on implementing the Boids physics, GPU compute shader optimizations, and asynchronous pipelines, while also deepening my understanding of the DirectX 12 API and graphics programming fundamentals.

Technical Features

CPU Boids

The CPU implementation of the Boids algorithm was split across two key systems; the BoidsPhysicsSystem and BoidsRenderSystem. The physics system was responsible for updating the Boids positions and applying the three rules; cohesion, separation and alignment. The rendering system was responsible for creating buffers to store the Boids vertices and indices, alongside passing commands to the render pipeline. 

The BoidsPhysicsSystem functions by iterating through all boids within the model, caching the current boid. It then iterates through all other boids, calculating the distance to the cached boid. If this distance is within the maximum range of any of the three rules, then a rule-specific target vector is calculated that is added to a final vector for that rule.

Once all boids in the model have been compared against the currently cached one, then the final vectors for the three rules are divided by the number of vectors added, resulting in an average vector that influences the cached boid's velocity. The physics system then repeats this process until all boids in the model have been updated.

The BoidsRenderSystem initializes the buffers for the vertices and indices of all boids in the model. The render system also calls GPU commands that draw all of the boids on screen, via mesh instancing. This approach avoids having to render each individual boid, as they all share the same vertex and index layout, though their location and rotation on screen is manipulated via the vertex shader stage of the render pipeline. 

The separation of responsibilities between the physics and render system allowed me to reuse the render functionality for the GPU and asynchronous compute variants, without recalculating the boids physics on the CPU. This is further reinforced through the purposeful RegisterBoids method and overloaded constructors, that ensure that boids are assigned manually, fully decoupling the systems from one another.

Activity Diagram of Update method within BoidsPhysicsSystem

Activity Diagram of individual rule methods within BoidsPhysicsSystem

(Figure 3.4)

GPU Accelerated Boids

The GPGPU implementation of the boids algorithm utilizes the compute shader within DirectX12 to allow for the GPU to multithread boids behaviour. The compute shader implementation reuses similar logic found within the BoidsPhysicsSystem, except it is configured for GPU multithreading architecture. GPU threads follows an SIMT (single instruction, multiple threads) approach to parallel computation, meaning that each thread performs the same instructions, but upon different data elements. 

To accommodate this, I implemented an unordered access view buffer to store boids data, including both position and velocity. This buffer is written to by the compute shader after it calculates the boids data. This same buffer is then immediately passed to the render pipeline, to draw the boids on screen. 

 

Originally, the boids world view projection matrices were calculated on the CPU, however by utilizing this UAV buffer, I kept all boids data solely within the GPU by performing the matrix calculations in the vertex shader of the render pipeline. This avoided costly CPU to GPU transfers that could have affected the overall performance of the model. 

 

 

 

 

 

 

 

 

 

 

 

 

 

To ensure that the compute pipeline has finished writing to the UAV buffer before the render pipeline begins, a fence object is used. Fence objects are synchronization tools that halt execution of other GPU command queues or the CPU thread until a specific value is reached. To open the fence and continue execution, an appropriate value must be signalled to the command queue.

 

In the OnRender method within the Tutorial3 class, the render command queue waits upon the fence of the compute command queue. When all render commands have finished executing, the render command queue signals its internal fence, that allows the compute command queue to begin execution of its commands. This synchronization ensures that the UAV buffer is never written to and read from simultaneously, avoiding any memory conflicts.

 

Screenshot of CalculateBoidWorldViewProjectionMatrix method within BoidsVertexShader hlsl file

Screenshot of render command queue (commandQueue) waiting on compute command queue's fence object, within OnRender method

Asynchronous Compute

The asynchronous compute implementation of the boids algorithm is built upon the previous GPU accelerated model, utilizing GPU multithreading in a similar way. The core principle of this approach involves executing both the render and compute pipelines concurrently, by coordinating through fence objects and multiple boid UAV data buffers. One of these buffers is exclusively written to each frame by the compute pipeline, representing the next frame's boids data. The other buffer is exclusively read from each frame by the render and compute pipeline, representing the current frame's boids data.

After both the compute and render pipeline have finished executing their commands, the two buffers are swapped. The old "current" buffer becomes the new "next", ready to be updated by the compute pipeline, while the old "next" becomes the new "current", allowing the render pipeline to draw the new boids data. This double buffering approach avoids memory conflicts by ensuring that reading and writing occur on separate UAV buffers, while fence objects enforce proper synchronization between the two pipelines. 

Asynchronous compute increases GPU utilization by increasing occupancy of the streaming multiprocessors. This increase in the number of thread warps being active helps to hide latency during warp stalls, resulting in performance gains over the standard GPU accelerated model. However, at larger boid counts, there are less GPU resources available due to the increased demand from the compute pipeline. As such, the performance benefits of the asynchronous compute technique become less prominent, due to the streaming multiprocessors being oversaturated with the larger workload. 

Screenshot of pipeline synchronization before swapping buffers, within OnRender method

Conclusion

This project was an opportunity for me to push myself technically after returning to university from my programming internship. I wanted to focus on an honours topic that related to optimization within gameplay AI as I am particularly interested in how games are able to render hundreds if not thousands of entities on screen simultaneously, while maintaining consistent framerates. Throughout development, I learned the fundamentals of DirectX12, GPU compute shaders and pipeline synchronisation through fences and signals, strengthening my graphics programming and optimization skills.

Boids That Sprint demonstrated that by leveraging GPGPU to heavily parallelize AI behaviour, significant runtime performance gains could be achieved, even at relatively low entity counts. As the number of entities increased, the performance gap between the standard CPU implementation and the two GPU based models widened even further. Initially, the asynchronous compute version resulted in a further 9.85% increase in performance over the standard GPGPU implementation at 500 entities due to increased occupancy. However, at 10000 entities, this reduces to 5.89% due to oversaturation of the streaming multiprocessors.

Overall, this project effectively demonstrates the benefits and limitations of GPU acceleration and asynchronous compute techniques for large scale AI simulations. Future development could focus on improving the worst case time complexity of O(n^2) through spatial partitioning techniques, as this would increase performance through limiting the number of boids that influence one another. 

bottom of page