One of the biggest drawbacks with the way games currently render 3D scenes is that there’s still a surprising amount of back and forth communication required between the CPU and GPU. This overhead can slow down graphics card processing in a multitude of ways. However, a new technique demonstrated by AMD has managed to massively reduce this, boosting performance by 1.64x with without any extra processing power required.
The technique was demonstrated on an AMD Radeon RX 7900 XTX, the best graphics card you can currently buy for workloads without ray tracing, but this technique doesn’t require such a high-end GPU. As such, it could see performance increases in many games for a multitude of GPUs.
This breakthrough concerns the fact that in many workloads, you may have an initial calculation done on the GPU that then determines that some subsequent work also needs doing on the GPU. However, in the existing GPU workload setup, this subsequent work needs to be triggered by the CPU, so a little round trip is required from the GPU to the CPU and back again (often using the ExecuteIndirect command in DirectX’s D3D12). This is both inefficient and slow, relative to the GPU simply being able to handle the whole process itself.
An initial workaround for this was proposed a few years ago, with a setup called work graphs. Work graphs allow a developer to define a whole interrelated framework of possible functions and next steps such that the GPU knows which function to perform next without having to go to the CPU.
Today’s demo, then, is an extension of work graphs called mesh nodes. As AMD’s, Matthäus Chajdas, puts it in the AMD OpenGPU blog, “Mesh nodes … allow a work graph to feed directly into a mesh shader, turning the work graph itself into an amplification shader on steroids.”
Didn’t understand all that? Well, in essence it allows for those clever work graph frameworks to directly trigger mesh shaders, which are the programs used to generate in-game terrain on the fly. It’s quite a specific use case of the work graph setup but AMD demonstrates its power with a demo that procedurally generates hundreds of elements (such as the ivy shown above – left is with less generated, right is with more), all using a single initial dispatch call to the GPU. As a result, in this demo AMD could measure that the traditionally ExecuteIndirect method was 1.64x slower than the mesh nodes system. You can see the video demo on AMD’s blog linked above.
What does all this mean for current and future games? Well, it’s just one more technique developers can call upon to try and eke out more performance from our games. It’s not really clear just how much a technique like this would affect outright frame rate but by freeing up system resources in general – and CPU resources specifically – there’s potential for performance to improve thanks to other system bottlenecks being released.