We are used to the concepts of fault injection and chaos engineering in normal clusters and web api services. Techniques like node shutdowns, cpu exhaustion, memory leaks, etc. are all easy things to automate in Kubernetes with open source or proprietary tools.
When we began to build large scale GPU clusters, we discovered that the ability to test our systems (and test our O11y) was quite limited. There's not a lot of tooling readily available to create failures in GPUs and GPU Nodes at scale.
In this talk we'll look at how to design a GPU "Chaos Monkey" and automated tools like GPU exhaustion/GPU Burn, DCGM fault injection, some custom jobs, and more to automate chaos engineering across large scale GPU Kubernetes clusters.
Talk code/technicals/demo are applicable to most generally available gpu nodes that run on Kubernetes aside from absolute bleeding edge (i.e. GB200 probably won't work out of the box).
Speaker

Bryan Oliver
Principal @Thoughtworks, Global Speaker, Co-Author of "Effective Platform Engineering" and "Designing Cloud Native Delivery Systems"
Bryan is an engineer who designs and builds complex distributed systems. For the last 3 years, he has been focused on Platforms, GPU Infrastructure, and cloud native at Thoughtworks. Through his work, he gets invited to speak at conferences all over the globe. He's also a multi-published author with Manning, Effective Platform Engineering, and an early access book with O'Reilly, Designing Cloud Native Delivery Systems.