We are used to the concepts of fault injection and chaos engineering in normal clusters and web api services. Techniques like node shutdowns, cpu exhaustion, memory leaks, etc. are all easy things to automate in Kubernetes with open source or proprietary tools.

When we began to build large scale GPU clusters, we discovered that the ability to test our systems (and test our O11y) was quite limited. There's not a lot of tooling readily available to create failures in GPUs and GPU Nodes at scale.

In this talk we'll look at how to design a GPU "Chaos Monkey" and automated tools like GPU exhaustion/GPU Burn, DCGM fault injection, some custom jobs, and more to automate chaos engineering across large scale GPU Kubernetes clusters.

Talk code/technicals/demo are applicable to most generally available gpu nodes that run on Kubernetes aside from absolute bleeding edge (i.e. GB200 probably won't work out of the box).

Chaos Engineering GPU Clusters

Speaker

Bryan Oliver

Find Bryan Oliver at:

Speaker

Bryan Oliver

Share

InfoQ Resources

Social Media Links

Conference

Helpful Resources

InfoQ & QCon Events