Summary
Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qcon.ai with any comments or concerns.
Presented by Bryan Oliver, the talk explores chaos engineering within GPU clusters, highlighting the unique challenges and methodologies when dealing with GPUs compared to traditional CPU-centric systems.
Key Points
- Complexity of GPU Clusters: GPU clusters are highly complex, and traditional chaos engineering methods aren't sufficient. These clusters demand specific approaches to handle intricacies such as stateful workloads, high costs, and unique failure modes like ECC errors and thermal throttling.
- Chaos Engineering Techniques: The presentation outlines seven chaos engineering techniques specifically for GPU clusters, including:
- DCGM Fault Injection: Allows injecting specific GPU failures at a node level.
- GPU Burn: Used to test scenarios like memory fragmentation and thermal management by exhausting GPU resources.
- Network Fault Injection: Introduces faults in GPU networking using methods like nickel environment variables and traffic control on IP over IB.
- NUMA Awareness: Highlights its significance in deploying workloads to GPU clusters by avoiding performance hits from non-aligned nodes.
Observability and Scheduling:
- A significant focus is placed on observability gaps, where current tools may fail to detect issues effectively. The need for comprehensive monitoring and alerting systems that detect injected faults and recover from them is emphasized.
- The importance of scheduling strategies that are aware of variability and infrastructure conditions to optimize performance and resource allocation.
Open Source Initiatives:
An open-source project named GPU Dragon is being initiated to implement the presented chaos experiments, enhancing accessibility and robustness in chaos engineering practices for GPUs.
Historical Context:
The presentation briefly touches on the evolution of GPU clusters, tracing back to significant breakthroughs that have shaped modern computational practices.
Overall, the presentation calls for adaption and innovation in chaos engineering methods to cater to rapidly advancing and expensive GPU cluster technologies.
.This is the end of the AI-generated content.
We are used to the concepts of fault injection and chaos engineering in normal clusters and web api services. Techniques like node shutdowns, cpu exhaustion, memory leaks, etc. are all easy things to automate in Kubernetes with open source or proprietary tools.
When we began to build large scale GPU clusters, we discovered that the ability to test our systems (and test our O11y) was quite limited. There's not a lot of tooling readily available to create failures in GPUs and GPU Nodes at scale.
In this talk we'll look at how to design a GPU "Chaos Monkey" and automated tools like GPU exhaustion/GPU Burn, DCGM fault injection, some custom jobs, and more to automate chaos engineering across large scale GPU Kubernetes clusters.
Talk code/technicals/demo are applicable to most generally available gpu nodes that run on Kubernetes aside from absolute bleeding edge (i.e. GB200 probably won't work out of the box).
Speaker
Bryan Oliver
Principal @Thoughtworks, Global Speaker, Co-Author of "Effective Platform Engineering" and "Designing Cloud Native Delivery Systems"
Bryan is an engineer who designs and builds complex distributed systems. For the last 3 years, he has been focused on Platforms, GPU Infrastructure, and cloud native at Thoughtworks. Through his work, he gets invited to speak at conferences all over the globe. He's also a multi-published author with Manning, Effective Platform Engineering, and an early access book with O'Reilly, Designing Cloud Native Delivery Systems.