What Is a Chaos Drill?

With the transformation from traditional IT infrastructure O&M to cloud service O&M, traditional O&M methods face challenges such as complex inter-service invoking, fast application iteration, massive O&M objects, and complex non-linearity systems. Service downtime will bring huge economic losses and reputational damage to a company.

To solve this problem, chaos engineering is introduced to the O&M process. Performing chaos drills periodically helps identify system weaknesses (such as software bugs, solution design deficiencies, and fault recovery process points) before issues occur on the live network. In this way, system availability problems can be detected and resolved in a timely manner, improving application resilience and building O&M confidence. For unavoidable scenarios (such as hardware faults, abnormal server power-off, and network device board faults), formulate a contingency plan for quick fault recovery in advance.

COC allows you to perform automatic chaos drills covering from risk identification, emergency plan management, fault injection, and review and improvement. Based on years of best practices of Huawei Cloud SRE in chaos drills, customers can proactively identify, mitigate, and verify risks of cloud applications, improving the resilience of cloud applications.

Parent Topic: Resilience Center FAQs