TCP # 40 : Mastering Chaos Engineering with AWS Fault Injection Service

Break things before they break you

Feb 08, 2025

You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.

Available for iOS and Android

The Cloud Playbook is now offering sponsorship slots in each issue. If you want to feature your product or service in my newsletter, explore my sponsor page.
Become a Proud Sponsor!

Why You Need Chaos Engineering (Before Disaster Strikes)

Your system is running fine until it’s not.

Suddenly, a critical service crashes, latency spikes, and your customers are furious.

That’s when you realize You’ve never tested what happens when things go wrong.

That’s what Chaos Engineering is about.

It’s not about breaking things randomly; it’s about deliberately injecting failures to see how your system handles them.

AWS Fault Injection Service (FIS) lets you simulate real-world outages, network failures, and latency spikes before they happen in production.

Instead of hoping your architecture is resilient, you’ll know for sure.

You need a structured plan to test, observe, and improve your system’s resilience to get started.

In today’s newsletter, I discuss how to do it.

Step 1: Define Your Steady State

Before introducing chaos, you must know what “normal” looks like.

Start by identifying the key metrics that define a healthy system.

This could be:

API response time under 200ms
99.99% availability
Database queries completed within 50ms

Use AWS CloudWatch dashboards, distributed tracing (AWS X-Ray), and monitoring tools like Datadog or New Relic to establish these baselines.

If you don’t have clear performance benchmarks, pause.

Measure first. Then, you’re ready to inject failures.

Step 2: Design a Failure Scenario

Not all failures are equal. Some bring down your entire system, while others create minor annoyances.

Prioritize testing the most business-critical components first.

Common failure scenarios include:

EC2 instance termination: What happens when a key server dies?
Latency injection: What if a database query takes 5x longer?
Network blackhole: What if a specific availability zone goes offline?
CPU stress: What happens if an app server hits 100% CPU?

AWS FIS comes with built-in templates for these failures.

You can also create custom ones tailored to your architecture.

Step 3: Create an Experiment in AWS Fault Injection Service

It's time to get hands-on. Here’s how to set up an experiment in AWS FIS:

Navigate to AWS FIS in the AWS console.
Create a new experiment template.
Choose an action, like terminating EC2 instances or adding network latency.
Define targets, specifying which instances or resources to affect.
Set stop conditions to prevent uncontrolled damage. For example, halt the test if CPU utilization exceeds 90%.
Launch the experiment and monitor its impact.

AWS FIS integrates with CloudWatch Alarms to auto-stop experiments if things spiral out of control. Use this safety net.

Step 4: Observe, Analyze, and Improve

Once the experiment starts, watch what happens. Are requests failing? Are auto-scaling policies kicking in? Are customers impacted?

Use AWS tools like:

CloudWatch Logs to track errors
X-Ray to see request traces and bottlenecks
AWS Config to verify system compliance

Once the test is completed, review the data.

If your system degraded beyond acceptable limits, tweak configurations, optimize scaling policies, or introduce fallback mechanisms (like circuit breakers).

Run the test again. And again.

Chaos Engineering is an iterative process.

Step 5: Automate Chaos Into Your CI/CD Pipeline

Running chaos experiments manually is good.

Automating them is game-changing.

Use AWS FIS alongside AWS Step Functions, Lambda, or even GitHub Actions to trigger chaos experiments on schedule or after deployments.

Example: After deploying a new version of your app, automatically run a latency injection test to ensure the system handles slowdowns gracefully.

The goal?

Make Chaos Engineering a regular practice, not a one-time event.

Have you checked out the free resources and guides I have shared with my subscribers in the past? Get them today and elevate your cloud skills further:

Free Guides & Helpful Resources

Final Thoughts

AWS Fault Injection Service allows you to test failures before they happen in production. It’s the difference between hoping your system is resilient and proving it is.

Start small. Run a simple EC2 termination test today.

Then scale up.

Simulate a regional outage, test for database connection failures, or inject latency spikes across services.

The more you test, the stronger your system becomes.

Your customers may never know you practice Chaos Engineering.

But they’ll know when you don’t.

That’s it for today!

Did you enjoy this newsletter issue?

Share with your friends, colleagues, and your favorite social media platform.

Share The Cloud Playbook

Until next week — Amrut

Posts that caught my eye this week

Whenever you’re ready, there are 4 ways I can help you:

NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter

Get in touch

You can find me on LinkedIn or X.

If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.

Meng Li

Feb 8

Chaos Engineering transforms system resilience by deliberately injecting controlled failures—based on clearly defined normal operating conditions—to uncover vulnerabilities and validate performance before real-world issues occur. By automating iterative experiments and learning from each test, organizations can continuously improve reliability and safeguard customer trust even under stress.

Expand full comment

The Cloud Playbook