TCP #49: SLI vs. SLO vs. SLA: What Every Platform Engineer Needs to Know

A strategic guide to defining the right metrics for platform reliability, cost efficiency, and developer velocity.

Mar 23, 2025

You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.

Available for iOS and Android

The Cloud Playbook is now offering sponsorship slots in each issue. If you want to feature your product or service in my newsletter, explore my sponsor page

Become a Proud Sponsor!

Understanding SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement) is crucial for defining, monitoring, and improving service reliability.

These three terms are commonly used in Site Reliability Engineering (SRE) and Platform Engineering to measure service performance and set expectations with internal teams and customers.

In today’s newsletter issue, I discuss SLIs, SLOs, and SLAs with examples and share the strategic factors to consider while defining them.

SLI vs SLO vs SLA

1. SLI (Service Level Indicator)

SLI is a quantitative metric that measures a system's actual performance. It provides real-time or historical data on system behavior.

Example SLIs:
- Availability: Percentage of successful requests (e.g., 99.95% uptime).
- Latency: Response time of an API (e.g., 200ms P99 latency).
- Error Rate: Percentage of failed requests (e.g., 0.01% error rate).
- Throughput: Number of requests per second.

Consider SLI a speedometer. It tells you how fast you're going.

2. SLO (Service Level Objective)

SLO is a target threshold for SLIs used to measure service reliability. It defines the acceptable level of internal service performance.

Example SLOs:
- Availability: 99.95% uptime over the last 30 days.
- Latency: 95% of requests must complete in ≤200ms.
- Error Rate: Less than 0.01% of total requests should fail.

Consider SLO a speed limit. It defines how fast you should go.

3. SLA (Service Level Agreement)

SLA is a formal contract between a service provider and a customer that guarantees a specific level of service. It defines business consequences (e.g., refunds, penalties) if the service provider fails to meet the agreed service level.

Example SLA Clauses:
- Availability: "We guarantee 99.95% uptime. If uptime falls below this threshold, customers receive a 10% service credit."
- Response Time: "Critical issues will be acknowledged within 30 minutes and resolved within 4 hours."
- Support Levels: "24/7 support is available for enterprise customers."

Think of SLA as a legal contract. If you violate it, there are consequences.

How They Relate to Each Other

SLIs measure reality.
SLOs define what is acceptable.
SLAs define external commitments (often stricter than SLOs).

Example in a Cloud Platform Engineering Context

Your team runs an internal Kubernetes platform for developers.
Your SLI for pod scheduling latency is 99% of pods scheduled within 2 seconds.
Your SLO is 99.9% of pods scheduled within 2 seconds over 30 days.
Your SLA for external users (if applicable) guarantees 99.95% availability, and if breached, your company owes service credits.

Example: SLI vs. SLO vs. SLA in Managing an AWS Account

Let’s define SLI, SLO, and SLA with a practical AWS-related example.

Scenario: Ensuring EC2 Uptime & Performance

Your team manages EC2 instances running critical workloads in AWS.

Ensuring uptime, performance, and availability is a top priority.

1. SLI (Service Level Indicator) – What You Measure

SLI is a measured metric that reflects the system's performance. In this case, you decide to measure EC2 instance uptime.

Example SLI for EC2:

Availability SLI: Percentage of time EC2 instances remain running without unplanned downtime.
Latency SLI: Average response time of applications hosted on EC2 (e.g., API response time < 100ms).
Error Rate SLI: Number of failed health checks per hour.

Example Calculation of SLI (Availability):

If an EC2 instance was running for 86455 seconds out of 86400 seconds in a day (24 hours),
SLI (Availability) = (86455 / 86400) * 100 = 99.99% uptime

2. SLO (Service Level Objective) – The Target You Set

SLO defines the acceptable performance level for your AWS infrastructure.

Example SLO for EC2:

Availability SLO: EC2 instances must have 99.95% uptime over a rolling 30-day period.
Latency SLO: API hosted on EC2 must respond in <200ms for 95% of requests.
Error Rate SLO: Less than 0.01% failed requests per month.

How SLO is Used:

If your actual SLI (availability) falls below 99.95%, your team investigates root causes and improves monitoring or auto-recovery mechanisms.
You configure CloudWatch alarms to notify the team if the availability drops below 99.95%.

3. SLA (Service Level Agreement) – The Commitment to Customers

SLA is a formal agreement with internal or external customers regarding AWS service performance.

Example SLA for EC2-hosted Service:

Availability SLA: “We guarantee 99.9% uptime for our application. If uptime falls below this, customers receive a 10% refund on their monthly bill.”
Response SLA: “Our platform will acknowledge customer-reported issues within 30 minutes and resolve critical incidents within 4 hours.”

What Happens If SLA Is Breached?

Customers get service credits if EC2 uptime drops to 99.5%, violating the 99.9% SLA.
If your team fails to respond to an outage within 4 hours, it’s considered an SLA breach.

Strategic Considerations for Defining SLA, SLI, and SLO

Defining SLA, SLI, and SLO is more than just setting metrics.

It’s about aligning reliability with business objectives, optimizing cost, and improving developer productivity.

1. Factors to Consider While Defining SLI (Service Level Indicator)

SLI is about measuring the right thing, not everything.

Your SLIs should be tied to what impacts users, developer efficiency, and business outcomes.

Strategic Factors for SLI Definition:

Customer Impact – What does "good performance" mean for end users? Measure meaningful user-perceived latency, availability, and error rates.

Example: API request success rate vs. just measuring EC2 uptime.

Business Alignment – Ensure SLIs track KPIs relevant to the business.

Example: If downtime costs $50K/hour, uptime and failure recovery time should be key SLIs.

Engineering & Ops Relevance – Your team should be able to act on SLIs. Avoid vanity metrics that don't drive decisions.

Example: Track deployment failure rates and infrastructure drift detection.

Granularity & Aggregation – SLIs should be fine-grained for engineering teams but aggregated for leadership reporting.

Example: Measure P95 API response times instead of just averages.

Automated Observability – Ensure SLIs can be automatically monitored using tools like CloudWatch, Prometheus, Datadog, or OpenTelemetry.

Example: Set up alerts if error rates exceed 0.1% over a 5-minute window.

2. Factors to Consider While Defining SLO (Service Level Objective)

SLO is about setting reliability targets that balance user expectations and engineering efficiency.

SLOs must be set to optimize developer velocity, incident management, and cost.

Strategic Factors for SLO Definition:

User Expectations vs. Cost – Higher reliability = higher cost. Define SLOs that meet business needs without over-engineering.

Example: Instead of 99.999% uptime, a 99.95% SLO might be good enough while reducing infra costs by 40%.

Error Budgeting – Define an acceptable level of failure to avoid burnout and overreaction to minor incidents.

Example: If SLO is 99.9% uptime, allow 43.2 minutes of downtime per month.

Developer Experience – SLOs should focus on users and improve internal developer workflows.

Example: "90% of infrastructure deployments should be complete within 5 minutes."

Dynamically Tuned SLOs – Instead of static SLOs, adjust them based on seasonality, scaling needs, and user behavior.

Example: Set stricter SLOs during peak usage (Black Friday) vs. lower ones during off-peak hours.

Accountability & Ownership – Ensure teams own SLOs and they are part of OKRs and performance reviews.

Example: Each product team has its own SLO review process tied to reliability goals.

3. Factors to Consider While Defining SLA (Service Level Agreement)

SLA is about managing business risk, setting customer expectations, and defining financial consequences.

SLAs must align with business contracts and revenue models.

Strategic Factors for SLA Definition:

Revenue Protection – SLAs should balance service quality with financial liability.

Example: If downtime costs $100K/hour, the SLA should penalize breaches while allowing sustainable operations.

Competitive Positioning – Research industry benchmarks (AWS, GCP, Azure) to ensure your SLAs are competitive but feasible.

Example: AWS offers 99.99% SLA for EC2—your SLA should align with such benchmarks.

Tiered SLAs for Different Customers – Enterprise customers may need a 99.99% SLA, while SMBs may be fine with 99.9%.

Example: Offer 24/7 premium support for enterprise clients vs. business-hours-only support for lower-tier customers.

Defining Financial Consequences – Set clear refund/penalty policies for SLA breaches.

Example: If uptime drops below 99.9%, customers get a 10% service credit.

Legal & Compliance Considerations – Ensure SLAs cover security, regulatory, and compliance expectations (SOC 2, HIPAA, FedRAMP).

Example: Define incident response times and data recovery SLAs for regulated industries.

Thanks for reading The Cloud Playbook! This post is public so feel free to share it.

Final Thoughts

Defining SLIs, SLOs, and SLAs isn’t just about setting reliability metrics

It’s about aligning engineering efforts with business outcomes.

By implementing a structured framework, you can:

✅ Measure what matters: Focus on SLIs that reflect real user experiences.

✅ Set realistic objectives: SLOs should optimize uptime while maintaining operational efficiency.

✅ Define clear commitments: SLAs should protect the business and customers while remaining achievable.

✅ Automate monitoring & alerting: Proactive observability ensures your team can act before problems escalate.

✅ Continuously improve: Quarterly SLO reviews help adjust targets based on real-world usage and feedback.

That’s it for today!

Did you enjoy this newsletter issue?

Share with your friends, colleagues, and your favorite social media platform.

Share The Cloud Playbook

Until next week — Amrut

Whenever you’re ready, there are 4 ways I can help you:

NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter

Get in touch

You can find me on LinkedIn or X.

If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.

The Cloud Playbook