Skip to content
SLA design: how to build service level agreements that work

SLA design: how to build service level agreements that work

A
abemon
| | 11 min read
Share

Why most SLAs are worthless

Open the SLA from any technology service provider and you will probably find something like: “99.9% availability.” Sounds solid. But what does it actually mean?

It means the service can be down 8 hours and 45 minutes per year. That is over 43 minutes per month. Seems manageable, until those 43 minutes fall in the middle of a sales peak, a financial close, or a critical logistics operation. Then you discover that the SLA does not specify how availability is measured, that scheduled maintenance windows are excluded, that “partial degradation” incidents do not count, and that the maximum penalty is two months of service credit that you will never claim because the process is deliberately cumbersome.

A well-designed SLA is not a legal document to file away. It is an operational contract that defines expectations, aligns incentives, and creates an accountability mechanism. We have designed (and operated under) dozens of SLAs for our managed services. Here is what we have learned.

The hierarchy: SLI, SLO, SLA

Before designing an SLA, you need to understand the three layers of the hierarchy. Google SRE popularized these concepts, and for good reason: they separate measurement from commitment.

SLI (Service Level Indicator). The concrete metric you measure. Example: “percentage of HTTP requests that respond in under 500ms with a 2xx status code.” An SLI is a number that comes from your monitoring system. It has no opinion; it is a fact.

SLO (Service Level Objective). The internal target you set for an SLI. Example: “99.5% of requests must meet the latency/success SLI.” An SLO is an engineering decision. Too high, and you overspend on redundancy. Too low, and users complain.

SLA (Service Level Agreement). The contractual commitment to the customer, based on one or more SLOs, with defined consequences for non-compliance. Example: “if monthly availability falls below 99.5%, the customer receives a 10% credit on the monthly invoice.”

The fundamental rule: your SLO must be stricter than your SLA. If your SLA guarantees 99.5%, your internal SLO should be 99.8% or 99.9%. The margin between the two is your error budget: the room you have for failures, maintenance, and experiments before breaching the contractual commitment.

Metric selection: what to measure

The most common mistake in SLA design is measuring what is easy to measure rather than what matters to the customer. Server uptime is not the same as service availability to the end user. A server can be “up” at 100% while the user sees a 503 error because the load balancer is misconfigured.

The metrics we recommend for web service SLAs:

Availability (success rate). Percentage of requests that respond correctly (HTTP 2xx or 3xx). Measured from the user’s perspective, not from the server. This requires synthetic monitoring (external checks) in addition to internal metrics. We use tools like Uptime Robot, Checkly, or our own check system to measure from multiple locations.

Latency. Response time measured in percentiles. The mean is useless because it hides outliers. We use p50 (median), p95, and p99. The SLA should define: “p95 latency will not exceed 500ms.” This means 95% of requests respond in under 500ms. The remaining 5% can be slower without breaching the agreement.

Data durability. For services that store data, the probability of not losing data. AWS S3 offers 99.999999999% (eleven nines). Your service probably does not need that, but the expectation must be defined. Do not confuse availability with durability: data can be available at 99.9% (you can access it almost always) but have a durability of 99.999% (it is almost impossible to lose).

Incident response time. Not how long the system takes to recover, but how long the team takes to start working on the problem. Defined by severity: P1 (service down) — response within 15 minutes. P2 (degradation) — response within 1 hour. P3 (minor issue) — response within 4 business hours.

Measurement methodology

How you measure matters as much as what you measure. An SLA without a defined measurement methodology is an invitation to conflict.

Measurement window. Monthly is the standard. Annual smooths out problems too much (you can have a disastrous month and still meet the annual SLA). Weekly is too granular and creates noise.

Exclusions. Explicitly define what does not count: scheduled maintenance (with minimum 48-hour advance notice), incidents caused by the customer (configuration changes, unannounced traffic spikes), force majeure. Exclusions must be specific, not generic. “Circumstances beyond our control” is broad enough to invalidate the SLA entirely.

Data source. Who provides the measurement data. Ideally, a system independent of the provider. If the provider measures its own availability, there is an obvious conflict of interest. Third-party monitoring tools (Datadog, New Relic, or synthetic checks from StatusPage) solve this.

Calculation. The exact formula. For example:

Monthly availability (%) = ((Total minutes in month - Downtime minutes) / Total minutes in month) * 100

Where “downtime” is defined as: any period exceeding 1 minute during which more than 5% of synthetic requests from at least 2 external locations fail.

The more specific, the less ambiguity. The less ambiguity, the fewer conflicts.

Penalty structure

Penalties must create real incentives. A 5% credit on a service that costs EUR 500 per month does not incentivize anyone to lose sleep on a Saturday. But they should not be so aggressive that the provider cannot operate sustainably.

A tiered model that works:

Monthly availabilityCredit
>= 99.5%0%
99.0% - 99.49%10%
98.0% - 98.99%25%
95.0% - 97.99%50%
< 95.0%100% + right to terminate

For latency or response time incidents, the model is similar but based on the number of monthly breaches.

Two critical rules:

Credits must be automatic. If the customer has to file a claim to receive the credit, 90% will not bother and the SLA becomes theater. In our managed services, credits are applied automatically on the next invoice when the system detects a breach.

The SLA must include a termination mechanism. If the provider breaches repeatedly (for example, 3 of the last 6 months below the SLA), the customer should be able to terminate the contract without penalty. This protects the customer from a provider that prefers paying credits to investing in improvement.

The nines are misleading

A note on the obsession with availability “nines.” The difference between 99.9% and 99.99% is an order of magnitude in engineering and cost, but not always an order of magnitude in business value.

LevelAnnual downtimeMonthly downtime
99%3.65 days7.3 hours
99.5%1.83 days3.65 hours
99.9%8.76 hours43.8 minutes
99.95%4.38 hours21.9 minutes
99.99%52.6 minutes4.38 minutes

Going from 99.9% to 99.99% requires geographic redundancy, automatic failover, zero-downtime deployments, and an infrastructure investment that can triple the cost. Before demanding (or promising) a fourth nine, ask yourself: do 39 fewer minutes of monthly downtime justify doubling or tripling the bill?

For most business applications (ERPs, CRMs, corporate websites, internal tools), 99.5% to 99.9% is the correct range. Four nines are reserved for critical infrastructure: payment gateways, healthcare systems, trading platforms.

Internal SLAs

SLAs are not just for vendor relationships. Internal SLAs between teams (platform to product, data to analytics, IT to operations) create the same dynamics of accountability and alignment.

A platform team operating under an internal SLO of 99.9% for the authentication service behaves differently than one that simply “tries not to let it go down.” The SLO creates an error budget that enables explicit decisions: if we have consumed 30% of the budget by mid-month, perhaps this is not the right time for that aggressive refactoring.

The format can be lighter than an external SLA (no financial penalties, for example), but the essential elements are the same: metric, objective, measurement, and periodic review.

Review and continuous improvement

An SLA is not a static document. It should be reviewed at least quarterly with real data. The key questions:

  • Has the SLA been met? If not, why? Was it a one-off incident or a pattern?
  • Is the internal SLO strict enough? If you never approach the limit, it may be too conservative and you are overinvesting.
  • Do the SLA metrics reflect the actual user experience? An availability SLA at 99.9% means nothing if the service is unusably slow without actually going down.
  • Are penalties proportional? Too low and they do not incentivize. Too high and they create an adversarial relationship.

To dive deeper into building the operational culture that makes SLA invocations rare, see our article on continuous improvement with digital Kaizen. And to understand the economic impact of choosing the right pricing model, our guide on cost reduction with managed services breaks down the numbers. The best SLA is the one you never need to invoke — not because it is poorly designed, but because the metrics, internal SLOs, and the provider’s operational culture make breaches rare and communication proactive when they occur.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.