Skip to content

Zero Downtime Deployment: Strategies for Never Stopping Production

A
abemon
| | 12 min read | Written by practitioners
Share

The cost of “we cannot deploy right now”

There is a phrase that kills delivery velocity for any engineering team: “let’s wait for the weekend to deploy.” What the phrase really says is that deploying is risky, that the team does not trust the process, and that the maintenance window has become a crutch.

The cost of not being able to deploy whenever you want is invisible but cumulative. Features that wait days. Urgent fixes that require approval ceremonies. Teams that batch changes into large releases because deploying is expensive, which makes each release more dangerous, which reinforces the belief that deploying is risky. A vicious cycle.

Zero downtime deployment breaks that cycle. If deploying is safe, you can deploy frequently. If you deploy frequently, each deployment is small. If each deployment is small, risk is low. And if risk is low, you do not need the maintenance window.

The three fundamental strategies

Blue-green

Blue-green is the simplest strategy to understand and the most expensive in infrastructure. You maintain two identical environments: blue (current production) and green (the new version). You deploy the new version to green, verify it, and when ready, redirect traffic from blue to green. If something fails, you redirect back to blue. Rollback is instantaneous.

The advantage is the simplicity of the mental model. There is always a clean version to return to. No intermediate states, no version mixing.

The cost is that you need double the infrastructure. In cloud, this is manageable: spin up the green environment, deploy, verify, redirect, destroy the old environment. With Kubernetes, blue-green is implemented with two Deployments and a Service that points to one or the other. With Railway or similar PaaS services, it is even simpler: deploy to an alternative service slot and switch at the DNS or load balancer level.

The real problem with blue-green is not infrastructure. It is the database. If the new version requires schema changes, both versions must work with the same schema. This imposes a migration discipline that we cover below.

Rolling

Rolling deployment updates instances incrementally. If you have 10 pods, update 2, verify, update 2 more, and so on. At all times, instances of both the old and new version serve traffic simultaneously.

In Kubernetes, rolling update is the default behavior of a Deployment. The maxSurge and maxUnavailable parameters control how many instances are updated simultaneously. A maxSurge: 1 and maxUnavailable: 0 guarantees you always have at least full capacity: a new instance comes up before an old one is killed.

The advantage is resource efficiency. You do not need double the infrastructure. The deployment is gradual and if there is a problem, it affects a fraction of traffic.

The complexity is that during deployment, two versions coexist. If the new version changes the format of a message in a queue, the old version will not know how to interpret it. If the new version changes an internal API, the old version will keep calling the previous one. Backward compatibility between consecutive versions is mandatory.

Rollback is slower than blue-green. You have to redeploy the previous version incrementally. In Kubernetes, kubectl rollout undo automates this, but it takes as long as a full rolling update.

Canary

Canary is the most sophisticated strategy and offers the best balance between safety and speed. You deploy the new version to a small subset of traffic (the canary: typically 5-10%), monitor metrics, and if everything looks good, increase gradually to 100%.

The key difference from rolling is that in canary you control what percentage of traffic receives the new version and can hold that proportion as long as needed. In rolling, the percentage advances automatically.

Implementation requires an intelligent routing component. In Kubernetes, Istio or Linkerd provide weight-based traffic splitting. In AWS, ALB with weighted target groups. In Cloudflare, Workers with routing logic. The tool varies, but the concept is the same: a configurable percentage of traffic goes to the canary.

Monitoring during the canary phase is critical. You need metrics that compare the canary against the baseline in real time: error rates, latency, business metrics. If the canary shows an error rate 0.5% higher than the baseline, that is a warning signal. Tools like Prometheus with comparison queries or Datadog with canary dashboards facilitate this comparison.

The most important point: automate the decision to promote or revert the canary. If metrics stay within thresholds for a defined period, the canary promotes automatically. If any metric crosses a threshold, it reverts automatically. Argo Rollouts in Kubernetes implements exactly this pattern with analysis templates.

Database migrations without downtime

This is where most teams stumble. You can have the best deployment pipeline in the world, but if your database migration requires an ALTER TABLE that locks the table for 45 seconds, you have 45 seconds of service degradation.

The fundamental rule is: never make a schema change that is incompatible with the current version of the code. This is achieved with the expand-contract pattern:

Phase 1: Expand. Add the new column, table, or index without removing anything. Current code keeps working because nothing it uses has changed. The new column can have a default value or be nullable.

Phase 2: Migrate. Deploy the new code version that writes to the new structure in addition to the old one. If it is a new column, the code writes to both. If it is a rename, it writes to the old column and the new one. Existing data is migrated in background (backfill).

Phase 3: Contract. Once the new version is deployed at 100% and historical data has been migrated, remove the old structure. This phase happens in a separate, subsequent deployment.

Three phases. Three deployments. Each backward compatible. No locks.

For ALTER TABLE in PostgreSQL, the situation has improved with recent versions. Adding a column with a default value no longer requires an exclusive lock since PostgreSQL 11. But creating an index on a multi-million row table does block reads and writes unless you use CREATE INDEX CONCURRENTLY. MySQL has Percona’s pt-online-schema-change for online migrations that create a shadow table, replicate changes, and do an atomic swap.

Tools like Flyway or Liquibase manage migration versioning. What they do not manage is the expand-contract logic. That requires team discipline.

Session handling

If your application stores sessions in server memory, a rolling deployment kills the sessions of users whose server gets recycled. This is unacceptable for most applications.

The solution is to externalize sessions. Redis is the de facto standard. Sessions are stored in Redis with an appropriate TTL, and any application instance can read them. When a pod is destroyed during a rolling update, the user is redirected to another pod that reads their session from Redis without interruption.

For applications using JWT, the problem is smaller. Tokens are self-contained and do not depend on server state. But you must manage signing key rotation. If the new version uses a different signing key, tokens issued by the previous version will be invalid. The solution is keeping both keys active during a transition period and verifying against both.

Sticky sessions (server affinity) are a tempting but fragile solution. They work until the server the user is “stuck to” gets recycled. In a zero downtime deployment, that happens by definition. Sticky sessions and zero downtime are fundamentally incompatible.

Rollback: the plan you hope not to need

Every deployment needs a rollback plan. The plan is not “revert and pray.” It is a documented procedure that answers: what metrics trigger the rollback, who authorizes it, how it is executed, and how long it takes.

In blue-green, rollback is redirecting traffic to the previous environment. Seconds.

In rolling, rollback is a rolling update back to the previous version. Minutes.

In canary, rollback is redirecting 100% of traffic to the stable version. Seconds, if routing is configurable in real time.

The hard case is when rollback involves reverting a database migration. If you followed the expand-contract pattern, this is not a problem: the previous version works with the expanded schema. If you did not, you are in dangerous territory. Reverting an ALTER TABLE that dropped a column with data is, at best, a restore from backup. At worst, data loss.

The rule is simple: if the migration is not reversible, the deployment is not reversible, and you need more caution (extended canary, manual verification, feature flags).

Feature flags as a multiplier

Feature flags decouple code deployment from feature release. You can deploy code containing a new feature that is disabled by default, verify the deployment is stable, and activate the feature gradually.

This enables a powerful pattern: deploy first, activate later. Deployment is mechanical and automatable. Activation is a product decision that can be made without touching infrastructure.

Tools like LaunchDarkly, Unleash, or even a simple JSON in Redis provide the feature flag infrastructure. What matters is the discipline to use them for any user-visible behavior change and to clean them up afterward (eternal feature flags are technical debt).

Health checks and readiness probes

A detail that seems minor but has direct impact on deployment quality: health checks must differentiate between liveness and readiness.

Liveness answers “is the process alive?” If the liveness check fails, the orchestrator kills the pod and creates a new one. This check should be lightweight: an endpoint that returns 200 if the process responds. It should not depend on external services (database, cache, APIs) because a database problem is not solved by killing pods.

Readiness answers “can it receive traffic?” A pod can be alive but not ready: it is loading cache, waiting for database connections, warming models. While the readiness check fails, the Kubernetes Service does not send traffic to that pod. This is critical during a rolling deployment: new pods do not receive traffic until they are genuinely prepared.

The classic mistake is using the same endpoint for both checks, or worse, not configuring readiness probes at all. The result is that during a rolling update, traffic reaches pods that are not yet ready, generating 503 errors for a percentage of users. Technically it is not downtime. For the user who sees an error, it is.

In practice, we configure the liveness probe with an initial delay of 10-15 seconds and a period of 10 seconds. The readiness probe with an initial delay of 5 seconds and a period of 5 seconds. Exact values depend on the application, but the key is that readiness should be more frequent and more strict.

Post-deployment smoke tests

After each deployment, before declaring success, a set of smoke tests verifies that critical functionality is operational. These are not exhaustive tests. They are the 5-10 tests that cover the highest-value flows: login, product search, checkout, order processing.

Smoke tests should run automatically as part of the pipeline. If they fail, the deployment reverts automatically. If they pass, the deployment is marked as successful and the team is notified.

A useful smoke test is fast (under 60 seconds total), reliable (no flaky tests generating false positives), and meaningful (tests real functionality, not health check endpoints). Cypress, Playwright, or even a curl script verifying HTTP responses fulfills this purpose.

Putting it all together

A mature deployment pipeline combines these elements:

  1. Automated tests as an entry gate. If tests fail, deployment does not proceed.
  2. Canary deployment at 5% of traffic with automated metric analysis.
  3. Gradual promotion (25%, 50%, 100%) with automated gates based on error rate and latency.
  4. Automatic rollback if any metric crosses defined thresholds.
  5. Feature flags for behavior changes that require controlled activation.
  6. Database migrations always backward compatible using expand-contract.

You do not need to implement everything from day one. Start with rolling deployment and compatible migrations. Add canary when traffic volume justifies the investment in observability. Add feature flags when deployment cadence is daily or higher.

The goal is not technical perfection. It is being able to deploy on a Tuesday at 11 AM without anyone losing sleep. When you reach that point, delivery velocity multiplies and quality improves because changes are small, incremental, and reversible. For a deeper dive into canary deployments and feature flags, see our article on testing in production. If you do not yet have an automated pipeline, start with our guide on CI/CD for teams without DevOps.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.