Kubernetes in Production: Lessons from 3 Years

The question you should ask first

Before discussing how to operate Kubernetes, there is a question too many teams skip: should I use Kubernetes at all?

The answer is not automatically yes. K8s solves a specific problem: orchestrating containerized workloads at scale, with high availability, declarative deployments, and autoscaling. If you have 3 services with predictable traffic, Kubernetes is probably overengineering. A pair of VMs with Docker Compose and a load balancer give you 80% of the benefit at a fraction of the complexity.

Our rule of thumb: Kubernetes starts justifying itself at 8-10 services, or when you need real autoscaling, or when your availability requirements demand systematic zero-downtime deployments. Below that, simpler alternatives (ECS, Cloud Run, even Railway or Render) cover most needs with less operational overhead.

That said, when K8s is the right answer, it is an extraordinary tool. Here is what three years of operating it taught us.

Cluster sizing: the day-one mistake

The first cluster we helped put into production was over-provisioned by 300%. Three m5.2xlarge nodes (8 vCPU, 32GB RAM each) for a workload that fit comfortably on a single m5.xlarge. The reason was the classic one: “just in case we need to scale.”

The problem with over-provisioning in cloud is not just cost (though it is: those three nodes cost EUR 850/month and the actual workload needed EUR 280/month). The problem is that it masks configuration errors. With 96GB of RAM, you do not notice a pod leaking 500MB of memory. With 24 vCPUs, you do not notice a service consuming 4 CPUs from an inefficient loop. Excess resources hide problems that explode when you actually need to scale.

What works: start with right-sized nodes, enable the Cluster Autoscaler from day one, and use nodes of different sizes. A combination of small nodes for base load and medium/large nodes that the autoscaler adds and removes based on demand. In EKS, Karpenter has replaced the classic Cluster Autoscaler and does this much better: it automatically selects the optimal instance type for pending pods.

Requests and limits: the most important and most ignored configuration. Requests determine how much CPU and memory the scheduler guarantees to a pod. Limits determine the maximum a pod can consume. Without requests, the scheduler cannot make good placement decisions. Without limits, a pod with a leak can take down an entire node.

The rule we follow: requests at 70-80% of observed average consumption, limits at 150-200% of observed peak. And we review these figures quarterly, because consumption changes with every release.

Networking: where the dragons live

Kubernetes networking is the source of more incidents in our experience than any other area. Not because it is poorly designed, but because it is complex and most teams do not understand it deeply enough.

CNI (Container Network Interface). The CNI choice matters more than it appears. Calico is the most common and most versatile. Cilium is gaining ground fast thanks to eBPF, which gives it superior visibility and performance. AWS VPC CNI assigns real VPC IPs to pods, which simplifies integration with AWS services but limits the number of pods per node (depends on instance type and supported ENIs).

We had a production incident caused by hitting the subnet IP limit on a cluster with VPC CNI. Pods sat in Pending indefinitely and the error message was cryptic. Since then, we always size subnets with a 3x margin over expected capacity.

Ingress. Nginx Ingress Controller remains the standard for most workloads. But watch the sizing: a single ingress controller with defaults is a single point of failure. We run two or three replicas with anti-affinity to distribute them across different nodes, and we monitor ingress latency as a primary metric.

Service mesh. Istio remains powerful but heavy. For teams that need network observability and mTLS between services, Linkerd is lighter and easier to operate. For most mid-market cases we see, a full service mesh is not needed. Native Kubernetes Network Policies cover basic microsegmentation.

DNS. CoreDNS is the invisible component that causes visible problems. In clusters with hundreds of pods, DNS resolution can become a bottleneck. Two adjustments we always apply: ndots:2 in pod configuration (to reduce unnecessary queries) and NodeLocal DNSCache (a local DNS cache on each node).

Security: the non-negotiable minimum

Kubernetes security has layers. These are the ones we consider mandatory:

RBAC. No user or service account should have cluster-admin permissions except the platform team. Each team gets a namespace with limited roles: they can deploy to their namespace, read logs, but cannot modify resources in other namespaces or global cluster resources. This seems obvious, but we have found production clusters where every service account had admin permissions.

Pod Security Standards. Since Kubernetes 1.25, Pod Security Standards (PSS) replace PodSecurityPolicies. The “restricted” level should be the default: no root, no host network, no privileged containers, no privilege escalation. Exceptions are documented and audited.

Image scanning. Trivy or Grype in the CI/CD pipeline. Every image entering the cluster undergoes vulnerability scanning. Critical vulnerabilities block deployment. High ones generate a ticket that must be resolved within 7 days. This sounds restrictive until you discover that 40% of base images on Docker Hub have known critical vulnerabilities.

Secrets. Native Kubernetes Secrets are base64-encoded, not encrypted. For production we use External Secrets Operator, which syncs secrets from AWS Secrets Manager or HashiCorp Vault. This way the actual secrets never live in etcd.

Network Policies. By default, all pods in a Kubernetes cluster can communicate with all others. This is terrible for security. We implement deny-all network policies by default in every namespace, then explicitly open only the necessary communication paths. Yes, it is extra work. But a compromised pod in one namespace should not be able to reach the database of another namespace.

Deployments that do not wake anyone at 3am

The default Kubernetes deployment pattern (RollingUpdate) works well for most cases. But details make the difference between a smooth deployment and an incident:

Well-configured readiness probes. The readiness probe must verify that the application is truly ready to receive traffic, not just that the process is running. If your application needs 30 seconds to warm its cache, the readiness probe must reflect that. Otherwise, Kubernetes sends traffic to pods that are not yet ready and users see 5xx errors.

PodDisruptionBudgets. A PDB guarantees that a minimum number of pods remain available during node updates or deployments. Without a PDB, a node update can take down all replicas of a service simultaneously. We always configure PDBs with at least maxUnavailable: 1 for critical services.

Canary deployments. For high-traffic services or risky changes, we use Argo Rollouts for canary deployments: 5% of traffic goes to the new version, we monitor metrics for 10 minutes, and if everything looks good, we increase progressively. If metrics degrade, automatic rollback. This has prevented production incidents that we would have detected too late with a traditional rolling update.

Cost optimization: the continuous cycle

Kubernetes in the cloud is not cheap. An EKS cluster with 10 m5.xlarge nodes costs approximately EUR 1,400/month in compute alone (not counting the EKS fee, networking, or storage). Cost optimization is a continuous process, not a one-time project.

Spot instances for tolerant workloads. Batch workloads, CI/CD jobs, and development environments run on spot nodes, which cost 60-70% less than on-demand. Karpenter automatically manages the spot/on-demand mix. The critical point is that spot workloads must tolerate interruptions: graceful shutdown, checkpointing, and appropriate retry logic.

Vertical Pod Autoscaler (VPA). VPA analyzes actual pod consumption and recommends (or optionally applies) request and limit adjustments. It is the most underutilized tool in the ecosystem. On a cluster with 80 pods, VPA identified that 60% had CPU requests over-provisioned by 2x to 5x. Adjusting them allowed us to reduce the node count from 12 to 8.

Development namespaces with TTL. Development environments in Kubernetes tend to accumulate. A namespace created to test a feature that merged three months ago is still consuming resources. We implement a TTL system that marks development namespaces for automatic deletion after 7 days of inactivity, with prior notification to the team.

Monthly cost review. Kubecost or OpenCost provide cost visibility per namespace, per deployment, per label. We review this data monthly and share it with development teams. When a team sees that their service costs EUR 200/month in compute, they make optimization decisions they never would have considered before.

Well-operated Kubernetes is a platform that multiplies development team productivity. Poorly operated, it is an inexhaustible source of incidents and costs. The difference is investing the necessary time in fundamentals: sizing, networking, security, and observability. There are no shortcuts.

Kubernetes in Production: Lessons from 3 Years Operating Clusters

The question you should ask first

Cluster sizing: the day-one mistake

Networking: where the dragons live

Security: the non-negotiable minimum

Deployments that do not wake anyone at 3am

Cost optimization: the continuous cycle

Tags

About the author

Related articles

Infrastructure as Code: Terraform vs Pulumi vs CDK

Microservices observability: lessons from the field

Railway and Modern PaaS: WordPress Deployments That Scale