Microservices Observability: Field Lessons

The three pillars in practice

Everyone in the industry knows the three pillars of observability: logs, metrics, and traces. What most teams discover the hard way is that knowing the pillars and implementing them effectively are entirely different things. After years of helping engineering teams build observability into their microservices architectures, we have seen the same patterns repeat across logistics platforms, hospitality systems, and retail backends.

Logs are where everyone starts, and where most teams stay too long. Structured logging with JSON output, shipped to a centralized store like Loki or Elasticsearch, is table stakes. The problem is that logs alone can’t answer the questions that matter in a distributed system. “Why is this request slow?” requires correlating logs across 6 services, which is a trace. “Is the system healthy right now?” requires aggregated measurements over time, which is a metric. Logs are essential for debugging after the fact, but they’re the worst pillar for real-time operational awareness.

Metrics are the most operationally valuable pillar and the most underinvested. Prometheus with Grafana is the de facto standard, and for good reason: it works, it scales, and the ecosystem is mature. The RED method (Rate, Errors, Duration) gives you the core metrics for every service: requests per second, error rate, and latency distribution. The USE method (Utilization, Saturation, Errors) covers infrastructure: CPU, memory, disk, network. Together they answer “is the system healthy?” and “where is the bottleneck?” without requiring you to read a single log line.

The mistake we see most often is teams collecting hundreds of custom metrics without a clear purpose. Start with RED for every service endpoint and USE for every infrastructure component. That covers 90% of operational questions. Add custom metrics only when you have a specific question that RED and USE can’t answer.

Traces are transformative but operationally expensive. Distributed tracing shows you the full journey of a request across services: which services it touched, how long each one took, where errors occurred. It’s the only pillar that gives you a complete picture of a single transaction. But full tracing generates enormous volumes of data. The practical approach is head-based or tail-based sampling: trace 100% of errors and slow requests, sample 1-10% of successful requests. This gives you complete visibility into problems while keeping storage costs manageable.

Distributed tracing implementation

OpenTelemetry has become the industry standard for instrumentation, and for good reason. It provides a single, vendor-neutral API for generating traces, metrics, and logs across virtually every language and framework. If you’re starting a new observability implementation, OpenTelemetry is the correct choice. If you’re migrating from Jaeger client libraries or Zipkin, the migration path is well-documented and worth the investment.

The critical implementation detail that separates effective tracing from decorative tracing is context propagation. Every request entering your system must receive a trace ID. That trace ID must propagate through every service-to-service call, every queue message, every database query. If context propagation breaks at any point, the trace fragments into disconnected pieces that tell you nothing.

For HTTP services, the W3C Trace Context headers (traceparent, tracestate) are the standard. For message queues, the trace context must be embedded in message headers or metadata. For database calls, the trace context should be attached as a comment or attribute. Every boundary between services is a potential break point for context propagation, and each one must be explicitly handled.

Correlation IDs serve a complementary purpose. While the trace ID connects spans across services for a single request, a correlation ID (often a business-level identifier like an order ID or shipment ID) connects all requests related to a single business transaction. When a customer reports that their order is stuck, you search by order ID and find every trace, log entry, and metric associated with that order. This is where observability becomes directly useful to the business, not just to engineers.

For the backend, we recommend Jaeger or Grafana Tempo as the trace storage. Tempo integrates naturally with the Grafana-Prometheus-Loki stack and uses object storage (S3, GCS) for cost-effective retention. Jaeger is more mature and offers stronger query capabilities out of the box. Either is a solid choice.

Alerting that actually works

Most alerting systems we encounter in the field are broken. Not technically broken — they fire alerts. But operationally broken: they fire too many alerts, the wrong alerts, at the wrong severity, and the team has learned to ignore them. Alert fatigue is the silent killer of operational reliability.

The fundamental principle is symptom-based alerting, not cause-based alerting. Alert on what the user experiences, not on what you think causes it. “Error rate above 1% for 5 minutes” is a symptom-based alert. “CPU above 80%” is a cause-based alert. The first one tells you something is broken for users. The second tells you a resource is busy, which may or may not matter.

Cause-based alerts have their place, but they should be lower severity and used for capacity planning, not for waking people up at 3am. A database at 90% disk usage deserves a warning during business hours. A 5xx error rate spike deserves a page.

Reduce alert volume ruthlessly. Every alert should have three properties: it’s actionable (someone can do something about it right now), it’s urgent (it can’t wait until morning), and it’s real (it fires less than once a week on average). If an alert doesn’t meet all three criteria, it should be downgraded to a dashboard panel or removed entirely.

Runbooks transform alerts from noise into action. Every alert should link to a runbook that answers: what does this alert mean, what’s the likely cause, what are the immediate mitigation steps, and who to escalate to if mitigation doesn’t work. A runbook doesn’t need to be perfect. It needs to exist. The team improves it after every incident. Over time, the runbooks become the most valuable operational documentation in the organization.

We typically recommend starting with no more than 10-15 alerts for a system of 20-30 microservices. If that sounds too few, it probably means the alerts are too granular. Aggregate to the symptom level.

Dashboards for operations vs debugging

Dashboards serve two fundamentally different purposes, and mixing them is a common mistake.

Operational dashboards answer: “Is the system healthy right now?” They should be glanceable from across the room. Green means healthy, yellow means degraded, red means broken. They show the RED metrics for critical services, the current error rate, the latency percentiles, and the SLO burn rate. An operational dashboard should have no more than 10-15 panels. If someone needs to scroll or study it carefully, it has too much information.

Debugging dashboards answer: “Why is this specific thing broken?” They’re dense, detailed, and interactive. They show per-endpoint breakdowns, per-instance metrics, resource utilization, query latency distributions, queue depths, cache hit rates. Nobody looks at these unless they’re actively investigating a problem. They should be comprehensive rather than glanceable.

The Grafana ecosystem supports this well. Build a top-level operational dashboard that links to service-specific debugging dashboards. The operational dashboard is on the TV in the office. The debugging dashboards live in bookmarks and get opened during incidents.

SLO dashboards are a third category worth investing in. They show the error budget: how much unreliability can the service tolerate before breaching its SLO? A service with 99.9% availability SLO has a monthly error budget of approximately 43 minutes. The SLO dashboard shows how much of that budget has been consumed. When the budget is burning faster than the month is progressing, that’s when to act. This approach replaces reactive firefighting with proactive reliability management.

Building an observability culture

The hardest part of observability isn’t the tooling. It’s the culture. We’ve seen teams with world-class Grafana dashboards that nobody looks at, and teams with basic Prometheus setups that catch every problem before it reaches users. The difference is always cultural.

Observability starts in development, not in operations. If developers don’t instrument their code, no amount of infrastructure-level monitoring will provide application-level visibility. Make instrumentation part of the definition of done for every feature. A feature without metrics, logs, and trace context is not complete. Code review should include instrumentation review.

Blameless postmortems build trust. When an incident happens (and it will), the postmortem should focus on systemic improvements, not on who made the mistake. “The deploy pipeline didn’t catch the regression” is a useful finding. “Developer X deployed a bug” is not. Teams that fear blame hide information. Teams that trust the process share it.

On-call should be sustainable. If on-call means being woken up three times a week, the problem isn’t the engineers on call. It’s the system. Invest in reliability until on-call is boring. The target we set with our clients is fewer than two pages per on-call rotation (typically one week). If on-call is consistently noisy, that’s the strongest possible signal to invest in observability and reliability improvements.

Start with what you have. The perfect observability stack is the one your team actually uses. A Prometheus instance with 20 well-chosen alerts and 3 operational dashboards delivers more value than an elaborate multi-tool setup that nobody maintains. Build incrementally. Add complexity only when you have evidence that it’s needed.

Observability is not a project with a completion date. It’s a practice that improves continuously. The teams that treat it as an ongoing discipline, rather than a one-time infrastructure investment, are the ones that sleep well at night.

Microservices observability: lessons from the field

The three pillars in practice

Distributed tracing implementation

Alerting that actually works

Dashboards for operations vs debugging

Building an observability culture

Tags

About the author

Related articles

Microservices Observability: Metrics, Traces and Logs That Actually Matter

CI/CD for Teams Without a DevOps Team

Infrastructure as Code: Terraform vs Pulumi vs CDK