Skip to content
Microservices Observability: Metrics, Traces and Logs That Actually Matter

Microservices Observability: Metrics, Traces and Logs That Actually Matter

A
abemon
| | 11 min read | Written by practitioners
Share

847 alerts in one week

That is what a client’s operations team was receiving when we started working with them. 847 alerts in 7 days. The predictable result: they ignored all of them. When a real alert surfaced (a payments service with degraded latency causing timeouts for 12% of customers), it took 4 hours to detect. Not because there was no alert. There were 23 related alerts. But they were buried in a sea of noise.

That scenario is the direct result of confusing monitoring with observability. Monitoring is collecting data. Observability is understanding the system. The difference is not semantic. It is operational. And it has a measurable cost in downtime, in wasted engineering hours, and in customers who leave without telling you why.

Which metrics to instrument (and which to ignore)

The natural instinct is to instrument everything. CPU, memory, disk, network, latency of every endpoint, error rate of every service, length of every queue. The result is a dashboard with 200 panels nobody looks at and an alerting system generating 847 weekly notifications.

The alternative: start with Google SRE’s four golden signals.

Latency. Time for a request to complete. But not the average. The average lies. A service with an average latency of 120ms can have a p99 of 3.2 seconds, meaning 1 in 100 users waits over 3 seconds. We instrument p50, p95, and p99. The p50 tells us how the typical user is doing. The p99 tells us how the worst-off users are doing. If p99 spikes but p50 stays flat, we have a problem affecting few users but severely.

A critical detail: measure latency separately for successful and failed requests. Requests returning a 500 error are often fast (the service fails quickly). If you mix them with successful ones, average latency drops when errors increase. Counterintuitive and dangerous.

Traffic. Requests per second. But broken down by operation type, not aggregated. Read traffic and write traffic have completely different profiles. A spike in reads might be normal (marketing campaign, peak hour). A spike in writes might be anomalous (a runaway script, an attack, a loop in a microservice).

Errors. Error rate as a percentage of total traffic. Not absolute count. 500 errors per hour sounds terrible. 500 errors per hour with 2 million requests per hour is a rate of 0.025%, which is probably within your SLO. 500 errors per hour with 10,000 requests per hour is 5%, and that is a crisis.

Saturation. How much of the service’s capacity is in use. CPU, memory, database connections, goroutines, threads, whatever is the limiting resource for your particular service. Saturation is the predictive metric: it tells you when you are going to have a problem, not when you already have one.

Those four metrics, well instrumented, cover 80% of observability needs. Everything else is additional context for diagnosis, not for alerting.

Distributed tracing: the piece that changes everything

Picture a user request that passes through the API gateway, then to the authentication service, then to the orders service, which calls the inventory service, which queries the database, which generates a Kafka event, which is consumed by the billing service. Seven services. If the request takes 4 seconds, where is the bottleneck?

Without distributed tracing, the answer is: we have no idea. Each service has its own logs. Correlating logs from seven services for a single request is a manual exercise that takes 30 minutes minimum. And if you have thousands of requests per minute, finding the one that failed is searching for a needle in a haystack.

With distributed tracing (OpenTelemetry, the de facto standard), each request receives a trace ID that follows it through all services. Each step within a service generates a span with its duration, attributes, and relationship to the parent span. The result is a call tree showing exactly where time was spent.

We implement tracing with the OpenTelemetry SDK in each service. Traces are sent to a collector (OpenTelemetry Collector) that exports them to Jaeger or Tempo. In production, we do not trace 100% of requests. We trace 10% with probabilistic sampling, plus 100% of requests returning errors, plus 100% of requests with latency above p95. This gives us sufficient coverage to diagnose problems without the cost of storing traces for every request.

Correlating with business metrics

This is where tracing stops being an infrastructure tool and becomes a business tool.

We add business attributes to spans. Not just the endpoint and HTTP method. The customer ID, the operation type (purchase, return, query), the monetary value of the transaction. With those attributes, we can answer questions like: do premium customers experience the same latency as standard ones? Do high-value transactions have more errors? Is the inventory service latency degradation affecting sales?

A real example: we detected that 8% of requests to the card payment service had latency above 5 seconds. Without business correlation, that was “a performance issue.” With correlation, we discovered that 100% of those slow requests were international card payments processed through a secondary payment gateway. The problem was not our service. It was the external provider’s timeout. The solution was increasing the timeout and showing a spinner to the user, not optimizing our code.

Without the business attributes in the trace, we would have spent days optimizing a service that was not the problem.

Logs: less is more

Logs are the oldest observability signal and the most abused. Most teams log too much. A service generating 500 MB of logs per hour is generating noise, not information.

Our rules for effective logging:

Structured logging always. JSON, not free text. A log like {"level":"error","service":"payments","trace_id":"abc123","error":"timeout","gateway":"secondary","latency_ms":5200} is searchable, filterable, and parseable. A log like ERROR: Payment failed after 5.2 seconds on secondary gateway requires regex to extract data. In production, with millions of lines, the difference is critical.

One event, one log. Not three logs for one operation: “Starting payment processing”, “Payment sent to gateway”, “Payment completed.” A single log at the end with the outcome and relevant data: success or failure, duration, gateway used, amount. Intermediate logs are useful in development. In production, they are noise.

Log levels with discipline. ERROR is something requiring immediate human action. WARN is something that could become an error if not addressed. INFO is significant business events (order created, payment processed, user registered). DEBUG is activated temporarily to diagnose a specific problem and deactivated afterward.

We have reduced client log volume by 60-80% without losing relevant information. The result: faster searches in Loki/Elasticsearch, lower storage bills, and operators who can find the information they need in seconds instead of minutes.

Alerting: the part everyone gets wrong

Back to those 847 weekly alerts. How do you get there?

You get there by alerting on symptoms instead of user impact. A “CPU at 85%” alert says nothing about whether users are affected. The service might run perfectly at 85% CPU. It might fail at 60%. CPU is an input, not an outcome.

Alerts should be defined in terms of SLOs (Service Level Objectives). An SLO says: “99.5% of requests to the payments service will have latency below 2 seconds.” The alert fires when the error budget is being consumed faster than expected, not when an individual metric crosses a threshold.

We implement alerts based on burn rate. If your SLO is 99.5% monthly availability (equivalent to 3.6 hours of permitted downtime), and in the last 6 hours you have consumed 20% of your error budget, the alert fires. This gives you time to react before violating the SLO, but does not wake you at 3 AM because a service had 5 errors in one minute.

Result for the 847-alert client: we reduced to 12-15 weekly alerts. Each one required attention. The false positive rate dropped from 94% to 8%. MTTR (Mean Time To Resolve) dropped from 4 hours to 22 minutes, because when an alert fired, engineers knew it was real and had the trace ID in the notification to start diagnosing immediately.

The stack we recommend

After implementing observability in a dozen microservices environments, this is the stack that converges for mid-market companies:

Metrics: Prometheus + Grafana. Open source, mature, massive exporter ecosystem. For those who do not want to operate Prometheus, Grafana Cloud offers Prometheus-as-a-service with a generous free tier.

Traces: OpenTelemetry + Tempo (or Jaeger). OpenTelemetry as the instrumentation SDK is non-negotiable: it is the standard, has support across all languages, and is vendor-neutral. Tempo as the traces backend if you already use Grafana. Jaeger if you prefer something independent.

Logs: Loki + Grafana. Loki is significantly cheaper than Elasticsearch for logs because it does not index the full content, only labels. For frequent full-text searches, Elasticsearch remains superior. But for most production log use cases, Loki is sufficient and costs a fraction.

Alerting: Grafana Alerting with SLOs defined in Prometheus. Alerts based on burn rate, not static thresholds.

The total cost of operating this stack for a 15-20 microservice environment is between 400 and 800 euros/month if self-managed, or between 600 and 1,200 euros/month on Grafana Cloud. Compared to Datadog (which for the same volume can cost 3,000-5,000 euros/month), the difference is significant.

Datadog is an excellent product. But for a mid-market company, the cost scales aggressively with data volume. And observability data volume grows faster than you expect.

Where to start

If your microservices system has no observability, or has the “basic monitoring with too many alerts” variant, the implementation plan is:

Weeks 1-2: Instrument the four golden signals with Prometheus on the 3-5 most critical services. Create a Grafana dashboard with p50/p95/p99 latency, error rate, and traffic per service.

Weeks 3-4: Implement tracing with OpenTelemetry on those same services. Configure 10% sampling with 100% for errors and high latency. Deploy Tempo or Jaeger.

Weeks 5-6: Define SLOs for critical services. Create burn-rate-based alerts. Eliminate all existing static threshold alerts.

Weeks 7-8: Consolidate logs in Loki. Migrate to structured logging. Apply the “one event, one log” rules and reduce volume.

It is a two-month effort for a team of 2-3 engineers working part-time. The return is immediate: the first incident you diagnose in 20 minutes instead of 4 hours pays for the investment.

For teams that need help with implementation, our cloud and DevOps services include observability as a standard component. Because a system you cannot observe is a system you cannot operate. Also see how observability connects with zero downtime deployment strategies and testing in production.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.