Real-time data pipelines: a practical guide for engineering teams
Do you actually need real-time?
Before designing a streaming architecture, the first question every engineering team should answer honestly is whether they actually need real-time data. In our experience building data infrastructure for logistics, hospitality, and retail clients, the answer is “no” more often than teams expect.
Real-time means sub-second latency from event to action. Near-real-time means seconds to low minutes. Batch means minutes to hours. The cost and complexity difference between these tiers is enormous, and most business requirements that get labeled “real-time” are actually near-real-time at best.
A logistics dashboard that updates shipment positions every 30 seconds is near-real-time. A hospitality system that syncs reservations every 5 minutes is near-real-time. A retail inventory system that prevents overselling on flash sales — that might genuinely need real-time. The distinction matters because a well-designed batch or micro-batch pipeline running every 60 seconds can serve 80% of use cases at a fraction of the operational cost.
The honest diagnostic is this: what is the business cost of a 60-second delay versus a 1-second delay? If nobody can articulate a concrete dollar figure, the team is probably over-engineering. Start with the simplest architecture that meets actual requirements and evolve toward streaming only when the data proves you need it.
That said, when you genuinely need real-time — fraud detection, live pricing engines, IoT sensor processing, real-time personalization — the investment is justified. The rest of this guide assumes you have confirmed that your use case warrants it.
Architecture patterns that work in production
The streaming ecosystem has consolidated around a few battle-tested patterns. Understanding when to apply each one saves months of rework.
Event streaming with Kafka or Redpanda. This is the backbone of most production streaming architectures. Kafka remains the industry standard for high-throughput, durable event streaming. Redpanda has emerged as a compelling alternative: it’s API-compatible with Kafka but written in C++ with no JVM dependency, which means simpler operations and lower tail latency. For teams starting fresh without existing Kafka expertise, Redpanda is worth serious evaluation.
The core pattern is producers publishing events to topics, consumers reading from those topics at their own pace. The durability guarantee — events are persisted to disk and replicated — is what makes this architecture resilient. If a consumer goes down, it picks up where it left off. No data loss.
Change Data Capture (CDC) with Debezium. When the source of truth lives in a relational database and you need to stream changes without modifying application code, CDC is the answer. Debezium reads the database’s transaction log (the WAL in PostgreSQL, the binlog in MySQL) and publishes each row-level change as an event to Kafka. The application doesn’t know or care that its writes are being streamed. This is particularly valuable for legacy systems where modifying the application to produce events directly is impractical or risky.
Stream processing with Flink or Spark Structured Streaming. Once events are flowing through Kafka, you often need to transform, enrich, aggregate, or join them in flight. Apache Flink is the strongest option for true stream processing: it handles event-time semantics, windowing, and stateful processing with a maturity that other frameworks haven’t matched. Spark Structured Streaming is a reasonable choice if your team already has Spark expertise and your latency requirements are in the low-seconds range rather than sub-second.
For simpler transformations — filtering, mapping, lightweight enrichment — Kafka Streams or even a consumer application with in-process logic may be sufficient. Not every pipeline needs Flink.
The reference architecture we deploy most often looks like this: source systems produce events to Kafka (or Debezium captures changes from databases into Kafka), Flink processes and enriches the streams, and the results land in both an operational data store (PostgreSQL, Redis) for serving and an analytical store (ClickHouse, BigQuery) for reporting.
Data quality in streaming
Data quality in batch pipelines is already hard. In streaming, it’s harder because you lose the luxury of reprocessing an entire dataset before serving it. Three challenges dominate.
Schema evolution. Events will change shape over time. Fields get added, types change, optional fields become required. Without schema management, downstream consumers break silently. Use a schema registry (Confluent Schema Registry or Apicurio) and enforce compatibility rules. Backward compatibility — new schemas can read old data — is the minimum. Full compatibility is better. Avro and Protobuf handle schema evolution far more gracefully than JSON. If you’re starting a new pipeline, choose one of them. Effective schema management is one pillar of a broader data governance strategy that every data team should establish early.
Late arrivals and out-of-order events. In distributed systems, events don’t arrive in order. A mobile device might buffer events during a network outage and flush them minutes or hours later. Stream processing frameworks handle this through watermarks — a declaration of “I believe all events up to timestamp T have arrived.” Events arriving after the watermark are late. You need a strategy for them: drop them, process them into a separate correction stream, or extend your watermark tolerance at the cost of higher latency.
The pragmatic approach is to define a lateness tolerance based on your domain. For logistics tracking, 5 minutes covers most network delays. For IoT sensors, 30 seconds may suffice. For financial transactions, you may need hours of tolerance with a correction mechanism.
Exactly-once semantics. The holy grail of streaming is processing each event exactly once: no duplicates, no drops. Kafka supports exactly-once within its ecosystem through idempotent producers and transactional consumers. Flink supports exactly-once with checkpointing. But exactly-once across system boundaries — from Kafka through Flink to an external database — requires idempotent writes on the sink side. Design your sink operations to be idempotent (upserts, not inserts) and you get effective exactly-once without the complexity of distributed transactions.
Observability for pipelines
A streaming pipeline without observability is a pipeline waiting to fail silently. The operational characteristics of streaming systems are fundamentally different from request-response services, and they require purpose-built monitoring.
Consumer lag is the single most important metric. It measures how far behind a consumer is from the latest event in the topic. Healthy lag is near zero and stable. Growing lag means the consumer can’t keep up with the production rate. Sudden lag spikes indicate processing failures or downstream bottlenecks. Monitor lag per consumer group, per partition, and alert on both absolute values and rate of change. Burrow or Kafka’s built-in metrics exposed through JMX are the standard tools.
Throughput metrics should be tracked at every stage: events produced per second, events processed per second, events written to sinks per second. Discrepancies between stages indicate data loss or accumulation. A processing stage that receives 10,000 events per second but only writes 9,500 to the sink has a 5% drop rate that needs investigation.
Dead letter queues (DLQs) are essential for production resilience. When an event can’t be processed — malformed data, schema mismatch, transient downstream failure — it should be routed to a DLQ rather than blocking the pipeline or being silently dropped. Monitor the DLQ growth rate. A healthy pipeline has a near-empty DLQ. A growing DLQ is an early warning of data quality issues upstream. Build tooling to inspect, replay, and resolve DLQ events. They will be needed.
End-to-end latency measures the time from event production to availability in the sink. Track percentiles (p50, p95, p99), not averages. A pipeline with 100ms average latency but 30-second p99 latency has a tail problem that averages hide.
Practical implementation guide
Based on our experience deploying streaming architectures for clients across logistics, hospitality, and retail, here is the implementation sequence we recommend.
Phase 1: Foundation (weeks 1-3). Deploy Kafka or Redpanda with a minimum of three brokers for fault tolerance. Set up a schema registry. Define your topic naming conventions and partitioning strategy early — changing them later is painful. Establish monitoring from day one: consumer lag, broker health, disk usage.
Phase 2: First pipeline (weeks 3-6). Pick one use case. The simplest, highest-value data flow. Implement a producer, a consumer, and a sink. No stream processing yet — just move data reliably from point A to point B. This forces you to solve the operational fundamentals: deployment, configuration management, secret handling, log aggregation.
Phase 3: Stream processing (weeks 6-10). Introduce Flink or your chosen processing framework for the first use case that requires transformation or enrichment. Start with stateless operations (filter, map, enrich from a lookup table). Graduate to stateful operations (windowed aggregations, joins) only when you have operational confidence.
Phase 4: Operationalize (weeks 10-14). Build the DLQ pipeline. Implement alerting on lag, throughput, and error rates. Create runbooks for common failure scenarios: broker failure, consumer rebalance, schema incompatibility. Run a failure injection exercise. Your pipeline will fail in production. The question is whether your team knows how to respond.
The most common mistake we see is teams trying to jump to Phase 3 or 4 before Phase 1 is solid. A streaming architecture built on a shaky operational foundation will cause more incidents than it solves business problems. Get the fundamentals right first.
For a deeper dive into Kafka and Flink applied to logistics operations, see our Kafka-Flink implementation guide. If your team needs hands-on support designing and operating these pipelines, our data engineering service covers everything from initial assessment through production operations.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
