Skip to content
How We Built a Real-Time Data Pipeline with Kafka and Flink

How We Built a Real-Time Data Pipeline with Kafka and Flink

A
abemon
| | 12 min read
Share

The problem: 47-minute delay

Our client operated a fleet of 340 delivery vehicles. The existing tracking system was a 15-minute batch: GPS devices sent positions to a server, the server stored them in PostgreSQL, and every 15 minutes a cron job generated updated positions for the operations dashboard.

In practice, the delay was larger. The batch took between 3 and 8 minutes to process depending on volume. If a vehicle reported right after a batch cycle, the operator would not see that position until the next cycle plus processing time. Worst case: 23 minutes. Actual average: 12 minutes. But with load incidents and cron failures, we measured delays of up to 47 minutes.

47 minutes. A vehicle that deviates from its route, breaks down, or arrives at a delivery point. 47 minutes for the operations team to know. In last-mile logistics, that is unacceptable.

The goal was to reduce that delay to under 5 seconds. We achieved 4.2 seconds at p95. This article explains how.

The batch architecture had a fundamental problem: it coupled data ingestion with processing. The same process that received GPS data transformed it, enriched it with route data, and wrote it to the database. If any of those steps failed, the entire data window was lost.

We needed to decouple. And for streaming with the guarantees we required, the Kafka-Flink combination is the one we know best.

We evaluated alternatives. Kafka Streams would have been sufficient for the processing logic, but we needed complex temporal windows (tumbling and sliding) with late arrivals, and Flink handles that with much more flexibility. Pulsar was an option, but the connector ecosystem was less mature at the time. Google Pub/Sub + Dataflow was viable, but the client wanted to avoid lock-in with a single cloud provider.

Kafka as the message bus. Flink as the processing engine. PostgreSQL (with TimescaleDB) as the historical position store. Redis as the current-state cache for the operations dashboard.

Pipeline architecture

The complete flow has five stages:

Ingestion. GPS devices send positions via MQTT to a Mosquitto broker. A Kafka Connect connector (MQTT Source) publishes each position to a Kafka topic (gps.raw.positions). Volume: between 800 and 1,200 messages per second during peak hours. Each message weighs about 200 bytes (latitude, longitude, speed, heading, timestamp, vehicle_id).

Validation. A Flink job consumes the raw topic, validates the schema, discards positions with invalid coordinates (outside the Iberian peninsula, speed above 200 km/h, future timestamp), and publishes to a clean topic (gps.validated.positions). Between 2% and 4% of messages are discarded. Most are GPS glitches when the vehicle starts.

Enrichment. A second Flink job takes validated positions and enriches them with route data. For each position, it looks up the route assigned to the vehicle (cached in Flink state, refreshed every 5 minutes from the planning API), calculates the distance to the next delivery point, and determines whether the vehicle is on route, deviating, or stopped.

Enrichment is the most computationally expensive stage. Reverse geocoding (converting coordinates to a readable address) was excluded from the real-time flow due to latency: it adds between 50 and 200 ms per position when calling an external service. Instead, we run it in a separate batch pipeline that feeds the historical views.

Current state. Enriched positions are written to Redis with one key per vehicle. Only the last known position is maintained. The operations dashboard reads from Redis with polling every 2 seconds (we tried WebSockets, but the overhead of maintaining 50 persistent connections did not justify itself for a dashboard with a 2-second refresh rate).

Historical storage. In parallel, enriched positions are written to TimescaleDB via Kafka Connect (JDBC Sink). Partitioned by day, 90-day online retention, archived to S3 afterward.

The problems we did not expect

Schema evolution

Three weeks into production, the GPS device vendor updated their firmware. Messages now included a new field: battery_level. Our Avro schema did not expect it. And Kafka, correctly, rejected the messages because they did not conform to the registered schema.

Lesson learned: always use backward-compatible schema evolution. We configured the Confluent Schema Registry in BACKWARD mode. All new fields must be optional with default values. Existing fields cannot be removed or have their type changed. It is restrictive, but it saves you from debugging at 3 AM why the pipeline has stopped.

Now every schema change goes through a PR that includes the compatibility test against the Schema Registry. If it does not pass, it does not merge.

Exactly-once semantics

The business requirement was clear: every position is processed exactly once. We cannot lose positions (a vehicle disappears from the map). We cannot duplicate positions (a vehicle appears in two places at once, or distance metrics are doubled).

Kafka offers exactly-once semantics with transactions. Flink has checkpointing with aligned barriers that guarantee exactly-once in processing. But the most delicate segment is the write to Redis and TimescaleDB.

For Redis, we use idempotent operations: every write is a SET with the vehicle key. If executed twice, the result is the same. Exactly-once is trivial when the operation is idempotent.

For TimescaleDB, the Kafka Connect JDBC Sink connector supports idempotency with primary keys. We configured a UNIQUE constraint on (vehicle_id, timestamp). Duplicates are ignored via ON CONFLICT DO NOTHING. This makes the write idempotent and gives us exactly-once end-to-end.

The cost of this configuration: 15% more latency in Flink due to checkpointing with aligned barriers. We went from 2.8 seconds p95 to 4.2 seconds. We accepted the trade-off because the alternative (at-least-once with manual deduplication) required more operational complexity.

Backpressure

At week six, the pipeline started accumulating lag during peak hours. The gps.validated.positions topic was growing at a rate of 200 messages per second faster than Flink could consume. The enrichment job was the bottleneck: the route cache lookup had degraded because Flink state had grown larger than expected.

Diagnosis took two days. Flink reports backpressure in its UI, but the cause is not always obvious. In our case, the state backend (RocksDB) was performing frequent compactions because we had misconfigured the compaction levels. The fix was setting state.backend.rocksdb.compaction.level.use-dynamic-size to true and increasing the write buffer.

After the fix, the enrichment job’s throughput went from 1,000 to 1,800 messages per second. Sufficient margin for current peak hours and for 50% fleet growth.

The lesson: monitoring the lag of each consumer group and the backpressure of each Flink operator is absolutely essential. We use Flink metrics exported to Prometheus with Grafana dashboards. Any sustained lag above 10 seconds for more than 5 minutes generates an alert.

Late arrivals

GPS devices do not always have connectivity. A vehicle entering an underground parking garage loses signal for 10 minutes. When it exits, it sends 600 accumulated positions at once. These positions arrive “late” relative to the real-time flow.

Flink handles this with watermarks and allowed lateness. We configured a watermark with 30 seconds of tolerance and an allowed lateness of 5 minutes. Positions arriving within those 5 minutes are processed normally. Those arriving later are sent to a side output that feeds a “late positions” topic processed by an independent batch pipeline.

In practice, 99.2% of positions arrive within the 30-second watermark. 0.6% arrive between 30 seconds and 5 minutes. Only the remaining 0.2% arrive later and go to the batch pipeline. Those percentages determined our thresholds.

Production metrics

After four months in production, these are the stable metrics:

  • Throughput: 1,100 msg/s average, 1,600 msg/s peak
  • End-to-end latency (GPS to Redis): p50 = 1.8s, p95 = 4.2s, p99 = 6.1s
  • Pipeline availability: 99.94% (32 minutes of downtime in 4 months, caused by a Kafka upgrade requiring a rolling restart)
  • Lost messages: 0 (exactly-once works)
  • Duplicate messages in TimescaleDB: 0 (idempotency works)
  • Infrastructure cost: 1,200 euros/month (3 Kafka brokers on m5.large instances, 2 Flink TaskManagers on c5.xlarge, Redis cache.t3.medium, all on AWS)

The previous batch system cost 400 euros/month. We tripled infrastructure cost, but the business value of moving from 12-minute delay to 4 seconds is not measured in infrastructure euros. It is measured in on-time deliveries, instantly detected route deviations, and an operations team making decisions with data that is 4 seconds old instead of a quarter of an hour.

What we would do differently

If we started over, three things would change.

First, we would have used Apache Iceberg instead of TimescaleDB for historical storage. TimescaleDB works well, but Iceberg would give us native schema evolution, time travel, and the ability to use Spark or Trino for analytics without moving the data. The TimescaleDB decision was based on familiarity, not technical merit.

Second, we would have invested more time in testing before production. Our staging environment had 10% of real traffic, which was insufficient to discover the backpressure problem. A synthetic load generator that simulated real patterns (including late arrival bursts) would have saved us two weeks of debugging in production.

Third, we would have implemented dead letter queues from day one. Messages that failed validation were discarded with a log. There was no way to reprocess them. When we discovered that our invalid coordinate filter was too aggressive (it discarded valid positions from Ceuta and Melilla because our bounding box only covered mainland Spain), it took three days to recover the lost positions from Kafka logs.

For teams evaluating a real-time data pipeline, our advice is direct: start with the business question (what latency do you actually need), not with the technology. If 30 seconds is sufficient, Kafka Streams will probably do. If you need complex windows, stream joins, or heavy stateful processing, Flink justifies its additional complexity. And if you do not need real-time at all, a well-executed batch every 5 minutes is a fraction of the cost and complexity.

For a broader perspective before committing to a specific architecture, our practical guide to real-time pipelines covers the evaluation process. For real-time data engineering projects in logistics and beyond, the right architecture depends on the problem. Not the hype.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.