MLOps: From Notebook to Production Pipeline

The model works on my laptop

That sentence should trigger the same response as “it works on my machine” triggers in a software engineer. The problem is identical: a controlled environment, clean data, a single user, zero concurrency. The Jupyter notebook where a data scientist achieves an AUC of 0.94 is a research artifact, not a production system.

Yet most data teams attempt to put exactly that into production. A notebook converted to a script, executed in a cron job, with no data versioning, no drift monitoring, no rollback capability. It works until it stops working, and when it stops working nobody knows why.

This whitepaper documents what we have learned deploying ML models in production for clients in logistics, finance, and retail. It is not a theoretical MLOps guide. It is a map of the real traps we have encountered and the solutions that survived contact with reality.

The chasm between experimentation and production

There is a chasm between a model that works and a model in production. It is not a minor technical jump. It is a complete transformation of the workflow, the artifacts, the guarantees, and the responsibilities.

In experimentation, the data scientist controls everything: the data, the environment, the timing. They can re-run an entire notebook, tweak a hyperparameter, regenerate features manually. The output is a number (accuracy, F1, AUC) and a conclusion (“works” or “doesn’t work”).

In production, none of that applies.

Data arrives in real time, with inconsistent formats, null values that were not in the training dataset, and distributions that shift without warning. The model must respond in milliseconds, not minutes. And if it fails, there is no human watching a notebook to correct it: there is a downstream system that receives an erroneous prediction and acts on it.

We have identified four fundamental gaps between experimentation and production:

Reproducibility gap. The original notebook used pandas 1.5.3, scikit-learn 1.2.1, and a dataset downloaded “yesterday.” Three months later, nobody can reproduce the exact result. Dependencies have changed, the dataset no longer exists in the same form, and the random seed was never recorded. Without reproducibility there is no auditability, and without auditability there is no trust.

Data gap. The model was trained on a static snapshot. In production, data is a stream. Distributions shift. New categories appear. Values that were rare become common. The model does not know the world has changed, and keeps predicting as if it were January when it is August.

Monitoring gap. In experimentation, the metric is a number you calculate at the end. In production, you need to know in real time whether the model is still performing correctly. Not just whether it returns a response (that is basic health checking), but whether the quality of that response remains within acceptable parameters.

Cost gap. A notebook that takes 20 minutes to train a model does not generate alarming invoices. A pipeline that retrains daily on cloud GPUs, processes millions of inferences, and stores terabytes of features does. And the bill arrives at the end of the month, when it is too late to optimize.

Model registry: the missing version control

The first component of a mature MLOps system is a model registry. It is the equivalent of a Git repository for models, but with ML-specific metadata: hyperparameters, evaluation metrics, training dataset, data lineage.

We use MLflow as our central registry. Not because it is perfect (it is not), but because it is open source, integrates with practically everything, and has a manageable learning curve. Alternatives like Weights & Biases or Neptune offer better UX, but vendor lock-in concerns us for long-term projects.

What we register for each model:

Training code version (Git commit hash)
Training dataset hash (for reproducibility)
Complete hyperparameters
Evaluation metrics (not just the primary metric, all relevant ones)
Dependencies and exact versions (frozen requirements.txt)
Training date and duration
Infrastructure used (instance type, GPU, RAM)

With this information, any registered model is reproducible. We can retrain version 23 exactly as it was originally trained. We can compare version 23 with 24 across any dimension. And we can roll back to version 22 if version 23 introduces a regression.

The key discipline is treating models as immutable artifacts. A registered model is never modified. If you need to change something, you create a new version. This seems obvious, but we have seen teams overwrite models in production directly because “it’s a small change.” Those small changes generate the hardest incidents to diagnose.

Staged promotion

Not every registered model goes to production. We implement a promotion flow with three stages:

Staging. The model passes through a battery of automated tests: input/output schema validation, performance tests (p50, p95, p99 latency), regression tests against a golden dataset, and integration tests with the data pipeline. If any test fails, it does not advance.

Canary. The model receives a small percentage of production traffic (typically 5%). We compare its predictions against the current production model. If the business metrics (not just ML metrics) are equal or better for 48 hours, it advances. If there is degradation, automatic rollback.

Production. The model receives 100% of traffic. But the previous model remains deployed and ready to receive traffic for 7 days. Rollback is a configuration change, not a redeployment.

This flow has prevented at least four serious incidents in the past year. In one case, a document classification model passed staging correctly but showed a 12% degradation in accuracy during the canary phase. The problem was a shift in the distribution of document types that only manifested with real traffic. Without the canary phase, that model would have been in production generating silent errors.

Feature stores: the most undervalued piece

Ask a data scientist what matters most for a good model, and they will say “the data.” Ask them which MLOps tool is most important, and they will probably say “the registry” or “the training pipeline.” They rarely say “the feature store.”

Yet in our experience, the feature store is the piece that has the most impact on the daily operations of a production ML system.

A feature store solves three problems:

Train-serve consistency. The number one problem with models in production is that the features the model receives at inference time are different from those it received during training. Not because the data is different, but because the code that computes the features is different. The training script calculates the 30-day moving average one way. The inference service calculates it another way. Or worse: the inference service does not calculate it at all, receiving it from another system with its own logic.

With a feature store, the definition of each feature exists in a single place. The training pipeline reads from there. The inference service reads from there. Same feature, same computation, always.

Feature reuse. Without a feature store, each team computes their own features. The fraud team computes “number of transactions in the last 7 days.” The marketing team computes exactly the same thing with a different name. The risk team computes it with a 30-day window. Three pipelines, three compute costs, three opportunities for inconsistency. With a centralized feature store, a feature is computed once and used by everyone.

Real-time serving. Some features need to be available with millisecond latency for online inference. Computing them on-the-fly is too slow. A feature store keeps pre-computed features in a low-latency store (typically Redis or DynamoDB) and updates them continuously or in batch depending on the case.

We use Feast as our feature store on most projects. It is open source, supports both batch and online serving, and integrates well with the ecosystem we already have (Spark, Kafka, PostgreSQL). For simpler projects where we only need batch features, sometimes a materialized table in BigQuery or Snowflake is sufficient. Not everything needs the full complexity of a dedicated feature store.

The cost of not having a feature store

On a demand prediction project for a logistics client, we deployed the model without a feature store. Features were computed in a Python script that ran before each batch inference.

Two months in, we discovered that a key feature (average volume per route over the last 90 days) was computed differently in training and inference. The training script used post-hoc corrected data. The inference script used raw data. The difference was subtle (less than 3% on most routes), but on low-volume routes it generated predictions 40% higher than reality.

Reconstructing the correct features, validating train-serve consistency, and retraining the model cost us three weeks. Implementing Feast cost two weeks. The lesson was expensive but clear.

Monitoring: beyond the health check

A production model returning HTTP 200 does not mean it is working correctly. It means it has not crashed. ML model monitoring has three layers, and most teams only implement the first.

Layer 1: Infrastructure monitoring

Latency, throughput, errors, CPU/GPU/memory usage. This is the monitoring any web service needs. We use Prometheus + Grafana for this, with alerts in PagerDuty. Nothing ML-specific here.

Layer 2: Data monitoring (data drift)

The input data to the model changes over time. Sometimes gradually (seasonal drift), sometimes abruptly (a supplier policy change that alters data formats). Drift monitoring detects when the distribution of input data has deviated significantly from the training distribution.

We implement drift monitoring with two complementary metrics:

PSI (Population Stability Index) for categorical features. A PSI above 0.2 indicates significant drift. Above 0.25, we generate an automatic alert.

KS test (Kolmogorov-Smirnov) for numerical features. A p-value below 0.01 with relevant effect size triggers the alert.

We do not monitor all features. We monitor the 10-15 most important features according to model importance (SHAP values from the last training). Monitoring 200 features generates noise and false positives. Monitoring the 15 that actually matter gives you actionable signals.

When we detect significant drift, we do not retrain automatically. First, we investigate. Drift can be a real business change (and the model needs to adapt) or a data problem (and the data needs to be fixed). Automatically retraining on corrupted data is worse than not retraining at all.

Layer 3: Model performance monitoring

This is the most valuable and hardest layer to implement. It requires ground truth: knowing what the correct answer was so you can compare it against the model’s prediction.

In some cases, ground truth arrives quickly. A ticket classification model can be evaluated when a human agent resolves the ticket and confirms or corrects the category. In other cases, it takes weeks or months. A churn prediction model cannot be evaluated until the 90-day prediction window has passed.

For slow-evaluation cases, we use proxy metrics: the model predicts churn, and if the customer reduces activity by 50% in the following two weeks, we count it as partial validation. It is not perfect ground truth, but it is better than flying blind for 90 days.

We log all predictions with their timestamp and input hash. When ground truth arrives, we match them and calculate real metrics. This feeds a dashboard that shows the evolution of real model accuracy over time.

Training pipeline: automation with guardrails

Automatic retraining is one of the most dangerous topics in MLOps. Done well, it keeps models updated without human intervention. Done poorly, it introduces silent regressions that degrade the service for weeks before anyone notices.

Our training pipeline has three modes:

Scheduled. Weekly or monthly retraining with fresh data. The new model goes through the full promotion flow (staging, canary, production). This is the default mode for stable models.

Drift-triggered. When layer 2 monitoring detects significant drift and a human confirms it is a real business change, a retraining is triggered. The new model is explicitly compared against the current one on the segment where drift was detected.

Manual. A data scientist decides to retrain because new features are available, because the base model has been updated, or because a bias has been identified. This mode bypasses the automatic trigger but does not bypass the promotion flow.

What we never do is closed-loop retraining: automatically detected drift producing automatic retraining producing automatic deployment. Without a human somewhere in the chain, the risk of silent degradation is too high.

Data versioning

Every training must answer the question: what exact data was this model trained on? Not “the January data” but “snapshot 2025-01-15T08:30:00Z of dataset customer_features_v3, SHA256 hash a1b2c3d4.”

We use DVC (Data Version Control) for this. DVC works like Git but for large datasets. The data is stored in S3, and DVC maintains the hashes and metadata in Git. Every training commit includes the DVC lockfile, ensuring we can reproduce exactly the dataset of any model version.

For data that changes at high frequency (streaming features), we use point-in-time snapshots in Delta Lake. Each snapshot is an immutable table that can be referenced by timestamp. It is not as clean as DVC for static datasets, but it works for our case of features updated every hour.

Cost optimization: where the money goes

MLOps costs have four main components, and intuition about which is most expensive is usually wrong.

Training compute. This is the most visible cost but rarely the highest. A weekly training on an A100 GPU for 4 hours costs about $48 on AWS (p4d.24xlarge spot instance). That is $200 per month. Not trivial, but not the problem.

Inference compute. This is where cost concentrates for most models. A model processing 10 million inferences per day needs dedicated infrastructure running continuously. If the model requires GPU for inference, costs can easily exceed $2,000 monthly.

The key optimization here is quantization and distillation. A model quantized to INT8 uses half the memory and is 2-3x faster at inference, with typical degradation of less than 1% in accuracy. For models that do not require maximum precision, it is an obvious optimization that many teams fail to implement.

Another optimization: inference batching. Instead of processing each request individually, we group requests into microbatches of 32 or 64. This better leverages GPU parallelism and can reduce inference cost by 40-60%.

Feature storage. The online feature store (Redis, DynamoDB) can be surprisingly expensive if not actively managed. A feature store with 50 million entities and 200 features per entity, updated hourly, can cost over $1,500 monthly on DynamoDB on-demand. The optimization here is being selective: only features that need millisecond latency go in the online store. The rest are served from the offline store (S3 + Athena) with latencies in seconds.

Data and log storage. Monitoring logs, historical predictions, data snapshots, training artifacts. It all adds up. We implement aggressive retention policies: detailed logs 30 days, aggregated logs 1 year, predictions 6 months (unless regulatory requirements apply), training artifacts only for the last 10 versions.

Real case: from $4,200/month to $1,800/month

For a retail client with three production models (product recommendation, demand prediction, fraud detection), the monthly MLOps cost was $4,200.

We identified three optimizations:

Quantization of the recommendation model from FP32 to INT8. 45% reduction in inference cost for that model.
Migration of the feature store from DynamoDB on-demand to provisioned with auto-scaling. 35% reduction in storage cost.
Consolidation of three training pipelines onto a single spot instance with scheduling. The three models trained on separate dedicated GPUs, even though they never trained simultaneously.

Result: $1,800/month. A 57% reduction. Implementation effort was two weeks of a senior engineer’s time.

Orchestration: the glue that holds it all together

The MLOps pipeline has many moving parts: data extraction, feature computation, training, evaluation, registration, deployment, monitoring. Orchestrating all of this requires a system that handles dependencies, retries, parallelism, and state.

We use Prefect for orchestration. We tried Airflow first (everyone uses it, so it must be the right answer) and abandoned it after three months. Airflow is excellent for classic batch ETL, but its programming model based on static DAGs fits poorly with ML workflows, which are more dynamic: decide whether to retrain based on monitoring results, choose the dataset based on the date, select hyperparameters based on historical performance.

Prefect lets us define flows as normal Python code with decorators, has configurable retries per task, supports native concurrent execution, and has a UI that data scientists actually use (unlike the Airflow UI, which only the infra team used).

A detail we underestimated initially: the importance of notifications. When a pipeline fails at 3 AM, someone needs to know. When a retraining produces a model with worse metrics than the current one, someone needs to decide whether to deploy it or not. We integrated Prefect with Slack and PagerDuty, with different severity levels: pipeline failure goes to Slack, deployment failure goes to PagerDuty.

Testing: the part nobody wants to do

ML models need tests. Not just validation on a hold-out set, but engineering tests that verify the complete system works correctly.

Schema tests. Verify that the model input has the expected format. Correct types, valid ranges, non-null values where required. These tests run on every inference and reject invalid inputs before they reach the model. Trivial to implement, but they prevent an entire category of silent errors.

Regression tests. A golden dataset of 500-1000 examples with known correct predictions. Every new model must produce predictions within an acceptable margin for this dataset. If a new model correctly classifies 92% of the golden set when the current model classifies 95%, there is a regression and the model is not promoted.

Performance tests. P99 latency below a threshold. Minimum throughput in inferences per second. Memory usage within limits. These tests run in the staging environment with synthetic load before any promotion.

Bias tests. Verify that the model does not discriminate by protected variables. We calculate fairness metrics (demographic parity, equal opportunity) for each model version. If there is a degradation in fairness, the model is manually reviewed before deployment. This type of test is increasingly relevant with regulations like the EU AI Act.

Integration tests. The complete model, with its preprocessing, inference, and postprocessing, runs end-to-end with synthetic data covering known edge cases. These tests verify that the integration between components works, not just that each individual component passes its unit tests.

Organization: who owns what

The hardest aspect of MLOps is not technical. It is organizational. Who is responsible for the model in production? The data scientist who trained it? The ML engineer who deployed it? The platform team maintaining the infrastructure?

We have seen three organizational models and their outcomes:

Model 1: the data scientist does everything. The data scientist trains, deploys, and monitors. Result: the data scientist spends 70% of their time on operations and 30% on data science. High frustration, low operational quality.

Model 2: complete separation. Data scientists train, ML engineers deploy. Result: a 3-4 week queue between “the model is ready” and “the model is in production.” Communication via tickets. Context lost at every handoff.

Model 3: integrated team with shared platform. Data scientists work alongside ML engineers on the same team. A shared MLOps platform reduces operational friction. The data scientist can deploy to staging without help. Promotion to production requires ML engineer approval. Result: 2-3 day time-to-production, high operational quality, happy data scientists.

Model 3 is what we recommend, with one condition: the MLOps platform must be good enough that the data scientist can use it without being an expert in Kubernetes or Docker. If deploying a model requires writing a Dockerfile, configuring a Helm chart, and creating an Ingress, the data scientist will find shortcuts. And those shortcuts end up in production.

MLOps maturity checklist

To close with something actionable, here is the checklist we use to evaluate an organization’s MLOps maturity. Each item is binary: you have it or you do not.

Level 1 (Basic):

Models are versioned in a registry
Dependencies are frozen per model version
There is a documented deployment process (even if manual)
There is basic infrastructure monitoring (latency, errors, uptime)

Level 2 (Operational):

Deployment is automated with staged promotion
There is data drift monitoring on critical features
Training data is versioned
There is a retraining pipeline (even if manually triggered)
There are automatic regression tests

Level 3 (Advanced):

Shared feature store with train-serve consistency
Model performance monitoring with ground truth
Scheduled retraining with automatic guardrails
Bias and fairness tests in the promotion pipeline
MLOps costs monitored and actively optimized
Automatic rollback based on production metrics

Most teams are at level 1. Those that reach level 2 already see a dramatic improvement in model reliability. Level 3 is where we want to take all our clients, but it is a journey, not a leap.

If your data team is struggling with the gap between experimentation and production, the first step is not buying an MLOps platform. It is sitting down with the data scientists and the engineers, mapping the current flow of a model from notebook to production, identifying the three highest-friction points, and solving them. The platform comes after.

At abemon, we help teams design and implement MLOps workflows adapted to their scale and maturity. We do not sell the ideal architecture. We build the architecture your team can operate today, with a clear path toward where you need to be tomorrow. Because an MLOps system that nobody uses is worse than a manually executed notebook: at least the notebook, someone understands it.