Data Quality: How to Improve It Without Stopping Operations
The 30% nobody wants to look at
An executive at a logistics company once told us: “Our data is fine, it just doesn’t always add up.” That sentence summarizes the state of data quality in most organizations. Nobody says the data is bad. They say there are “some inconsistencies” or that “sometimes manual adjustments are needed.”
Reality is less kind. IBM estimates that low-quality data costs US businesses $3.1 trillion per year. Gartner calculates that 30% of a typical organization’s critical business data is inaccurate, incomplete, or duplicated. We are not talking about historical data buried in a data warehouse that nobody queries. We are talking about data that feeds daily operational decisions: shipping addresses, product prices, order statuses, customer information.
The natural impulse when someone acknowledges the problem is to propose a “data cleansing project.” Assemble a team, clean the database for three months, declare victory, and return to normal. The problem is that data gets dirty continuously. Without mechanisms that prevent and detect degradation, three months after the project you will be at the same point. Or worse, because the organization will think the problem is solved.
Profiling: measure before you act
You cannot improve what you do not measure. Data profiling is the diagnosis that precedes treatment. It involves systematically analyzing existing data to understand its actual structure (not the documented one, the actual one), its patterns, its anomalies, and its gaps.
A basic profile covers four dimensions:
Completeness. What percentage of records has each field populated? If your customer table has 50,000 records but the “email” field is empty in 12,000, your email completeness is 76%. That may be acceptable or catastrophic depending on the use case.
Uniqueness. How many duplicates exist? Exact duplicates are easy to detect. Fuzzy duplicates (John Smith and J. Smith Jr. who are the same person) require more sophisticated matching algorithms. Tools like Python’s dedupe library or SQL’s SOUNDEX function help, but fuzzy matching always requires human validation for ambiguous cases.
Consistency. The same data represented in different ways. “London” vs. “LONDON” vs. “London, UK” in a city field. “United Kingdom” vs. “UK” vs. “GB” in a country field. Dates in DD/MM/YYYY vs. MM/DD/YYYY (a classic nightmare). Inconsistency is not an error in itself; it is an error multiplier in any process that consumes that data.
Validity. Do the data comply with business rules? A UK postcode has a defined format. An email has a validatable structure. A price cannot be negative. A quantity cannot be a fraction for certain product types. These rules seem obvious, but without active validation, invalid data enters and propagates.
Profiling tools range from basic (pandas profiling in Python, which generates a complete HTML report in one line of code) to dedicated platforms like Great Expectations, Soda, or Monte Carlo. The tool matters less than the discipline to run profiling periodically and act on the results.
Validation rules: the first line of defense
Validation rules act at data entry points. Before a record is written to the database, it passes through a set of validations. If it fails, it is rejected or flagged for review.
Rules are organized into three levels:
Schema rules. The data type is correct, the required field is present, the value is within an acceptable range. These rules are deterministic and can be implemented at the database layer (constraints, check clauses) or at the application layer (form validation, API validation). Implementing them at the database level is safer because they cannot be bypassed. Implementing them only in the application leaves the door open for direct imports or integrations that skip validation.
Business rules. The order has at least one line item. The discount does not exceed 40%. The delivery date is after the order date. These rules encode domain knowledge and change with the business. They should be configurable, not hardcoded, so the business team can adjust them without code changes.
Statistical rules. The value is within N standard deviations of the historical mean. An invoice for EUR 50,000 when the average is EUR 500 requires review, even though it is technically and legally valid. These rules detect anomalies that deterministic rules do not capture.
Great Expectations is the de facto standard for implementing these rules in data pipelines. It defines “expectations” (rules) about the data, runs them against each batch, and generates a validation report. Expectations are versioned in Git like any other code, providing auditability and collaboration.
Automated cleansing pipelines
Validation rules prevent bad data from entering. Cleansing pipelines fix bad data that already exists (or that enters through channels that bypass validations).
A typical cleansing pipeline has three phases:
Detection. Identify records that do not meet quality standards. This can be a scheduled job that runs the same validation rules on existing data, or a more sophisticated process that uses fuzzy matching to detect duplicates.
Correction. For deterministic corrections (normalizing capitalization, formatting phone numbers, fixing postal codes), automation is safe. For ambiguous corrections (merging duplicates, correcting names), automation should propose and a human should approve. Automating duplicate merges without human review is a recipe for data loss.
Verification. After each cleansing cycle, quality metrics must be recalculated to confirm the cleansing was effective and did not introduce new problems. A pipeline that fixes 500 records but corrupts 50 is a broken pipeline.
Cleansing cadence depends on volume and degradation rate. For databases with high ingestion (thousands of records daily), cleansing should be continuous. For more stable databases, a weekly run may suffice.
The most common mistake is running the cleansing pipeline once, celebrating the improvement in metrics, and never running it again. Data gets dirty continuously. The pipeline must be a recurring operation, not an event.
Data stewards: the missing role
Technology solves half the problem. The other half is organizational. Who decides that 76% email completeness is unacceptable? Who defines the rule that a discount cannot exceed 40%? Who decides if two ambiguous records are duplicates or different people?
The data steward is the role that answers these questions. It is neither a purely technical profile nor a purely business one. It is someone who understands the data, understands the business, and has the authority to make quality decisions.
In small organizations (fewer than 50 people), the data steward is usually a partial role assumed by someone in operations or product. In mid-sized organizations, it is a dedicated role or small team. In large companies, it is a structure with stewards per data domain (customers, products, transactions) coordinated by a data governance officer.
What does not work is having nobody. Without a defined owner, quality rules do not get updated, ambiguous cases do not get resolved, and quality metrics become dashboards nobody watches.
Quality dashboards: continuous visibility
A data quality dashboard shows the state of quality metrics in real time (or near real time). The fundamental metrics are:
- Completeness by field and by table. Weekly trending.
- Duplicate rate detected and resolved.
- Invalid record rate entering the system (by source).
- Mean time to resolution for quality incidents.
- Overall quality score (a composite index weighting the dimensions relevant to the business).
The dashboard is not for technology. It is for business. If the commercial director can see that 8% of shipping addresses are incomplete, and this correlates with a 12% return rate, data quality stops being a technical problem and becomes a business problem with a visible cost.
Grafana with Great Expectations data, or dedicated tools like Monte Carlo or Bigeye, provide the visualization. What matters is that the dashboard is visible, current, and actionable. A dashboard nobody reviews is decoration.
Data quality and external sources
An aspect teams often overlook is that data quality does not depend solely on your internal systems. A significant portion of data enters from external sources: vendor APIs, client imports, marketplace integrations, data scraped from public websites.
Each external source is a degradation vector. A supplier that changes their product code format without notice. A client that sends a CSV with Latin-1 encoding when your system expects UTF-8. A marketplace that starts including special characters in product names.
The defense is treating each integration point as a validation boundary. Data entering from outside passes through the same validation rules (or stricter ones) as internally generated data. An ingestion pipeline that accepts everything it receives without validation is an invitation to chaos.
In practice, we build a staging area for external data. Data arrives, gets validated, gets normalized, and only then is written to production tables. Records that fail validation go to a review queue where an operator (or an automated process) corrects or rejects them. This pattern adds latency to ingestion (seconds, not minutes), but prevents corrupt data from contaminating the main database.
The cost of not doing this is real and measurable. One of our retail clients discovered that 23% of errors in their product catalog came from a single supplier sending data in an inconsistent format. Implementing ingestion validation for that supplier reduced catalog errors by 18% in the first month.
Data governance vs. data quality
It is easy to confuse governance and quality, but they are complementary concepts. Quality ensures data is correct, complete, and consistent. Governance manages who can access what data, how it is classified, how it is retained, and how it is deleted.
Without governance, quality is unsustainable. If anyone can modify any table without access controls, validation rules get bypassed for convenience. If there are no retention policies, obsolete data accumulates and dilutes quality metrics. If there is no sensitivity classification, personal data gets copied into development environments without protection.
The practical recommendation is to start with quality (it has immediate ROI) and build data governance incrementally. But do not ignore governance indefinitely. A basic governance framework (data catalog, access control, retention policy) is necessary before data complexity exceeds the team’s capacity to manage it manually.
Incremental implementation
Do not try to fix all data quality at once. The approach that works is incremental.
Week 1-2: Run profiling on the most critical tables. Identify the three problems with the greatest business impact.
Week 3-4: Implement validation rules at entry points to prevent new bad data. This stops the bleeding.
Month 2: Build the cleansing pipeline for existing data, starting with the three identified problems. Measure before and after.
Month 3: Deploy the quality dashboard and assign a data steward (even as a partial role). Establish a weekly metrics review.
Ongoing: Expand validation rules, add new sources to profiling, refine the cleansing pipeline based on emerging patterns.
The goal is not perfection. It is continuous improvement with mechanisms that prevent regression. A 95% completeness rate maintained consistently is better than 100% that lasts a week and drops to 80%.
Data is the most undervalued asset in most companies. Not because they lack data, but because the data they have is not reliable. Improving data quality is not glamorous, does not generate headlines, and does not impress in a demo. But it is what allows everything else (analytics, AI, automation, business decisions) to operate on solid ground instead of quicksand.
About the author
abemon engineering
Engineering team
Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.
