Run your data operations on a single, unified platform.

  • Easy setup, no data storage required
  • Free forever for core features
  • Simple expansion with additional credits
cross-icon
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

9 Data Quality Checks to Solve (Almost) All Your Issues

Guided solutions and best practices for the most common data quality problems

How To
June 5, 2023
9 Data Quality Checks to Solve (Almost) All Your Issues
No items found.
Guided solutions and best practices for the most common data quality problems

Can you put a number to how much bad data is costing your business?

“Every year, poor data quality costs organizations an average of $12.9 million” - Gartner

This includes everything from marketing losing 21 cents of every advertising dollar due to misinformed business decisions to data engineers wasting more than 3h daily resolving data quality issues.

Luckily, you can guarantee high-quality data with 9 simple checks. In this article, we’ll look at:

  1. The 9 most common data quality issues (with solutions to each).
  2. When and where to run data quality checks.
  3. How to run data quality tests to ease your work and increase data validity.
  4. Pro tips: the best data quality tools on the market.
#getsmarter
Oops! Something went wrong while submitting the form.

Say goodbye to testing worries with a data cleaning checklist - your one-stop solution for never missing a step in data quality testing.

The 9 most common data quality issues (with solutions to each) 

Below we showcase the 9 most common problems companies face when trying to improve the quality of their data.

Data quality issue #1: Inaccessible data 

What is it?

Data-driven decision-making is only as effective as the data used to make business decisions. If the data is missing, business choices get detached from the competitive insights that data can provide.

How does the data quality problem arise?

There are three main drivers for inaccessible data during decision-making time:

  1. The volume of data is getting too big. As the quantity of data is reaching “big data” status, companies struggle to extract insights due to unmanageable volumes of data.
  2. Business priorities are unaligned with data priorities. Business stakeholders make decisions for one area of business, while the data team generates reports for another.
  3. Wrong choice of data-sharing tools. Stakeholders may lack access to essential datasets and reports needed for decision-making. This includes everything from analysts lacking API credentials to export a dataset to the Head of Marketing being unaware of a preexisting ARPU report.
How do you solve this data quality problem?

To make your data accessible, diagnose the cause of the problem and apply one of the following solutions: 

  1. Tame your scaling data operation volume with the right architecture. If big data is getting too big to manage, it’s time to migrate to an ELT architecture or even a data mesh architecture
  2. Align your business and data units with the right data strategy. Keep your business needs and operational priorities aligned by building an actionable data strategy.
  3. Pick data tools that give everyone a seat at the table. The right tools can simplify data access. From the Visual Flow Builder, which allows everyone to meet their data needs on their own, to the Data Catalog, which improves data sharing across teams and departments.
“Best-in-Class companies are 30% more likely to have a dedicated data catalog solution” - Aberdeen Strategy & Research, 2020

Data quality issue #2: Inaccurate data

What is it?

Data accuracy measures how well your data matches the real world. An example of inaccurate data is a customer changing their residence, but your CRM shipping address doesn’t get updated alongside. Or your sales rep entering the shipping address while talking to your customer on the phone, but mistyping the street name. 

How does the data quality problem arise?

Multiple factors create inaccurate data, including human entry or updating errors, data drift, and data decay. 

How do you solve this data quality problem?

There isn't a single solution to data accuracy. The best approach is to apply business rules to your data cleaning procedure to verify the accuracy of the data. For example, verifying if every CRM shipping address corresponds to a real-world location via the Google Maps API.

Data quality issue #3: Incomplete data

What is it?

From incomplete insights to machine learning models learning only on a (biased) subset of data, missing data can lead your data products and decision-making astray.

How does the data quality problem arise?

Multiple drivers lead to missing values, including bugs in data integration pipelines, errors from data providers, human entry and update errors, and the nature of the data collection (e.g. survey data often have missing fields because people don’t answer every question). 

How do you solve this data quality problem?

Missing values are often represented as NULL values in SQL systems, or as NaNs or Nones in other systems. Run your data quality tests by checking for missing values in your dataset (column-wise and row-wise). If you’re certain there should be no missing values (e.g. primary key column, unique identifiers), you can create NOT NULL values constraints for your SQL table.

Data quality issue #4: Inconsistent data

What is it?

When you look up a customer’s purchase history and the CRM says they only bought one item, while the transaction database shows the customer is a regular big spender you start to question the trust you have in reporting.

Data inconsistency refers to conflicting information about the same operational unit (such as a customer, product, business branch, etc.). This inconsistency erodes trust in data, makes data less usable for decision-making, and can even lead to regulatory compliance issues.

How does the data quality problem arise?

Inconsistencies arise whenever a company grows. This can be due to multiple unsynchronized data integration pipelines, databases that are not fully migrated, or ad hoc datasets prepared for a specific use case but then integrated into the source of truth storage, etc.

How do you solve this data quality problem?

You need to ensure that the same data in different locations matches. This can be achieved with multiple data quality tests:

  1. Data integrity checks. Verify the same primary key points to all relevant foreign key records in different SQL tables. The combination of entity integrity (each table has a primary key) and referential integrity (a foreign key refers to only one primary key) helps you maintain data consistency as your data records multiply. 
  2. Data consistency checks. No primary or foreign keys? No problem! Simply link your investigated data asset (e.g. customer) with its associated information (e.g. number of purchases) and compare the related information across different datasets. If it doesn’t match, you have a data inconsistency problem. 
  3. Operational process automation. When creating or migrating data assets, make sure to keep them synchronized. Rely on advanced but easy-to-use technologies like CDC.
  4. Design a single source of truth. Design your data architecture around the concept of a single source of truth. Even if data inconsistency arises in some data assets (e.g. data marts used for a single analytic use case), they will not affect your main data assets. How? Check our guide to data consolidation.

Say goodbye to testing worries with a data cleaning checklist - your one-stop solution for never missing a step in data quality testing.

Data quality issue #5: Irrelevant data

What is it?

Sometimes we have so much data that the majority of it is irrelevant. For example, the weather on the date our customer purchases our software product. Sure, maybe there is a relationship between rainy days and purchases, but how strong is it if we’re not an umbrella company?

How does the data quality problem arise?

We import every data source imaginable, irrespective of the need for it. Additionally, by not sanitizing the data before making it available for analysis, we risk introducing irrelevant data into the insights process.

How do you solve this data quality problem?

Improve your filtering processes. Consider either not importing the irrelevant data at all (limit extraction in the ETL paradigm) or limiting which data moves from the data lake to the data warehouse (data transformation layer in the ELT process).

Data quality issue #6: Stale data

What is it?

Data freshness is important in decision-making. Some data ages like fine wine (transaction data), while other data gets irrelevant and stale as time passes by. This particularly affects customer data - for example, shipping addresses change, and purchases made 17 months ago should not influence recent upsell suggestions.

How does the data quality problem arise?

Data becomes stale as a matter of nature. You’ll always have to fight data staleness. Additionally, concept drift and business need changes affect which data is relevant right now.

How do you solve this data quality problem?

There is no need to only look at real-time data. But make sure to filter records based on their timeliness. This will be analysis specific - for recent purchases, introduce a sensible business rule (e.g. 1 week for fast-moving goods, 1 year for long B2B sales cycles) that filters out data based on its freshness.

Data quality issue #7: Invalid data

What is it?

Invalid data doesn’t make sense. The term 'data validity' is sometimes used to encompass all data quality testing, but here, we refer to it in terms of how "reasonable" data is. For example, a customer cannot be 160 years old (semantic invalidity), and the average number of meals per day can’t be over 50 (statistical outlier).

At best, invalid data misguides your business decision-making. At worst, statistically invalid data can severely skew your analytic and machine learning models, making them biased towards outliers and statistical anomalies. 

How does the data quality problem arise?

Invalid data is caused by multiple processes: human entry and update errors, and statistical anomalies of systems (some outliers truly exist, but can be deceitful when analyzed). 

How do you solve this data quality problem?

Use data quality tests that check for invalid data:

  1. Implement thresholds. Remove or cap data that is outside the thresholds. For example, cap customer age above 100.
  2. Check for outliers and statistical anomalies
  3. Introduce business logic. If a customer purchase record is older than your company, it is better to disregard it in the analysis.

Data quality issue #8: Data (non-)uniqueness

What is it?

Duplicated data values can be problematic in and of themselves. They bias judgment toward duplicated entries. But sometimes duplicated data records can be classified as invalid data. For example, non-unique social numbers or identifiers can be problematic when searching for a particular customer’s data.

How does the data quality problem arise?

Multiple causes: human entry errors, integration mistakes, database migration or replication errors, etc.

How do you solve this data quality problem?

Identify duplicate data values and either remove both (or all) copies, keep the first/last entry, or deduplicate - introducing logic to disambiguate duplicate entries.

Data quality issue #9: Changing data

What is it?

Data can change without that making sense. This can range from formatting issues (for example, a column representing the purchase value of orders suddenly displaying dates instead of dollar values) to removed columns, and everything in between.

How does the data quality problem arise?

Data formatting issues typically occur due to human error and migration errors. Meanwhile, schema changes (missing, altered, or added columns) usually take place when third-party providers modify the schema at the source (for instance, during an API change).

How do you solve this data quality problem?

Make sure to test for data formatting issues. Create checks and alerts for your schema changes as part of the regular data quality assurance by running additional tests.

When and where should you run data quality checks?

Where (and when) to run your tests depends on your general data architecture and use cases:

  1. For ETL data pipelines: Create tests at extraction (filter out just some columns, test no extra data is imported) and during the data transformation layer. When new data is extracted from the data sources, validate it, before loading it as inputs to the next stage.
  2. For ELT workflows: All data is collected and loaded first. Introduce data quality checks before you input the data from the data lake to the data warehouse (or another single source of truth).
  3. For ad hoc analyses: Introduce customized data profiling. For example, a data scientist building a new machine learning algorithm will check the new dataset for outliers and statistical anomalies before building their AI algorithm.

Irrespective of when and where you run data quality checks, make observability a priority. Make data quality part of your data governance and introduce it in every data quality check.

How to run data quality checks?

Data quality management should not be the sole responsibility of your data team. Everyone who uses data for their business needs should understand the general quality of data they work with. And should also be responsible for the quality of the data.

How can everyone be involved when data quality checks are usually run in data quality tools by 3rd party providers (which require specialized knowledge) or by using open-source programming libraries (and everyone does not know how to code)? 

By using tools that offer you all possible testing solutions:

  1. Scripting tests. Write tests in SQL, Python, or other scripting language that allows you to express the data quality check of your choice.
  2. No-code tests. Use no-code transformations, pre-built tests, and visual drag-and-drop builders to chain together multiple data quality checks.
  3. Post-test automatizations. What happens when a data quality check fails? You have multiple options: you can trigger an alert (observability), make the data pipeline fail, or even build custom “healing” scripts that trigger quality improvements as specific tests fail (for example, a data deduplication to remove non-unique entries).

With Keboola, you can run all three methods of data quality checks.

Pro tip: High data quality is not just a result of running data quality checks automatically, consistently, and thoroughly. High data quality also comes from correctly designing your data pipelines. Check our how-to guide that walks you through the 7 steps for higher quality data operations

Improve data quality with Keboola 

Keboola is a data stack as a service that automates all your data operations, including data quality checks.

With Keboola you can:

  1. Create data quality checks in Python, SQL, Julia, or R (check these 27 example SQL checks to get inspired).
  2. Create data quality checks with no-code transformations.
  3. Have control over what happens when a test fails: send alerts to Slack/email/another app, abort the data pipeline, or even run a healing data transformation.
  4. Easily share and reuse tests across many pipelines with a button click. 

Build your ETL and ELT data pipelines with peace of mind and automate data quality checks in Keboola today

Subscribe to our newsletter
Have our newsletter delivered to your inbox.
By subscribing to our newsletter you agree with Keboola Czech s.r.o. Privacy Policy.
You are now subscribed to Keboola newsletter
Oops! Something went wrong while submitting the form.

Recommended Articles

Close Cookie Preference Manager
Cookie Settings
By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage and assist in our marketing efforts. More info
Strictly Necessary (Always Active)
Cookies required to enable basic website functionality.
Made by Flinch 77
Oops! Something went wrong while submitting the form.
>