Data Transformation in ETL Process Explained (2023 Guide)

How To

September 14, 2022

5 min read

No items found.

8 TYPES OF ETL DATA TRANSFORMATIONS AND HOW TO AUTOMATE THEM.

Thank you for your registration! We received your contact details. Expect an email with information about the event. We look forward to seeing you there!

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

Working with raw or unprocessed data often leads to poor decision-making.

Accordingly, the ETL process - the foundation of all data pipelines - devotes an entire section to T, transformations: the act of cleaning, molding, and reshaping data into a valuable format.

In this blog, you’ll find:

What is data transformation in an ETL process
ETL vs ELT
8 types of data transformation
How to choose which data transformation to apply

Complete the form below to get your complimentary copy.

Oops! Something went wrong while submitting the form.

You can scale your data operations, save 30% on costs and make more money. But only if you use Keboola.

What is data transformation in an ETL process?

Data transformation is part of an ETL process and refers to preparing data for analysis. This involves cleaning (removing duplicates, fill-in missing values), reshaping (converting currencies, pivot tables), and computing new dimensions and metrics. This process requires some technical knowledge and is usually done by data engineers or data scientists.

In a typical ETL process, data transformation follows data extraction, where raw data is extracted to the staging area (an intermediate, often in-memory storage).

After data is transformed, it is then loaded to its data store: a target database (such as the relational databases MySQL or PostgreSQL), a data warehouse, a data lake, or even multiple destinations.

Usually, cleaned data is loaded to business intelligence (BI) tools, where it is ready for visualization and analytics done by the business users.

ETL vs ELT: The two ways to architect transformations

ETL is an idealized form of data architecture that portrays data pipelines as sequentially linear processes.

However, the reality is usually messier.

Usually, the steps of the ETL process overlap and are done in parallel wherever possible, to get the freshest data available ASAP.

That is why when it comes to data engineering architecture there are two distinct ways of incorporating transformations into data pipelines:

ETL: The traditional process of extracting data to the staging area, where it is transformed before being loaded into its final destination storage.

ELT (Notice the L before the T): The data loading happens before the transformations. Raw data is extracted from the source system and loaded into a target data warehouse (e.g. BigQuery, Amazon Redshift, Snowflake…). Only afterward the data is transformed.

Both paradigms have advantages and shortcomings and are better thought of as two strategies for different DataOps challenges.

ETL is usually better suited for smaller data pipelines, while ELT is the go-to design pattern for big data.

Irrespective of the architecture of your choice, all etl transformations can be categorized into a couple of prototypical types, which we are going to breakdown in this blog.

‍Recommended read: Complete ETL process overview.

8 Types of data transformations

Depending on the type of transformation you employ to clean your data, you could be using one of the prototypes of data transformation.

1. Data Filtering

Data filtering is one of the simplest transformations. Certain columns or rows are filtered out before saving or inserting the data into the data storage.

Example: For the table sales_2021, you filter out all data from orders that were placed before 2021-01-01.

2. Data Mapping

Data mapping is one of the most common types of data transformations across all operations and industries. Data mapping (also called translation or conversion), takes one data input and changes it to its equivalent in another format.

It takes on many forms such as unifying capitalizations, converting strings to the same encoding standard, synchronizing date and time values, aligning units across different granulations, etc.

Example 1: Take product names from your eCommerce shop across all language variants and translate them into English before inserting them into the product_details table.

Example 2: You are running an app and are interested in usage analytics. The data you collect is time-stamped with the local timezone of each user. Before extracting and saving events to your event_raw table, you convert local time zones to UTC timestamps. For example, Mario and Jessica are playing each other in your app’s game. But Mario plays it in Italy at 5 pm, while Jessica plays it in New York at 11 am (both local time). To avoid confusing timestamps, you convert events from both Mario and Jessica to UTC (3 pm).

Example 3: You are building a machine learning classifier that helps you determine which customers are more likely to repurchase. The data you extract from the CRM says “female” or “male”, which your algorithm would not understand. Therefore you recode the categorical variable into 0 for “male” and 1 for “female” before saving the data to the database.

3. Data Deduplication

Duplicated data repeats the same information two or more times. Data deduplication removes duplicated data before inserting it into the data warehouse.

Example: After collecting distribution information from your ERP, you find the same item being shipped twice:

*The data is unnecessarily duplicated, so you insert only one row into the final database*.

4. Derived Variables

Derived variables or metrics compute attributes from other data points and variables.

‍Example: You create a column “profit” in your orders table that is computed by subtracting taxes, shipping_costs, and COGS from your sales_total for each row (aka order).

5. Data Sorting or Ordering

Before saving the data, you sort it by date, so it is easier to find via search.

6. Joining data from multiple sources

Merge related data that was collected from multiple data sources.

Example: Join Facebook Ads spend data, Google Ads spend data and Linkedin Ads spend data into a single table called digital_advertising_spend.

In Keboola, you can instantly pull data from different sources using the 250+ pre-built connectors that include Google Ads, Facebook Ads, LinkedIn Ads, and Mailchimp.

You can create a data pipeline in X steps:

Create a forever free account
Go to Flows
Start adding all your components (different data sources)
Choose transformations
Select the final destination
Hit Run! (and “Schedule” if you want to automate this flow forever and ever)

flow building — Use Keboola's drag'n'drop flow builder to intuitively design your data pipeline.

‍7. Aggregating data

Data aggregation is a form of summarization. Aggregation is done by picking a relevant dimension and grouping a metric across that dimension.

It is often performed for metrics that are commonly used to speed up analytics.

Example 1: Compute the total number of new customers (grouped metric) on each date (dimension).

Example 2: Find total sales (grouped metric) by geographical region (dimension).

8. Splitting data

When working with unstructured and semistructured data (such as strings, JSON, or XML files), you split the data before saving it into a table.

Example 1: You get the customer address in the form of the string “Boulevard avenue 23, Portobello PO1234, UK”. Before saving it into the customer table, you split it into the columns:

street,
city,
zip_code, and
country.

How to choose which transformation to apply?

The choice of transformation type ultimately depends on your business needs.

Get yourself on the right track by following this process:

Identify the target tables schema. Prepare a list of tables and their schemas (columns, possible values, etc.) of how the data would ideally look once it has been cleaned and loaded into the target database or data warehouse.
Record the gap between target and incoming data. Contrast the ideal target state with the raw data you collected in the extraction step. Note down what transformation (types) would be needed to convert the raw data into the cleaned data.‍
Implement the changes needed to bridge the gap between extracted data and target data. Validate the transformations before setting them on autopilot with one of the ETL tools.

You can scale your data operations, save 30% on costs and make more money. But only if you use Keboola.

Run and automate your data transformations with Keboola

Keboola is a data platform as a service that automates the majority of data integration process to free up valuable engineering resources.

When it comes to data transformations, Keboola has many smart features data engineers love:

Use whatever language you know. Write your transformation in SQL or pick a language of your choice (such as Python, R, Julia, …).
Automate and save time. Transformations can be scheduled through orchestrations, so they can easily be repeated without anyone needing to manually run them.

Know exactly what is going on. You won't have to worry about setting regulatory compliance because every transformation is fully monitored and traceable.
Everything in one place. Access all transformations from the “Transformations” tab within the Keboola dashboard. The centralization gives your team visibility over the transformations set in place and liberates precious time that is usually spent with organizing and searching for existing data cleaning scripts.

‍Take Keboola for a spin and check for yourself what it can do for your ETL pipelines and transformations.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

Data Transformation in ETL Process Explained (2023 Guide)

You can scale your data operations, save 30% on costs and make more money. But only if you use Keboola.

What is data transformation in an ETL process?

ETL vs ELT: The two ways to architect transformations