Gartner® Hype Cycle™ for Data Management 2024
Read The ReportGartner® Data Management 2024
Read The ReportDiscover the difference between a data lake and a data warehouse.
You will have to consider multiple solutions and their tradeoffs when setting up your enterprise data architecture.
In this article, we contrast a data lake and a data warehouse side-by-side to make your choice easier.
A data warehouse is a data storage technology that acts as a repository and single source of truth for disparate enterprise data.
A data warehouse collects data from different data sources (CRM, ERP, 3rd Party Apps, social media, …) and models the data for data analytics (predictive modeling, statistical analysis) and decision support via business intelligence (BI).
Notable examples of data warehouses include Amazon Redshift, Snowflake, and Google BigQuery.
(Curious about the Enterprise Data Warehouse (EDW)? Dig deeper and read more about data warehouses here.)
Similar to a data warehouse, a data lake acts as a repository for disparate enterprise data sources.
But unlike a data warehouse, the goal of a data lake is not to provide decision support and data analytics. Instead, the main goal of a data lake is to store all data in its raw native format within a single platform.
Notable examples of data lakes include Amazon S3, Apache Hadoop, Microsoft Azure Data Lake Storage (ADLS).
Let’s compare the two solutions more closely.
As a big picture comparison, a data warehouse (or its more streamlined version, the data mart) can be thought of as bottled water - filtered, packaged, and ready to consume.
The data lake, on the other hand, is best characterized as a raw, untamed body of water.
The metaphor makes instinctual sense, but let’s dig deeper into the comparison to better contrast and understand the two data solutions.
There are 9 main differences between a data lake and a data warehouse:
Data lakes store raw data in its native format. This can include transactional data from CRMs and ERPs, but also less-structured data such as IoT devices logs (text), images (.png, .jpg, …), videos (.mp3, .wave, …), and other complex data types.
The types of data a data warehouse stores, on the other hand, are the same types that relational databases store - text, numerical data types, and other types handled by SQL queries.
The raw vs SQL-type distinction can also be characterized as a structured vs unstructured data comparison.
Data lakes store structured data, semi-structured data, and unstructured data at ease.
While the data warehouses store structured data, and some (like Snowflake with its variant and object data type) can store semi-structured data. Data warehouses can hold information from semi-structured or unstructured sources, but need to transform it first - by computing metrics over it (e.g. compute metric of how many times our brand name appears in the search results instead of saving all search results).
Data lakes store all the information - the ones which an enterprise needs, the ones it might need in the future, and even the information that might never be used by analysts.
In contrast, a data warehouse carefully selects what data it will store before loading it into the data warehouse.
The schema describes the formal organization of data.
Data lakes use schema-on-read. Aka, the format and organization of data is specified every time we read data and there is no presupposed grand organization principle before we query the data in the data lake.
Data lakes use schema-on-write - the format and organization of data need to be determined in detail before the data is written to the data warehouse.
Data operatives and architects spend a lot of time designing the data model (normalized tables, denormalized tables, star schemas, snowflake schemas) to make the data structure easy to use for data analysts and reporting.
This is why the schema-on-write is adopted: to keep the integrity of the data model.
Both data lakes and data warehouses store large amounts of data.
But there is an order of magnitude in the difference between the large volumes of data both solutions hold.
A data lake usually stores petabytes of data, while data warehouses operate in terabytes.
Data lakes are generally cheaper per GB of data stored because data lakes are typically built on scalable, low-cost commodity servers or leverage cloud-based object storage with devoted low-cost tiers.
On the other hand, data warehouses are comparatively more expensive, because their storage costs are coupled with compute costs to run analytical queries.
Data warehousing technology is tried-and-tested and is a highly mature piece of technology, while data lakes are not yet fully matured.
Because of the rigorous modeling requirements that give data warehouses amazing analytic capabilities, they are less flexible with incoming data changes.
On the other hand, data lakes presuppose no organizational structure and are much more adaptable to change.
Data warehouses get incoming data through the ETL process. They:
In contrast, data lakes use the ELT process. Data is extracted from the sources, loaded into the data lake as it is, and only when needed, a data scientist or data engineer transforms the data once it’s read.
Data lakes have multiple advantages over data warehouses. Data lakes are
However, data lakes also carry operational risks. Just dumping data without a clear vision of how it will be used can lead to a data swamp - a useless, undocumented, bloated data storage. As Sean Martin, CTO of Cambridge Semantics, said:
“We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there.
The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.”
Data warehouses beat data lakes on this ground - they do not store redundant data, but are highly designed to keep only the useful aspects.
Deciding between data lakes and data warehouses can be tough.
But the choice is not mutually exclusive.
You can keep a lightweight and cheap data lake for your machine learning and non-missing-critical engineering, while you deploy a data warehouse for your data-driven decision-making.
Keboola can help you with both.
As an end-to-end operational platform, Keboola helps you build ETL and ELT pipelines with low-code automation.
With over 250 integrations between sources and databases, data warehouses, and data lakes, you can easily set up your data pipelines to design your dream architecture with a couple of clicks.
Try it out. Keboola offers a no-questions-asked, always-free tier, so you can play around and build your pipelines leading to the data lake or data warehouse with a couple of clicks.