Learn about the best practices and principles of effective data architecture design.
The same way architects design blueprints to plan and support the building of houses, data architects design the data architecture to support building end-to-end data pipelines.
In this article, we dive into data architecture, its principles, and how data mesh fits into the story.
What is data architecture?
Data architecture is the catch-all term to describe the technology, software, approaches, policies, rules, standards, and models that support, drive, and govern data collection, data transformation, data storage, and data usage.
Because data architecture encompasses all enterprise data and takes the “big picture” or complete view of the company, it might come across as an abstract, soft approach to managing data.
However, data architecture as a practice is quite concrete. Effective data architecture will specify in detail:
- Data sources used in data integration to collect data. All the 3rd Party API endpoints, CRM, ERP, advertising software, financial software, basically all the data sets, which are collected by an enterprise.
- Transformations. All the preprocessing, cleaning, outlier removal, aggregations, de-anonymization, and every other transformation applied to data to make it usable.
- Data storage. The concrete data stores used, as in which data lake, data warehouse (Amazon Redshift, Snowflake, Google BigQuery, ...), or database (Postgres, MySQL, …) will be deployed.
- Data movement. What workloads will flow data from raw data capture to its final cleaned form, where it is ready for analytics.
- Data consumption and deployment. How the data will be used, consumed, analyzed, or deployed in machine learning or artificial intelligence products.
All the areas have a high overlap with what a data engineer or database architect does. So, how does data architecture differ and what sets a data architect apart from other roles?
What is the role of a data architect?
The data architect is responsible for envisioning the desired or target state of data operations and then guiding the execution of data operations to reach the vision.
In other words, the data architect defines the big picture level of how data will be collected, ingested, transformed, stored, moved, served, and consumed across the entire enterprise data system.
Alongside the data flow, the data architect specifies which concrete technologies will be deployed and how will the entire system synchronize internally to guarantee data freshness, consistency, and validity.
How does the role of the data architect differ from data engineers or database architects?
The data architect specifies the data flow even outside of storage, while the database architect is focused more on the internal mechanisms of the data storage phase.
Let’s look at a concrete example.
A data architect analyzes the business use cases and specifies that we need storage that can ingest up to 1M data entries per second (and therefore pick Cassandra), while the database architect designs the specific table partitions and distributed multi-node architecture to handle such ingestions.
Similarly, a data engineer will be concerned with the flow of data through the engineering systems (ETL or ELT), but will mostly ignore data once it is stored. On the other hand, the data architect will also think and design the processes around data consumption, for example, which BI tool will analysts use, how to curate data to make it accessible for machine learning prototyping, etc.
Obviously, in smaller organizations, the same person might wear multiple hats and switch between the data architect role and other data roles.
But irrespective of company size, it is crucial that an individual takes ownership over data architecture.
Why is data architecture important?
Modern data architecture brings multiple benefits:
- Lower operational complexity. As enterprises grow, so does the complexity of their system. New data is captured by adding new sources, additional transformations are introduced in the serving layer to reshape datasets according to the data consumers’ needs, and additional software is added to handle operational edge cases. A clear data architecture simplifies the processes and by analyzing the overall system, cuts out unnecessary components and simplifies data flows. This allows enterprises to tap into big data analytics and real-time analytics without sacrificing operational speed.
- Faster insights. Operational clarity via simple data architecture allows enterprises to (1) identify data pipelines that lead to insights, (2) improve those pipelines further to speed up time to insights.
- Agility. When data architecture is designed effectively, enterprises are more agile. Adding a new source, transformation, storage, or consumption is clearer and faster since the entire system overview allows for analysis of how to best do it without unnecessary duplication, movement, or data degradation.
- Cost-saving. Every time data is unnecessarily moved across the network, it increases costs. Without a clear data architecture design, that avoids unnecessary data movement and duplication, costs ramp up.
To tap into the advantages of modern data architecture, enterprises must follow the best practices and principles of effective data architecture design.
What are the principles of an effective data architecture?
The modern data architecture is effective when it is designed within the following 7 principles:
- Design is consumer-centric. Modern data platforms serve multiple stakeholders with different needs. From software engineers who validate data extraction via logs, to data engineers to check hashed functions to understand ingestion reliability, to data scientists and data analysts who analyze data. Because of the multitude of consumers, modern data architecture picks solutions and designs interfaces to best serve the needs of the consumer, not the data architect or engineer. This means data architects are concerned with the reliability of data warehouses, but also with the best choice for BI tooling so that data can be accessed and used in analytics, as well as with the setup of virtual sandboxes where data scientists can run machine learning experiments without needing data engineers.
- Data is a shared asset. In modern data architecture, departmental silos are broken down. Data is a shared asset consumed by all stakeholders, so data architects eliminate regional, business unit, and departmental data silos when designing the data architecture. This can be as simple as collecting all data into a centralized data lake, or as complicated as exposing the different data assets via distinctive endpoints to all stakeholders.
- The common understanding is the goal. Unless different stakeholders share the same understanding of KPI definitions, key dimensions, fiscal calendar dimensions, and other aspects of data, the departmental silos will not break down. A common vocabulary leads to a common understanding. Modern data architecture implements data catalogs with key definitions and product catalogs that index in a centralized place all important dashboards and data assets to allow data consumers a common denominator when thinking and interpreting data.
- Data curation is necessary. Serving consumers with raw data will cause more problems, clarifying back-and-forths and wrong assumptions turned into false business advice. Data needs curation before exposing it to consumers. For this reason, data architecture sweats the details such as data models, master data, metadata, reference data, and all other information that empowers consumers to use the data without the need for an interpreter. When this is not possible, data architecture appoints the role of data stewards, that is people who serve as interpreters of potentially ambiguous data assets.
- Scalability is a concern even when it is not a problem. Data architects collect information about the current operational needs, but design systems for future needs as well. Even when scalability is not an issue today, modern data architecture envisions potential growth cramps down the line and designs systems for the possibility of scaling. Moving the data platform to the cloud is the first step. Designing data flows and pipelines to be fault-tolerant, replicable, or distributed is a more complex design pattern that embodies the same principle.
- Data flows are agile. Data architecture adapts to multiple business needs, but at the same time cuts complexity with a fury. Modern architecture reduces unnecessary data duplication, process codependence, and movement across systems and networks. This cuts costs but also avoids data corruption since simple and agile designs respond better to change.
- Security is essential. Modern data architectures put security at its foundations. Access controls are granted on an individual and role basis, while keeping track via modern technologies of every touchpoint and access event, to avoid data falling into the wrong hands.
How does data mesh fit into data architecture principles?
Enterprises have different ways of implementing the principles of modern data architecture, depending on their specific business use cases and the maturity of their data operations.
Let’s say your company offers music streaming services to users. And you want to understand better what music users listen to, and which songs to recommend to them.
The same business problem can be tackled from three architectural designs:
- A simple ETL pipeline. You could build an ETL pipeline that uses Go to send song-listening events (extract) to a Postgres (load) database, implement trigger-like Postgres transformations to clean the data at insert and update events (remove outliers, remove corrupted data, aggregate events into sessions, …) (transform) and save the clean data into two tables user_sessions and events_aggregated (load) within Postgres. The latter tables are used by data analysts to build simple recommendation systems.
- A data lake architecture. Streaming events are sent via a message broker (Kafka) to Snowflake in near-real-time. At the same time, you build Python pipelines that collect data from your in-house CRM and Advertising APIs at regular 20 minutes intervals (batch processing design). All data is ingested into a Snowflake schema for raw data. Within Snowflake your data engineers create SQL-like procedures that normalize and denormalize the raw data into a star schema of tables. Analysts plug into the data using Tableau or Looker, while data scientists access the same information via Databricks to build recommender systems in Jupyter Notebook-like platforms.
- A data mesh architecture. Instead of building a centralized data pipeline, you break down the problem into 3 business use cases: real-time machine learning recommender system, financial analysis of user-level usage that is tied to subscription and customer churn, and product (album/song) analytics. Each business use case is awarded its team with a data project manager, software engineers, analysts, and data scientists. Each team develops the extraction, ingestion, transformation, and serving pipelines according to their need, on top of a cross-team shared infrastructure that takes care of governance and security.
The same business problem can be tackled with three very different architectural designs. Why is data mesh the superior option?
The simple ETL pipeline will work well until your business operation will scale. The use of Postgres as your event repository will become a bottleneck due to its transaction write locks.
The data lake architecture will then serve you well until your operational complexity grows. As more disparate teams tap into the same data, the common understanding of data will start to diverge and data curation will become increasingly harder with every new transformation and extraction added. Of course, you will be able to course-correct by implementing data catalogs and dividing labor, but the architecture in and of itself is designed for a centralized approach, not for divergent business needs, thus violating the principle of consumer-centric design.
Finally, the data mesh architectural pattern will fulfill all the principles of modern architecture:
- Consumer-centricity is guaranteed by having teams organized around data use cases instead of technology.
- Data as a shared asset is realized via each data team sharing their data sets with others, while also taking care of data curation.
- Each data team is aligned around common understandings: they are accountable for the team’s data KPIs, definitions, and upkeep of product catalogs.
- Scalability, governance, and security are taken care of by infrastructure as a service - the domain agnostic data platform, which all data teams use to develop their respective products.
- Data flows are agile by definition. Each team builds only the minimally necessary pipelines to develop their product. If a new data need arises, a new team is added to the mix of distributed teams, to build the novel product.
This analysis glances over the genius architectural design of data mesh. If you are more curious about how data mesh is envisioned, check the in-depth article we wrote on the topic here.
How can Keboola help you implement your data architecture?
Data architecture is an ongoing process that adapts to changing business requirements.
Keboola can help you cut the implementation of your data architecture vision without sacrificing its technological sophistication.
As the end-to-end data operations platform, Keboola offers infrastructure as a service that can help you:
- Easily ingest multiple data sources with a couple of clicks. Keboola automates data collection and loading from over 250 sources and destinations, without additional engineering or maintenance needed.
- Transform, clean, and aggregate your data with automated jobs, written in the preferred language of data engineers and scientists - iterate faster, by tapping into tried and tested SQL, R, Python, or other programs that help you wrangle your data into shape in the language of choice of your users.
- Share data with others and enable common understanding with Keboola’s Data Catalog. Need to expose your data with data scientists in a safe environment where they can develop machine learning algorithms without engineering overhead? Implement Keboola’s sandbox, a virtual environment that comes equipped with Jupyter Notebooks and RStudio.
- Implement data governance and guarantee enterprise-level security standards for all operations within the platform.
- Curate data with extensive metadata that goes beyond the offerings of a single data warehouse.
- Quickly adjust existing or add new data flows by simply connecting new components.
Try it for free. Keboola has an always-free, no-questions-asked plan. So, you can implement all the envisioned architectural designs with a couple of clicks. Feel free to give it a go or reach out to us if you have any questions.