Learn more about data architecture, its design, and deployment. Discover how Keboola can help you manage your data architecture.
Data architecture is a hot topic right now.
And rightfully so.
Technological advances bring out a myriad of new solutions that go beyond the traditional relational databases and data warehouses. They enable companies to accelerate their entire data pipeline (or at least remove painful bottlenecks) and shorten the analytic cycles.
The portfolio of data assets managed by companies is also growing.
Nowadays, data collection has to include multiple data sources, from advertising APIs via CRMs to event pipelines. Data lakes and other data platforms are being deployed to handle the growing size of data.
Data movement from one store to another may have been simple in the past, but it can now require a whole department of data engineers just to deal with the overhead of deploying Kafka or Hadoop to parallelize the interaction between infrastructures.
It’s no wonder that the importance of - and demand for - good data architecture design is booming. Modern data architecture requires the savvy data expert to master a multitude of complex data systems and processes.
In this introduction, we’ll take a look at data architecture and the best practices and methodologies involved in designing complex data pipelines.
What is data architecture?
Just like architecture is necessary to spec out a blueprint for a small house or a complex apartment building, data architecture sets out the blueprint of all the data flow within an organization - from small companies to huge enterprises.
You can think of data architecture as a unified view of every data flow from raw data to insights and back. Along the way, it specifies all of the technology and processes needed to achieve the information requirements.
“Data Architecture, in its broadest sense, asks, ‘What are we trying to do as a business?’ And then from all the diverse technologies ‘what’s the best fit for that purpose and how do they work together?” - Donna Burbank, Managing Director at Global Data Strategy
This blueprint bridges the divide between the business and technology silos within an organization. It starts with business objectives, specifies data requirements and data standards, then pins down the infrastructure and tools needed to get the data flows going.
How to design the data architecture
As a data architect, you’ll work through five steps when designing the data architecture:
- Start with stakeholder interviews to specify business requirements. Talk to business users from all of the different departments (and business units within departments) to understand what reference data they need to achieve their business strategy. Be sure to discuss both current and future needs, as this will help you to understand what future demands might become relevant down the line.
- Specify must-haves vs. nice-to-haves. Not all business needs are created equal. Prioritize the business models and relevant data models to achieve them.
- Create an inventory of in-house data and skills. Take note of what datasets are already available, what’s missing, and what kind of expertise you have in-house that can help you to close that gap.
- Make a list of the technological stack that you’ll need. Make sure that you include all of the necessary technologies, from cloud storage to specific RDBMS (will you go with Postgres or MySQL?) to analytic tools such as Looker for your reporting needs. This is where your expertise has the chance to shine. It’s also where hard questions should be asked:
“Do we need to look at real-time data or are daily reports sufficient enough?”
“Are paralleled data ingestions necessary or can we use some queue technology?”
“Will the on-premise hardware handle the scaling needs or is it better to opt for a cloud provider?”
There are bound to be trade-offs that you’ll need to make in order to achieve all of your business needs, and knowing which technological stack does that best is part of the data architect’s know-how. - Create documents that specify data architecture design. Different companies have different demands. In startups, you can get by with using simple charts that explain how the ETL or ELT process will flow data within your infrastructure. In enterprises, you might need to prepare documentation specifying the varying levels (conceptual, logical, physical) of your data model.
But in the same way that designing a house is not the same as building it (it’s kind of hard to sleep under a blueprint), designing the data architecture is not the same as implementing it.
How to deploy data architecture
The data architect can be a one-woman band (also called a unicorn✨🦄✨) or the captain of the ship directing others. To implement the vision specified in the data architecture design documents, an organization needs a small army of experts:
- Data architect. This one is obvious. However, a data architect doesn’t just design the entire data system, they’re also responsible for its implementation. This means that they need to project-manage the deployment, check that the specs are being fulfilled, course-correct when the implementation deviates from the plan, and further specify data entities that were ambiguous or underspecified in the original design.
- Business decision-makers. The data business operatives don’t just talk to the data architect before the system design - they have multiple touchpoints to both verify that the business requirements are being met and further elaborate on any unclear data patterns. Talking to the business users of your data is also an educational opportunity to teach them about the data that they will be consuming.
- Data engineers and Database administrators (DBAs). The engineers will implement the actual data models and ETL/ELT pipelines. They make sure that the right transactions are being executed in the specified order at the correct location.
- DataOps and DevOps. Operational and IT personnel will deploy and monitor the reliability of the data services. From the server’s CPUs to Kafka’s downtimes, DataOps are crucial to setting up the underlying infrastructure of the entire system.
- Data scientists and Data analysts. Sometimes, the consumer of a system will not be coworkers, but data experts. This is especially true for companies that build products and services with artificial intelligence. When this is the case, data scientists will help you to specify, and sometimes implement, the correct pipelines and data models needed for machine learning.
Depending on your in-house skills, business needs, and your appetite for outsourcing, deploying the envisioned data architecture can fall onto the shoulders of a single super-talented individual, or a whole team of hard workers.
But a data architect is needed even after the initial deployment.
Why so?
Left to themselves, systems tend to become chaotic. It’s a simple law of entropy that complex systems become more diverse rather than staying homogenous.
Whether it’s integrations breaking when data sources change their attributes following new software updates, or changes in the market which turn the business needs upside down, the data architect is responsible for the entire ecosystem of data infrastructure, software, tools, and processes.
The data architect must constantly monitor the system and adjust both the design and implementation to reflect the inner and external changes and adapt to business needs. This may be a lot of responsibility and a great deal of work, but it certainly has fantastic benefits for the business.
What are the advantages of well-designed data architecture?
Well-designed data architectures offer multiple benefits:
- Clarity of data operations. Having a clear data architecture is not just a bragging right for data managers. When you need to check metadata or data transformations in real-time for an analytic task, a well-designed architecture allows you to clearly identify where the data is at each point. Once the data operations are clearly spelled out, you can quickly locate bottlenecks and improve the entire system even further.
- Increased operational flexibility, speed, and resilience. Well-designed data architectures are prepared for scaling. This means that they can flexibly handle different volumes and speeds of data with high resilience (aka without failing).
- Technological savings. Choosing the right technology for the task means that you’ve also found a good economic balance between the value that technology provides for your business needs and the cost of that technology. In general, the trade-offs between the different solutions being considered also include the cost of those solutions, so a well-designed data architecture saves your company money.
What are the challenges of data architecture?
Despite the multiple benefits of a well-designed and implemented data architecture, setting it up can come with a lot of challenges:
- Design and setup require wide expertise. To properly architect the entire data ecosystem, a data architect needs to be an expert in a wide array of technological solutions. At the same time, they need to have in-depth knowledge of the solutions in order to fully understand the trade-offs of choosing one implementation over another. And if the data architects do not have this expertise themselves, this raises outsourcing costs.
- The trade-off between commitment and flexibility. It’s really hard to strike the right balance between committing to a technological stack and keeping yourself flexible to any unforeseen changes.
- Implementation cost and time. Whether it’s due to the lack of in-house technical expertise or simply the complexity of the system that you’re implementing, deploying the envisioned data architecture can be very expensive.
- POCs can be expensive. Sometimes, trying out our technology to determine their Proof of Concept (POC) carries its own opportunity cost. Failing to try the new tech stack carries the opportunity cost of opting for the already tried-and-tested but suboptimal solution.
- Scalability. For a simple POC, SQLite would be fine. But if you plan to scale, Postgres or MySQL are your best bet. However, if you plan to scale even further, even they might not be sufficient. Netflix is a prime example of that. They had to change their entire data architecture to handle the increased processing and analytic load of their customers. And again. And then one more time when that wasn’t enough. A lot of companies suffer from these growth cramps. As you expand, your previous data architecture drags you down and demands time and money from you to change it and ensure that it keeps up with your business growth.
So, how do companies surpass these growing pains? Either by hiring elite engineers to do the hard work or by relying on smart tools.
Tools to help you set up and manage your data architecture
Keboola - the all-in-one DataOps platform - was built to ease the work of data practitioners.
When designing your data architecture, Keboola can help you to speed up the process at a fraction of the cost:
- With its plug-and-play design, pick the technological stack that you want to test or implement and deploy it in a matter of clicks.
- Keep a bird’s-eye view over your entire data architecture within Keboola itself. No need to change platforms or write extended documentation. Keboola centralizes the entire know-how within the platform.
- Experiment with new stacks without adding to your overhead. It’s simply a matter of picking a different connector within the GUI.
- Scale with ease. Keboola natively scales to different speeds and volumes of data without breaking down and causing you infrastructural hiccups.
- Automate your processes. Automation is one of the leading principles behind Keboola. It eliminates the need to manually adjust and set configurations to make it work every time.
- Collaborate. Whether you work with data engineers, data scientists, or data ops, Keboola centralizes the data pipeline and tooling for all actors to work within the same data environment.
Curious about what Keboola can do for you? Create a free account and start exploring its endless possibilities.