8 Ways You Can Reduce the Costs of Your Data Operations

How To

September 30, 2022

Updated on

10 min read

No items found.

Don’t sacrifice scalability for savings - have it both ways!

Thank you for your registration! We received your contact details. Expect an email with information about the event. We look forward to seeing you there!

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

When left unchecked, the cumulative costs of your company data can ramp up fast.

From training CPU-intensive machine learning algorithms that aren’t used in production to supporting enormous databases storing every minute event “just in case”.

Letting your data operating costs run without checks and balances can quickly cause costs to bloat beyond your allocated budgets.

Luckily, improving data operations can help and in this blog, we are going to tell you how.

Complete the form below to get your complimentary copy.

Oops! Something went wrong while submitting the form.

Keboola takes care of all your data operations without breaking the bank. Only pay for what you use, no mental gymnastics for calculations required.

Try Keboola today

The 4 principles of cutting data costs

Four principles guide the philosophy of cost reduction.

These help you understand the big picture and context necessary to prioritize concrete data operation initiatives (explored later) that save operating expenses.

Principle #1 Balance scalability and savings

Unlike sales or marketing, the data teams are rarely directly responsible for revenue growth and cash flow.

With the rare exception of products with machine learning at their core, the data teams are usually a supportive role for other players in your company that help them make better business decisions that in turn indirectly drive growth.

It is often hard to quantify the direct impact data insights have on your company scaling and growth. But that doesn’t mean they are not impactful.

When cutting business costs, take into consideration the downstream effects of your decisions.

For example, unsubscribing from a business intelligence tool license might save costs today. But it can also cut your sales team's quick access to customer data needed to close cold calls next week.

Always keep scalability and savings in balance. A good rule of thumb is: to cut costs, it is better to optimize existing data operations than to remove entire data workflows and tools.

Principle #2 Simplify complex architecture

Complexity is the sister of growth. As your company grows, your data architecture tends to increase in complexity.

For example, let’s say you’re running an e-commerce shop. As you were growing, you wanted predictive analytics to better inform optimal delivery routes. Your data team decided to introduce a new database (MongoDB) that can scale geo data predictions better than your existing e-commerce transaction database (MySQL).

This is just one of the many examples of how entropy increases the complexity of your data operations - from additional tools (complex stack) to layered and codependent workflows (new ETL data pipelines, additional last-minute data quality scripts written for investor reports, …), your data operations become more chaotic as your company scales.

Simplify complexity to cut costs. We’ll look at concrete examples later on.

Principle #3 Loosen up rigid architecture

Mature and heavily regulated companies (banks, insurances, etc.) tend to have the opposite issue from chaos - they are too rigid.

From stacks that cannot change (“we need to keep the Oracle database for compliance reasons”) to rigid infrastructure (on-premise servers cannot be migrated to the cloud), a non-flexible DataOps can cause you opportunity costs - different tools and workflows could help you cut costs, but you do not implement them, because your company’s architecture is too rigid.

Loosen up fixed architectures to allow your companies to grow with lean methodologies.

Principle #4 Measure twice, cut once

You cannot fly a plane blindfolded. And you cannot cut costs unless you know what you’re cutting alongside the savings.

There are three ways to improve your cost measurements:

Start with the big picture. Data costs are often dispersed throughout your organization. IT operations take care of your infrastructure and servers, while the data analytics team manages paid data providers and visualization software programs. Bring all your costs together across departments and silos, to better understand how costs sum together.
Measure your existing stack with the total cost of ownership. To understand how much your tools are costing you, it is not sufficient to look at licenses, subscriptions, cloud consumption, and other direct costs. You need to include the total costs of owning the tools and workflows. If you need a full-time data engineer to manage your Redshift warehouse on AWS and all its associated processes, you need to include their paycheck in your AWS costs.
Measure your potential stack with opportunity costs. When considering migrating from an existing solution to a new one (e.g., from on-premise to cloud providers, from relational to NoSQL databases, from manually scripted Jupyter notebook dashboards and Python visualizations to business intelligence tools, …), calculate the cost of the current workflows and tools vs the alternative. Companies are often surprised how many hours are wasted tinkering with existing tools and maintaining processes. When an alternative product achieves the same results through automation at a fraction of the man-hour costs.

Now that we are equipped with the right principles to guide us, let’s look at concrete ideas on how to cut costs by improving data operations.

Where to cut costs?

You may not be able to reduce office supplies costs, travel expenses or downsize office spaces to optimize business expenses, but there are ways data teams can cut costs.

Here are most the most common areas where expenditure cutting can save your bottom line while helping your company grow.

#1 Identify stale and historical data and optimize its storage

Companies amass large quantities of data through the lifetime of their operations.

The majority of historical data is stale and is used only on rare occasions (transactional data kept for regulatory reasons, raw data dumps that were used in big data algorithm training but are seldom rechecked once the algorithm’s parameters have been calibrated, etc.).

The data cannot be deleted. But it can be re-architectured into cheaper storage.

For example, by combining a data lake and data warehouse architecture, the data lake can keep historical data in data dumps that are optimized for storage but not data processing (e.g. AWS’s Glacier), while data that is crucial for company growth is piped into the data warehouse for data analytics and data science initiatives.

#2 Model data assets according to business value

In the past, database administrators worried a lot about query optimization and data modeling at rest, so the database costs would not skyrocket.

Data engineers would have strong discussions about the merits of Kimball vs Inmon data warehouse designs or the superiority of star vs snowflake schemas.

But with the popularization of MPP warehouses (Snowflake, Redshift, BigQuery, …), storage became comparatively cheap and data modeling fell in popularity.

The cloud warehouses made 7- and 8-figure technological solutions available for 4- to 5-figures.

But that doesn't mean there is no room for improvement.

Modeling your data correctly can save you a lot of money. For example, if you identify that your data analysts perform the same join over two massive tables 20-times every day, you need to either create indices for those two tables to speed up processing or save (materialize) the joined table as an analytic table, to avoid computing all rows at each join.

#3 Remove un(der)used and costly processes

Do you have random EC2 instances running without any jobs on them? Is there an Airflow DAG updating a dashboard with a live data stream, despite no one looking at the dashboard in real time?

Every company has unused and underused data assets and data pipelines. Identify where random workflows and assets are being wasted and cut the unnecessary excess off.

#4 Deduplicate processes

When departmental silos exist, processes get duplicated.

From marketing and sales both running their customer data enrichment processes to engineering teams and data insights teams both collecting database logs for monitoring.

Analyze which workflows are duplicated and join them together to halve the costs of these processes.

#5 Filter and aggregate data early

Many companies make the mistake of not filtering the data early enough in the data lifecycle.

When you collect raw data from various data sources (data integration with your data lake), not all data needs to get to the data warehouse. Or at least not at the same granularity.

Let’s say you collect sensor data that is produced 50-times every second. But all your data operations, business intelligence, and SLAs to customers use sensor data at a granularity of 60-seconds (a 3k difference in magnitude).

You can aggregate the data (sum it, take averages, …) from the 50 Hz to the minute and make the aggregated data the input to your data warehouse, where application developers and data scientists will pick it up for their models.

#6 Manage data lineage (especially for bugs and onboarding)

Data management and data governance help you establish tools and processes that take control over data flow throughout its lifecycle - from collecting raw data to driving insights.

Having a clear understanding of where data is, what certain data means, how it is generated, how it is protected, and all the metadata associated with it is crucial for running streamlined data operations on three levels:

Problem resolution - When errors happen, data engineers, data scientists, and data analysts deploy root cause analysis to determine the source of the error. Without clear data lineage, it is hard and time-consuming to trace data flows and inefficiencies through your system.
Onboarding (human resources) - Whenever someone new joins the data team, they need to be onboarded. Undocumented data assets and data pipelines can quickly delay onboarding and postpone the onset of the new person bringing value to the business.
Changing complex systems - If there are no common standards or documentation, every new analytics use case can require weeks or months of data discovery, data ingestion, data cleaning, and ETL pipeline building.

One tool that makes data lineage process a breeze is Keboola. Not only can you automate your entire data pipeline: from collecting structured and unstructured data, to transforming and storing it for analysis. At each step Keboola automatically tracks all relevant metadata and constructs logs, which gives you a granular view of data lineage so you can identify root cause of errors immediatelly.

Create a free account and try it out!

#7 Prioritize processes and tools that self-serve business needs

How much does it cost for you to wait on a report to be produced and delivered?

This simple question hints at a common truth across companies: the data insights and data engineering teams are often bottlenecks in data-driven decision-making.

The story is familiar:

Stakeholders from multiple departments need a report so they can use data insights to drive their marketing/sales/finance/logistics/operations/people/hiring/firing decisions.
The stakeholders ask the data team for the report they need.
The data team puts the request on a backlog with more tasks than there are hours in a day.
The data team slowly and consecutively goes through the requests, fulfilling the tasks one by one, often ages after the request was made (or relevant).

This is not mismanagement of the data team, but a challenge to be solved through better data operations.

You could improve the operational throughput by increasing the headcount of your data team, or maybe outsourcing some operations. But labor costs a lot and you have so many other opportunities for optimisation.

Instead, invest in processes and tools that can help you automate reporting:

Model data in a format that is accessible to lay people. For example, a single table with customer information is preferred to 17 tables with (normalized) partial customer data.
Invest in self-service business intelligence tools. The majority of the modern BI datascape requires clicking, not programming, so non-technical people can set up their own dashboards.
Where business problems and data might be too complex to simplify, teach the stakeholders basic SQL or expose data automatically in a format they are familiar with (Excel).

Simple automation (data modeling, BI tool, upskilling, Excel) can cover the proverbial 80% of all requests and free up valuable resources.

#8 Avoid manual scripting as the plague

Development teams love manual scripting. It is fun to tinker with code to get something done. But the fun stops once the manually scripted systems start to fail.

A common example is writing extractors in Python/Java/Go/pick-your-language that collect raw data from data sources and ingest it into your data lake or data warehouse. This is fun until the data warehouse tables go through migration and the extractor script fails. Or the source data API changes endpoints or protocols and your development team spends a week figuring out how to collect the same data again.

Wherever possible, rely on tools to do the heavy lifting and avoid scripting. From maintenance costs to increased chances of making errors, manual scripting solutions seldom scale and carry long-term management costs.

Those were the 8 use cases of how to improve your data operations for cost saving without jeopardizing growth. But how do you implement them if you do not have a devoted data operations team?

You rely on the right tools to get the job done.

With Keboola you can reduce costs without slowing growth

Keboola is a data platform as a service designed to streamline and automate your in-house data operations end-to-end, so you can optimize business processes and save costs as a result.

How does it do it?

Every step of the data pipeline is automated. Raw data collection from multiple data sources is achieved with automated Extract components. Data cleaning can be done with SQL/Python/R and other languages. Data loading (and data modeling) can be done with automated Writers that send data to Excel files or data warehouses. Keboola’s 1-click components allow you to build ETL pipelines with minimum effort.
Entire data pipelines are orchestrated. Each job (extract, transform, load) can be scheduled and automated, so you set it once and don’t have to lift a finger. What’s more, with Keboola Templates, you can create workflows (entire ETL pipelines) and share them with others, so they can replicate your work and set up complex corporations with one click.
Full governance and lineage. Each process is tracked on the user, changelog, and event levels. The rich metadata and exhaustive tracing allow you to set up data lineage. Moreover, data sharing and access permissions can be developed on a granular level, to always keep safety in mind when setting up your data governance. All of this is done out of the box, without your data engineers having to code new solutions.
Clarity of product usage and speedy onboarding. Every user can understand your whole data operations from the intuitive UI of Keboola’s data platform. Moreover, Keboola allows you to document and share your data with its Data Catalog, a feature that helps you speed up onboarding new team members onto new data assets. For novices, there is Keboola’s Data Academy, a self-learning platform that equips you with all the platform knowledge needed to master data operations.
Fair pricing model. In Keboola, you only pay for what you use. The pricing is determined with the universal currency of “computational runtime minutes”, so all your data operations - whether we are talking extracting data, transforming them, loading them, building machine learning algorithms, or any other Ops activity - can easily be measured and compared without having to do mental gymnastics for your calculations. Did we say that every account gets 300 minutes for free every month? That’s right, Keboola is designed to help you optimize and streamline your data operations and helps you cut costs across the board with a competitive and fair pricing model (check the details here).

… and there are many more features, perks and use cases. Keboola allows you to speed up and automate your data operations for data science, security, data integration optimization.

As Brett Kokot, Director of Product at Roti and satisfied Keboola user, said:

“I don’t want to manage Airflow, I don’t have time for that! I can set up an orchestration in Keboola in 5 minutes, that would take 2+ hours of coding there.”

Try Keboola out for yourself.

Keboola offers a no-questions-asked, always-free tier (no credit card required), so you can play around and optimize business operating costs with a couple of clicks.

Download for Free

Oops! Something went wrong while submitting the form. Try it again please.

8 Ways You Can Reduce the Costs of Your Data Operations

Keboola takes care of all your data operations without breaking the bank. Only pay for what you use, no mental gymnastics for calculations required.

The 4 principles of cutting data costs

Principle #1 Balance scalability and savings

Principle #2 Simplify complex architecture