Improving Data Pipeline Reliability with On-Call Data Teams

Example H2

Example H3

Resources

December 2, 2024

5 min read

Improving Data Pipeline Reliability with On-Call Data Teams

Pavel Chocholous

Senior Manager, Product Marketing

How to deal with the unpredictable? This guide will help you avoid common pitfalls and get some quick wins with the on-call data reliability process.

Thank you for your registration! We received your contact details. Expect an email with information about the event. We look forward to seeing you there!

Oops! Something went wrong while submitting the form. Try it again please.

Scroll to download

“Where there is data smoke, there is a business fire.” —Thomas Redman‍

Taming Data Chaos

A big part of data teams’ responsibilities is dealing with the unpredictable. Data pipelines don’t always run without incident: you need to rerun processes and fix data processing issues—in other words, put out data fires—to keep stakeholders happy.

For every significant roadblock, additional time and effort is given over to investigation and post-mortem reports to make sure the incident doesn’t reoccur. But naturally, they keep happening. To alleviate these issues and bring some order to the chaos, you need to look at the big picture.And counterintuitively, the way to stop playing catch-up and build a foundation for reliable, predictable pipelines is to start with on-call duty.

Understanding the Real Impact of Failing Data Pipelines

Before we get into the importance of site reliability engineers, let’s briefly explore what’s at stake. When data platforms fail, the effects ripple through the organization: stakeholders, leadership—and even customers—may be impacted.

Wasted time, unutilized insights, missed opportunities, misinformed decisions—all of these negatively impact revenue and erode trust. This leads to a domino effect of uncertainty: stakeholders lose confidence in data and delay projects to manually validate the results.

In this regard, the definition of “failure” is much broader for business teams than it is for data teams. Even if the pipeline didn’t fail from a purely technical perspective, missed data, duplicates, stale data, and inconsistencies are all enough to make reporting inaccurate and incomplete, making data-driven decisions impossible.

Ultimately, the consequences circle back to the data team. Repairing broken pipelines takes time and effort. When data teams need to step in, manual workload increases drastically, and so does the cost. And if this downtime happens to overlap with board meetings, QBRs, or investor reports (and it often does, as per Murphy’s law!), the bottleneck becomes even more costly, putting data engineers under immense pressure.

This creates a vicious circle. Every incident indirectly contributes to the next one because data teams focused on recovery can’t work on proactive improvements and value-adding projects. Instead of building, maintenance becomes the main focus.

On-Call Data Teams and Work-Life Balance

On-call duty is the practice of designating a person from the IT, infrastructure, or site reliability engineering (SRE) team to be available during specific times, even outside normal working hours, to be able to respond to an incident. On-call duty can help remediate issues as they happen, rather than when it’s too late, and if the process is set up properly, it reduces the overall workload of the data engineering team and helps increase efficiency. The important part is not to overdo it and to prioritize, making sure the team’s personal lives are not encroached upon for no good reason.
‍

“Paging a human is a quite expensive use of an employee’s time. If an employee is at work, a page interrupts their workflow. If the employee is at home, a page interrupts their personal time, and perhaps even their sleep. When pages occur too frequently, employees second-guess, skim, or even ignore incoming alerts, sometimes even ignoring a "real" page that’s masked by the noise. Outages can be prolonged because other noise interferes with a rapid diagnosis and fix. Effective alerting systems have good signal and very low noise.” —Rob Ewaschuk, Site Reliability Engineering: How Google Runs Production Systems

Here are some principles for setting up an effective on-call duty process:

Minimize the Scope
Identify the most important processes and forget about the rest. If the team will have to respond to a minor incident every other night, the morale will go quickly and the burnout will set in. As a first step, label alerts with high, medium, and low priority to manage escalation and alert routing, and also don’t forget to aggregate alerts so a single incident doesn't result in 1000 calls and messages.
Establish an On-Call Rotation System
Keep it fair by creating a clear on-call rotation schedule. Distribute the load so no one person carries the burden of being the “always-on” go-to. A properly set up secondary escalation helps manage the pressure: if an on-duty team member doesn’t answer the alert in a timely manner, it escalates to the secondary, which ensures that every incident gets someone’s attention.
It might be tempting to set up a follow-the-sun system if you have a colleague on the other side of the globe, but in practice it’s distracting and stressful. Follow-the-sun can help to easily substitute a team member who is to be off duty for one night, but making it a regular shift is not recommended. That being said, if you have whole teams on different continents rather than individuals, this model works like a charm as the team members can still rotate the shifts.
Create Incident Playbooks
Also known as runbooks. These serve to help the on-duty team member resolve specific issues. You might start with just a simple empty Confluence page, Notion, or Google Docs with a step-by-step guide on how to rerun the failed pipeline. In an ideal world, those playbooks should be created once the incident is identified and added to monitoring, with known steps to resolve it. In the beginning, having a simple list with links to documentation explaining where to look, what to check, and how to debug can be very helpful.
Select Your On-Call Tools
PagerDuty, Opsgenie, Splunk On-Call (VictorOps), etc. Don't spend too much time choosing the right one; setting them up is generally pretty easy, and nothing stops you from migrating to a different tool if the need arises. Your focus should be on configuration and integration with your pipelines. For example, in Keboola we have notifications for failed and long-running pipelines, which can be easily integrated to PagerDuty via email.
Building your own alerting and incident response tool is not recommended.
Organize Shift Handoffs
Make sure there’s a process for handoff after every on-call shift. This could be a quick debrief or a handoff log so the next person on call knows about any unresolved issues. Having the whole team on the shift handoff meeting will help prevent many incidents, and for this reason it is highly recommended.

Conduct a Monthly Review
If getting the whole team together for every handoff seems excessive, try holding a monthly review meeting. Discuss what worked and what didn’t, and go over incidents from the past month. Update the playbooks as needed. Some incidents might need to be removed from the monitoring process because they occur too frequently and don’t actually have critical impact. Don’t focus on the numbers and always keep the scope in mind.

Proactive Monitoring: Building a Solid Foundation for Data Health

Effective monitoring is the backbone of a healthy data pipeline, but implementing it without clutter is a long-term process that takes considerable effort. No matter what tools you use to monitor the data flow, you need to integrate them with your alerting system. There are plenty of data quality tools on the market that do a pretty good job of monitoring for generic anomalies. But every data stack has environment-specific issues that need to be analyzed and understood by your team for the best results.

Monitoring data health includes the following:

Data Completeness and Freshness Monitoring
This one’s easy. Ensure that data arrives on time and is complete. Set up alerts for any significant delay or unexpected gaps.
‍
Use of Rolling Windows to Filter Noise
A rolling window approach helps avoid false alarms by comparing data over relevant time periods rather than static thresholds.
‍
Volume Anomaly Detection
Sudden surges or drops in data volume can signal issues with upstream systems. Set up volume anomaly checks to detect unusual activity.
‍
Schema Monitoring
Even small schema changes can cause downstream issues. Monitor for schema changes and set up alerts to address them before they impact production.
‍
Data Drift Alerts
Detect any patterns in data that are deviating from the norm (e.g., unexpected changes in data values). This is helpful for identifying issues with upstream data sources.
‍
User Feedback Loops
Set up a channel where users can give feedback on data health. This helps identify areas where your automated checks failed. Incorporating these into your data management is crucial for building trust.

Re-Runs
It depends on the team policies, but in the real world, plenty of issues are resolved by just “turning it off and on again.” There’s nothing wrong with having such retries in all your data pipelines, but it pays to keep an eye on this. If something never succeeds on the first run, it not only consumes unnecessary resources but could also be masking a bigger issue and may later spiral out of control.
‍
Latency
Track latency within each pipeline stage, from data ingestion to transformation and delivery. Latency spikes will reveal rising bottlenecks that need to be optimized.
‍‍
Dependency Health Checks
No matter what breaks, your data will suffer. To prevent this, try to understand the external systems you depend on as best as you can. If you have an important data source and can get its health status, you can easily explain what happened without debugging the entire pipeline all the way to the data source. Cooperating with the teams responsible for these systems is also vital for efficiency.

Incident Management for Data Teams: Real-Life Tips for On-Call Handling and Response

To minimize impact, it’s crucial to respond to data incidents quickly and effectively. A big part of this is effective communication and cooperation. Don't play the blame game. Always focus on WHAT and WHY instead of WHO.

Incident Documentation
Incident documentation and up-to-date playbooks are building blocks for the future reliability of your systems.
Postmortem Reporting
After every significant incident, document what went wrong and what corrective actions were taken. Consider keeping these records accessible for users by creating an incident log.
Status Page
Providing your users with an easy way to view the current status and health of critical data will avoid redundant requests and messy communication when resolving incidents and help gather relevant feedback when everything’s operating as it should, thereby building users’ trust in data quality.
Root Cause Analysis (RCA)
Conduct RCA after significant incidents to identify the underlying issue and prevent it from reoccurring. Depending on the incident, this could be part of your monthly incidents review meeting. Solving root causes helps lower the maintenance overhead of your platform.
Go Beyond Technical
We recommend reading about root cause analysis in business and sticking to various techniques like the 5 Whys, Ishikawa Fishbone Diagram, etc. Check out the article Root Cause Analysis Explained: Definition, Examples, and Methods by Tableau for more on RCA.

Continuous Improvement and Data SLOs: Building Reliability and Trust

SLO stands for service level objective and was coined in the book Site Reliability Engineering. A reliable data pipeline requires constant adjustments. Here’s how to create a culture that values reliability and keeps pipelines performing at their best. It’s all about understanding the business impact of the incidents and aligning your effort to avoid disruptions.

Data SLOs as Measurable Goals
In essence, SLOs are about clear objectives like data freshness, quality, and availability. Continuously educate users and stakeholders, for example, via a data team status page. Reevaluate data SLOs periodically, adjusting as business needs change or pipeline capabilities improve.
‍
Stakeholder Communication
Regularly communicate pipeline performance and incident metrics to stakeholders, which helps build trust and confidence. This is often intertwined with the budget, so you need to align on priorities. Again, don’t focus on the numbers—focus on impact. For example, 40 resolved incidents in a quarter is impressive, but more important is how many were diagnosed with RCA and how much time was needed to fix them. And even more important is whether the incidents affected the business and what potential negative consequences were avoided.

Actionable Tips

To wrap up, let’s briefly recap all the steps needed to set up an effective data pipeline reliability process.

Start monitoring row count checks for a single high-impact pipeline (e.g., daily sales overview); set up on-call rotation with two skilled data engineers. This will help you start building your first playbooks while eliminating pipeline failures.

Start by writing a Python or SQL script to check for the desired state at a specified time and send a text message (or call) using Twilio or Plivo. Create a CSV file with the schedule to determine who gets contacted. You will get better tools later; for now, focus on quick outcomes.

To keep stakeholders informed, publish a report showing the status of your jobs. It can be a Streamlit app, simple Notion website, or a Google Sheet. Just a few rows with critical pipelines colored in red and green will greatly reduce confusion among your users.

Pick a critical dataset and start monitoring its health. Start simple (focus on freshness) and begin to gather user feedback to add more checks, and then publish the results on your status page to build trust.

If you’re dealing with data from databases like PostgreSQL, MySQL, Microsoft SQL, or Oracle, consider using advanced techniques like change data capture to synchronize data to your data warehouse in order to minimize the data loading impact on your source systems and pipelines while shortening the time needed to resolve issues.
‍
Pick a major incident from the recent past and try to document it. Look for relevant information across your organization: what, why, who was affected; time taken to resolve the incident; a short description of what happened; and a short description of the fix. This could be the seed of your first playbook.
‍
Make sure you complete such an incident report for all future incidents.
‍
Conduct a postmortem session with RCA without assigning blame. Pick a serious incident from the recent past and try to get to the bottom of it. Publish the results to inform your users. Follow this process for all major incidents.
‍
Your first SLO should be very simple. Focus on making it understandable for the user. Let’s say you already have your status dashboard on a Notion site or in a Streamlit app. Conduct a session with your advanced users and prepare a short description of SLOs so they are understandable to the rest of the company.
‍
Organize a session with key stakeholders and pick one SLO as a focus for the next quarter. Moderate the discussion in order to pick something reasonably easy to measure and evaluate. Keep the stakeholders regularly updated on the metric and repeat the session in the next quarter.

At the end of the day, it all boils down to this: understand the business impact of incidents, gather feedback from users, and focus on what matters most—taking steps to handle incidents quickly, efficiently, and with minimum stress, building up data pipeline reliability in the process.

‍

"As a data team manager, understanding the business and aligning your team with its objectives is essential to drive impactful results. Building trust with stakeholders through clear communication, educating them on data’s role, and handling incidents transparently are key to sustaining long-term success." —David Kroupa, Head of Business Data, Seznam.cz