Gartner® Hype Cycle™ for Data Management 2024
Read The ReportGartner® Data Management 2024
Read The ReportTo achieve faster yet consistent replications.
In the age when data is the new oil, more than 80% of IT decision-makers delay their business decisions due to slow data processing.
Architecting your ETL pipeline with database replication can speed up your data processes.
Database replication creates an analytic database as a separate copy of your production database. This unburdens the transactional database from analytical queries while securing fresh data in the analytical database for faster time-to-insights.
Data replication can be achieved with many replication techniques. But change data capture (CDC) is one of the most popular architectural patterns.
In contrast to other techniques, such as full-table loads that copy entire tables at every replication cycle, change data capture is more ingenious. CDC copies only the rows that have changed since the last replication.
Unlike bulk load updates to sync data, CDC offers multiple benefits when replicating a data source:
Change data capture identifies the rows in source tables that have changed since the last replication.
It replicates transactional changes to data, such as new data being added to a table, existing data being altered, or past data being removed. These are the create-update-delete (CUD) operations also known in SQL by their commands: INSERT, UPDATE, DELETE.
Moreover, CDC also identifies changes in metadata, such as schema migrations (column name changes, the addition of attributes to tables, etc.), and flexibly adjusts the target database for the schema changes.
There are multiple ways of implementing CDC. The implementation details have spillover effects on the efficiency of the CDC replication.
In general, there are three ways to implement change data capture: by replicating logs, replicating transactions, or by writing a custom script to perform CDC.
Enterprise databases commit all CUD transactions to the database transaction log for recovery in case of a database failover. The transaction log is kept separate from the operations of the relational database. Log-based CDC parses the logs for instructions on how to replicate transactions from the source database to the analytical one.
Log-based CDC has multiple advantages:
Unfortunately, there are also drawbacks to log-based CDC:
Log-based CDC is the most performant implementation of change data capture. Its main challenge is the feasibility of implementation. But once the hurdle is solved, log-based change data capture is the most performant CDC for your database replication.
Trigger-based CDC uses SQL database triggers on source transactions to perform the same transactions on the destination database. For example, you can set up triggers on a table to send data to a destination table “AFTER INSERT” or “AFTER UPDATE” commands are run on the source table.
There are multiple advantages to trigger-based CDC:
However, trigger-based CDC also has drawbacks:
Trigger-based CDC can cause severe overheads. It is best deployed when only parts of the source database need to be replicated and there is an advantage exposing replicate data fast.
The third option for implementing CDC is writing your own CDC script. Custom CDC scripts use store procedures to trigger SQL queries that identify some predefined changes. Changes are usually tracked as separate columns. For example, having a timestamp column LAST_UPDATED that indicates the timestamp of the last change for a given row or a VERSION_NUMBER column that shows the consecutive version of the row update. The script then checks if the LAST_UPDATED/VERSION_NUMBER values differ between the replica and source table, and copies data from source to destination.
The advantages of custom CDC scripts are:
However, custom scripts also come with disadvantages:
Custom CDC scripts require a lot of maintenance and monitoring for little added benefits, especially in comparison to similar trigger-based CDC implementations. If specific queries are needed on a table, it is better to create a new view in the replica database after all the data has been migrated. Custom CDC is, therefore, most useful with low data volumes and for replication needs that are highly specific and limited in scope.
Keboola is the end-to-end data platform that streamlines and automates the heavy lifting behind data operations.
CDC is easy to perform with Keboola. Simply connect your source database with an extractor that collects data from your source database and saves it to a staging database. From there, deploy writers to load the data into the replica database or data warehouse of your choice.
Why is Keboola better than other CDC tools?
Try it for free. Keboola has an always-free, no-questions-asked plan. So, you can explore all the CDC power Keboola has to offer. Feel free to give it a go.