Gartner® Hype Cycle™ for Data Management 2024
Read The ReportGartner® Data Management 2024
Read The ReportLearn what is CAP theorem and why does it matter in data engineering.
In the modern age, everything runs on the cloud. The majority of modern applications are written with cloud technologies - they use public cloud providers for DNS, distributed caching, and distributed data stores.
Cloud solutions are so popular among engineers because of their many advantages:
But distributed systems are not impervious to breaking.
Foursquare’s example is testimony that even the great and mighty experience failure within distributed systems.
The CAP theorem describes what happens to a distributed data store at the edge cases - when things go wrong.
The theorem informs us of the system design choices we have to make when choosing the right distributed database for our business problems.
Computer scientist Eric Brewer put forward the CAP theorem in 1998. Later, Gilbert and Lynch published a formal proof of Brewer’s theorem under the name Brewer’s conjecture (follow the link to their academic proof).
CAP stands for Consistency, Availability, and Partition tolerance.
The theorem states that a distributed system cannot guarantee all three (consistency, availability, and partition tolerance) all the time. When things go wrong, we need to decide on a trade-off and prioritize at most two characteristics of distributed systems to keep.
To better understand the implications behind the CAP theorem, let’s dive deeper and understand what each characteristic means.
A consistent system always returns the same information irrespective of which node we query.
Imagine the following scenario: there are multiple nodes around the globe. Clients (or users or applications) send write operations to those nodes.
For example, Anna sends money to her family account from the Philippines and her brother John withdraws money from the same family account in the Netherlands.
Both users send write operations through their financial app. Anna’s operation was a deposit write and John’s operation was a withdrawal write. The financial app uses a distributed database, so the writes were processed in different locations - once by the Philippines node and once by the Dutch server.
When Anna and John both check their account balance to see if they successfully transferred money, they want to see the most state after the most recent write operations - they do not care that the nodes are across the globes and need to sync information between them.
A consistent system will show the same information (the most recent write to the most recent read) irrespective of which nodes first process that information.
A note of caution: do not mistake consistency in CAP for consistency in ACID. ACID consistency refers to transactions in relational databases. The ACID consistency guarantees that the data in a database system will be valid after a transaction, not universally consistent across all nodes as is the case in CAP.
An available system gives every read or write request an appropriate (non-error) response.
High availability systems achieve responsiveness through replication. They duplicate data across multiple nodes so they are available to take read or write requests.
Note that an available system makes no consistency guarantees - it guarantees the write and read requests will be processed, but not that the latest write will be observed.
Partition tolerance refers to the ability of a distributed system to perform normally at the time of network failures.
A network partition is the technical term of a network failure - the entire network is partitioned into subnetworks because one or two nodes lost communication between them.
The partition tolerance guarantee promises that users sending read or write requests to the system will not be affected by network partitions. From an external observer, the system will behave as if no single node has failed.
Distributed systems realistically cannot exist without network partitions. Hence, partition tolerance is a given for any distributed system, otherwise, it would not sustain network faults.
What CAP states - therefore - is that system designers need to decide on a trade-off between high availability and partition tolerance (AP) on one hand versus consistency and partition tolerance (CP) on the other hand in their distributed database designs.
In the case of AP systems, users (or applications or scripts) will always be able to query the database. But their read operations might return stale data, while their write operations might not be consistently recorded.
In the case of CP systems, users (or applications or scripts) will always get the latest read or write to a consistent database, but the data storage service might not be available all the time. So the programmer will need to decide what to do with read/write requests that do not get a response from a node.
Keep in mind that CAP is a theoretical computer science model, in practice, the trade-offs are not as painful as portrayed here.
In practice, the CAP theorem forces data architects and data engineers to decide on a tradeoff in distributed systems - either have an AP system that can be inconsistent or a CP system that can be unavailable.
The trade-off is not pragmatically as common as people expect.
Distributed systems perform under normal operations for the majority of the time.
Errors in the system are an exception. And even when they occur, are usually resolved quickly.
But with the rise of big data and the increase of velocity, volume, and variety of data in our distributed database systems, CAP makes us more conscious of the trade-offs we need to make when errors do happen. And the bigger our system is, the more replicas it has, the more likely it is some error will occur.
This is what led to the popularization of NoSQL databases, such as Cassandra (offering AP guarantees) or MongoDB (offering CP guarantees).
Which one should you choose? That really depends on your ultimate business goals.
Would you lose more revenue if your system was unavailable to process requests or if you returned inaccurate requests from time to time?
The trade-off the CAP system introduced is ultimately a business one - between an always accurate service and an always-available service.
The CAP theorem introduces just one of many data engineering trade-offs you will have to make when designing your data pipelines.
Start your journey towards becoming a (better) data engineer with Keboola’s Data Engineer Certificate, and develop a competitive engineering skillset that will help you stand out.