CAP theorem is great, though it does omit latency. Which leads you to the logical extension, PACELC [1]:
If there is a network Partition you must choose Availability or Consistency, Else the tradeoff is Latency vs Consistency.
This offers practical intuition for certain design choices.
For example Google Spanner [2] is a distributed DB that offers globally consistent reads/writes, but to achieve high performance (low latency) nodes must synchronize with multiple extremely accurate reference clocks (GPS & atomic) and follow a complex two phase commit protocol that ensures transactional linearizability using lower bounds and uncertainty of timestamps.
As another example, your multicore CPU leverages a cache coherency protocol that faces remarkably similar tradeoffs. Perhaps others have made this connection before…it does feel like some sort of universal law of physics.
[1] https://en.m.wikipedia.org/wiki/PACELC_theorem
[2] https://static.googleusercontent.com/media/research.google.c...
Is a Partition meaningfully distinguishable from especially high Latency?
CAP doesn't omit latency. Availability essentially is latency.
PACELC only says that even during normal operation you still need to make tradeoff between availability and latency whereas CAP only deals with a tradeoff during a partition event.
The A in CAP doesn't mean what people think it means, it has nothing to do with nodes being up, or crashes, or latency, or SLA, or any abnormal dysfunction.
Availability in CAP is purely a software decision, it's an algorithmic property. Roughly speaking, it is the decision to continue accepting requests during a partition, possibly sacrificing Consistency. If your refuse requests during a partition, you conserve Consistency, and you lose Availability. If you keep accepting requests during a partition, it's the opposite.
High latency is considered a partition in CAP. Any hardware error, any network trouble, any bug, crash, any unreachable machines is never an Availability issue.
Availability's definition:
> every request received by a non-failing node in the system must result in a response
I don't believe that encodes latency
Instead, here's consistency:
> any read operation that begins after a write operation completes must return that value, or the result of a later write operation
"after a write operation completes" feels like where latency kicks in? Because within that space you can play around with completing a write operation to get what you want.
It's impossible to distinguish between high latency and unavailability. You can model unavailability as infinite latency.
In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).
> In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).
This is true, but PACELC states that even if there is no partition you still have a consistency vs latency tradeoff (because the processes that guarantee consistency eat up latency in form of network roundtrips)
100%, I don't think what we're saying conflicts. There is _always_ a tradeoff between consistency and availability whether you're thinking of PAC or PACELC. PACELC just makes it explicit that this tradeoff exists whether you're partitioned or not.
One thing that I've always wondered about: Is the CAP theorem really making a non-trivial claim?
If a distributed system is consistent and available, it must return the last written value, and successfully do this all the time. How can it do this if the system is partitioned, and the node receiving the last write was unable to propagate this to other nodes?
The proof described in this website appears to confirm this. Am I missing something?
You're not missing something. In the original paper the proof is half a page long (and still worth a read, very low investment, decent payoff). I think the value of the CAP theorem is not so much in its statement (CP or AP, choose one), but rather in its proof. It helps you cook up similar arguments/reasonings for more interesting situations and constraints.
I prefer the walkthrough of the CAP theorem in “Designing Data Intensive Applications,” which says that the CAP “theorem” is an undefined and generally unhelpful concept in distributed systems. “Available” has no formal definition.
And it’s confusing that it’s three letters, because it’s not “choose two”. It’s not that a system is “partition tolerant,” it’s if there’s a network partition, you choose availability or consistency. And obviously you choose availability for the distributed systems you most commonly encounter.
I also like the fact that how he says that that theorem was a good theorem for the time but is now not that relevant
Maybe PACELC is better theorem, maybe not - but I guess three letter acronyms would always rule better than a six letter one
The text left me wanting a more formal treatment of the theorem. I went ahead and found a paper which does just that:
https://dl.acm.org/doi/10.1145/564585.564601 PDF: https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.p...
I was always wondering the same. Thank you for the references. Above I asked for a mechanized proof.
The "proof" is kind of weird. We assume there exists a system that has all three of CAP, but how can we assume that system has the layout in the post with two servers and one client?
The proof is not really formal, but you could view the shown system as a minimal subset of an arbitrarily shaped large system.
For the system to be distributed it must have at least two nodes, and to be available all nodes must respond to requests. So however the rest of the system is shaped, the proof still holds.
The minimal subset is too small. It should have a second client, to avoid solutions like "the client stores a copy of all the messages it sent, so it can detect the stale value".
what if the graph remains connected when you remove that link, so the two nodes can communicate through other nodes?
Partitioning is defined so that all, or at least arbitrarily many, of the communication links may be down.
In practice you're absolutely right and one approach to distributed systems is to make partition tolerance less important by having lots of redundant links instead.
This is trivial but the problem is that especially consistency is not a binary choice. Heck even non distributed systems can give you varying consistency guarantees it's the whole point of isolation levels of most RDBMS.
It's good to visualize that there is a trade off to be made. However that trade off does not have to be binary. You can get a bit more consistency for a little less partition tolerance or availability. All of Designing data intensive applications is about those trade offs.
Consistency in CAP and Consistency in ACID have entirely separate meanings.
The C in ACID has a different definition of Consistency, yet the combination of guarantees given by ACID should also imply Consistency in the distributed system sense, right? I.e. a distributed database that claims to be ACID cannot sacrifice consistency [edit: at least for some isolation levels].
Isolation (the I in ACID) is more closely related to the notion of consistency in the distributed system community.
Consistency in the case of CAP refers to linearizability.
How about this:
1. A read is returned with a map of the network the replying node has become consistent with.
2. A client that is not satisfied with the map can read again from the other partition.
These two steps get around the proof by changing one of its assumptions (that the operation to read a value can only involve a request to one node).
You’re certainly allowed to make the client a more active participant in your consensus protocol, but then it needs to play by the same rules if you want the system to have guarantees. For example, you need to handle network partitions between clients and some servers, and you need to be able to reconcile multiple reads from servers that might have seen different sets of writes. The CAP theorem still applies to the system as a whole.
Do you have a plan for doing writes in that scenario? It's not proper availability without writes.
If the partitions are configured for consistency, and they can't hear from each other, then at most one will accept writes. If the client relays metadata between the partitions to make the write happen in both, then you don't actually have a partition anymore.
Also in practical terms, the nodes almost always have better contact with each other than the client has to the nodes. The situation where the client can connect to both sides of a partition is unlikely enough that if you add code for it you're probably getting negative value. It wastes time and can add scary bugs.
Consider the following sequence* of events:
1. client A reads from partition X
2. client A is unsatisfied with the network map, and requests another read from partition Y
3. (meanwhile) client B writes a value to partition X
4. client A reads from partition Y, sees the value is the same as partition X's (stale) value, and accepts the value as consistent
This is the same kind of behavior you might get if your servers used e.g. a buggy version of Raft or something. You can't get around the proof by just relabeling some of your server nodes as client nodes.
* In the spirit of distributed systems, I use this term loosely :)
The value is now stale, but it was correct at some point between the read starting and the read finishing, right?
That can happen with any read. Even without any partitions. Can't it?
In that case I don't see the problem.
> The value is now stale, but it was correct at some point between the read starting and the read finishing, right?
Not necessarily. Maybe both versions of it were from partial writes that were never committed, so your invariants are violated (if we're talking about e.g. a credit account A and debit account B scenario).
> That can happen with any read. Even without any partitions. Can't it?
Depends on your isolation level. If your system has serializable transactions then it's supposed to give you a history equivalent to one where all transactions were executed serially, for example.
> Not necessarily. Maybe both versions of it were from partial writes that were never committed, so your invariants are violated (if we're talking about e.g. a credit account A and debit account B scenario).
I'm pretty sure the scenario above is looking at committed writes.
If you're reading uncommitted writes, you're not really in the market for consistency to being with. (Or you could be handling consistency by waiting to see if the transaction succeeds or fails, making sure it would fail if the data you read got backed out. But in that situation nothing goes wrong here.)
> Depends on your isolation level. If your system has serializable transactions then it's supposed to give you a history equivalent to one where all transactions were executed serially, for example.
Even then, a new write can happen before your "read finished" packet arrives at the client, making the read stale. Your entire transaction is now doomed to fail, but you won't know until you try to start committing it.
For pure read operations, I'm not convinced it's a proper stale read unless the value was stale before the read operation started.
> I'm pretty sure the scenario above is looking at committed writes.
> If you're reading uncommitted writes, you're not really in the market for consistency to being with.
But what does "committed" mean when you're only reading from one partition in a partitioned scenario? You literally can't tell whether what you're reading is committed or not (or rather, you have to build your own protocol for when a write is considered committed).
> For pure read operations, I'm not convinced it's a proper stale read unless the value was stale before the read operation started.
I think you can get a read that is half from a prior stale operation and half from a subsequent uncommitted operation, or something on those lines.
> But what does "committed" mean when you're only reading from one partition in a partitioned scenario?
I would say you can't make new commits in that situation? I don't know, I didn't make up the scenario, I think you need to figure out your own answer and/or get clarification from fallingsquirrel if you want to talk about that kind of problem.
> I think you can get a read that is half from a prior stale operation and half from a subsequent uncommitted operation, or something on those lines.
What's the full timeline for that?
If you're specifically talking about the ABA problem, that's trivial to fix with a generation counter.
How can it know what it is consistent with? Anything could have happened since the last update. All it knows is that it most recently got an update at a certain time. It may or may not be consistent.
So, it's impossible to transfer a value from one machine to another if there's no network connection between them? How did this extremely trivial observation become a well-known theorem?
So, when it was originally phrased, the primary thing that you would have learned about databases, was that they enable ACID transactions. (And if you were a practitioner you would have learned about the various isolation levels and dirty reads and dirty writes.)
But if you wanted to go from this level to implementation, typically you could get a prototype working, but it would be slow. When things are slow, the WD-40 of the programming world is to add a cache. And this is where you get the quip that there are only two hard problems in computing, cache invalidation and naming things. (The “and off-by-one errors” is a later addition.) The problem is that cache invalidation shows up as consistency bugs in your database. Someone does a rare combination of commits and rollbacks and some cache doesn't get wiped quite as fast as it needs to, or is wiped overoptimistically causing a pull from uncommitted data, and your isolation level has dropped to READ UNCOMMITTED.
The CAP theorem was originally raised as a conjecture, something like, “once you shard the database, I don't think there's any way to solve these cache problems without one of the replicas just going silent for arbitrarily long pauses while it tries to at least partially synchronize its cache with the other shards.” Phrased that way, you can understand why it was a conjecture, it relies on nobody at MIT having a super clever way to deal with caches and cleverly routing the synchronization around the sharding.
BUT, you can make that statement for many different reasons, and this was not for Pedagogical Reasons, its point was rather Evangelism! The author was attempting to introduce the idea of Eventual Consistency, and gain adoption by ditching all of the wisdom about ACID transactions. This antagonism was deliberate, eventual consistency became the E in a rival acronym BASE. And so the argument was that we could explore a new corner of design-space.
It was later that someone decided they could prove it by coming up with a universal subgraph, “whatever connection you've got, it has to contain this: two nodes, fighting over one value, with a network connection possibly passing through other nodes but we can abstract all that away.” And then you have a proof, and then you have a bunch of people comparing the proof to the stated claims of various database vendors, and finding that over and over they claim to be both shardable with high availability among the shards, and to support ACID transactions that keep everything consistent. It turns out those statements are usually made assuming a happy path!
(You also get Paxos and Raft, “here is how to get consistency without arbitrary latency on two-phase commit via majority vote”, and the Jepsen blog “you said this had consistency level X, let’s fuzz it and see if we can generate a counterexample”, and some interesting exceptions like Datomic saying “this one part is not scalable and it's a single point of failure to sacrifice P for the CAP theorem’s sake, but in exchange we can simplify our C and A guarantees so that you can scale the reads of the system consistently.”)
Because, sadly, a lot of system designers want to believe they have a way around it.
Related:
An Illustrated Proof of the CAP Theorem - https://news.ycombinator.com/item?id=17528817 - July 2018 (71 comments)
Is availability sacrificed more often than the other two?
The only way to sacrifice P is to have have a single node (i.e. non-replicated) system. So in practice it's a choice between C and P.
It depends purely on the application you're writing. Many applications prefer availability over consistency, but the application is built to cope with inconsistent data.
Properties are only sacrificed during partitions. Due to Murphy's Law, if you assume partitions never happen they will happen more often but if you can tolerate partitions they happen really rarely.
That is not true with PACELC
There are plenty of systems that sacrifice consistency even while the network is fully connected, in the name of performance—for example, DNS, or any system with a caching proxy server.
Yeah, CAP is about the best possible behavior a system can have but you can always do worse.
Not really. Both are all over because they address vastly different (literally incompatible) needs.
There is a fair bit of industry hype in the past decade or so around eventually consistent systems, because a lot of components of a business can use them, and that opens up a lot of extremely desirable performance options that can be fine tuned to the moon. But they're a real nightmare to use (i.e. often literally impossible) for anything that needs to be correct, so they are often just used in addition to the more classical consistency-oriented databases like postgres.
The problem with the example is it never actually visualises a partition, because the client can always reach both servers.
an actual partition is when clients and servers are partitioned and cant talk to each other.
solving for the situation presented is fairly trivial, and is pretty much exactly what blockchain systems solve.
Thank you! That makes a lot more sense than several other descriptions of the proof I have read (and promptly forgotten!)
I always think of capital asset pricing theorem first.
Is there a formalization of this proof? Lean, Coq, .... ?
>"G2 returns v0 to our client after the client had already written v1 to G1. This is inconsistent."
This is absolutely true IF AND ONLY IF client and servers (and data) have no notion of time...
If client and servers are say, synchronized to a central clock, and do include a timestamp with any data updated, and include timestamps with any response/"done" messages and check those timestamps for accuracy (the client should perform timestamp checking as well as the servers), and if the client checks timestamps of response messages from different servers, it could then check different servers that it would later connect with, to see if they had been updated appropriately or not.
If later-connected-to-servers that should have been updated were not appropriately updated, then the client (if client and server are exchanging data and timestamp information) could decide how to proceed at that point.
If a later-connected-to-server was determined to be inconsistent, then depending upon how the client code was written, it could do many things to mitigate the problem. It could notify other servers that it knows about of data inconsistency on the inconsistent server(s), for example...
Hmm, now that I think about it, on any distributed database, for any given row of data, there should not be one, but TWO timestamp fields, one for the time that the data was updated on the first server it was updated on, i.e., the one the client connected to.
The second timestamp field would be for the time that a secondary given distributed server received the data and processed it...
Those two fields could be separated by mere nanoseconds of time (if distributed servers are tightly coupled), but it could be up to weeks if a secondary server was knocked offline for a long enough time period...
But see, if the client software is software engineered properly, and contains the client's history, then the client (or main server, or server multiplexing proxy) can check the veracity of any arbitrary distributed server it connects to by asking it to "read me back this data", getting its timestamps and comparing with the local copies...
All of that coupled with databases that do not delete/overwrite old records when data is updated, but rather keep a log of all updates with a timestamp, aka "append-only" aka "immutable" database, such as InfluxDB, Apache Cassandra, CouchDB, SQL Server Temporal Tables, (and I think that RethinkDB may have done something like that in the past) should equate to, at the very least, the client being able to know if a given server in a distributed system was updated properly (is consistent), or not...
If it were engineered properly, the client itself could determine what to do under such anomalous conditions...
It could even take it upon itself to update the server properly IF it contained a copy of the correct most up-to-date data AND had the authority to do so... which wouldn't be the case for some types of applications (i.e., banking, due to necessary security constraints), but might be the case for other types of applications (i.e., a distributed social media app, where the client is posting to the user's own account and has total permission to update any of the user's data as needed...)
Maybe a better question to ask when dealing with the CAP theorem is not so much "is it true", so much as "what kind of distributed database requires consistency such that the CAP theorem is required?" (i.e., banking), and "what kind of distributed database doesn't require consistency such that the CAP theorem isn't required?" (i.e., distributed social media over a plethora of distributed servers)...
If we see that CAP is required in only some specifc types of database/application/database contexts, then perhaps those specific databases/applications/distributed systems -- should be individually separated from non-CAP requiring ones...
This being said, I think Michael Stonebraker is a brilliant Computer Scientist, and I have tremendous respect for him and the CAP Theorem...
What if the client just reads from every server and compares the timestamps?
It is possible but difficult at scale.
https://dl.acm.org/doi/pdf/10.1145/112600.112601
https://cloud.google.com/spanner/docs/true-time-external-con...
Read from every server is not necessarily possible, unless we assume no network partitions.
Unfortunately, timestamps are not immutable. The system clock can be miscalibrated or even deliberately manipulated. It is however something of an open question. CAP is still a bit lacking in terms of rigorous mathematical proof (albeit the arguments are pretty convincing).
it violates the consistency rule. it must get the value that it was written