March 26th, 2020
In previous articles, we explained what consistency is, the difference between “strong” and “eventual” consistency, and why this distinction is more important than ever to modern application developers. We also introduced the notion of ‘consistency tax’: the extra time and effort that a development team needs to invest if they choose a system with only eventual consistency or limited consistency guarantees.
Several modern databases use state-of-the-art algorithms to eliminate the tradeoff between consistency and performance. Of course, we would not want you to take our word for it without a proper explanation. Therefore, in this final article, we dive into the technical details behind some of these databases. Typically, the only source of information for these technical details are research papers, so the point of this article is to explain these systems in simpler terms. Because these systems are far more complex in reality, we’ll provide the links in the text in case you want to know more and love to read research papers.
In parts 1 and 2 of this article series, we explained how distributed databases use different replicas to spread the load and/or serve users in different regions. To summarize here, for new readers, a replica is just a duplication of your data. And this duplication can live either in the same location for redundancy, or in another location to offer lower latencies to users in those locations. Having multiple replicas that can handle both reads and writes has a strong advantage, because the database becomes scalable and can offer lower latency to all your users, no matter where they are. However, you do not want each of the replicas to have their own interpretation of the data. Instead of small data differences between each replica, you want one unique interpretation of the data, which is often referred to as a single source of truth. In order to achieve that, you need to have some sort of agreement on data changes. We need a consensus.
Waiting for consensus
Every distributed database that aims to be consistent has multiple replicas that have to agree on the outcome of transactions. If conflicting data updates happen these replicas have to agree which update goes through and which doesn’t. This is called “consensus.”
Let’s go back to our game to exemplify why we need consensus. Imagine that the player of our game only has 3 gold pieces left, but tries to simultaneously buy two different items from two different shops for a total budget larger than the remaining 3 gold pieces. This involves two transactions, one for each item/shop, which we denote as t1 and t2. And let’s pretend that the owners of the shops are across the globe from each other, so the transactions take place on two different replicas. If both of the transactions are accepted the user would be able to buy more than he can afford. How do we prevent the user from overspending?
We know that these replicas need to communicate in order to agree on the final outcome of the two transactions. What we don’t know is how much communication they need. How many messages have to go back and forth between replica 1 and replica 2 in order to agree which transaction gets priority and which one gets cancelled?
As replicas in a distributed database are meant to serve users from different regions in the world with low latency, they are far apart by nature. By placing duplicates of the data closer to the end users, these users can read with lower latencies. However, when writes happen, the replicas need to send messages to each other to update all duplicated data uniformly–and these messages can take several 10s of milliseconds because they’re bridled by the speed of light as they travel across the globe. It’s clear that we need to keep the number of cross-data center messages as small as possible so that the end user isn’t left waiting around for these replicas across the globe to come to consensus.
For a long time, it had been thought to be impossible or impractical to do this. But today, several technologies exist to keep the number of round-trips low and bring latency within normal bounds.
The distance between New York and Paris is 5,839 km. For light to travel from New York to Paris and then back again would take 40 milliseconds.
If it takes a minimum of 40 milliseconds to travel between New York and Paris, a round-trip would take at least 80ms. The most important question that remains is: “How many round-trips do we need to execute transactions?” The answer to this question depends largely on the algorithms that are used.
How to reach agreement?
It appears that in order to achieve consensus about something, you need at least four hops (or two rounds of communication): one round to let each replica know that you are about to do something, then a second round to actually execute the action once everyone agrees that this action can be executed. This is something called distributed two-phase commit which is used by almost any distributed database. Let’s look at an analogy. Imagine you have to agree with a group of people on a good date for a party. It might go like this:
First, Polly asks everyone if they can make it to a party on Monday; she now knows that everyone can actually come to the party. Next, she needs to let everyone know that the party will indeed be on Monday, and people acknowledge that they will be there.
These are very similar to the two phases in two-phase commit. Of course, databases don’t party so the phases have different functions. In the case of a distributed system, the phases are called:
- Prepare or request to commit: make sure that everyone knows about the transaction. In this phase, replicas in a distributed database store the query in some kind of todo list (a transaction log) on the disk to make sure they still know what to do if the server goes down.
- Commit: actually calculate the results and store them
Of course, as always, it’s never that simple. There are many flavors of such algorithms. For example, there are improvements of two-phase commits called Paxos and Raft and even many variants of these (multi paxos/fast paxos/…). These alternatives aim to improve issues of availability or performance. To understand the availability issues, simply imagine that Polly falls sick or Amber’s phone dies. In the former case, she would be unable to continue her work as party coordinator and in the latter case, it would temporarily be impossible for Polly to know whether Amber agrees on the party date. Raft and Paxos improve on this by only requiring the majority to answer and/or selecting a new coordinator automatically when the leader or coordinator goes down. A good animation that shows how Raft works can be found here.
Agree about what?
Can we conclude that each distributed database then requires 2 round trips to write/read data? No, the reality is more complex than that. On one side, there are many possible optimizations and on the other side, there might be multiple things we need to agree on.
- Agree on the time of a transaction
- Agree whether reads can be executed
- Agree whether reads can be executed
The simplest example that has multiple two-phase commit rounds is probably Cassandra’s light-weight transactions. They first require consensus agreements on reads and then consensus on writes. If each message takes 40ms to travel, this means the entire transaction requires 320ms or longer–depending on the required “locks” as we’ll explain later.
This is fairly easy to understand, but there are some issues with the implementation since Cassandra was never designed to be strongly consistent. Does that mean that strongly consistent databases are even slower? Not at all! Modern distributed databases use a mix of interesting features to achieve better performance.
Waiting for locks
Not only do we need to wait for messages to come to an agreement, but almost every distributed database will also use “locks”. Locks guarantee that the data about to be altered by a transaction is not being simultaneously altered by another transaction. When data is locked, it can’t be altered by other transactions, which means that these transactions have to wait. The duration of such a lock, therefore, has a big impact on performance. Again, this performance impact depends on the algorithm and optimizations that were implemented by the database. Some databases hold locks longer than others and some databases do not use locks at all.
Now that we know enough basics, let’s dive into the algorithms.
Modern Algorithms for Consensus
We now know that consensus and locks are the main bottlenecks that we need to optimize. So let’s go back to the main question of this article: “How does new technology lower these latencies within acceptable bounds?” Let’s start off with the first of these modern algorithms, which sparked interesting ideas for the rest of the database world.
2010 – Percolator
Percolator is an internal system built upon BigTable (one of the early NoSQL databases built by Google) that Google used to make incremental updates to their search index’s page crawling speed. The first paper on Percolator was released in 2010, inspiring the first distributed database inspired by it: FoundationDB in 2013. FoundationDB then got acquired by Apple to finally release a stable version in 2019, together with the release of a FoundationDB paper.
Although Percolator allowed Google to speed up page crawling significantly, it was not originally built as a general-purpose database. It was rather intended to be a fast and scalable incremental processing engine to support Google’s search index. Since the search index had to be scalable, many calculations had to happen on many machines concurrently, which required a distributed database. As we learned in the previous articles, programming against distributed systems that store data can be very complex, and traditionally required that developers pay a ‘consistency tax’ to program around unpredictable database behavior. To avoid paying so high a consistency tax, Google adopted a strong consistency model when they built Percolator.
The consistency model of Percolator could not exist without two key ingredients: versioning, and the Timestamp Oracle
Ingredient 1: Versioning
As we mentioned in previous articles, strong consistency requires us to agree on a global order for our transactions. Versioning is one of the elements that will be crucial to many of these algorithms since it can be used for failure recovery, to help replicate data, and to support a consistency model called ‘snapshot isolation’.
Versioning helps in failure recovery when a node fails or gets disconnected. When the node comes back online, thanks to the versions, it can easily restore its state by starting at the last snapshot that it was able to save, and then replaying the transactions based on the versions in another node. All it has to do is ask another node: “Hey, what has changed since I was gone?” Without versioning, it would have to copy over all the data, which would have put a huge strain on the system.
Failure recovery is great, but the strongest advantage lies in the fact that such a versioning system can be used to implement a strong consistency model. If the versioning system keeps versions for each data change, we can actually go back in time and do queries against an earlier version of our data.
Some bright minds found out that this historical querying capability could be used to provide a consistency model called ‘snapshot consistency’. The idea of snapshot consistency is to pick a version of the data at the beginning of the query, work with that version of the data during the rest of the query, then write a new version at the end of the query.
There is one possible pitfall here: during the execution of such a query, another query could be writing data that conflicts with the first query. For example, if two write queries start with the same snapshot of a bank account with $1000 on it, they could both spend the money since they do not see the writes of the other query. To prevent that, an additional transaction will take place to see if the snapshot’s values changed before either query writes a result. If something conflicting did happen to change the snapshot’s value, the transaction is rolled back and has to be restarted.
However, there is still one problem Percolator needs to solve. Clocks on different machines can easily drift apart a few 100s of milliseconds. If data for a query is split over multiple machines such as in our initial example, you can’t simply ask both machines to give you data at a certain timestamp since they have a slightly different idea of what the current time is. It’s a matter of milliseconds, but when many transactions have to be processed, a few milliseconds are all it takes to go from correct data to faulty data.
Time synchronization brings us to the second Percolator ingredient.
Ingredient 2: The Timestamp Oracle
Percolator’s solution to the time synchronization problem is something called the Timestamp Oracle. Instead of letting each node dictate its own time (which was not accurate enough), Percolator uses a central system that exposes an API providing you with a timestamp. The node on which this system lives is the Timestamp Oracle. When we keep multiple versions of our data, we need at least two timestamps for each query. First, we need a timestamp to query a snapshot, which we will use to read data. Then, at the end of the transaction when we are ready to write, we need a second timestamp to tag the new data version. As a result, Percolator has the disadvantage that it needs at least two calls to the Timestamp Oracle, which introduces even more latency if the Oracle is in another region from the nodes where the calls originated. When Google came up with their Distributed Database Spanner, they solved this problem.
2012 – Spanner
Spanner was the first globally distributed database to offer strong consistency, which essentially means that you get low latency reads without having to worry about potential database errors anymore. Developers no longer need to invest extra work to circumvent potential bugs caused by eventual consistency. The paper was released in 2012 and it was released to the general public in 2017 as Spanner Cloud.
Ingredient 1: Versioning
Google built Spanner after their experience with Percolator. Since Percolator’s versioning system proved to work, they kept this in Spanner’s design. This versioning system provided the ability to do very fast reads (snapshot reads) if you were willing to give up consistency. In that case, you could run queries and give Spanner a maximum age of the results. For example: “Please return my current inventory as fast as possible, but the data can only be 15 seconds old”. Basically, instead of abandoning consistency, you could now choose for each query which consistency level suited your use-case.
Ingredient 2: TrueTime
To eliminate the extra overhead to synchronize time between machines, Spanner abandoned the Timestamp Oracle in favor of a new concept called TrueTime. Instead of having one central system that provides a unified view of time, TrueTime tries to reduce the clock drift between the machines themselves. Engineers at Google managed to limit local clock drift by implementing a time synchronization protocol based on GPS and atomic clocks. This synchronization algorithm allowed them to limit clock drift within a boundary of 7ms, but required specific hardware that consisted of a combination of GPS and Atomic clock technology.
Of course, there is still a potential clock drift of 7ms, which means that two servers could still interpret a timestamp to be two different snapshots. This is solved by the third ingredient for Spanner: commit-wait.
Ingredient 3: Commit-wait
In fact, the TrueTime API does not return one timestamp but returns and interval n which it is sure that the current timestamp should lie. Once it is ready to commit, it will just wait a few milliseconds to cope with the potential drift which is called ‘Commit-wait’. This makes sure that the timestamp that will be assigned to the write is a timestamp that has passed on all nodes. It’s also the reason that running Spanner on commodity hardware can not deliver the same guarantee since the wait period would need to be a few 100s of milliseconds.
2012 – Calvin
The first paper on the Calvin algorithm was released in 2012, from research at Yale. Just like the previous approaches, Calvin consists of several ingredients. Although versioning is also part of it, the rest of the approach is radically different which requires a few extra ingredients to work: deterministic calculations, and the separation of ordering from locking. These are ingredients that are typically not found in databases with traditional architecture. By changing the architecture and accepting that queries have to be deterministic, Calvin can reduce the worst-case number of cross- datacenter messages to two. This pushes down the worst-case latency of global transactions significantly and brings it below 200ms or theoretically even below 100ms. Of course, in order to believe that this is possible, you might want to know how it works first, so let’s take a look at the algorithm.
Ingredient 1: Versioning
Similar to Percolator and Spanner, Calvin relies on versioned data. These snapshots in Calvin are mainly used to ensure fault-tolerance. Each node stores different snapshots which can be considered as checkpoints. A disconnected node that comes back online only needs to grab the timestamp of the last checkpoint it has witnessed, and then ask another node to inform him of all the transactions that came after that checkpoint.
Ingredient 2: Deterministic calculations
Similar to Elm, Calvin has given up something to speed up the calculations. In the case of Calvin, we can basically say that the result of a transaction will be the same, whether it’s executed on machine A or Machine B. This might seem evident, but typically databases do not guarantee this. Remember that SQL allows you to use the current time or allows something called interactive transactions where user input can be inserted in the middle of a transaction, both of which could violate the guarantees provided by Calvin.
To achieve deterministic calculations, Calvin (1) needs to take out calculations such as current time and pre-calculate them, and (2) does not allow interactive transactions. Interactive transactions are transactions where a user starts a transaction, reads some data, provides some additional user input in the middle, and then finally does some extra calculations and possibly some writes. Since the user is not predictable, such a transaction is not deterministic. In essence, Calvin trades in a minor convenience (interactive transactions) for great performance.
Ingredient 3: Separate the problem of ordering.
Databases spend a lot of time negotiating locks in order to make it look like the system is executing in a specific order”. If an order is all you need, maybe we can separate the problem of locking from the problem of ordering. This means though that your transactions have to be pure.
— Kyle Kingsbury
Separating the concern of ordering transactions from the actual execution has been considered many times in the database world but without much success. However, when your transactions are deterministic, separating the ordering from the calculations actually becomes feasible. In fact, the combination of deterministic calculations and the separation of ordering from the rest of the algorithm is extremely powerful since it helps to reduce lock duration and greatly diminishes the slower communication between distant nodes (cross-datacenter communication).
Shorter lock duration
Whenever locks are held on a piece of data, it means that other queries that use that data have to wait. Therefore, shorter locking results in better performance. Below is an image that shows an overview of the locking procedure in Calvin compared to how a traditional distributed database might do it. Most databases would keep a lock on data until there is at least a consensus on what to write while Calvin would only keep the lock until all nodes agree on the order. Because the calculations are deterministic and they all agreed on the order, each node will calculate separately and come to the same end result.
Less communication between distant nodes
Besides the advantages in lock duration, separating ordering from the rest of the algorithm also requires less communication. As explained before with the Cassandra example, a distributed database typically requires cross-datacenter communication in many phases of their algorithm. In the case of Calvin, the only moment we need to agree on something is at the moment we determine the order. With the Raft protocol, this could be done in two hops which makes it possible to achieve sub 100ms latencies for read-write queries.
Together with the reduced lock time, this also brings superb throughput. The original Calvin paper has also done experiments that show that this approach significantly outperforms traditional distributed database designs under high contention workloads. Their results of half a million transactions per second on a cluster of commodity machines are competitive with the current world record results obtained on much higher-end hardware.
Run on any hardware
Besides that, Calvin has another advantage: it no longer requires specific hardware in order to obtain such results. Since Calvin can run on commodity machines, it can run on any cloud provider.
2014 – The FaunaDB flavor of Consensus
Ingredient 1: Versioning
FaunaDB has its own distributed transaction protocol with some similarities to Calvin. Just like the former approaches, FaunaDB’s data is also versioned. Since versioning is not only useful for the consistency model but can also have business value, FaunaDB has upgraded this mechanism to a first-class citizen that can be used by end-users. This feature essentially allows time-traveling queries. End-users can execute a query on historic data to answer questions such as: “What would the result of this query have been 20 days ago?”. This is useful to recover data that was accidentally overwritten, audit data changes, or simply incorporate time-travel in your application’s features.
Ingredient 2 and 3: Deterministic calculations and Separation
Like Calvin, FaunaDB also has deterministic calculations and separates the problem of ordering from the rest of the algorithm. Although there are similarities, calculating transactions in FaunaDB happens in a different phase than Calvin. Where Calvin takes advantage of the deterministic nature to execute the same transaction multiple times once the order is set, FaunaDB will calculate only once prior to consensus on the order of the transactions. Which brings us to the fourth ingredient.
Ingredient 4: Optimistic calculation
FaunaDB adds a fourth ingredient which we have seen already when we talked about Snapshot Isolation: Optimistic calculations instead of locking.
FaunaDB will not lock, but will instead optimistically calculate the result of the transaction once in the node where the transaction was received, and then add the result and the original input values to the log. Where Calvin would have saved the query that needs to be executed in the transaction log, FaunaDB will save both the result of the calculation and the original input values in the log. Once there is consensus on the order in which the results have to be applied, FaunaDB will verify whether the input data for that calculation has changed or not (thanks to versioning). If the input values have changed, the transaction is aborted and restarted, if they have remained the same, the results are applied on all nodes without any extra calculation.
FaunaDB’s algorithm has similar advantages as Calvin, but reduces the amount of required calculations in the cluster.
In this series, we have explained how strong consistency can help you build error-free applications more efficiently. In this last article, we have further explained how revolutionary ideas can power a new generation of distributed databases that are both consistent and performant. The takeaway in the previous articles was: “Consistency matters”. In this final article, the takeaway is encompassed in the following:
In the near future, if you read a phrase such as:
“Many NoSQL databases do not offer atomic writes for multiple documents, and in return give better performance. And while consistency is another great feature of SQL databases, it impedes the ability to scale out a database across multiple nodes, so many NoSQL databases give up consistency.” – the biggest challenges of moving to NoSQL
Realize that modern algorithms enable databases to deliver consistency without centralization. In this article, we have seen a few examples of algorithms and databases that do this. Databases that build upon these algorithms are the next generation of databases that no longer can be described by simple categories such as NoSQL, SQL, or even NewSQL.
With distributed cloud databases based on Percolator, Spanner, Calvin, and FaunaDB’s transaction protocol, you can have highly performant distributed databases that offer stronger consistency models. This means that you can build data-intensive applications that offer low-latency without having to worry about data errors, performance, or service provisioning. In such systems, consistency is transparent, and you do not have to think about it as a developer. The next time you choose a database, pick one that is consistent by default.