Replication in Distributed Systems

Vidush Vishwanath
6 min readJan 7, 2021

--

Motivations for data replication include increased scalability through load balancing requests, fault tolerance through replication, and a decrease in latency through geographic content distribution. Almost all systems used in production distribute their data.

This article will discuss the 3 most common methodologies for replication.

Single Leader

Replication is often performed in a leader-follower fashion. In this, one node in the cluster is designated as a leader. This receives all write requests from clients and in turn passes replication to other nodes, the followers.

Single leader replication is praised for its simplicity and is a good option for read heavy applications. Indeed, while all write requests must be directed at the leader, read requests can be fulfilled by any node in the system. However, accounting for node failure, particularly in the case of the leader can be especially tricky in this methodology.

Single Leader Write and Read Requests, note how the read path can use a load balancer.

In the event of leader failure, certain steps must be taken. First, we must confirm the failure. This can be done easily with a preset timer, for example if we suspect the leader has failed, we can send write requests to check the cluster’s health. If those fail continuously for 2–5 seconds, we can assume our leader is unavailable.

From there, we must choose a new leader. The best node to become the next leader is one with the latest updates to ensure minimal data loss. This can be chosen by referring to write logs and finding the node which was sent the latest write requests by the leader. It is important to note that this logging must be stored separately from the leader and replicated itself to ensure its availability. Finally, after the new leader is chosen, the system must be modified to ensure all new write requests are directed towards it.

Multi Leader

Multi-leader replication presents some powerful advantages over the predecessor. Perhaps most obvious is that it does not have a single point of write-failure. Allowing multiple servers to write also opens up the possibility of horizontal scaling for write requests. Additionally, by allowing writes to come from multiple different nodes, this topology presents advantages in the event of network partitioning. If a single node temporarily goes offline, other nodes can service write requests. When it comes back online, it will simply continue propagating write-requests as a leader of the cluster with zero downtime or further configuration.

Multi Leader Architecture

However, with no single designated leader, we always have information scattered between nodes in the cluster. This presents consistency challenges. Imagine you have a meeting invite on Outlook open on your phone and laptop. At the same instant, you accept the invite from your phone and reject it from your laptop. Which write will prevail?

A consistency issue — If you accept a meeting from one device and reject it from another device at the same instant, which write will prevail?

There are a several methods to deal with the conflict. Perhaps we can attach a time-stamp to each write request and only accept the latest data as the final write. This will guarantee eventual consistency but may result in data loss under high loads — when several requests are being processed each second. Another approach would be to present the write conflict to the user and ask them to take appropriate steps. This method would prevent unintended data loss at the expense of the user’s experience. Ultimately, key design decisions of how to mitigate write conflicts must be developed by engineers to serve the end product’s goal.

Leaderless

The leaderless topology provides the best mitigations to network partition scenarios. If any one node in the cluster is reachable, we can have a successful write. However, with multiple nodes processing writes simultaneously and spreading them unpredictably, performing a read that reflects non-stale data is especially challenging.

Unlike the previous topology, reads in leaderless systems are usually performed from a quorum of nodes.

The amount of nodes required for write and read quorum are typically customizable and can be adjusted to serve different purposes. Consider a cluster with n nodes where w writes are required for a write quorum and r reads are required for a read quorum.

As long as r+w > n, we are guaranteed that latest written value will be in our read quorum.

A Quorum Read where r = 3. Here Node 3 has the most up to date version and its data will be used.

With this guarantee in place, we can adjust w and r to fit our needs. For example, in a read heavy application, we may want to ensure r << n while still maintaining r+w>n. This way, reads will be performed quickly and will continue to be accurate even if up to n-r nodes are unavailable.

Correcting stale data

Indeed, through performing reads from multiple nodes consistency issues in the data are likely to be highlighted, next it is our job to correct them. This can be performed through two architectures. The first occurs at read time. For example, if a read from multiple nodes reveals older versions of the data, those can be corrected in real time.

A read repair performed on the above read.

Repairing data in this way is cheap as no extra service is used. However, values that are rarely read are likely to remain stale for extended periods of time. The latter approach counters this through having a dedicated system “Anti-entropy” system for updating stale values. This system runs separately of incoming read requests and compares all data rather than just a small subset. Because of the increased resources required to run anti-entropy, it’s a good idea to start it when system load is low.

Anti-Entropy

When designing a fault-tolerant system the engineer must weigh the cost and benefits of the different replication methodologies presented above. In addition, engineers must consider the tradeoff between synchronous and asynchronous replication.

In a nutshell, synchronous replicators confirm their readers have received and written data.

The benefit for synchronous replication is obvious — a guarantee of consistency. However, waiting for each follower to write data and report back will inevitably result in an unacceptably high latency.

In practice, this approach is often adopted with a small set of followers that are guaranteed to have the latest data while others respond asynchronously. This hybrid model balances response latency with consistency and ensures a known set of nodes has the latest data. Nevertheless, many systems opt for asynchronous approach: to have minimal write latency at the expense of potential data loss. Such an approach is particularly appealing when cross-node communication is highly latent such as in geographically distributed data centers.

Ultimately, when designing a distributed system for handling data, engineers must consider the tradeoffs between the various replication methodologies with concern of the end user experience. Many of the decisions here can be easily considered from the end user’s perspective as a tradeoff between availability and consistency.

--

--

No responses yet