Raft Consensus Algorithm

 

Introduction

Q: what is membership change, why we need it?

Replicated State Machines

  • Consensus algorithms typically arise in the context of replicated state machine (RSM).
  • RSM is used for fault tolerance in single leader system like GFS, HDFS
  • RSM examples are Chubby, Zookeeper
  • RSM implemented using a replicated log with ordered commands
  • keeping the replicated log consistent is the job of the consensus algorithm. Every log eventually contains the same requests in the same order

The Raft consensus algorithm

Basic

term
arbitrary length of time, numbered with consecutive int.
election
each term begins with an election, in which one or more candidates attempt to become leader. If a candidate wins the election, then it serves as leader for the rest of the term.
  • at most one leader in a given term.
  • each server stores a current term number, which increases monotonically over time.

Two types RPC:

  • RequestVote RPCs are initiated by candidates during elections
  • AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat.

Q: difference between commit and apply?

  • I learnt the following from section 5.3
  • commit is a leader’s concept. A log entry is committed once the leader that created the entry has replicated it on a majority of the server. This also commits all preceding entries in the leader’s log, including entries created by previous leaders.
  • apply is a server’s concept (leader or follower):
    • once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).

Leader election

heartbeat mechanism to trigger leader election.

servers start up as followers

Q: how the servers know each other? A: it seems there is a global configuration of all the machines.

Log replication

Log Matching Property

  • if two entries in different logs have the same index and term, then they store the same command
    • this is due to the fact that a leader creates at most one entry with a given log index in a given term
    • the fact that log entries never change their position in the log
  • if two entries in different logs have the same index and term, then the logs are identical in all preceding entries
    • consistency check by AppendEntries
    • when sending AppendEntries, the leader includes the index and term of the entry in its log that immediately precedes the new entries
    • Q: “if the follower does not find an entry in its log with the same index and term, then it refuses the new entries” Does this mean if a follower misses one entry, it will miss all the rest entries, as it will reject all following entries?
      • A: leader force the follower’s logs to duplicate its own. The conflicting entries in follower logs will be overwritten with entries from the leader’s log. After rejection, the leader decrements nextIndex and retries the AppendEntries.
    • Q: will follower reapply entries? No. As all applied entries will be consistenty with the leader.

Q: Leader maintains nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. What about new leader?

  • A: new leader initializes all newIndex value to the index just after the last one in its log.

Safety

a follower might be unavailable while the leader commits several log entires, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.

Need to add restriction on which servers may be elected leader.

Election restriction

Q: you only compare committed log entries or the whole log?

A: committed

Q: how the issue in Figure 8 is resolved by just commit log entries from the leader’s current term by counting replicas?

A:

Safety Argument

Leader Completeness Property
(with my own words) If a leader for term T ($leader_T$) commits a log entry from its term, then this log entry must be stored by the leader of a future term.
State Machine Safety Property
If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Q: How do you guarantee that a server can receive all possible vote request before making the decision? A: (maybe we can figure this out during impl)

Q: Does Raft sacrifice anything for simplicity?

A: Raft gives up some performance in return for clarity; for example:

  • Every operation must be written to disk for persistence; performance probably requires batching many operations into each disk write.

  • There can only usefully be a single AppendEntries in flight from the leader to each follower: followers reject out-of-order AppendEntries, and the sender’s nextIndex[] mechanism requires one-at-a-time. A provision for pipelining many AppendEntries would be better.

  • The snapshotting design is only practical for relatively small states, since it writes the entire state to disk. If the state is big (e.g. if it’s a big database), you’d want a way to write just parts of the state that have changed recently.

  • Similarly, bringing recovering replicas up to date by sending them a complete snapshot will be slow, needlessly so if the replica already has a snapshot that’s only somewhat out of date.

  • Servers may not be able to take much advantage of multi-core because operations must be executed one at a time (in log order).

Q: Should the leader wait for replies to AppendEntries RPCs?

A: The leader should send the AppendEntries RPCs concurrently, without waiting. As replies come back, the leader should count them, and mark the log entry as committed only when it has replies from a majority of servers (including itself).

One way to do this in Go is for the leader to send each AppendEntries RPC in a separate goroutine, so that the leader sends the RPCs concurrently. Something like this:

for each server { go func() { send the AppendEntries RPC and wait for the reply if reply.success == true { increment count if count == nservers/2 + 1 { this entry is committed } } } () }