[SysDes1][Chap. 6] Design a Key-value Store

Design Questions

The size of key-value store
Requirement for availability, scalability and consistency

CAP Theorem

CAP theorem states it is impossible for a distributed system to simultaneously provide more than two of the three guarantees: consistency, availability and partition tolerance.

Consistency: all clients see the same data at the same time no matter which nodes they connect to.
Availability: availability means any client can get a response no matter which node they connect to.
Partition tolerance: Network partition means a communication break between two sets of nodes. Partition tolerance means the system continues to operate despite network partition.

Since network failure is unavoidable in the real world, a distributed system must tolerate network partition. Therefore, a distributed system is usually CP (consistency and partition tolerance) or AP (availability and partition tolerance).

Ideal situation

If network partition never occurs, the system can have both consistent and availability. When there is a write, write to all replicas. When there is a read, read from any of the replicas.

How to store large data – Data Partition

We can use consistent hashing to partition data and store them into different servers.

How to keep data reliable – Data Replication

We can store the same data on the consecutive N servers in the hash ring.

How to keep data consistent – Different consistent models

Two dimensions to measure the different consistency: 1. The staleness of read operation, 2. Ordering of read and/or write operation

Strong consistency: As if there is only one copy of data, all reads get the latest data.
- Explanation: No staleness for read operations. Exists total order on read & write operations.
- Implementation1: Quorum consistency. Setting R + W > N
- Implementation2: Single leader model
- Cons: can not guarantee availability all the time
- Example: Spanner Database
Sequential consistency: There exists a total order for all operations (both read and write) that all nodes agree on. But the order doesn’t need to be finalized when operations are still on-going.
- Explanation: No staleness for read operations according to the logical time, but there could be staleness according to the real time. Exists total order on read & write operations.
- Implementation: Vector lock, Two phase locking, Logical timestamp
Causality consistency: There is only partial order for all operations. Each thread can have its own ordering for all read and write operations as long as it preserves causality.
- Explanation: Stale read operation, but each node has a consistent write order.
- Implementation: Vector lock, Two phase locking, Logical timestamp
- Example: Collaborative Editing.
Eventual consistency: All replicas converge to the same copy of data
- Explanation: Stale read operation, no consistent write ordering.
- Implementation: need to provide a solution for data conflict.
- Example: Amazon DynamoDB

How to detect server failure – Gossip protocol

Motivation

Single source of information is not sufficient but all to all multicast protocol is inefficient.

Implementation

Each member maintains a membership list, it records the heartbeat count and update time for each node.
Each node send out membership list to a random set of nodes.
Each node update the membership list if received update.

How to handle temporal node failure

If a node doesn’t fail according to the membership list, but cannot be contacted, we can let another node take over the responsibility of that node and later on hand over the responsibility back.

How to handle permanent node failure

If a node fails according to the membership list, we need to restart the node and then sync the newest data to the node.

When syncing the data, we need to compare different versions of the data and determine which one is the final one. To improve the efficiency of this process, we can use the Anti-entropy protocol. A Merkle Tree (or Hash Tree) can be used for inconsistency detection and minimizing the amount of data transferred.

Drive key space into different buckets
Build a tree on top of these buckets, each non-leaf node is labeled with the hash value of its child nodes’ labels or values.
Comparing two trees from the root node, you can find out the minimum leaf nodes need to be synced quickly.

How to handle data center outage

Replica data across multiple data centers.

How to write data to each node

Write to commit log
Write to memory
Flush to SStables

How to read data from a node

Read from memory cache
If not exist, use bloom filter to identify the SSTable
Read from SSTable