Unique ID Generation in Distributed System (Twitter Snowflake Approach)
SonilKumar2
133 views
13 slides
Aug 14, 2024
Slide 1 of 13
1
2
3
4
5
6
7
8
9
10
11
12
13
About This Presentation
In traditional database systems, generating unique IDs is typically handled by using a primary key with the auto_increment attribute. However, this approach becomes problematic in distributed environments due to the limitations of a single database server.
Scaling a single database server to handle...
In traditional database systems, generating unique IDs is typically handled by using a primary key with the auto_increment attribute. However, this approach becomes problematic in distributed environments due to the limitations of a single database server.
Scaling a single database server to handle the demands of a distributed system can be challenging. Additionally, generating unique IDs across multiple databases with minimal delay presents significant challenges, as coordinating ID generation across different servers becomes complex.
Size: 2.54 MB
Language: en
Added: Aug 14, 2024
Slides: 13 pages
Slide Content
Unique ID Generation in Distributed
Systems
This document explores various approaches to generating unique IDs in distributed systems, highlighting the challenges of
traditional methods and presenting alternative solutions like UUIDs, Ticket Servers, and Twitter's Snowflake approach. We will
analyze the pros and cons of each method and ultimately focus on the Snowflake approach as a robust and scalable solution
for generating unique IDs in distributed environments.
by Sonil Kumar
The Challenge of Distributed ID Generation
In traditional database systems, generating unique IDs is typically handled by using a primary key with the auto_increment
attribute. However, this approach becomes problematic in distributed environments due to the limitations of a single
database server.
Scaling a single database server to handle the demands of a distributed system can be challenging. Additionally, generating
unique IDs across multiple databases with minimal delay presents significant challenges, as coordinating ID generation across
different servers becomes complex.
Multi-Master Replication
One approach to address the scalability issues of a single database server is to use multi-master replication. This method
leverages the database's auto_increment feature, but instead of incrementing the next ID by 1, it increments by k, where k
represents the number of database servers in use.
For example, if there are two database servers, the next ID generated on each server would be equal to the previous ID on that
server plus 2. This approach allows IDs to scale with the number of database servers, addressing some scalability issues.
Difficult to scale across multiple data centers.
IDs do not increase sequentially over time across different servers.
It does not handle server additions or removals efficiently.
Universally Unique Identifier (UUID)
UUIDs are 128-bit numbers used to identify information in computer systems. They are generated independently without
coordination between servers, making them suitable for distributed environments.
UUIDs have a very low probability of collision. According to Wikipedia, "after generating 1 billion UUIDs every second for
approximately 100 years would the probability of creating a single duplicate reach 50%".
Generating UUIDs is straightforward, requiring no coordination between servers, which eliminates synchronization issues.
The system is easy to scale since each web server is responsible for generating its own IDs. The ID generator can easily scale
along with the web servers.
UUIDs are 128 bits long, while our requirement is for 64-bit IDs.
UUIDs do not increase sequentially over time.
UUIDs can be non-numeric.
Ticket Server
Ticket servers offer a centralized approach to generating unique IDs. This method was developed by Flickr to create
distributed primary keys.
The idea is to use a centralized auto\_increment feature in a single database server, known as the Ticket Server. This server
handles the generation of unique IDs, which are then distributed to other servers in the system.
Generates numeric IDs.
Simple to implement and effective for small to medium-scale applications.
Single point of failure: If the Ticket Server goes down, all dependent systems will be affected. To mitigate this, multiple
ticket servers can be set up, but this introduces new challenges, such as data synchronization.
Twitter Snowflake Approach
The Snowflake approach, developed by Twitter, is a robust and scalable solution for generating unique IDs in distributed
systems. It leverages a "divide and conquer" strategy, dividing the ID into several sections.
A 64-bit ID is typically divided into the following sections:
Timestamp: Represents the time when the ID was gene rated.
Data Center ID: Identifies the data center where the ID was generated.
Worker ID: Identifies the machine within the data center that generated the ID.
Sequence Number: A counter that increments within the same millisecond to avoid collisions.
Snowflake ID Generation in C#
Here is an example of generating Snowflake type Id generation using C#.
using System;
public class SnowflakeIdGenerator
{
private const long Epoch = 1288834974657L; // Custom epoch (example: Twitter's Snowflake epoch)
private const int DataCenterIdBits = 5;
private const int WorkerIdBits = 5;
private const int SequenceBits = 12;
private const long MaxDataCenterId = -1L ^ (-1L << DataCenterIdBits);
private const long MaxWorkerId = -1L ^ (-1L << WorkerIdBits);
private const long MaxSequence = -1L ^ (-1L << SequenceBits);
private const int WorkerIdShift = SequenceBits;
private const int DataCenterIdShift = SequenceBits + WorkerIdBits;
private const int TimestampShift = SequenceBits + WorkerIdBits + DataCenterIdBits;
private readonly object _lock = new object();
private long _lastTimestamp = -1L;
private long _sequence = 0L;
public long DataCenterId { get; }
public long WorkerId { get; }
public SnowflakeIdGenerator(long dataCenterId, long workerId)
{
if (dataCenterId > MaxDataCenterId || dataCenterId < 0)
throw new ArgumentException($"DataCenterId must be between 0 and {MaxDataCenterId}");
if (workerId > MaxWorkerId || workerId < 0)
throw new ArgumentException($"WorkerId must be between 0 and {MaxWorkerId}");
DataCenterId = dataCenterId;
WorkerId = workerId;
}
public long NextId()
{
lock (_lock)
{
long timestamp = CurrentTimestamp();
if (timestamp < _lastTimestamp)
{
throw new Exception("Clock moved backwards. Refusing to generate id for " + (_lastTimestamp - timestamp) + "
milliseconds");
}
if (timestamp == _lastTimestamp)
{
_sequence = (_sequence + 1) & MaxSequence;
if (_sequence == 0)
{
// Sequence overflow, wait till next millisecond
timestamp = WaitForNextMillis(_lastTimestamp);
}
}
else
{
_sequence = 0L;
}
_lastTimestamp = timestamp;
// Combine the parts to generate the Snowflake ID
return ((timestamp - Epoch) << TimestampShift) |
(DataCenterId << DataCenterIdShift) |
(WorkerId << WorkerIdShift) |
_sequence;
}
}
private long WaitForNextMillis(long lastTimestamp)
{
long timestamp = CurrentTimestamp();
while (timestamp <= lastTimestamp)
{
timestamp = CurrentTimestamp();
}
return timestamp;
}
private long CurrentTimestamp()
{
return DateTimeOffset.UtcNow.ToUnixTimeMilliseconds();
}
}
public static void Main()
{
// Calling SnowFlake
var generator = new SnowflakeIdGenerator(dataCenterId: 1, workerId: 1);
long id = generator.NextId();
Console.WriteLine($"Generated ID: {id}");
}
This Snowflake ID generator in C# generates unique 64-bit identifiers suitable for use in distributed systems. The code is
broken down into sections for clarity and understanding.
Constants and Bit Allocation
private const long Epoch = 1288834974657L;
private const int DataCenterIdBits = 5;
private const int WorkerIdBits = 5;
private const int SequenceBits = 12;
Epoch: This is the custom epoch timestamp in milliseconds. It marks the starting point for your ID generation. Twitter's
Snowflake used a custom epoch, and this code uses a similar concept. The value here (1288834974657L) corresponds to
November 4, 2010. All timestamps in generated IDs are relative to this epoch.
DataCenterIdBits: The number of bits allocated for the data center identifier. Here, 5 bits are used, which allows for 2^5 = 32
different data centers.
WorkerIdBits: The number of bits allocated for the machine (or worker) identifier within a data center. Again, 5 bits are
used, allowing for 32 different workers per data center.
SequenceBits: The number of bits allocated for the sequence number, which increments within the same millisecond to
avoid collisions. 12 bits allow for 2^12 = 4096 unique IDs per millisecond per machine.
Maximum Values
private const long MaxDataCenterId = -1L ^ (-1L << DataCenterIdBits);
private const long MaxWorkerId = -1L ^ (-1L << WorkerIdBits);
private const long MaxSequence = -1L ^ (-1L << SequenceBits);
These constants calculate the maximum values that the data center ID, worker ID, and sequence number can have:
MaxDataCenterId: The maximum value for a data center ID is 31 (with 5 bits).
MaxWorkerId: The maximum value for a worker ID is 31.
MaxSequence: The maximum sequence number is 4095.
Bit Shifts
private const int WorkerIdShift = SequenceBits;
private const int DataCenterIdShift = SequenceBits + WorkerIdBits;
private const int TimestampShift = SequenceBits + WorkerIdBits + DataCenterIdBits;
These constants define how many bits each component should be shifted to combine them into a single 64-bit number:
WorkerIdShift: Shifts the worker ID by 12 bits to make space for the sequence number.
DataCenterIdShift: Shifts the data center ID by 17 bits (12 + 5) to make space for both the sequence number and the wor ker
ID.
TimestampShift: Shifts the timestamp by 22 bits (12 + 5 + 5) to make space for the sequence number, worker ID, and data
center ID.
Class Members
private readonly object _lock = new object();
private long _lastTimestamp = -1L;
private long _sequence = 0L;
public long DataCenterId { get; }
public long WorkerId { get; }
_lock: Used for thread synchronization to ensure that only one thread generates an ID at a time.
_lastTimestamp: Stores the timestamp of the last generated ID to detect when the timestamp has moved to the next
millisecond.
_sequence: The sequence number that increments within the same millisecond.
DataCenterId & WorkerId: Public properties representing the data center ID and worker ID for the current instance.
5. Constructor
public SnowflakeIdGenerator(long dataCenterId, long workerId)
{
if (dataCenterId > MaxDataCenterId || dataCenterId < 0)
throw new ArgumentException($"DataCenterId must be between 0 and {MaxDataCenterId}");
if (workerId > MaxWorkerId || workerId < 0)
throw new ArgumentException($"WorkerId must be betw een 0 and {MaxWorkerId}");
DataCenterId = dataCenterId;
WorkerId = workerId;
}
The constructor initializes the data center ID and worker ID. It also validates that the provided values are within the allowed
range.
ID Generation
public long NextId()
{
lock (_lock)
{
long timestamp = CurrentTimestamp();
if (timestamp < _lastTimestamp)
{
throw new Exception("Clock moved backwards. Refusing to generate id for " + (_lastTimestamp - timestamp) + "
milliseconds");
}
if (timestamp == _lastTimestamp)
{
sequence = (sequence + 1) & MaxSequence;
if (_sequence == 0)
{
timestamp = WaitForNextMillis(_lastTimestamp);
}
}
else
{
_sequence = 0L;
}
_lastTimestamp = timestamp;
return ((timestamp - Epoch) << TimestampShift) |
(DataCenterId << DataCenterIdShift) |
(WorkerId << WorkerIdShift) |
_sequence;
}
}
The NextId() method generates a new Snowflake ID. The method is thread-safe due to the lock statement.
CurrentTimestamp(): Gets the current timestamp in milliseconds.
Clock Moved Backwards : If the current timestamp is less than the last timestamp (indicating a clock rollback), an
exception is thrown.
Same Timestamp : If the current timestamp is the same as the last one, the sequence number is incremented. If the
sequence overflows (i.e., it reaches 4096), the method waits for the next millisecond.
New Millisecond: If the timestamp is different (i.e., we’ve moved to a new millisecond), the sequence is reset to 0.
ID Combination: The Snowflake ID is generated by combining the timestamp, data center ID, worker ID, and sequence
number using bitwise shifts and OR operations.
Wait for Next Millisecond
private long WaitForNextMillis(long lastTimestamp)
{
long timestamp = CurrentTimestamp();
while (timestamp <= lastTimestamp)
{
timestamp = CurrentTimestamp();
}
return timestamp;
}
This method waits until the system clock moves to the next millisecond to avoid sequence overflow.
8. Getting the Current Timestamp
private long CurrentTimestamp()
{
return DateTimeOffset.UtcNow.ToUnixTimeMilliseconds();
}
This method returns the current timestamp in milliseconds since the Unix epoch (January 1, 1970).
Final ID Format
The final 64-bit Snowflake ID is structured as follows:
41 bits for the timestamp (enough for 69 years from the custom epoch).
5 bits for the data center ID.
5 bits for the worker ID.
12 bits for the sequence number.
This structure ensures that IDs are unique across different machines and times while still being sortable by generation time.