Twitter Timeline and Search Distributed System.pptx

MdRakibTrofder 22 views 34 slides Nov 13, 2022
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

Design the Twitter timeline and search
Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.

Design the Facebook feed and Design Facebook search are similar qu...


Slide Content

Distributed Systems Twitter Timeline and Search

Presented By Md. Rakib Trofder BSSE-1129 Nazmus Sakib Ahmed BSSE-1108 Md. Siam BSSE-1104

Use cases 01

Tweets Use cases Followers Notifications Emails

Timeline Home Timeline Searching High Availability Use cases

Constraints and assumptions 02

State assumptions 01 100 million active users 02 500 million tweets per day 03 250 billion read requests per month 04 10 billion searches per month

State assumptions 10 deliveries fa nout for a tweet 5 billion tweets delivered on fanout per day 150 billion tweets delivered on fanout per month 500 million tweets per day 15 billion tweets per month

State assumptions Fast Optimized Heavy Viewing the timeline Re ads of tweets Reads and Writes Timeline & Searching

Calculate usage 03

Calculate usage 250 billion read requests per month * (400 requests per second / 1 billion requests per month) = 100 thousand read requests per second 15 billion tweets per month * (400 requests per second / 1 billion requests per month) = 6,000 tweets per second

Calculate usage 150 billion tweets delivered on fanout per month * (400 requests per second / 1 billion requests per month) = 60 thousand tweets delivered on fanout per second 10 billion searches per month * (400 requests per second / 1 billion requests per month) = 4,000 search requests per second

Calculate usage 500 million tweets per day * 10 KB per tweet * 30 days per month = 150 TB of new tweet content per month 1 50 TB of new tweet content per month * 12 months per year * 3 years = 5.4 PB of new tweet content in 3 years

Design core components 04

Database Selection (Maintaining Own Timeline) Indexing Super Fast Index Lookups Faster Search Inserting of Tweet to the User’s Profile and Viewing Own Timeline Strict Relation Relational Database

Database Selection (Delivering tweets and building the home timeline ) Overload Writing the Tweets to Everyone’s Profile who Follows Fast Write Is the Only Solution Caching Non Relational Database

Database Selection (Object Storing ) Object Database Storing the Media (images, videos) files

Services User Info Service Tweet Info Service Timeline Service User Graph Service Search Service Notification Service Fan Out Service

Use case: User posts a tweet The Client posts a tweet to the Web Server, running as a reverse proxy The Web Server forwards the request to the Write API server The Write API stores the tweet in the user's timeline on a SQL database

Use case: User posts a tweet (Cont) The Write API contacts the Fan Out Service, which does the following: Queries the User Graph Service to find the user's followers stored in the Memory Cache Stores the tweet in the home timeline of the user's followers in a Memory Cache O(n) operation: 1,000 followers = 1,000 lookups and inserts Stores the tweet in the Search Index Service to enable fast searching Stores media in the Object Store Uses the Notification Service to send out push notifications to followers: Uses a Queue (not pictured) to asynchronously send out notifications

Use case: User Views the Home Timeline The Client posts a home timeline request to the Web Server The Web Server forwards the request to the Read API server The Read API server contacts the Timeline Service, which does the following: Gets the timeline data stored in the Memory Cache, containing tweet ids and user ids - O(1) Queries the Tweet Info Service with a multiget to obtain additional info about the tweet ids - O(n) Queries the User Info Service with a multiget to obtain additional info about the user ids - O(n)

Use case: User views the user timeline The Client posts a user timeline request to the Web Server The Web Server forwards the request to the Read API server The Read API retrieves the user timeline from the SQL Database

Use case: User searches keywords The Client sends a search request to the Web Server The Web Server forwards the request to the Search API server The Search API contacts the Search Service, which does the following: Parses/ tokenizes the input query, determining what needs to be searched Removes markup Breaks up the text into terms Fixes typos Normalizes capitalization Converts the query to use boolean operation Queries the Search Cluster (ie Lucene) for the results: Scatter gathers each server in the cluster to determine if there are any results for the query Merges, ranks, sorts, and returns the result

High Level Design

Scale the design 05

Testing Benchmark Find Bottlenecks

CDN Load Balancer Scale the design DNS

Replicas Master Slave Write Scale the design

Points of Bottlenecks SQL Write Master-Slave A single SQL write master-slave might be overwhelmed Fan out service In case of users with millions of followers

Points of Optimization (Fan Out Service) Avoid fanout of users with millions of followers Search to find tweets for highly-followed users Merge the results with user’s home tweets Re-order and serve them

Points of Optimization (SQL Write Master-Slave) Federation (Functional Partitioning) Sharding Denormalization SQL Tuning

Points of Optimization (Additional) Keep only several hundred tweets for each home timeline in the Memory Cache Keep only active users' home timeline info in the Memory Cache If a user was not previously active in the past 30 days, we could rebuild the timeline from the SQL Database Query the User Graph Service to determine who the user is following Get the tweets from the SQL Database and add them to the Memory Cache Store only a month of tweets in the Tweet Info Service Store only active users in the User Info Service The Search Cluster would likely need to keep the tweets in memory to keep latency low

Scaled Design

Thanks! Any Questions?