Building a Replicated Logging System with Apache Kafka
GuozhangWang
3,710 views
78 slides
Sep 07, 2015
Slide 1 of 78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
About This Presentation
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structur...
Apache Kafka is a scalable publish-subscribe messaging system
with its core architecture as a distributed commit log.
It was originally built as its centralized event
pipelining platform for online data integration tasks. Over
the past years developing and operating Kafka, we extend
its log-structured architecture as a replicated logging backbone
for much wider application scopes in the distributed
environment. I am going to talk about our design
and engineering experience to replicate Kafka logs for various
distributed data-driven systems, including
source-of-truth data storage and stream processing.
Size: 13.48 MB
Language: en
Added: Sep 07, 2015
Slides: 78 pages
Slide Content
Building a Replicated Logging System with Apache Kafka Guozhang Wang, Joel Koshy, Sriram Subramanian, Kartik Paramasivam Mammad Zadeh, Neha Narkhede, Jun Rao, Jay Kreps, Joe Stein Body Level Two Body Level Three Body Level Four Body Level Five
We All Love Logs!
Apache Kafka A distributed messaging system ..that store messages as a log!
Example: LinkedIn back in 2010 Point-to-Point Pipelines What We Want: A Centralized Data Pipeline
Log-centric Data Flow Logical Ordering Persistent Buffering “ Source-of-Truth”
Store Messages as a Log 4 5 5 7 8 9 10 11 12 ... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 10) Messages 3
Partition the Log across Machines Topic 1 Topic 2 Partitions Producers Producers Consumers Consumers Brokers
Configurable ISR Commits ACK mode Latency On Failures “no" no network delay some data loss “leader" 1 network roundtrip a few data loss “all" ~2 network roundtrips no data loss
Use an embedded controller Detect broker failure via ZooKeeper Leader failure => elect new leader from ISR Leader and ISR persisted in Zookeeper For Controller fail-over Membership Management
Overview: Logs and Kafka Log Replication in Kafka Kafka Usage at LinkedIn Conclusion Agenda
Change Log Replication
Apache Kafka Example: Kafka at LinkedIn
Example: Espresso A distributed document store Primary online data serving platform at LI Member profile, homepage, InMail, etc [SIGMOD 2013]
Old Espresso Replication Data Center-1 Storage Node Storage Node MySQL Replication MySQL MySQL Search Index Hadoop … … Databus Cross-DC Replicator Data Center-1 Storage Node Storage Node MySQL Replication MySQL MySQL Search Index Hadoop … Databus Cross-DC Replicator
New Espresso Replication Data Center-1 Storage Node Storage Node Storage Node Kafka Logs MySQL MySQL MySQL Data Center-n Storage Node Storage Node Storage Node Kafka Logs MySQL MySQL MySQL Kafka MirrorMaker Search Index Hadoop … … Search Index Hadoop … * In Progress
Stream Processing
Apache Kafka Example: Kafka at LinkedIn
Data flow streaming on Kafka and YARN Stateful processing Re-processing Failure Recovery Example: Samza [CIDR 2015]
Kafka Kafka Samza State Process Protocol State Process Protocol State Process Protocol Samza Processing
Kafka Kafka Samza State Process Protocol State Process Protocol State Process Protocol Samza Processing Kafka Changelog
Kafka Kafka Samza State Process Protocol State Process Protocol State Process Protocol Samza Processing Kafka Changlog
Kafka Kafka Samza State Process Protocol State Process Protocol State Process Protocol Samza Processing Kafka Changlog State Process Protocol
Take-aways Log-centric data flow helps scaling your systems Kafka: replicated log streams for real-time platforms
We are Hiring
Take-aways Log-centric data flow helps scaling your systems Kafka: replicated log streams for real-time platforms THANKS!