Zookeeper Architecture

PrasadWali 354 views 12 slides Oct 27, 2019
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

Know how the internals of zookeeper is architectured


Slide Content

Coordination service : A coordination service in a distributed world helps distributed applications to offload some common challenges like Synchronization b/w the nodes of the cluster. Distributing a common configuration b/w the nodes of the cluster Grouping and naming services for each of the nodes of the cluster Leader election b/w the nodes. Znodes : Zookeeper(Zk) helps the nodes of the distributed applications coordinate with each other by providing a common namespace. Nodes can use this namespace to save and retrieve any shared info to help coordination . The namespace is hierarchical much like a tree d/s. Each element in this namespace is called a znode associated with a name separated by a (/) to indicate its hierarchical path from the root. These namespace is stored in-memory and therefore provides faster access. Ensemble : Similar to the distributed application clients that it servers, zookeeper itself is distributed i.e, a set of zookeeper nodes work together to achieve its goal. This group of zookeeper nodes is called an Ensemble. Clients can talk to any node within the ensemble (via zk client lib). Clients periodically send heartbeats to server & receive an ack back to reaffirm its connectivity. Each node in the ensemble is aware of and talks to other nodes to share info. Znodes - namespace What is Zookeeper ? Ensemble

Zookeeper Data Model Each Znode stores a stat structure that contains zxid (transactionID), version #, timestamp, ACL. Client receives this stats structure at the time of read. T he stat structure helps to validate the updates/deletes from client. With every creation/updates the stat structure is updated. Version # increments, Zxid increments, reset timestamp etc.. Types of ZNodes: Ephemeral : Exists as long as the session that created them also exist. Cannot have children. Persistent : Unlike ephemeral , these are persisted across sessions. Sequential : Contains a monotonically increasing no as a part of its name. Helps keep its uniqueness .

Zookeeper Sessions Client lib configuration contains the list of all zookeeper servers. Client establishes connection to any random server from the list. The connected server sends an auth token upon successfully connection. Both client and connected server periodically exchange heartbeats to confirm that they are each alive. If the client loses connectivity, the client lib upon timeout will connect to some other server from the config list. This switch is transparent to the client application . During reconnection the auth token from the prev connection is used for validity to attempt an connection to its lost session. Zookeeper Watches Client ops like getData(), getChildren(), exits() etc.., has an optional parameter to enable a watch on the target znode. Zookeeper servers notifies a single change event to the watchers of the znode. Successive changes to the znode will not be notify the watchers. There are 2 kinds of Watches Data watches : Watches for a change in data on a znode. getData(), exists() are set to watch for a change in data. Also create(), delete() Children watches : Watches for a add/deletion of a child node for a parent znode. getChildren() is set to watch for add/dele for child znodes for a parent znode. Also create(), delete() Server A & B creates ephemeral nodes 1 & 2 respectively. When A dies, B that’s watching 1 is notified before 1 expires. B can now leverage the info to take evasive action .

Read Path : In a Leader/Follower model, reads are eventually consistent. Client connects to one of the zk servers and request for znode along a path. The connected zk server authenticates the client & servers the read from its locally stored namespace. Since its a local copy, it can be stale. Zk servers choose availability over consistency hence each servers stores its own copy of the namespace. Zookeeper Data Access Write Path : Client connects to one of the zk servers and requests a create/delete of a znode along a path in the namespace. Since all writes are handled by the leader, the connected zk server forwards the write to leader. The leader persists the data and broadcasts the write to all followers in the cluster & awaits their response. If majority of them writes into their local namespace and responds back, we then have a quorum & write is a success. The initial connected zk server responds the write request as a success to the client.

Zk commands

Intent : Enforce a barricade while performing crucial job. Client calls exits(/b, true) , to check if barrier exists and sets a watch. If barrier /b doesn’t exist, create a Ephemeral node and proceed with the client job create(/b, EPHEMERAL ) If barrier /b exists, client waits for the watch trigger. At this point the there may be multiple clients that are on wait n watch for the same barrier /b. One the client job is done it can delete the barrier. The delete of barrier node triggers notification to all watchers. Other waiting clients can now retry with calls to exits(/b, true). Usage : Critical updates/housekeeping tasks to force wait on other processes. Recipe - Barrier delete (/b, Ephemeral) Is exists(/b, true) Create(/b, Ephemeral) Run client job Client Client Client exists(/b, true) Yes No Notify state change to watchers Create & delete are atomic ops performed by leader upon agreement with quorum. Leader guarantees order in the event of race condition for multiple creation requests from different clients are sequential. Notify state change to watchers

Recipe - Cluster Management Intent : Notify nodes about the arrival or departure of other nodes in the cluster. Create a PERSISTENT parent node /member Each client sets a watch on the parent node /member exists(/member, true) Each client creates EPHEMERAL child node under /member create(/member/host1, false) Each client updates its status like CPU/memory/failure etc to its node in the hierarchy. Watches are triggered to all watchers with a change to any child node. Usage : Cluster monitoring or management for elastic scaling. Client c1 /member Client c2 Client c3 Watches parent creates/ updates /member/c1 /member/c2 /member/c3 Notifies watchers When client c3 creates /member/c3, zk notifies the other watches viz., c1 and c2.

Recipe - Queues Intent : Creates a ordered data access FIFO Create a PERSISTENT parent node /queue Each client creates EPHEMERAL & SEQUENTIAL child node under /queue. Since its sequential it appends a monotonically increasing no at the end e.g., /queue/X-00001, /queue/X-0002... create(/queue/X-, false) A client that wants to access the nodes in insertion order simply invokes all its children. getChildren(/queue, true) By enabling the watch on the parent, the accessor client is notified when a child is created or removed externally. Useage : Cluster monitoring or management for elastic scaling. Client c1 /queue Client c2 Client c3 creates / queue /x-0001 /queue/x-0002 /queue/x-0003 Client c4 getChildren Watches for changes to children

Recipe - Locks Intent : Avoid race condition by enforcing a lock/key pattern Create a PERSISTENT parent node /lock Each client creates EPHEMERAL & SEQUENTIAL child node under /lock. Since its sequential it appends a monotonically increasing no at the end e.g., /lock/X-00001, /queue/X-0002… c reate(/lock/X-, false) Locks are granted in the insertion order from smallest to largest. Client wants to check if its the lowest, invokes getChildren(/lock, false) If 1st znode in the list of children is its very own, the lock is acquired . Client proceeds to do its job. Upon completion, releases the lock by deleting its znode. delete(/lock/X-00001) Else, waits for its turn by adding a watch of its predecessor znode. (If its immediate predecessor doesn’t exists look for the one before and so until you find one). exists(/lock/X-00000n - 1) When its predecessor znode is deleted/update the client is notified. When a node receives this event it goes to step 3. Client c1 /lock Client c2 Client c3 creates / lock /x-0001 /lock/x-0002 /lock/x-0003 Watches it predecessor getChildren() Checks for existence of its predecessor. Also need to check with the parent if its the 1st existent child, In the event its predecessor dies.

Recipe - Leader selection Intent : Leader election Create a PERSISTENT parent node /election Each zk servers creates EPHEMERAL & SEQUENTIAL child node under /election. Since its sequential it appends a monotonically increasing no at the end e.g., /election/X-00001, / election /X-0002… create(/ election /X-, false) Each Zk server checks if it’s the smallest among all children getChildren(/ election , false) If yes, it becomes the leader. Else, it sets a watch on the znode just smaller that itself (smallest and closest predecessor) . exists(/ election /X-00000n - 1) If the leader dies, so does it ephemeral znode triggering a watch event to only its successor (next in line that watching it). When a node receives this event it goes to step 3. Zk 1 (Leader) /election Zk 2 Zk 3 creates / election /x-0001 / election /x-0002 / election /x-0003 Watches it predecessor getChildren() Checks for existence of its predecessor. Also need to check with the parent if its the 1st existent child, In the event its predecessor dies.

Brief Architecture https://data-flair.training/blogs/zookeeper-architecture/ Datamodel https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#ch_zkDataModel Zk API https://www.tutorialspoint.com/zookeeper/zookeeper_api.htm Overview https://www.slideshare.net/scottleber/apache-zookeeper References