distributed systems all about the data science process, covering the steps present in almost every data science project.

CS3551 Distributed Systems 1

Planned Hour Description of Portion to be Covered Relevant CO Nos Highest Cognitive level** 1 Logical Time: A framework for a system of logical clocks CO3 K2 1 Scalar time –Vector time CO3 K2 1 Message ordering paradigms CO3 K2 1 Asynchronous execution with synchronous communication CO3 K2 1 Synchronous program order on an asynchronous system CO3 K2 1 Group communication – Causal order (CO) CO3 K2 1 Total order CO3 K2 1 Global state and snapshot recording algorithms: Introduction CO3 K2 1 System model and definitions CO3 K2 1 Snapshot algorithms for FIFO channels CO3 K2 2 UNIT II LOGICAL TIME AND GLOBAL STATE

Definition A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another . 3

The space–time diagram of a distributed execution. A horizontal line represents the progress of the process; a dot indicates an event; a slant arrow indicates a message transfer .

Logical Time C ausality is tracked using physical time. However , in distributed systems, it is not possible to have global physical time; It is possible to realize only an approximation of it .

Why Is Time Interesting? Ordering of events: what happened first ? Storage of data in memory, file, database Requests for exclusive access - who asked first? Interactive exchanges - who answered first? Debugging - when could have caused the fault ? Causality is linked to temporal ordering: Causality , i.e. causal precedence relation, among events in a distributed system is a powerful concept in reasoning, analyzing and drawing inferences about a computation

Computer Clocks and Timing Events Each computer has its own internal (physical) clock, which can be used by local processes to obtain a value of the current time Processes (on different computers) can associate timestamps with their events

This is because: Computer clocks drift from perfect time Their drift rates differ from one another. Clock drift rate: rate at which a computer clock deviates from a perfect reference clock . Consequence ==> if the physical clocks are not precisely synchronized, the causality relation between events may not be accurately captured

No Accurate Clocks... but Event Ordering! We are interested in knowing whether an event (sending or receiving a message ) at one process occurred before , after or concurrently with another event at another process The execution of a system can be described in terms of events and their ordering despite the lack of accurate clocks Example [Real-Time Ordering of Events]: consider the following set of exchanges between a group of email users Bob, Alice, Peter, and Paul on a mailing list :

1. Bob sends a message with the subject Meeting 2. Alice and Peter reply by sending a message with the subject Re: Meeting

Due to the independent delays in message delivery, the messages may be delivered in the following order:

Idea... Logical Time! Since clocks cannot be synchronized perfectly across a distributed system. LOGICAL TIME can be used to provide an ordering among the events ( at processes running in different computers in a distributed system) without recourse to clocks Let us consider our email ordering problem.. what do we know logically ?

The Idea... in 1 Slide

Logical Clocks

Happened-Before Relation (➝)

Logical Clock

Scalar Time Proposed by Lamport in 1978 as an attempt to totally order events in a distributed system.

Scalar Time - Logical Clocks... in Practice!

[ Lamport Clocks] Example 1

[ Lamport Clocks] Example 2

[ Lamport Clocks] Example 3

[ Lamport Clocks] Example 4

Basic Properties Consistency property Total Ordering Event counting No strong consistency

Properties

Event counting If the increment value d is always 1, the scalar time has the following interesting property: If event e has a timestamp h, then h-1 represents the minimum logical duration, counted in units of events, required before producing the event e; We call it the height of the event e. In other words, h-1 events have been produced sequentially before the event e regardless of the processes that produced these events. For example, in Figure 3.1, five events precede event b on the longest causal path ending at b.

Shortcoming of Lamport clocks

Mattern and Fidge Vector Clocks

Vector Clocks

Physical Clock Synchronization: NTP In centralized systems, there is only single clock. A process gets the time by simply issuing a system call to the kernel . In distributed systems, there is no global clock or common memory. Each processor has its own internal clock and its own notion of time . These clocks can easily drift seconds per day, accumulating significant errors over time. Also, because different clocks tick at different rates, they may not remain always synchronized although they might be synchronized when they start. This clearly poses serious problems to applications that depend on a synchronized notion of time.

For most applications and algorithms that run in a distributed system, we need to know time in one or more of the following contexts: ◮ The time of the day at which an event happened on a specific machine in the network . ◮ The time interval between two events that happened on different machines in the network. ◮ The relative ordering of events that happened on different machines in the network. Unless the clocks in each machine have a common notion of time, time-based queries cannot be answered.

Clock synchronization is the process of ensuring that physically distributed processors have a common notion of time . Due to different clocks rates, the clocks at various sites may diverge with time and periodically a clock synchronization must be performed to correct this clock skew in distributed systems . Clocks are synchronized to an accurate real-time standard like UTC

Definitions and Terminology

Clock Inaccuracies Physical clocks are synchronized to an accurate real-time standard like UTC ( Universal Coordinated Time). As discussed earlier due to the clock inaccuracy a timer (clock) is said to be working within its specification if (where constant is the maximum skew rate specified by the manufacturer.)

Figure 3.8 The behavior of fast , slow, and perfect clocks with respect to UTC.

Offset delay estimation method The Network Time Protocol (NTP) which is widely used for clock synchronization on the Internet uses the Offset Delay Estimation method. The design of NTP involves a hierarchical tree of time servers. The primary server at the root synchronizes with the UTC. The next level contains secondary servers, which act as a backup to the primary server. At the lowest level is the synchronization subnet which has the clients.

Global state of a distributed system A distributed system consists of multiple processes Each process is located on a different computer Each process consists of “events”

An event is either sending a message, receiving a message, or changing the value of some variable Each process has a communication channel in and out Recording the global state of a distributed system on-the-fly is an important paradigm when one is interested in analyzing, testing, or verifying properties associated with distributed executions

Definition In order to test whether a certain property of the system is true, we cannot just look at each process individually A “snapshot” of the entire system must be taken to test whether a certain property of the system is true This “snapshot” is called a Global State The global state of a distributed system is the set of local states of each individual processes involved in the system plus the state of the communication channels.

Global State Detection 51 A naïve snapshot algorithm Processes record their state at any arbitrary point A designated process collects these states + So simple!! - Correct??

Global State Detection 52 Example p q m

Global State Detection 53 Example Producer Consumer problem p records its state m p q

Global State Detection 54 Example q records its state p q m

Global State Detection 55 Where did we err? What did we do? p q m

Global State Detection 56 Error!! The sender has no record of the sending The receiver has the record of the receipt Result Global state has record of the receive event but no send event violating the happened before concept!!

Global State Detection 57 The notion of Consistency A global state is consistent if it could have been observed by an external observer If e e` then it is never the case that e` is observed by the external observer and not e All feasible states are consistent

Global State Detection 58 An Example p q p q S p S p 1 S p 2 S p 3 S q S q 1 S q 2 S q 3 m 1 m 2 m 3

Global State Detection 59 A Consistent State? p q p q S p S p 1 S p 2 S p 3 S q S q 1 S q 2 S q 3 m 1 m 2 m 3 S p 1 S q 1

Global State Detection 60 A Consistent State? p q p q S p S p 1 S p 2 S p 3 S q S q 1 S q 2 S q 3 m 1 m 2 m 3 S p 2 S q 3 m 3

Global State Detection 61 An inconsistent State p q p q S p S p 1 S p 2 S p 3 S q S q 1 S q 2 S q 3 m 1 m 2 m 3 S p 1 S q 3

The state of a process at any time is defined by the contents of processor registers , stacks, local memory, etc. and depends on the local context of the distributed application. The state of a channel is given by the set of messages in transit in the channel . The occurrence of events changes the states of respective processes and channels . An internal event changes the state of the process at which it occurs. A send event changes the state of the process that sends the message and the state of the channel on which the message is sent. A receive event changes the state of the process that or receives the message and the state of the channel on which the message is received.

Notations The system consists of a collection of n processes p1, p2, ..., pn that are connected by channels. There are no globally shared memory and physical global clock and processes communicate by passing messages through communication channels. Cij denotes the channel from process pi to process pj and its state is denoted by SCij . The actions performed by a process are modeled as three types of events: Internal events , the message send event and the message receive event. For a message mij that is sent by process pi to process pj , let send( mij ) and rec( mij ) denote its send and receive events.

At any instant, the state of process pi , denoted by LSi , is a result of the sequence of all the events executed by pi till that instant. For an event e and a process state LSi , e∈LSi iff e belongs to the sequence of events that have taken process pi to state LSi . For an event e and a process state LSi , e LSi iff e does not belong to the sequence of events that have taken process pi to state LSi . For a channel Cij , the following set of messages can be defined based on the local states of the processes pi and pj Transit: transit( LSi , LSj ) = { mij |send( mij ) ∈ LSi rec( mij ) LSj }

Notations

Global State

A Consistent Global State Even if the state of all the components is not recorded at the same instant, such a state will be meaningful provided every message that is recorded as received is also recorded as sent. Basic idea is that a state should not violate causality – an effect should not be present without its cause. A message cannot be received if it was not sent. Such states are called consistent global states and are meaningful global states

Inconsistent Global State Inconsistent global states are not meaningful in the sense that a distributed system can never be in an inconsistent state.

Cuts of a Distributed Computation

Issues in recording a global state The following two issues need to be addressed: I1: How to distinguish between the messages to be recorded in the snapshot from those not to be recorded . - Any message that is sent by a process before recording its snapshot , must be recorded in the global snapshot (from C1 ). -Any message that is sent by a process after recording its snapshot, must not be recorded in the global snapshot (from C2). I2 : How to determine the instant when a process takes its snapshot . - A process pj must record its snapshot before processing a message mij that was sent by process pi after recording its snapshot.

Chandy – Lamport Algorithm

Prelims A snapshot algorithm attempts to capture a coherent global state of a distributed system (for the purpose of debugging or checkpointing , for instance ). Sending and receiving are events that take place on processes; in this example, T here also happen to be internal events that are neither sends nor receives. Messages are sent and received along channels , which are FIFO queues going between each pair of processes

P rocesses in our system are named P1, P2, etc., The channel from process Pi to process Pj is named Cij . For instance, the channel from P1 to P2 is C12, W hile the channel from P2 to P1 is C21 A snapshot is a recording of the state of each process (i.e., what events have happened on it) and each channel (i.e., what messages have passed through it)

Chandy-Lamport Distributed Snapshot Algorithm Marker receiving rule for Process Pi Marker sending rule for Process Pi

Decentralized — Any process (or multiple processes at once!) Can begin taking a snapshot without coordinating with other processes

Initiating a snapshot To get a snapshot started, an initiating process has to do three things: First , it has to record its own state . Next — immediately after recording its own state, and before it does anything else — it has to send a marker message out on each of its outgoing channels Finally, it needs to start keeping track of the messages it receives on all of its incoming channels .

We’ve recorded P1 ’s state, sent marker messages out along C12 and C13 , and started recording on C21 and C31 .

FIRST MARKER MESSAGE When a process Pi receives a marker message on channel Cki , there are two possibilities: Either this is the first marker message that Pi has ever seen, or it isn’t. If it’s the first marker message that Pi has ever seen, then Pi needs to do the following : Record its own state. Mark the channel that the marker message came in on, Cki , as empty Send marker messages out itself, on all its outgoing channels. Start recording incoming messages on all its incoming channels except Cki ,

THIS AIN’T MY FIRST MESSAGE P3 has sent out its marker messages. Let’s consider the one that went to P1 first. Is this the marker message the first that P1 has seen? No , because P1 was the first process to send marker messages in the first place! If it is not the first marker message it’s ever seen: it should stop recording on Cki A nd it should set Cki’s final state as the sequence of all the messages that arrived on Cki

The last step I -> G

I -> G B -> F

Chandy-Lamport Distributed Snapshot Algorithm Marker receiving rule for Process Pi Marker sending rule for Process Pi

Message Ordering and Group Communication

Inter-process communication via message-passing is at the core of any distributed system. Message orders: non-FIFO, FIFO, causal order, synchronous order Group communication : causal order, total order

Message ordering paradigms

The order of delivery of messages - Most Important Several orderings on messages have been defined: non-FIFO , FIFO , (iii) causal order, and ( iv) synchronous order.

Asynchronous Executions ( A -execution) An asynchronous execution (or A-execution) is an execution E≺ for which the causality relation is a partial order . N o causality cycles . On any logical link messages may be delivered in any order , not necessarily FIFO delivery , e.g., network layer IPv4 connectionless service All physical links obey FIFO

Figure 6.1 Illustrating FIFO and non-FIFO executions. (a) An A-execution that is not a FIFO execution. ( b) An A-execution that is also a FIFO execution .

FIFO executions A FIFO execution is an A-execution in which, for all On any logical link in the system, messages are necessarily delivered in the order in which they are sent The receiver uses a buffer to order the incoming messages as per the sender’s sequence numbers, and accepts only the “next” message in sequence

Causally ordered (CO) executions If send events s and s’ are related by causality ordering (not physical time ordering ), their corresponding receive events r and r’ occur in the same order at all common destinations . If s and s’ are not related by causality, then CO is vacuously satisfied

Figure 6.2 Illustration of causally ordered executions. (a) Not a CO execution. (b ), ( c), and (d) CO executions.

Figure 6.2(a) shows an execution that violates CO because s1 ≺ s3 and at the common destination P1, we have r3 ≺ r1. Figure 6.2(b) shows an execution that satisfies CO. Only s1 and s2 are related by causality but the destinations of the corresponding messages are different. Figure 6.2(c) shows an execution that satisfies CO. No send events are related by causality. Figure 6.2(d) shows an execution that satisfies CO. s2 and s1 are related by causality but the destinations of the corresponding messages are different. Similarly for s2 and s3.

Causal Order: Definition from Implementation Perspective Message arrival vs. delivery: message m that arrives in OS buffer at Pi may have to be delayed until the messages that were sent to Pi causally before m was sent (the “overtaken” messages ) have arrived! The event of an application, processing an arrived message is referred to as a delivery event (instead of as a receive event ). no message overtaken by a chain of messages between the same ( sender, receiver ) pair. In Fig. 6.2(a ), m1 overtaken by chain <m2,m3> CO degenerates to FIFO when m1,m2 sent by same process

Synchronous execution (SYNC) Involves Handshake between sender and receiver Instantaneous communication  modified definition of causality, where s,r are atomic and simultaneous, neither preceding the other.

Asynchronous Execution with Synchronous Communication If a program is written for an asynchronous system, say a FIFO system, will it still execute correctly if the communication is done by synchronous primitives instead?

Executions realizable with synchronous communication (RSC) An execution can be modeled (using the interleaving model) as a feasible schedule of the events to give a total order that extends the partial order ( E≺). In an A-execution, the messages can be made to appear instantaneous if there exists a linear extension of the execution Send event < Receive event Such an A-execution can be realized under synchronous communication and is called a realizable with synchronous communication (RSC ) execution

In the non-separated linear extension, if the adjacent send event and its corresponding receive event are viewed atomically, then that pair of events shares a common past and a common future with each other.

Synchronous program order on an asynchronous system There do not exist real systems with instantaneous communication that allows for synchronous communication to be naturally realized This suggest that the Distributed programs are deterministic , i.e., repeated runs of the same program will produce the same partial order. In many cases, programs are non-deterministic in the following senses A receive call can receive a message from any sender, if the expected sender is not specified Multiple send and receive calls which are enabled

Rendezvous – ( a meeting at an agreed time and place) One form of group communication is called multiway rendezvous , which is a synchronous communication among an arbitrary number of asynchronous processes All the processes involved “meet with each other,” i.e., communicate “synchronously ” with each other at one time . It is complex and not popular R endezvous between a pair of processes at a time, which is called binary rendezvous as opposed to the multiway rendezvous .

Algorithm for binary rendezvous The algorithm typically share the following features At each process, there is a set of tokens representing the current interactions that are enabled locally. If multiple interactions are enabled, a process chooses one of them and tries to “synchronize” with the partner process .

Constraints to be followed Schedule on-line, atomically, and in a distributed manner. Schedule in a deadlock-free manner (i.e., crown-free), Schedule to satisfy the progress property (i.e., find a schedule within a bounded number of steps) Additional features of a good algorithm are: i ) Symmetry or some form of fairness, i.e. not favouring particular processes over others during scheduling ii) Efficiency, i.e., using as few messages as possible,

O utline of a simple algorithm by Bagrodia 1. Receive commands are forever enabled from all processes. 2. A send command, once enabled, remains enabled until it completes, 3 . To prevent deadlock, process identifiers (PID’s) are used to introduce asymmetry. 4. Each process attempts to schedule only one send event at any time . The message types used are: ( i ) M, (ii) ack (M), (iii) request (M), and ( iv) permission (M). When a process is blocked waiting for a particular message that it is currently synchronizing , any other message that arrives is queued up.

Execution events in the synchronous execution are only the send of the message M and receive of the message M. The send and receive events for the other message types – ack (M), request (M), and permission (M) which are control messages – are under the covers, and are not included in the synchronous execution . The messages request (M), ack (M), and permission (M) use M’s unique tag ; the message M is not included in these messages. We use capital SEND(M ) and RECEIVE(M) to denote the primitives in the application execution , the lower case send and receive are used for the control messages.

The key rules to prevent cycles among the messages are summarized as follows a) To send to a lower priority process, messages M and ack (M) are involved in that order . The sender blocks waiting for the partner process to synchronize and send an acknowledgement . b) To send to a higher priority process, messages request (M), permission (M ), and M are involved, in that order

Messages used to implement synchronous order. Pi has higher priority than Pj . (a) Pi issues SEND(M). (b) Pj issues SEND(M). 117

Bagrodia’s Algorithm for Binary Rendezvous

Before sending M to Pi , Pj requests permission in a non-blocking manner. While waiting for this permission, there are two possibilities :

Group communication Processes across a distributed system cooperate to solve a joint task. They need to communicate with each other as a group, and therefore there needs to be support for group communication . Message Broadcast Multicasting Both can be provided by the network protocol stack using variants of the spanning tree

Network layer protocol assisted multicast cannot efficiently provide features such as the following : Application-specific ordering semantics on the order of delivery of messages . Adapting groups to dynamically changing membership. Sending multicasts to an arbitrary set of processes at each send event. Providing various fault-tolerance semantics

If a multicast algorithm requires the sender to be a part of the destination group , the multicast algorithm is said to be a closed group algorithm. If the sender of the multicast can be outside the destination group, the multicast algorithm is said to be an open group algorithm Two popular orders for the delivery of messages were proposed in the context of group communication: Causal order (CO ) and Total order Open Group Closed Group Are more general and more difficult to design and more expensive to implement Cannot be used in several scenarios such as in a large system

Causal order (CO) It has many applications such as updating replicated data , allocating requests in a fair manner, and synchronizing multimedia streams.

TWO CRITERIA Given a system with FIFO channels, causal order needs to be explicitly enforced by a protocol. Safety In order to prevent causal order from being violated, a message M that arrives at a process may need to be buffered until all system wide messages sent in the causal past of the send(M) event to that same destination have already arrived Liveness A message that arrives at a process must eventually be delivered to the process.

Raynal-Schiper-Toueg (RST) Algorithm (1991) It seems logical that each message M should carry a log of all other messages or their identifiers , sent causally before M’s send event, and sent to the same destination dest (M) This log can then be examined to ensure whether it is Safe to deliver a message. All algorithms aim to reduce this log overhead , and the space and time overhead of maintaining the log information at the process. Try to reduce the size of the local space and message space overhead

Total Order Total order - requires that all messages be received in the same order by the recipients of the messages.

Each process sends the message it wants to broadcast to a centralized process which simply relays all the messages it receives to every other process over FIFO channels. It is straightforward to see that total order is satisfied. Furthermore, this algorithm also satisfies causal message order

Complexity Each message transmission takes two message hops and exactly n messages in a system of n processes. Drawbacks A centralized algorithm has a single point of failure and congestion, and is therefore not an elegant solution.

Three-phase distributed algorithm The three phases of the algorithm are first described from the viewpoint of the sender, and then from the viewpoint of the receiver . Sender Phase 1 In the first phase, a process multicasts (line 1b) the message M with a locally unique tag and the local timestamp to the group members.

Phase 2 In the second phase, the sender process awaits a reply from all the group members who respond with a tentative proposal for a revise timestamp for that message M. The await call in line 1d is non-blocking, i.e ., any other messages received in the meanwhile are processed. Once all expected replies are received, the process computes the maximum of the proposed timestamps for M, and uses the maximum as the final timestamp .

Phase 3 In the third phase, the process multicasts the final timestamp to the group in line (1f ).

Receivers Phase 1 In the first phase, the receiver receives the message with a tentative/proposed timestamp. It updates the variable priority that tracks the highest proposed timestamp (line 2a), T hen revises the proposed timestamp to the priority , and places the message with its tag and the revised timestamp at the tail of the queue temp_Q (line 2b). In the queue, the entry is marked as undeliverable.

Phase 2 In the second phase, the receiver sends the revised timestamp (and the tag) back to the sender (line 2c). The receiver then waits in a non-blocking manner for the final timestamp (correlated by the message tag ).

Phase 3 In the third phase, the final timestamp is received from the multicaster (line 3). The corresponding message entry in temp_Q is identified using the tag (line 3a), and is marked as deliverable (line 3b) after the revised timestamp is overwritten by the final timestamp (line 3c). The queue is then resorted using the timestamp field of the entries as the key (line 3c).

If the message entry is at the head of the temp_Q , that entry, and all consecutive subsequent entries that are also marked as deliverable, are dequeued from temp_Q , and enqueued in deliver_Q in that order (the loop in lines 3d–3g).

1. A sends a REVISE_TS (7) message, having timestamp 7. B sends a REVISE_TS (9 ) message, having timestamp 9. E xample execution to illustrate the algorithm

2. C receives A’s REVISE_TS (7), enters the corresponding message in temp_Q , and marks it as undeliverable; priority = 7. C then sends PROPOSED_TS (7 ) message to A.

3. D receives B’s REVISE_TS (9), enters the corresponding message in temp_Q , and marks it as undeliverable; priority = 9. D then sends PROPOSED_TS (9 ) message to B.

4. C receives B’s REVISE_TS (9), enters the corresponding message in temp_Q , and marks it as undeliverable; priority = 9. C then sends PROPOSED_TS (9 ) message to B.

5. D receives A’s REVISE_TS (7), enters the corresponding message in temp_Q , and marks it as undeliverable; priority = 10. D assigns a tentative timestamp value of 10, which is greater than all of the timestamps on REVISE_TS s seen so far, and then sends PROPOSED_TS (10) message to A.

6 When A receives PROPOSED_TS (7) from C and PROPOSED_TS (10) from D, it computes the final timestamp as max7 10 = 10, and sends FINAL_TS (10 ) to C and D. The continuing sequence of main steps is as follows

7. When B receives PROPOSED_TS (9) from C and PROPOSED_TS (9) from D, it computes the final timestamp as max9 9 = 9, and sends FINAL_TS (9 ) to C and D.

8. C receives FINAL_TS (10) from A, updates the corresponding entry in temp_Q with the timestamp, resorts the queue, and marks the message as deliverable.

9. D receives FINAL_TS (9) from B, updates the corresponding entry in temp_Q by marking the corresponding message as deliverable, and resorts the queue. As the message is at the head of the queue, it is moved to delivery_Q .

152

distributed systems all about the data science process, covering the steps present in almost every data science project.

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

distributed systems all about the data science process, covering the steps present in almost every data science project.

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77