Spark.pptx to knowledge gaining in wdm days ago

PreethamMCPreethamMC 12 views 58 slides Jul 26, 2024
Slide 1
Slide 1 of 58
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58

About This Presentation

Hlukivbbhjjfufyfufiuuyfffggghjjkgjgjbvmcncbncbkbbmnkllllllllllllknkknbxxsdvdfnncnmvnknjknlnlnknnvhjjghkllmnbbkkkjhhjkjgguddzgxgxigxigxgiyodtuutstudittixtixtitttftitifittxdyofoyfiytdtxiyoxtoxycyoccyoxtitixxtixiizriieyuejdtitiityixtixtxtxixitxiitxitx


Slide Content

Spark 1

How to store Books of 4 th year BE student BE 4 th year books - Cache 2. BE first 3 years books – Main memory 3. PUC and school books – Hard disk 2

What is SPARK ? A parallel processing framework It is open source Developed at the university of California, Berkeley Team that created SPARK went on to form a company DATABRICKS in 2013 Initial release was in May 30 th 2014 Developed predominantly using Scala High throughput applications deployed on spark enjoys the best performance if developed in SCALA

Overview of Job execution in HADOOP A job submitted by the user is picked up by the name node and the resource manager Job gets submitted to the name node and eventually resource manager is responsible for scheduling the execution of the job on the data nodes in the cluster The data nodes in cluster consists the data blocks on which the user’s program will be executed in parallel Job client Name node Resource manager/YARN Name node Data nodes

The Map and Reduce stage

Disc I/O problem in Hadoop Map-Reduce The above example demonstrates a map-reduce job involving 3 mappers on 3 input splits There is 1 reducer Each input split on each data resides on the hard disc. Mapper reading them would involve a disc read operation. There would be 3 disc read operations from all the 3 mappers put together Merging in the reduce stage involves 1 disc write operation Reducer would write the final output file to the HDFS, which indeed is another disc write operation Totally there are a minimum of 5 disc I/O operations in the above example (3 from the map stage and 2 from reduce stage) The number of disc read operations from the map stage is equal to the number of input splits

SPARK’s approach to Problem Solving Spark allows the results of computation to be saved in the memory for future re-use Reading the data from the memory is much faster than that of reading from the disc Caching the result in memory is under the programmer’s control Not always is possible to save such results completely in memory especially when the object is too large and memory is low, In such cases the objects needs to be moved to the disc Spark, therefore is not a completely in memory based parallel processing platform Spark however is 3X to 10X faster in most of the jobs when compared to that of Hadoop

Understanding SPARK architecture

Spark’s Background and History Hadoop Batch processing is useful, but is not the answer for every computational situation. Driven by continued widespread adoption of Big Data, and user hunger for additional use cases researchers began looking at ways to improve on MapReduce’s capabilities This led to development of Spark whose designers created it with the expectation that it would work with petabytes of data that were distributed across a cluster of thousands of servers 10

Spark Spark is a general purpose distributed data processing engine that suitable for use in wide range of situations 11

Spark’s was designed to exploit Speed – by using in-memory information processing Scalability – easily be deployed over thousands of servers 12

Spark offers 4 Primary advantages Performance Simplicity Ease of administration Faster application development 13

Common Use Cases for Spark Streaming data Artificial intelligence Machine learning, and deep learning Business intelligence Data integration 14

Streaming data Enterprises are now guzzling a constant torrent of streaming data , fed by Sensors Financial trades Mobile devices Medical monitors Website access logs Social media updates from Facebook, Twitter, and LinkedIn 15

Working directly on streaming data is different Using the financial services example, a Spark application can analyze raw incoming credit card authorizations and completed transactions in essentially real time , rather than wading through this information after the fact. This time-saver is a game-changer for solutions like security, fraud prevention ( instead of just detection ), and so on. 16

Another Usecase For example, the firm could carry out a fact-based business decision to reject a bogus credit card transaction at the point of sale in real time , rather than permitting the fraudster to acquire merchandise and thus trigger a loss for the bank and retailer. 17

Artificial intelligence, machine learning, and deep learning AI projects begin as Big Data problems , and the accuracy and quality of an AI model is directly affected by the quality and quantity of data used. Experts estimate that emerging AI solutions will need eight to ten times the data volume used for current Big Data solutions. In-memory processing and highly efficient data pipelines make Spark an ideal analytics engine for data preparation, transformation, and manipulation for AI projects. 18

Business intelligence Business analysts continually seek more up-to-date, accurate visibility into the full spectrum of an organization’s data. Give the flexibility t o change their inquiries, drill down on the data, and so on. Spark provides the requisite performance necessary to allow on-the-fly analysis  19

Data integration Most enterprises deploy an array of business systems. Some are on-premises, while others are cloud-based. Some are internally developed , while others are provided by vendors. Tying all this data together is complex and computationally intensive. Bringing Spark into this ETL loop means that you can quickly and easily combine data from multiple silos to get a better picture of the situation. Spark has impressive integration with numerous data storage technology suppliers and platforms and excels at processing these types of workloads 20

Spark Designer’s basic Philosophy Basic fact that in-memory processing is always faster than interacting with hard disks (a ten times performance gain is a commonly accepted figure). This realization informed all their architectural and product design decisions, which yielded another ten fold improvement. Taken together , this made Spark solutions up to 100 times faster than earlier disk-based MapReduce applications. 21

Reflections Justify - Mobile phone is a general purpose gadget, while landline phone is not general purpose Justify - Spark is a general purpose distributed data processing engine 22

Reflection - Compare 23 Figure 2 – In-memory Processing

Resilient Distributed Datasets (RDDs) Resilient Distributed Datasets ( RDD ) is a fundamental in-memory data structure of  Spark . They are a distributed collection of objects , which are stored in memory or on disks of different machines of a cluster. RDDs are immutable (read-only ) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD. 24

Important Properties of an RDD Im mutable Resilient Partitioned Distributed and spread across multiple nodes in a machine 25

Im mutable Data stored in an RDD is in the read-only mode : you cannot edit the data which is present in the RDD. But, you can create new RDDs by performing transformations on the existing RDDs. 26

Transformation the data object 1,2 3,4 5 2,3 4,5 6 Transformation User defined Object 2 Object 1 The data in the objects cannot be modified as the very nature of the SPARK objects is immutable and the data in these objects are partitioned & distributed across nodes

Resilient Literal meaning - able to withstand or recover quickly from difficult conditions 28

Execution starts only when ACTION starts Base RDD: (1,2,3,4,5) Object 2: (2,4,6) Display output or save in a persistent file Object 1 : (2,3,4,5,6) map(increment) filter(even nos) collect the output

RDD’s are fault tolerant (resilient) Base RDD RDD1 RDD3 RDD4 Final output RDD’s lost or corrupted during the course of execution can be reconstructed from the lineage LOSTRDD 0. Create Base RDD Increment the data elements Filter the even numbers Pick only those divisible by 6 Select only those greater than 78

RDD’s are fault tolerant (resilient) Base RDD RDD1 RDD3 RDD4 Final output Lineage is a history of how an RDD was created from it’s parent RDD through a transformation The steps in the transformation are re-executed to create a lost RDD RDD2 0. Create Base RDD Increment the data elements Filter the even numbers Pick only those divisible by 6 Select only those greater than 78

Scenario - Team 9 is preparing Project report Report template given by dept – report_template.doc Project report - our_proj_report.doc Stored – in Shared Folder Working – chapters are assigned to different members; when someone makes changes they will inform to everyone else 32

Spark supports 2 Primary types of Operation Transformations Actions 33

Transformations These are functions that accept the existing RDDs as input and outputs one or more RDDs . However, the data in the existing RDD in Spark does not change as it is immutable. These transformations are executed when they are invoked or called. Every time transformations are applied, a new RDD is created.   34 Functions Description Map( ) Returns a new RDD by applying the function on each data element Filter( ) Returns a new RDD formed by selecting those elements of the source on which the function returns true

Actions Actions in Spark are functions that return the end result of RDD computations. It uses a lineage graph to load data onto the RDD in a particular order. After all of the transformations are done, actions return final result to the Spark Driver. Actions are operations that provide non-RDD values. 35

Actions- Example 36 Functions Description count( ) Gets the number of data elements in an RDD first( ) Retrieves the first data element of an RDD

What brings Power to Spark In-memory Processing RDD Lazy Evaluation 37

Lazy Evaluation in Apache Spark? Transformations  are lazy in nature meaning when we call some operation in RDD, it does not execute immediately. E xecution will start only when action is triggered 38

L azy E valuation 39 In the example above there is no point in evaluating 3 + 2, because it's not used. The style of only evaluating what's needed is called lazy evaluation, while the opposite (evaluating immediately) is called strict evaluation.

40

Advantages of Lazy Evaluation in Spark Increases Manageability Saves Computation and increases Speed - Since only necessary values get compute Optimization - by reducing the number of queries 41

Reflections Why RDDs are made immutable ?? Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is definitely safe to share across processes. 42

Reflections What is the advantage of Lazy evaluation ?? 43

Core Spark Technology Components Spark Core and RDD Spark’s libraries Storage systems 44

Spark’s Libraries 45

Storage systems Hadoop/Spark has two major storage deployment models: The “share-nothing ” storage model in which compute and storage resources both come from commercial storage-rich servers The shared storage model in which a shared storage system, like IBM Elastic Storage Server (ESS), provides storage service for the entire Hadoop/Spark cluster. 46

Data Storage Shared storage is becoming popular primarily because it separates storage from compute , enabling compute and storage to grow independently. Spark’s designers did a great job of opening it up to work with numerous file systems  — commercial and open source alike . This is in complete concordance with Spark’s mission of interacting with the full range of enterprise data . 47

Spark supports any Hadoop-compliant storage layer The Hadoop File System (HDFS) IBM Spectrum Scale (upon which IBM ESS solutions are built) Redhat’s GlusterFS Amazon’s S3 Microsoft’s Azure Blob Storage (WASB) Alluxio (formerly Tachyon) 48

Spark supports other popular repositories HBase Cassandra Solr Redshift MongoDB Elastic MapR Couchbase 49

50

Spark’s Open-Source Challenges Dynamically managing resources Living up to Service Level Agreements (SLAs) Incorporate many other data sources and frameworks Coping with the speed of open-source updates Performance demands Production support 51

Dynamically managing resources Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. When dynamic allocation is enabled and a Spark application has a backlog of pending tasks, it can request executors. When the application becomes idle, its executors are released and can be acquired by other applications. If not precisely managed, enterprise assets can be poorly utilized , which wastes time and money , and places undue burdens on administrators and operational staff. 52

53 Service Level Agreements 

Living up to Service Level Agreements (SLAs) SLAs are now a fact of life in most organizations. Opensource Spark can have some difficulty meeting contractual obligations related to Performance Scalability Security Resource utilization 54

Incorporate many other data sources and frameworks Open-source Spark excels at interacting with data hosted in the Hadoop environment , but easily getting to other sources can be problematic 55

Coping with the speed of open-source updates Spark has numerous components that require administration, including SQL DataFrames Spark Streaming MLlib GraphX Each of these components is continually being revised , and it’s possible that application code changes will be necessary when moving between Spark versions. Keeping pace with the torrent of updates is very challenging 56

Performance demands Hardware that hosts Spark environments typically serves multiple tenants , each of which has different workloads that are subject to different SLAs. But because Spark applications are quite varied, it can be difficult for IT to prioritize appropriately 57

Production support Open-source Big Data platforms, including Spark, mandate that administrators and developers use community resources such as blogs, Q&A forums, and meetups to get answers to their questions and issues . It’s hard to achieve the SLAs required for mission critical applications without dedicated and accountable support resources. 58
Tags