Spark architecture

GauravBiswas9 2,388 views 21 slides Apr 19, 2019
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

APACHE Spark architecture


Slide Content

APACHE SPARK ARCHITECTURE Gaurav biswas Bit mesra 16-04-2019 1

OUTLINE SPARK & ITS FEATURE SPARK ARCHITECTURE RESILIENT DISTRIBUTED DATASETS(RDDs) DIRECT ACYCLIC GRAPH(DAG) ADVANTAGES & DRAWBACKS CONCLUSION 16-04-2019 2

INTRODUCTION Apache Spark : an open source cluster computing framework for real-time data processing According to Spark Certified Experts : Sparks performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop The main feature of Apache Spark is its  in-memory cluster computing   that increases the processing speed of an application 16-04-2019 3

FEATURES OF APACHE SPARK 16-04-2019 4

FEATURES OF APACHE SPARK Speed : Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing Powerful Caching : Simple programming layer provides powerful caching and disk persistence capabilities . Deployment: It can be deployed through Mesos , Hadoop via YARN, or Spark’s own cluster manager 16-04-2019 5

FEATURES OF APACHE SPARK Real-Time: It offers Real-time computation & low latency because of in-memory computation   Polyglot: Spark provides high-level APIs in Java, Scala , Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python 16-04-2019 6

SPARK ARCHITECTURE 16-04-2019 7 Figure:-Apache spark architecture

CORE CONCEPTS 16-04-2019 8

SPARK ARCHITECTURE SPARK DRIVE :- Separate process to execute user application Creates SparkContext to schedual Jobs execution & negotiate with cluster manager EXECUTORS :- Run tasks scheduled by driver Store computation result in memory,on disk or off-heap Interact with storage systems 16-04-2019 9

SPARK ARCHITECTURE CLUSTER MANAGER :- Spark context works with the cluster manager to manage various jobs The driver program & Spark context takes care of the job execution within the cluster 16-04-2019 10

SPARK ARCHITECTURE Apache Spark Architecture is based on two main abstractions: Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG) 16-04-2019 11

Resilient Distributed Dataset (RDD) 16-04-2019 12

Resilient Distributed Dataset (RDD) 16-04-2019 13

Resilient Distributed Dataset (RDD) 16-04-2019 14

Resilient Distributed Dataset (RDD) 16-04-2019 15

OPERATION OF RDD:- RDDs can perform two types of operations: Transformations: They are the operations that are applied to create a new RDD. Actions:  They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. 16-04-2019 16

DIRECT ACYCLIC GRAPH(DAG) 16-04-2019 17

DIRECT ACYCLIC GRAPH(DAG) 16-04-2019 18

ADVANTAGES & drawbacks ADVANTAGES: Integration with Hadoop Faster Real time stream processing DRAWBACKS: No File Management system No Support for Real-Time Processing Cost Effective Manual Optimization 16-04-2019 19

Conclusion SPARK makes it easy to write and run complicated data processing It enables computation of tasks at a very large scale Although spark has many limitations, it is still trending in the big data world Due to these drawbacks, many technologies are overtaking Spark Such as Flink offers complete real-time processing than the spark In this way somehow other technologies overcoming the drawbacks of Spark 16-04-2019 20

THANK YOU 16-04-2019 21
Tags