APACHE SPARK ARCHITECTURE Gaurav biswas Bit mesra 16-04-2019 1
OUTLINE SPARK & ITS FEATURE SPARK ARCHITECTURE RESILIENT DISTRIBUTED DATASETS(RDDs) DIRECT ACYCLIC GRAPH(DAG) ADVANTAGES & DRAWBACKS CONCLUSION 16-04-2019 2
INTRODUCTION Apache Spark : an open source cluster computing framework for real-time data processing According to Spark Certified Experts : Sparks performance is up to 100 times faster in memory and 10 times faster on disk when compared to Hadoop The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application 16-04-2019 3
FEATURES OF APACHE SPARK 16-04-2019 4
FEATURES OF APACHE SPARK Speed : Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing Powerful Caching : Simple programming layer provides powerful caching and disk persistence capabilities . Deployment: It can be deployed through Mesos , Hadoop via YARN, or Spark’s own cluster manager 16-04-2019 5
FEATURES OF APACHE SPARK Real-Time: It offers Real-time computation & low latency because of in-memory computation Polyglot: Spark provides high-level APIs in Java, Scala , Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python 16-04-2019 6
SPARK ARCHITECTURE SPARK DRIVE :- Separate process to execute user application Creates SparkContext to schedual Jobs execution & negotiate with cluster manager EXECUTORS :- Run tasks scheduled by driver Store computation result in memory,on disk or off-heap Interact with storage systems 16-04-2019 9
SPARK ARCHITECTURE CLUSTER MANAGER :- Spark context works with the cluster manager to manage various jobs The driver program & Spark context takes care of the job execution within the cluster 16-04-2019 10
SPARK ARCHITECTURE Apache Spark Architecture is based on two main abstractions: Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG) 16-04-2019 11
Resilient Distributed Dataset (RDD) 16-04-2019 12
Resilient Distributed Dataset (RDD) 16-04-2019 13
Resilient Distributed Dataset (RDD) 16-04-2019 14
Resilient Distributed Dataset (RDD) 16-04-2019 15
OPERATION OF RDD:- RDDs can perform two types of operations: Transformations: They are the operations that are applied to create a new RDD. Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver. 16-04-2019 16
DIRECT ACYCLIC GRAPH(DAG) 16-04-2019 17
DIRECT ACYCLIC GRAPH(DAG) 16-04-2019 18
ADVANTAGES & drawbacks ADVANTAGES: Integration with Hadoop Faster Real time stream processing DRAWBACKS: No File Management system No Support for Real-Time Processing Cost Effective Manual Optimization 16-04-2019 19
Conclusion SPARK makes it easy to write and run complicated data processing It enables computation of tasks at a very large scale Although spark has many limitations, it is still trending in the big data world Due to these drawbacks, many technologies are overtaking Spark Such as Flink offers complete real-time processing than the spark In this way somehow other technologies overcoming the drawbacks of Spark 16-04-2019 20