spark ...................................

itsTIM66 14 views 18 slides Oct 08, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

.....


Slide Content

Spark Session 7

Map reduce problems 2 Force your pipeline into map and reduce steps. This can not accommodate every data analysis workflow. Read from disk for each map reduce job.(iterative algorithms? i.e. Machine learning ). Only native java programming interface.

3

Spark 4 The solution is to write a framework : same features of Map reduce and more. Capable of reusing Hadoop ecosystem. 20 highly efficient distributed operations , any combinations of steps rather than only 2 steps(map & reduce). Cashing data in memory rather than write it on the disk. It is been designed to be more easier for new users to write their analysis by providing access with other languages like scala , python an R.

Map reduce VS Spark 5

6

7

Spark architecture 8

Driver program 9 The Driver Program is a process that runs the main() function of the application and creates the  SparkContext  object. The purpose of  SparkContext  is to coordinate the spark applications, running as independent sets of processes on a cluster. To run on a cluster, the  SparkContext  connects to a different type of cluster managers and then perform the following tasks: - It acquires executors on nodes in the cluster. Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext . At last, the SparkContext sends tasks to the executors to run.

Cluster manager 10 The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.

Worker Node 11 The worker node is a slave node Its role is to run the application code in the cluster

Executer (JVM) 12 An executor is a process launched for an application on a worker node. It runs tasks and keeps data in memory or disk storage across them. It read and write data to the external sources. Every application contains its executor.

RDD RDD  stands for “ Resilient Distributed Dataset” . The way in spark to store data (data containers). Resilient , i.e. fault-tolerant with the help of RDD lineage graph( DAG ) and so able to recompute missing or damaged partitions due to node failures. Distributed ,   since Data resides on multiple nodes. Dataset  represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. 13

Pyspark 14 Apache Spark is written in Scala programming language. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame , Streaming, MLlib (Machine Learning) and Spark Core.

From the pyspark console 15 create RDD integer_RDD = sc.parallelize (range(10),3) collect all data from nodes integer_RDD.collect () check how the data are partitioned across the nodes integer_RDD.glom ().collect()

Read text file 16 Read a local file : text_RDD = sc.textFile ("file:///home/cloudera/data.txt") Read from HDFS: text_RDD = sc.textFile ("/user/input/data.txt") Read the first line in RDD text_RDD.take (1) Read all lines text_RDD.collect ()

Word count example : MAP def splitLine (line): return line.split () def createPairs (word): return (word,1) txt_RDD = sc.textFile ("/user/input/data.txt") pairs_RDD = txt_RDD.flatMap ( splitLine ).map( createPairs ) 17

Word count example : Reduce def sumCounts ( a,b ): return a+b word_counts_RDD = pairs_RDD.reduceByKey ( sumCounts ) word_counts_RDD.collect () 18
Tags