spark ...................................

itsTIM66 14 views 18 slides Oct 08, 2024

Slide 1 of 18

About This Presentation

.....

Size: 1.66 MB

Language: en

Added: Oct 08, 2024

Slides: 18 pages

Slide Content

Spark Session 7

Map reduce problems 2 Force your pipeline into map and reduce steps. This can not accommodate every data analysis workflow. Read from disk for each map reduce job.(iterative algorithms? i.e. Machine learning ). Only native java programming interface.

Spark 4 The solution is to write a framework : same features of Map reduce and more. Capable of reusing Hadoop ecosystem. 20 highly efficient distributed operations , any combinations of steps rather than only 2 steps(map & reduce). Cashing data in memory rather than write it on the disk. It is been designed to be more easier for new users to write their analysis by providing access with other languages like scala , python an R.

Map reduce VS Spark 5

Spark architecture 8

Driver program 9 The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster. To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - It acquires executors on nodes in the cluster. Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext . At last, the SparkContext sends tasks to the executors to run.

Cluster manager 10 The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.

Worker Node 11 The worker node is a slave node Its role is to run the application code in the cluster

Executer (JVM) 12 An executor is a process launched for an application on a worker node. It runs tasks and keeps data in memory or disk storage across them. It read and write data to the external sources. Every application contains its executor.

RDD RDD stands for “ Resilient Distributed Dataset” . The way in spark to store data (data containers). Resilient , i.e. fault-tolerant with the help of RDD lineage graph( DAG ) and so able to recompute missing or damaged partitions due to node failures. Distributed , since Data resides on multiple nodes. Dataset represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure. 13

Pyspark 14 Apache Spark is written in Scala programming language. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame , Streaming, MLlib (Machine Learning) and Spark Core.

From the pyspark console 15 create RDD integer_RDD = sc.parallelize (range(10),3) collect all data from nodes integer_RDD.collect () check how the data are partitioned across the nodes integer_RDD.glom ().collect()

Read text file 16 Read a local file : text_RDD = sc.textFile ("file:///home/cloudera/data.txt") Read from HDFS: text_RDD = sc.textFile ("/user/input/data.txt") Read the first line in RDD text_RDD.take (1) Read all lines text_RDD.collect ()

Word count example : MAP def splitLine (line): return line.split () def createPairs (word): return (word,1) txt_RDD = sc.textFile ("/user/input/data.txt") pairs_RDD = txt_RDD.flatMap ( splitLine ).map( createPairs ) 17

Word count example : Reduce def sumCounts ( a,b ): return a+b word_counts_RDD = pairs_RDD.reduceByKey ( sumCounts ) word_counts_RDD.collect () 18

spark ...................................

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

spark ...................................

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......