Apache spark

3,865 views 24 slides Jan 25, 2020

Slide 1 of 24

About This Presentation

Big Data processing Engine

Size: 993.09 KB

Language: en

Added: Jan 25, 2020

Slides: 24 pages

Slide Content

Apache Spark
Shima jafari

Overview
●Introduction
●What is apache Spark
●Spark stack
●RDD
●Operation
●Sample
●Architecture
●Spark Streaming
●Kafka Streaming

Map-Reduce
●It is a two step process
●Once data is processed through the map and reduce, it has to be stored again
inefﬁcient for iterative and interactive computing jobs
Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like
support for in-memory storage and efﬁcient fault recovery

Apache Spark
●Speed
●Ease Of Use

What is apache spark

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

The main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.

Who use spark, and for what?
●Data science tasks
○Analyze and model data
●Data processing application
○Parallelize application across cluster

The Spark Stack

Resilient Distributed Dataset(RDD)
●In-memory computation
●Lazy Evaluation
●Fault Tolerance
●Immutability
●Persistence
●Partitioning
●Parallel

Spark Operation
●Transformation
○create a new dataset from an existing one
●Action
○return a value to the driver program after running a computation on the dataset.

Spark Operation
Transformation Action
Map/Map partition Reduce
Flatmap Count/Count by key
Filter Foreach
Sort by key Save as...
Group/Reduce by key First/ Take
Union/Join Collect
Cartesian ...
...

Spark Transformation
●Narrow
○Map /Map Partition
○Flatmap
○Filter
○Sample
○Union
●Wide
○Join
○Intersection
○Distinct
○Reduce/GroupByKey
○Cartesian
○Repartition
○Coalesce

Lazy evaluation

Sample
Movies similarity:
nameDict = loadMovieNames()
data = sc.textFile("/SparkCourse/ml-100k/u.data")

Sample
Movies similarity:
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split()).map(lambda l: (int(l[0]), (int(l[1]), ﬂoat(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to ﬁnd every combination.
joinedRatings = ratings.join(ratings)

Sample
Movies similarity:
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.ﬁlter(ﬁlterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs)

Sample
Movies similarity:
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) …
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).cache()

Architecture

Terms
●Driver Program
●Cluster manager
●Executor
●Job
●Trask
●stage

Spark Streaming

Streaming Flow:

Streaming Program Structure:
After a context is deﬁned, you have to do the following.
1.Deﬁne the input sources by creating input DStreams.
2.Deﬁne the streaming computations by applying transformation and output operations to
DStreams.
3.Start receiving data and processing it using streamingContext.start().
4.Wait for the processing to be stopped (manually or due to any error) using
streamingContext.awaitTermination().
5.The processing can be manually stopped using streamingContext.stop().

Discretized Stream(DStream)

Source:
●https://www.kdnuggets.com/2018/07/introduction-apache-spark.html
●https://stackoverﬂow.com/questions/32621990/what-are-workers-executors-cores-in-spark-sta
ndalone-cluster
●https://spark.apache.org/docs/latest/cluster-overview.html
●https://spark.apache.org/docs/latest/streaming-programming-guide.html
●https://dzone.com/articles/spark-streaming-vs-kafka-stream-1
●https://www.edureka.co/blog/spark-architecture/

Apache spark

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Apache spark

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx