Apache spark

3,865 views 24 slides Jan 25, 2020
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Big Data processing Engine


Slide Content

Apache Spark
Shima jafari

Overview
●Introduction
●What is apache Spark
●Spark stack
●RDD
●Operation
●Sample
●Architecture
●Spark Streaming
●Kafka Streaming

Map-Reduce
●It is a two step process
●Once data is processed through the map and reduce, it has to be stored again
inefficient for iterative and interactive computing jobs
Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like
support for in-memory storage and efficient fault recovery

Apache Spark
●Speed
●Ease Of Use

What is apache spark

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

The main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.

Who use spark, and for what?
●Data science tasks
○Analyze and model data
●Data processing application
○Parallelize application across cluster

The Spark Stack

Resilient Distributed Dataset(RDD)
●In-memory computation
●Lazy Evaluation
●Fault Tolerance
●Immutability
●Persistence
●Partitioning
●Parallel

Spark Operation
●Transformation
○create a new dataset from an existing one
●Action
○return a value to the driver program after running a computation on the dataset.

Spark Operation
Transformation Action
Map/Map partition Reduce
Flatmap Count/Count by key
Filter Foreach
Sort by key Save as...
Group/Reduce by key First/ Take
Union/Join Collect
Cartesian ...
...

Spark Transformation
●Narrow
○Map /Map Partition
○Flatmap
○Filter
○Sample
○Union
●Wide
○Join
○Intersection
○Distinct
○Reduce/GroupByKey
○Cartesian
○Repartition
○Coalesce

Lazy evaluation

Sample
Movies similarity:
nameDict = loadMovieNames()
data = sc.textFile("/SparkCourse/ml-100k/u.data")

Sample
Movies similarity:
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split()).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
joinedRatings = ratings.join(ratings)

Sample
Movies similarity:
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs)

Sample
Movies similarity:
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) …
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).cache()

Architecture

Terms
●Driver Program
●Cluster manager
●Executor
●Job
●Trask
●stage

Terms
●Driver Program
●Cluster manager
●Executor
●Job
●Trask
●stage

Spark Streaming

Streaming Flow:

Streaming Program Structure:
After a context is defined, you have to do the following.
1.Define the input sources by creating input DStreams.
2.Define the streaming computations by applying transformation and output operations to
DStreams.
3.Start receiving data and processing it using streamingContext.start().
4.Wait for the processing to be stopped (manually or due to any error) using
streamingContext.awaitTermination().
5.The processing can be manually stopped using streamingContext.stop().

Discretized Stream(DStream)

Source:
●https://www.kdnuggets.com/2018/07/introduction-apache-spark.html
●https://stackoverflow.com/questions/32621990/what-are-workers-executors-cores-in-spark-sta
ndalone-cluster
●https://spark.apache.org/docs/latest/cluster-overview.html
●https://spark.apache.org/docs/latest/streaming-programming-guide.html
●https://dzone.com/articles/spark-streaming-vs-kafka-stream-1
●https://www.edureka.co/blog/spark-architecture/