Map-Reduce
●It is a two step process
●Once data is processed through the map and reduce, it has to be stored again
inefficient for iterative and interactive computing jobs
Spark was designed to be fast for interactive queries and iterative algorithms, bringing in ideas like
support for in-memory storage and efficient fault recovery
Apache Spark
●Speed
●Ease Of Use
What is apache spark
Apache Spark is a cluster computing platform designed to be fast and general-purpose.
The main feature of Spark is its in-memory cluster computing that increases the processing
speed of an application.
Who use spark, and for what?
●Data science tasks
○Analyze and model data
●Data processing application
○Parallelize application across cluster
Spark Operation
●Transformation
○create a new dataset from an existing one
●Action
○return a value to the driver program after running a computation on the dataset.
Spark Operation
Transformation Action
Map/Map partition Reduce
Flatmap Count/Count by key
Filter Foreach
Sort by key Save as...
Group/Reduce by key First/ Take
Union/Join Collect
Cartesian ...
...
Sample
Movies similarity:
nameDict = loadMovieNames()
data = sc.textFile("/SparkCourse/ml-100k/u.data")
Sample
Movies similarity:
# Map ratings to key / value pairs: user ID => movie ID, rating
ratings = data.map(lambda l: l.split()).map(lambda l: (int(l[0]), (int(l[1]), float(l[2]))))
# Emit every movie rated together by the same user.
# Self-join to find every combination.
joinedRatings = ratings.join(ratings)
Sample
Movies similarity:
# At this point our RDD consists of userID => ((movieID, rating), (movieID, rating))
# Filter out duplicate pairs
uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
# Now key by (movie1, movie2) pairs.
moviePairs = uniqueJoinedRatings.map(makePairs)
Sample
Movies similarity:
# We now have (movie1, movie2) => (rating1, rating2)
# Now collect all ratings for each movie pair and compute similarity
moviePairRatings = moviePairs.groupByKey()
# We now have (movie1, movie2) = > (rating1, rating2), (rating1, rating2) …
# Can now compute similarities.
moviePairSimilarities = moviePairRatings.mapValues(computeCosineSimilarity).cache()
Architecture
Terms
●Driver Program
●Cluster manager
●Executor
●Job
●Trask
●stage
Terms
●Driver Program
●Cluster manager
●Executor
●Job
●Trask
●stage
Spark Streaming
Streaming Flow:
Streaming Program Structure:
After a context is defined, you have to do the following.
1.Define the input sources by creating input DStreams.
2.Define the streaming computations by applying transformation and output operations to
DStreams.
3.Start receiving data and processing it using streamingContext.start().
4.Wait for the processing to be stopped (manually or due to any error) using
streamingContext.awaitTermination().
5.The processing can be manually stopped using streamingContext.stop().