Introduction to Spark (Intern Event Presentation)

databricks 2,975 views 15 slides Aug 19, 2015
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.


Slide Content

Introduction to Spark
Matei Zaharia
Databricks Intern Event, August 2015

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large
datasets
•APIs in Java, Scala, Python, R
•Libraries for SQL, streaming, machine learning, …
•100x faster than HadoopMapReducefor some
apps

About Databricks
Founded by creators of Spark in 2013
Offers a hosted cloud service built on Spark
•Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project
in big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
•Collections of objects stored in memory or disk across a
cluster
•Built via parallel transformations (map, filter, …)
•Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “MySQL” in s).count()
messages.filter(lambda s: “Redis” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Running Time (s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration80 s
further iterations1
s
Iterative algorithm used in machine learning

Source: Daytona GraySortbenchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014
Record:
Spark
207
machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
// Load data using SQL
points = ctx.sql(“select latitude, longitude from tweets” )
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

Demo

Over 1000 production users, clusters up to 8000
nodes
Many talks online at spark-summit.org
Spark Community

Ongoing Work
Speeding up Spark through code generation
and binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

Thank you.
We’re hiring!