Introduction to Spark (Intern Event Presentation)

databricks 2,975 views 15 slides Aug 19, 2015

Slide 1 of 15

About This Presentation

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

Size: 556.23 KB

Language: en

Added: Aug 19, 2015

Slides: 15 pages

Slide Content

Introduction to Spark
Matei Zaharia
Databricks Intern Event, August 2015

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large
datasets
•APIs in Java, Scala, Python, R
•Libraries for SQL, streaming, machine learning, …
•100x faster than HadoopMapReducefor some
apps

About Databricks
Founded by creators of Spark in 2013
Offers a hosted cloud service built on Spark
•Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project
in big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
•Collections of objects stored in memory or disk across a
cluster
•Built via parallel transformations (map, filter, …)
•Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “MySQL” in s).count()
messages.filter(lambda s: “Redis” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Running Time (s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration80 s
further iterations1
s
Iterative algorithm used in machine learning

Source: Daytona GraySortbenchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014
Record:
Spark
207
machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
// Load data using SQL
points = ctx.sql(“select latitude, longitude from tweets” )
// Train a machine learning model
model = KMeans.train(points, 10)
// Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

Demo

Over 1000 production users, clusters up to 8000
nodes
Many talks online at spark-summit.org
Spark Community

Ongoing Work
Speeding up Spark through code generation
and binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

Introduction to Spark (Intern Event Presentation)

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Introduction to Spark (Intern Event Presentation)

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 14

Slide 15

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......