Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017
papisdotio
367 views
15 slides
Jun 26, 2017
Slide 1 of 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
About This Presentation
In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly in...
In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.
Size: 1.05 MB
Language: en
Added: Jun 26, 2017
Slides: 15 pages
Slide Content
Building Machine Learning
applications locally with Spark
21/06/2017
Joel Pinho Lucas
Agenda
•Problems and Motivation
•Spark and MLlib overview
•Launching applications in a Spark cluster
•Simulating a Spark cluster using Docker
•Demo: deploying a Spark cluster in a local machine
•Unit tests for Spark jobs
2
3
•
How to setup a Spark cluster (infra + configuration)?
•
Test and/or Debug a Spark job
•
All team should have the same environment
4
•Lightweight cluster
•One machine
•Same environment for all team
•Deployed easily in any platform
Run Spark Locally with docker
5
•
Easy to develop (API in Java, Scala, Python, R)
•
High Quality algorithms
http://spark.apache.org/mllib/
•
Fast to run
•
Lazy evaluation
•
In memory Storage
6
http://spark.apache.org/docs/2.1.0/cluster-overview.html
Spark Execution Model
8
Starting a Cluster Manually
Manually Submitting an Application
Choose your Docker Image
(or build your own and share)
9
Some available Spark Docker
Images
10
•https://github.com/big-data-europe/docker-spark
•https://hub.docker.com/r/internavenue/centos-spark/
•https://github.com/sequenceiq/docker-spark
•https://github.com/epahomov/docker-spark
•https://www.anchormen.nl/spark-docker/
•https://github.com/gettyimages/docker-spark
•https://hub.docker.com/r/bigdatauniversity/spark/
http://github.com/joelplucas/docker-spark
11
Example to Run
•MLlib's FP-Growth algorithm
•Data from the digital publishing domain
•Problem: to find frequent patterns from navigation profiles
•Write results in MongoDB
http://github.com/joelplucas/fpgrowth-spark-example
12
The Dataset
13
Unit Testing using Spark Testing Base
•
Launched in Strata NYC 2015 by Holden Karau (and maintained by the community)
•
Supports unit tests in Java, Scala and Python
14