Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

papisdotio 367 views 15 slides Jun 26, 2017
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly in...


Slide Content

Building Machine Learning
applications locally with Spark
21/06/2017
Joel Pinho Lucas

Agenda
•Problems and Motivation
•Spark and MLlib overview
•Launching applications in a Spark cluster
•Simulating a Spark cluster using Docker
•Demo: deploying a Spark cluster in a local machine
•Unit tests for Spark jobs
2

3

How to setup a Spark cluster (infra + configuration)?

Test and/or Debug a Spark job

All team should have the same environment

4
•Lightweight cluster
•One machine
•Same environment for all team
•Deployed easily in any platform
Run Spark Locally with docker

5

Easy to develop (API in Java, Scala, Python, R)

High Quality algorithms
http://spark.apache.org/mllib/

Fast to run

Lazy evaluation

In memory Storage

6
http://spark.apache.org/docs/2.1.0/cluster-overview.html
Spark Execution Model

Cluster Types
•Standalone
•Apache Mesos
•Hadoop Yarn
7

8
Starting a Cluster Manually
Manually Submitting an Application

Choose your Docker Image
(or build your own and share)
9

Some available Spark Docker
Images
10
•https://github.com/big-data-europe/docker-spark
•https://hub.docker.com/r/internavenue/centos-spark/
•https://github.com/sequenceiq/docker-spark
•https://github.com/epahomov/docker-spark
•https://www.anchormen.nl/spark-docker/
•https://github.com/gettyimages/docker-spark
•https://hub.docker.com/r/bigdatauniversity/spark/

http://github.com/joelplucas/docker-spark
11

Example to Run
•MLlib's FP-Growth algorithm
•Data from the digital publishing domain
•Problem: to find frequent patterns from navigation profiles
•Write results in MongoDB
http://github.com/joelplucas/fpgrowth-spark-example
12

The Dataset
13

Unit Testing using Spark Testing Base

Launched in Strata NYC 2015 by Holden Karau (and maintained by the community)

Supports unit tests in Java, Scala and Python
14

Q&A - Contact

Linkedin: http://br.linkedin.com/in/joelplucas/

Email: [email protected]
15