Machine Learning with Spark MLlib

1,911 views 9 slides May 23, 2016
Slide 1
Slide 1 of 9
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9

About This Presentation

An introduction to Spark MLlib from the Apache Spark with Scala course available at https://www.supergloo.com/fieldnotes/portfolio/apache-spark-scala/. These slides present an overview on machine learning with Apache Spark MLlib.

For more background on machine learning see my other uploaded pres...


Slide Content

Spark MLlib

Overview
•MLlib is Spark’s library of machine learning (ML) functions
designed to run in parallel on clusters. MLlib contains a
variety of learning algorithms
•MLlib invokes various algorithms on RDDs
•Some classic ML algorithms are not included with Spark
MLlib because they were not designed for parallel

Overview
•Divided into two packages:
•spark.mllib contains the original API built on top of
RDDs.
•spark.ml provides higher-level API built on top of
DataFrames
•Using spark.ml is recommended because with
DataFrames the API is more versatile and flexible. Plan is
to keep supporting spark.mllib along with the
development of spark.ml.

Machine Learning Recap
•Machine learning algorithms try to predict or make
decisions based on training data.
•There are multiple types of learning problems,
including classification, regression, or clustering. All of
which have different objectives.

Spark MLlib Data Types
•MLlib contains a few specific data
types including Vector, LabeledPoint,
Rating, Matrix (local and distributed)
and various Model classes.

MLlib Supported Supervised Algorithm Methods
•Binary Classification Problems
•linear SVMs, logistic regression, decision trees, random forests,
gradient-boosted trees, naive bayes
•Multiclass Classification Problems
•logistic regression, decision trees, random forests, naive Bayes
•Regression Problems
•linear least squares, Lasso, ridge regression, decision trees,
random forests, gradient-boosted trees, isotonic regression

MLlib Supported Unsupervised Models
•K-means
•Gaussian mixture
•Power iteration clustering (PIC)
•Latent Dirichlet allocation (LDA)
•Bisecting k-means
•Streaming k-means

Recommender Systems
•Collaborative filtering is commonly used for
recommender systems.
•spark.mllib currently supports model-based
collaborative filtering, in which users and products
are described by a small set of latent factors that
can be used to predict missing entries.
•spark.mllib uses the alternating least squares
(ALS) algorithm to learn these latent factors.

For more, visit https://supergloo.com