Machine Learning with Spark MLlib

1,911 views 9 slides May 23, 2016

Slide 1 of 9

About This Presentation

An introduction to Spark MLlib from the Apache Spark with Scala course available at https://www.supergloo.com/fieldnotes/portfolio/apache-spark-scala/. These slides present an overview on machine learning with Apache Spark MLlib.

For more background on machine learning see my other uploaded pres...

Size: 299.88 KB

Language: en

Added: May 23, 2016

Slides: 9 pages

Slide Content

Spark MLlib

Overview
•MLlib is Spark’s library of machine learning (ML) functions
designed to run in parallel on clusters. MLlib contains a
variety of learning algorithms
•MLlib invokes various algorithms on RDDs
•Some classic ML algorithms are not included with Spark
MLlib because they were not designed for parallel

Overview
•Divided into two packages:
•spark.mllib contains the original API built on top of
RDDs.
•spark.ml provides higher-level API built on top of
DataFrames
•Using spark.ml is recommended because with
DataFrames the API is more versatile and ﬂexible. Plan is
to keep supporting spark.mllib along with the
development of spark.ml.

Machine Learning Recap
•Machine learning algorithms try to predict or make
decisions based on training data.
•There are multiple types of learning problems,
including classiﬁcation, regression, or clustering. All of
which have diﬀerent objectives.

Spark MLlib Data Types
•MLlib contains a few speciﬁc data
types including Vector, LabeledPoint,
Rating, Matrix (local and distributed)
and various Model classes.

MLlib Supported Supervised Algorithm Methods
•Binary Classiﬁcation Problems
•linear SVMs, logistic regression, decision trees, random forests,
gradient-boosted trees, naive bayes
•Multiclass Classiﬁcation Problems
•logistic regression, decision trees, random forests, naive Bayes
•Regression Problems
•linear least squares, Lasso, ridge regression, decision trees,
random forests, gradient-boosted trees, isotonic regression

MLlib Supported Unsupervised Models
•K-means
•Gaussian mixture
•Power iteration clustering (PIC)
•Latent Dirichlet allocation (LDA)
•Bisecting k-means
•Streaming k-means

Recommender Systems
•Collaborative ﬁltering is commonly used for
recommender systems.
•spark.mllib currently supports model-based
collaborative ﬁltering, in which users and products
are described by a small set of latent factors that
can be used to predict missing entries.
•spark.mllib uses the alternating least squares
(ALS) algorithm to learn these latent factors.

Machine Learning with Spark MLlib

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Machine Learning with Spark MLlib

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx