An introduction to Spark MLlib from the Apache Spark with Scala course available at https://www.supergloo.com/fieldnotes/portfolio/apache-spark-scala/. These slides present an overview on machine learning with Apache Spark MLlib.
For more background on machine learning see my other uploaded pres...
An introduction to Spark MLlib from the Apache Spark with Scala course available at https://www.supergloo.com/fieldnotes/portfolio/apache-spark-scala/. These slides present an overview on machine learning with Apache Spark MLlib.
For more background on machine learning see my other uploaded presentation "Machine Learning with Spark".
Size: 299.88 KB
Language: en
Added: May 23, 2016
Slides: 9 pages
Slide Content
Spark MLlib
Overview
•MLlib is Spark’s library of machine learning (ML) functions
designed to run in parallel on clusters. MLlib contains a
variety of learning algorithms
•MLlib invokes various algorithms on RDDs
•Some classic ML algorithms are not included with Spark
MLlib because they were not designed for parallel
Overview
•Divided into two packages:
•spark.mllib contains the original API built on top of
RDDs.
•spark.ml provides higher-level API built on top of
DataFrames
•Using spark.ml is recommended because with
DataFrames the API is more versatile and flexible. Plan is
to keep supporting spark.mllib along with the
development of spark.ml.
Machine Learning Recap
•Machine learning algorithms try to predict or make
decisions based on training data.
•There are multiple types of learning problems,
including classification, regression, or clustering. All of
which have different objectives.
Spark MLlib Data Types
•MLlib contains a few specific data
types including Vector, LabeledPoint,
Rating, Matrix (local and distributed)
and various Model classes.
Recommender Systems
•Collaborative filtering is commonly used for
recommender systems.
•spark.mllib currently supports model-based
collaborative filtering, in which users and products
are described by a small set of latent factors that
can be used to predict missing entries.
•spark.mllib uses the alternating least squares
(ALS) algorithm to learn these latent factors.