Big Data Analytics is not something which was just invented yesterday!

bhargavi804095 42 views 29 slides Apr 29, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Big Data Analytics is not something which was just invented yesterday! We have had success in this domain with Hadoop and the Map-Reduce paradigm.


Slide Content

Lesson 17

Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
1

Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
2
•Fast and Efficient Processing of Big
Data
•Processes very large datasets at the
cluster of machines with high efficiency
for above 10 M data points in shared-
nothing environment.
•Big Data analysis: datasets with over 1
million data points

Mahout
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
3
•Can run on multiple machines as well as
multi-core processing units
•Compute tasks fast when distributed
over the cloud, tasks runs parallel, and
run in shared nothing-computational
environment of multiple machines and
multi-core units

Mahout in sequential shared
environment
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
4
•Higher time efficiency for less than 1 M
data points.

Mahout Eclipse and Maven
Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
5
•Download Maven and Eclipse
https://www.eclipse.org/downloads/
Add -Name: m2eclipse-Location: http://
download.eclipse.org/technology/m2e/rele
ases (Google “install m2eclipse”)
•Ubuntu: sudo apt-get install git
sudo apt-get install maven

Mahout Installing
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
6
•http://mahout.apache.org/general/
downloads.html
Latest Release: 0.13.0 - mahout-
distribution-0.13.tar.gz
•Apache Preparing for Mahout version
0.14.0 (June 2018)

Git Hub
•GitHub: Mahout Mirror
https://github.com/apache/mahout
•git://git.apache.org/mahout.git mahout-
trunk
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
7

Git Hub Maven under mahout-trunk
folder:
•<dependency>
•<groupId>org.apache.mahout</groupId>
•<artifactId>mahout-core</artifactId>
•<version>${mahout.version}</version>
•</dependency>

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
8

Mahout Environment Setting
•export
MAHOUT_HOME=/path/to/mahout
•export MAHOUT_LOCAL=true # for
running standalone on your dev
machine,
•# unset MAHOUT_LOCAL for running
on a cluster
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
9

Feature in Mahout 0.13.0 (2017)
•Enables easier implementations of most
modern machine learning and deep
learning algorithms
•Open-source distributed scalable linear
algebra library ViennaCL, the Java
wrapper library interface JavaCPP
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
10

Feature in Mahout 0.13.0 (2017)
•The graphics processor manufacturer,
NVIDIA CUDA bindings directly into
Mahout
•Easier to run matrix mathematics on
graphics cards (used in computers for
fast graphic computations)

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
11

Java Implementations
•Java libraries for common mathematics
and statistical operations (focused on
linear algebra) and primitive Java
Collection Interfaces
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
12

Figure 6.21 Mahout architecture
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
13

Features of Mahout
•Designed on top of Apache Hadoop
•Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
14

Features of Mahout
•A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
•Supports algebraic platforms like fast
computing Apache Spark paradigm and
MapReduce
•Distributed row matrix (DRM), scalable
libraries for matrix and vectors
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
15

Random Access Feature
•Random access means accessing vectors
using key, index or hash followed by
values, in any order
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
16

Clustering Implementations
•Contains several Spark and MapReduce
enabled, such as k-means, fuzzy k-
means, Canopy, Dirichlet and mean
shift, Latent Dirichlet Allocation,
•Spectral Clustering, MultiHash
Clustering
•Hierarchical clustering

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
17

Classification Implementations
•Single Machine Sequential: Stochastic
Gradient Descent (SGD), Logistic
Regression trained via SGD, Hidden
Markov, Multi-layer Perceptron
•Parallel Distributed (Map Reduce):
•Naïve Bayes, complementary Naïve
Bayes and Random Forest
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
18

Collaborative Filtering
•Enables making automatic predictions
about items and item sets of interests
•Similar item-sets mining
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
19

SVD++ and SGD
•Weighted matrix factorization SVD++,
and parallel SGD (in sequentially in
shared-data environment)
•SGD [an iterative learning algorithm in
which each training example is used to
pull the model M program slightly to
reach more closer to correct answer]

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
20

Recommender Implementation
•Collaborative filtering based
recommender,
•SVD recommender
•KNN-item-based recommender (linear
interpolation item based recommender)
•Cluster-based recommender
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
21

Implementation APIs
•APIs for distributed and in-core first
and second moment routines
•Distributed and local principal
component analysis (DSPCA and
SPCA) and stochastic singular value
decomposition (DSSVD and SSVD),
singular value decomposition (SVD),
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
22

Implementation APIs
•Distributed Cholesky QR (thinQR)
•Distributed regularized Alternating
Least Squares (DALS)
•Matrix factorization with alternating
least squares
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
23

Principal Component Analysis (PCA)
•Means a linear transformation method
•Finds the directions of maximum
variance in high-dimensional data
•Projects those data for transformation
onto a smaller dimensional subspace
while retaining most of the information
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
24

Principal Component Analysis (PCA)
•Identifies patterns in data sets
•Detect the correlation among the
variables
•Strong correlation: Try for reducing the
dimensionality
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
25

PCA Applications
•PCA applies in a number of use cases,
such as stock market predictions, and
the analysis of gene expression data.
2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
26

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
27
Summary
We learnt:
•Mahout for Fast and Efficient
Processing of Big Data
•A scalable generalized tensor and linear
algebra solving engine, designed on top
of Apache Hadoop.
•Mahout implementations of Machine
Learning Algorithms

2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
28
Summary
We learnt:
•Mahout for Similar item sets, Clustering,
Classification, Recommender
•PCA, SVD++, SGD
•Collaborative Filtering



2019
“Big Data Analytics “, Ch.06 L17: Machine Learning ...for... analytics,
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
29
End of Lesson 17 on
Mahout
Tags