Big Data Analytics using Mahout

imcinstitute 5,546 views 73 slides Mar 29, 2015

Slide 1 of 73

About This Presentation

This lab describes the use of Apache Mahout for Machine Learning on a Hadoop platform.

Size: 3.39 MB

Language: en

Added: Mar 29, 2015

Slides: 73 pages

Slide Content

Big Data Analytics
Using Mahout
Assoc. Prof. Dr. Thanachart Numnonda
Executive Director
IMC Institute
April 2015

2
Mahout

3
Mahout is a Java library which Implementing
Machine Learning techniques for
clustering, classification and recommendation
What is Mahout?

4
Mahout in Apache Software

5
Why Mahout?
Apache License
Good Community
Good Documentation
Scalable
Extensible
Command Line Interface
Java Library

6
List of Algorithms

7
List of Algorithms

8
List of Algorithms

9
Mahout Architecture

10
Use Cases

11
Installing Mahout

13
Select a EC2 service and click on Lunch Instance

14
Choose My AMIs and select “Hadoop Lab Image”

15
Choose m3.medium Type virtual server

16
Leave configuration details as default

17
Add Storage: 20 GB

18
Name the instance

19
Select an existing security group > Select Security
Group Name: default

20
Click Launch and choose imchadoop as a key pair

21
Review an instance / click Connect for
an instruction to connect to the instance

22
Connect to an instance from Mac/Linux

23
Connect to an instance from Windows using Putty

24
Connect to the instance

25
Install Maven
$ sudo apt-get install maven
$ mvn -v

26
Install Subversion
$ sudo apt-get install subversion
$ svn --version

27
Install Mahout
$ cd /usr/local/
$ sudo mkdir mahout
$ cd mahout
$ sudo svn co http://svn.apache.org/repos/asf/mahout/trunk
$ cd trunk
$ sudo mvn -DskipTests

28
Install Mahout (cont.)

29
Edit batch files
$ sudo vi $HOME/.bashrc

$ exec bash

30
Running
Recommendation Algorithms

31
MovieLens
http://grouplens.org/datasets/movielens/

32
Recommend Movies
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
$ unzip ml-100k.zip

33
Item-Based Recommendation
Step 1: Gather some test data
Step 2: Pick a similarity measure
Step 3: Configure the Mahout command
Step 4: Making use of the output and doing more
with Mahout

34
Preparing Movielen data
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
$ unzip ml-100k.zip
$ hadoop fs -mkdir /input
$ hadoop fs -put u.data /input/u.data
$ hadoop fs -mkdir /results
$ unset MAHOUT_LOCAL

35
Running Recommend Command
$ mahout recommenditembased -i /input/u.data -o
/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD
--numRecommendations 5
$ hadoop fs -ls /results/itemRecom.txt

36
View the result
$ hadoop fs -cat /results/itemRecom.txt/part-r-00000

37
Running Recommendation in
a single machine
$ export MAHOUT_LOCAL=true
$ mahout recommenditembased -i ml-100k/u.data -o
/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD
--numRecommendations 5
$ cat results/itemRecom.txt/part-r-00000

38
Running
Example Program
Using CBayes classifer

39
Running Example Program

40
Preparing data
$ export WORK_DIR=/tmp/mahout-work-${USER}
$ mkdir -p ${WORK_DIR}
$ mkdir -p ${WORK_DIR}/20news-bydate
$ cd ${WORK_DIR}/20news-bydate
$ wget
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
$ tar -xzf 20news-bydate.tar.gz
$ mkdir ${WORK_DIR}/20news-all
$ cd
$ cp -R ${WORK_DIR}/20news-bydate/*/* $
{WORK_DIR}/20news-all

41
Extract Features
Convert the full 20 newsgroups dataset into a < Text, Text >
SequenceFile.
Convert and preprocesses the dataset into a < Text,
VectorWritable > SequenceFile containing term frequencies for
each document.

42
Prepare Testing Dataset
Split the preprocessed dataset into training and testing sets.

43
Training process
Train the classifier.

44
Testing the result
Test the classifier.

45
Sample Output

46
Command line options

47
Command line options

48
Command line options

49
K-means clustering

50
Reuters Newswire

51
Preparing data
$ export WORK_DIR=/tmp/kmeans
$ mkdir $WORK_DIR
$ mkdir $WORK_DIR/reuters-out
$ cd $WORK_DIR
$ wget
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
$ mkdir $WORK_DIR/reuters-sgm
$ tar -xzf reuters21578.tar.gz -C $WORK_DIR/reuters-sgm

52
Convert input to a sequential file
$ mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/reuters-sgm $WORK_DIR/reuters-out

53
Convert input to a sequential file (cont)
$ mahout seqdirectory -i $WORK_DIR/reuters-out -o
$WORK_DIR/reuters-out-seqdir -c UTF-8 -chunk 5

54
Create the sparse vector files
$ mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o
$WORK_DIR/reuters-out-seqdir-sparse-kmeans
--maxDFPercent 85 --namedVector

55
Running K-Means
$ mahout kmeans -i $WORK_DIR/reuters-out-seqdir-sparse-
kmeans/tfidf-vectors/ -c $WORK_DIR/reuters-kmeans-clusters
-o $WORK_DIR/reuters-kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
-x 10 -k 20 -ow

56
K-Means command line options

57
Viewing Result
$mkdir $WORK_DIR/reuters-kmeans/clusteredPoints

$ mahout clusterdump -i $WORK_DIR/reuters-kmeans/clusters-
*-final -o $WORK_DIR/reuters-kmeans/clusterdump -d
$WORK_DIR/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
-dt sequencefile -b 100 -n 20 --evaluate -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
-sp 0 --pointsDir $WORK_DIR/reuters-kmeans/clusteredPoints

58
Viewing Result

59
Exercise: Traffic Accidents Dataset
http://fimi.ua.ac.be/data/accidents.dat.gz

60
Import-Export RDBMS data

61
Sqoop Hands-On Labs
1.Loading Data into MySQL DB
2.Installing Sqoop
3.Configuring Sqoop
4.Installing DB driver for Sqoop
5.Importing data from MySQL to Hive Table
6.Reviewing data from Hive Table
7.Reviewing HDFS Database Table files

Thanachart Numnonda, [email protected] Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
1. MySQL RDS Server on AWS
A RDS Server is running on AWS with the following
configuration
> database: imc_db
> username: admin
> password: imcinstitute
>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com
[This address may change]

63
1. country_tbl data
Testing data query from MySQL DB
Table name > country_tbl

64
2. Installing Sqoop
# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-
1.0.0.tar.gz
# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/
# rm sqoop-1.4.5.bin__hadoop-1.0.0

65
Installing Sqoop
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc

66
3. Configuring Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/conf/
ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

67
4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/lib/
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$
wget
https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$
exit

68
5. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$sqoop import --connect
jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-
2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl
--hive-import --hive-table country -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>

69
6. Reviewing data from Hive Table

70
7. Reviewing HDFS Database Table files
Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

71
Sqoop commands

72
www.facebook.com/imcinstitute

73
Thank you
[email protected]
www.facebook.com/imcinstitute
www.slideshare.net/imcinstitute
www.thanachart.org

Big Data Analytics using Mahout

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Big Data Analytics using Mahout

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Tags

Categories

Download

Quick Actions