Big Data Analytics using Mahout

imcinstitute 5,546 views 73 slides Mar 29, 2015
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

This lab describes the use of Apache Mahout for Machine Learning on a Hadoop platform.


Slide Content

Big Data Analytics
Using Mahout
Assoc. Prof. Dr. Thanachart Numnonda
Executive Director
IMC Institute
April 2015

2
Mahout

3
Mahout is a Java library which Implementing
Machine Learning techniques for
clustering, classification and recommendation
What is Mahout?

4
Mahout in Apache Software

5
Why Mahout?
Apache License
Good Community
Good Documentation
Scalable
Extensible
Command Line Interface
Java Library

6
List of Algorithms

7
List of Algorithms

8
List of Algorithms

9
Mahout Architecture

10
Use Cases

11
Installing Mahout

12

13
Select a EC2 service and click on Lunch Instance

14
Choose My AMIs and select “Hadoop Lab Image”

15
Choose m3.medium Type virtual server

16
Leave configuration details as default

17
Add Storage: 20 GB

18
Name the instance

19
Select an existing security group > Select Security
Group Name: default

20
Click Launch and choose imchadoop as a key pair

21
Review an instance / click Connect for
an instruction to connect to the instance

22
Connect to an instance from Mac/Linux

23
Connect to an instance from Windows using Putty

24
Connect to the instance

25
Install Maven
$ sudo apt-get install maven
$ mvn -v

26
Install Subversion
$ sudo apt-get install subversion
$ svn --version

27
Install Mahout
$ cd /usr/local/
$ sudo mkdir mahout
$ cd mahout
$ sudo svn co http://svn.apache.org/repos/asf/mahout/trunk
$ cd trunk
$ sudo mvn -DskipTests

28
Install Mahout (cont.)

29
Edit batch files
$ sudo vi $HOME/.bashrc

$ exec bash

30
Running
Recommendation Algorithms

31
MovieLens
http://grouplens.org/datasets/movielens/

32
Recommend Movies
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
$ unzip ml-100k.zip

33
Item-Based Recommendation
Step 1: Gather some test data
Step 2: Pick a similarity measure
Step 3: Configure the Mahout command
Step 4: Making use of the output and doing more
with Mahout

34
Preparing Movielen data
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
$ unzip ml-100k.zip
$ hadoop fs -mkdir /input
$ hadoop fs -put u.data /input/u.data
$ hadoop fs -mkdir /results
$ unset MAHOUT_LOCAL

35
Running Recommend Command
$ mahout recommenditembased -i /input/u.data -o
/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD
--numRecommendations 5
$ hadoop fs -ls /results/itemRecom.txt

36
View the result
$ hadoop fs -cat /results/itemRecom.txt/part-r-00000

37
Running Recommendation in
a single machine
$ export MAHOUT_LOCAL=true
$ mahout recommenditembased -i ml-100k/u.data -o
/results/itemRecom.txt -s SIMILARITY_LOGLIKELIHOOD
--numRecommendations 5
$ cat results/itemRecom.txt/part-r-00000

38
Running
Example Program
Using CBayes classifer

39
Running Example Program

40
Preparing data
$ export WORK_DIR=/tmp/mahout-work-${USER}
$ mkdir -p ${WORK_DIR}
$ mkdir -p ${WORK_DIR}/20news-bydate
$ cd ${WORK_DIR}/20news-bydate
$ wget
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
$ tar -xzf 20news-bydate.tar.gz
$ mkdir ${WORK_DIR}/20news-all
$ cd
$ cp -R ${WORK_DIR}/20news-bydate/*/* $
{WORK_DIR}/20news-all

41
Extract Features
Convert the full 20 newsgroups dataset into a < Text, Text >
SequenceFile.
Convert and preprocesses the dataset into a < Text,
VectorWritable > SequenceFile containing term frequencies for
each document.

42
Prepare Testing Dataset
Split the preprocessed dataset into training and testing sets.

43
Training process
Train the classifier.

44
Testing the result
Test the classifier.

45
Sample Output

46
Command line options

47
Command line options

48
Command line options

49
K-means clustering

50
Reuters Newswire

51
Preparing data
$ export WORK_DIR=/tmp/kmeans
$ mkdir $WORK_DIR
$ mkdir $WORK_DIR/reuters-out
$ cd $WORK_DIR
$ wget
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
$ mkdir $WORK_DIR/reuters-sgm
$ tar -xzf reuters21578.tar.gz -C $WORK_DIR/reuters-sgm

52
Convert input to a sequential file
$ mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/reuters-sgm $WORK_DIR/reuters-out

53
Convert input to a sequential file (cont)
$ mahout seqdirectory -i $WORK_DIR/reuters-out -o
$WORK_DIR/reuters-out-seqdir -c UTF-8 -chunk 5

54
Create the sparse vector files
$ mahout seq2sparse -i $WORK_DIR/reuters-out-seqdir/ -o
$WORK_DIR/reuters-out-seqdir-sparse-kmeans
--maxDFPercent 85 --namedVector

55
Running K-Means
$ mahout kmeans -i $WORK_DIR/reuters-out-seqdir-sparse-
kmeans/tfidf-vectors/ -c $WORK_DIR/reuters-kmeans-clusters
-o $WORK_DIR/reuters-kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
-x 10 -k 20 -ow

56
K-Means command line options

57
Viewing Result
$mkdir $WORK_DIR/reuters-kmeans/clusteredPoints

$ mahout clusterdump -i $WORK_DIR/reuters-kmeans/clusters-
*-final -o $WORK_DIR/reuters-kmeans/clusterdump -d
$WORK_DIR/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
-dt sequencefile -b 100 -n 20 --evaluate -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
-sp 0 --pointsDir $WORK_DIR/reuters-kmeans/clusteredPoints

58
Viewing Result

59
Exercise: Traffic Accidents Dataset
http://fimi.ua.ac.be/data/accidents.dat.gz

60
Import-Export RDBMS data

61
Sqoop Hands-On Labs
1.Loading Data into MySQL DB
2.Installing Sqoop
3.Configuring Sqoop
4.Installing DB driver for Sqoop
5.Importing data from MySQL to Hive Table
6.Reviewing data from Hive Table
7.Reviewing HDFS Database Table files

Thanachart Numnonda, [email protected] Feb 2015Big Data Hadoop on Amazon EMR – Hands On Workshop
1. MySQL RDS Server on AWS
A RDS Server is running on AWS with the following
configuration
> database: imc_db
> username: admin
> password: imcinstitute
>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com
[This address may change]

63
1. country_tbl data
Testing data query from MySQL DB
Table name > country_tbl

64
2. Installing Sqoop
# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-
1.0.0.tar.gz
# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz
# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/
# rm sqoop-1.4.5.bin__hadoop-1.0.0

65
Installing Sqoop
Edit $HOME ./bashrc
# sudo vi $HOME/.bashrc

66
3. Configuring Sqoop
ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/conf/
ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

67
4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-
1.0.0/lib/
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$
wget
https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar
ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$
exit

68
5. Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$sqoop import --connect
jdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-
2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl
--hive-import --hive-table country -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>

69
6. Reviewing data from Hive Table

70
7. Reviewing HDFS Database Table files
Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

71
Sqoop commands

72
www.facebook.com/imcinstitute

73
Thank you
[email protected]
www.facebook.com/imcinstitute
www.slideshare.net/imcinstitute
www.thanachart.org