RHadoop

PraveenkumarDonta 4,123 views 70 slides Nov 30, 2016
Slide 1
Slide 1 of 70
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70

About This Presentation

Big Data Analytics with R and Hadoop


Slide Content

Outline
Big Data Analytics with R and Hadoop
D. Praveen Kumar
Research Scholar (Full-Time)
Department of Computer Science & Engineering
YSREC of Yogi Vemana University, Proddatur
Kadapa Dt., A. P, India
November 30, 2016
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 1 / 70

Outline
1Introduction
2RHadoop
3RHadoop Installation
4rhdfs Methods
5rmr2
6Examples
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 2 / 70

Outline
Big Data - Introduction
Big Data has to deal with large and complex data sets that can be
structured, semi-structured, or unstructured and will typically not
t into memory to be processed. They have to be processed in
place, which means that computation has to be done where the
data resides for processing.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 3 / 70

Outline
Big Data - 3V's
Velocity refers to the low latency, real-time speed at which the
analytics need to be applied. (Example: to perform analytics
on a continuous stream of data originating from a social
networking site)
Volume refers to the size of the data set. It may be in KB,
MB, GB, TB, or PB based on the type of the application that
generates or receives the data.
Variety refers to the various types of the data that can exist,
for example, text, audio, video, and photos.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 4 / 70

Outline
Big Data - 3V's (Cont..)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 5 / 70

Outline
Popular Organizations that hold Big Data
Some of the popular organizations that hold Big Data are as
follows: (upto 2014)
Facebook: It has 40 PB of data and captures 100 TB/day
Yahoo!: It has 60 PB of data
Twitter: It captures 8 TB/day
EBay: It has 40 PB of data and captures 50 TB/day
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 6 / 70

Outline
Hadoop - Introduction
Apache Hadoop is an open source Java framework for
processing and querying vast amounts of data on large
clusters of commodity hardware.
Hadoop is a top level Apache project, initiated and led by
Yahoo! and Doug Cutting.
Its impact can be boiled down to four salient characteristics:
scalable, cost-eective, exible, fault-tolerant solutions.
Apache Hadoop has two main features:
HDFS (Hadoop Distributed File System) - Storing
Map Reduce - Processing
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 7 / 70

Outline
Requirements
Necessary
Java>= 7sshLinux OS (Ubuntu>=
14.04)
Hadoop framework
Optional
EclipseInternet connection
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 8 / 70

Outline
Java 7 & Installation
Hadoop requires a working Java installation. However, using
java 1.7 or more is recommended.
Following command is used to install java in linux platform
sudo apt-get install openjdk-7-jdk (or)
sudo apt-get install default-jdk
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 9 / 70

Outline
Java PATH Setup
We need to set JAVA path
Open the.bashrcle located in home directory
gedit
~
/.bashrc
Add below line at the end:
export JAVAHOME=/usr/lib/jvm/java7openjdkamd64
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 10 / 70

Outline
Installation & Conguration of SSH
Hadoop requires SSH(Secure Shell) access to manage its
nodes, i.e. remote machines plus your local machine if you
want to use Hadoop on it.
Install SSH using following command
sudo apt-get install ssh
First, we have to generate DSA an SSH key for user.
ssh-keygen -t dsa -P '' -f ~ /.ssh/iddsa
cat ~ /.ssh/iddsa.pub>>~ /.ssh/authorizedkeys
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 11 / 70

Outline
Download & Extract Hadoop
Download Hadoop from the Apache Download Mirrors
http://mirror.bergrid.in/apache/hadoop/common/
Extract the contents of the Hadoop package to a location of your
choice. I picked/usr/local/hadoop.
$ sudo chmod 777 /usr/local
$ cd /usr/local
$ tar xzf hadoop-2.7.2.tar.gz
$ sudo mv hadoop-2.7.2 hadoop
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 12 / 70

Outline
Add Hadoop conguration in .bashrc
Add Hadoop conguration in.bashrcin home directory.
export HADOOPINSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOPINSTALL/bin
export PATH=$PATH:$HADOOPINSTALL/sbin
export HADOOPMAPREDHOME=$HADOOPINSTALL
export HADOOPHDFSHOME=$HADOOPINSTALL
export HADOOPCOMMONHOME=$HADOOPINSTALL
export YARNHOME=$HADOOPINSTALL
export HADOOPCOMMONLIBNATIVEDIR=$HADOOPINSTALL/lib/native
export HADOOPOPTS="-Djava.library.path=$HADOOPINSTALL/lib"
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 13 / 70

Outline
Create temp le, DataNode & NameNode
Execute below commands to create NameNode
mkdir -p /usr/local/hadoopdata/hdfs/namenode
Execute below commands to create DataNode
mkdir -p /usr/local/hadoopdata/hdfs/datanode
Execute below code to create the tmp directory in hadoop
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmpYSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 14 / 70

Outline
Files to Congure
The following are the les we need to congure
core-site.xmlhadoop-env.shmapred-site.xmlhdfs-site.xml
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 15 / 70

Outline
Add properties in /usr/local/hadoop/etc/core-site.xml
Add the following snippets between the
<conguration> ::: < =conguration>tags in thecore-site.xml
le.
Add below property to specify the location oftmp
<property>
<name>hadoop:tmp:dir< =name>
<value> =app=hadoop=tmp< =value>
< =property>
Add below property to specify the location ofdefault le
systemand its port number.
<property>
<name>fs:default:name< =name>
<value>hdfs:==localhost: 54310< =value>
< =property>
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 16 / 70

Outline
Add properties in /usr/local/hadoop/etc/hadoop-env.sh
Un-Comment the JAVAHOME and Give Correct Path For
Java.
export JAVAHOME=/usr/lib/jvm/java-7-openjdk-amd64
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 17 / 70

Outline
Add property in
/usr/local/hadoop/etc/hadoop/mapred-site.xml
In le we add The host name and port that the MapReduce job
tracker runs at. Add following inmapred-site.xml:
<property>
<name>mapred:job:tracker< =name>
<value>localhost: 54311< =value>
< =property>
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 18 / 70

Outline
Add properties in ... etc/hadoop/hdfs-site.xml
In lehdfs-site.xmladd following:
Add replication factor
<property>
<name>dfs:replication< =name>
<value>1< =value>
< =property>
Specify the NameNode
<property>
<name>dfs:namenode:name:dir< =name>
<value>le:=usr=local=hadoopdata=hdfs=namenode< =value>
< =property>
Specify the DataNode
<property>
<name>dfs:datanode:name:dir< =name>
<value>le:=usr=local=hadoopdata=hdfs=datanode< =value>
< =property>
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 19 / 70

Outline
Formatting the HDFS le system via the NameNode
The rst step to starting up your Hadoop installation is
Formatting the Hadoop le system
We need to do this the rst time you set up a Hadoop.
Do not format a running Hadoop le system as you will lose
all the data currently in HDFS
To format the le system, run the command
hadoop namenode -format
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 20 / 70

Outline
Starting single-node cluster
Run the command:
start-all.sh
This will startup a NameNode,SecondaryNameNode,
DataNode, ResourceManager and a NodeManager on your
machine.
A nifty tool for checking whether the expected Hadoop
processes are running is jps
hadoop1@hadoop1:/usr/local/hadoop$ jps
2598 NameNode
3112 ResourceManager
3523 Jps
2917 SecondaryNameNode
2727 DataNode
3242 NodeManager
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 21 / 70

Outline
Stopping your single-node cluster
Run the command
stop-all.sh
To stop all the daemons running on your machine output will be
like this.
stopping NodeManager
localhost: stopping ResourceManager
stopping NameNode
localhost: stopping DataNode
localhost: stopping SecondaryNameNode
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 22 / 70

Outline
R - Introduction
R is an open source software package to perform statistical
analysis on data.
R is a programming language developed from S(Statistical)
R provides a wide variety of statistical, machine learning,
graphical techniques, and is highly extensible.
R can now connect with other data stores, such as MySQL,
SQLite, MongoDB, and Hadoop etc.,
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 23 / 70

Outline
R - Features
Following are Some of the R Features
Eective statistical programming language
Relational database support
Data analytics
Data visualization
Extension through the vast library of R packages
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 24 / 70

Outline
R - Operations
R allows performing Data analytics by various operations such as:
Regression
Classication
Clustering
Recommendation
Text mining
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 25 / 70

Outline
R - Installation (Windows)
For Windows, follow the given steps:
1Navigate towww.r-project.org.
2Click on the CRAN section, select CRAN mirror, and select
your Windows OS (stick to Linux; Hadoop is almost always
used in a Linux environment).
3Download the latest R version from the mirror.
4Execute the downloaded.exeto install R.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 26 / 70

Outline
R - Installation (Ubuntu)
For Linux-Ubuntu, follow the given steps:
1Navigate towww.r-project.org.
2Click on the CRAN section, select CRAN mirror, and select
your OS.
3In the/etc/apt/sources.listle, add the CRAN
<mirror>entry.
4Download and update the package lists from the repositories
using thesudo apt-get updatecommand.
5Install R system using thesudo apt-get install r-base
command.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 27 / 70

Outline
RHEL/CentOS
For Linux-RHEL/CentOS, follow the given steps:
1Navigate towww.r-project.org.
2Click on CRAN, select CRAN mirror, and select Red Hat OS.
3Download theR-*core-*.rpmle.
4Install the .rpm package using therpm -ivh R-*core-*.rpm
command.
5Install R system usingsudo yum install R.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 28 / 70

Outline
Hadoop MapReduce in R
Hadoop MapReduce in R, we can perform inThreeWays:
1R and Hadoop Integrated Programming Environment
(RHIPE)
2HadoopStreaming
3RHadoop
Among these threeRHadoopis ecient and easiest.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 29 / 70

Outline
RHadoop - Introduction
RHadoop was developed by Revolution Analytics
RHadoop is available with three main R packages:
1rhdfs- provides HDFS data operations
2rmr- provides MapReduce execution operations
3rhbase- input data source at the HBase
Here it's not necessary to install all of the three RHadoop
packages to run the Hadoop MapReduce operations with R
and Hadoop.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 30 / 70

Outline
RHadoop - Architecture
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 31 / 70

Outline
rhdfs
rhdfsis an R interface for providing the HDFS usability from
the R console.
rhdfspackage calls the HDFS API in backend to operate data
sources stored on HDFS.
Withrhdfsmethods, R programmer can easily performread
andwriteoperations on distributed data les.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 32 / 70

Outline
rmr
rmris an R interface for providing Hadoop MapReduce facility
inside the R environment.
R programmer needs to just divide their application logic into
themapandreducephases and submit it with thermr
methods.
After that,rmrcalls the Hadoop streaming MapReduce API
with several job parameters as input directory, output
directory, mapper, reducer, and so on, to perform the R
MapReduce job over Hadoop cluster.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 33 / 70

Outline
rhbase
rhbaseis an R interface for operating the HadoopHBasedata
source stored at the distributed network via aThriftserver.
Therhbasepackage is designed with several methods for
initialization andread/writeand table manipulation
operations.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 34 / 70

Outline
R and Hadoop installation
We already installedRandHadoop
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 35 / 70

Outline
Installing the R packages
To connect R and Hadoop we need to install some of the packages:
httr
functional
devtools
plyr
reshape2
rJava
RJSONIO
itertools
digest
Rcpp
install.packages( c('httr','functional','devtools', 'plyr','reshape2'))
install.packages( c('rJava','RJSONIO', 'itertools', 'digest','Rcpp'))
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 36 / 70

Outline
Setting environment variables
We need to set following environment variables through R console.
## Setting HADOOPCMD
Sys.setenv(HADOOPCMD="/usr/local/hadoop/bin/hadoop")
## Setting up HADOOPSTREAMING
Sys.setenv(HADOOPSTREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.3.jar")
or, we can also set the R console via the command line as follows:
export HADOOPCMD="/usr/local/hadoop/"
export HADOOPSTREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.3.jar"
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 37 / 70

Outline
Usage of Hadoop Streaming jar
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 38 / 70

Outline
Downloading RHadoop Packages
Download RHadoop packages fromGitHubrepository of
Revolution Analytics:
https://github.com/RevolutionAnalytics/RHadoop
rmr: [rmr-23.3.1.tar.gz]
rhdfs: [rhdfs-1.0.8.tar.gz]
rhbase: [rhbase-1.2.1.tar.gz]
We can install these packages using R-command line or RStudio
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 39 / 70

Outline
Installing rmr package
Install throught R Commander using the following Command
R CMD INSTALL rmr-23.3.1.tar.gz
Install using Rstudio follow the steps
Click on Tools!Install Packages
Change Install from option from Repository(CERN) to
Package Archive File (.tar.gz) option
Choose thermr-23.3.1.tar.gzle from your local system
Click on Install button (It also install supporting packages of
rmr)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 40 / 70

Outline
Installing rhdfs package
Install throught R Commander using the following Command
R CMD INSTALL rhdfs-1.0.8.tar.gz
Install using Rstudio follow the steps
Click on Tools!Install Packages
Change Install from option from Repository(CERN) to
Package Archive File (.tar.gz) option
Choose therhdfs-1.0.8.tar.gzle from your local system
Click on Install button (It also install supporting packages of
rhdfs)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 41 / 70

Outline
Installing rhbase package
Install throught R Commander using the following Command
R CMD INSTALL rhbase-1.2.1.tar.gz
Install using Rstudio follow the steps
Click on Tools!Install Packages
Change Install from option from Repository(CERN) to
Package Archive File (.tar.gz) option
Choose therhbase-1.2.1.tar.gzle from your local system
Click on Install button (It also install supporting packages of
rhdfs)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 42 / 70

Outline
Loading the RHadoop libraries
However we load a normal library in R, Similarly we can load
RHadoop libraries using require() or library() methods.
library('rhdfs') # Loading HDFS
library('rmr2') # Loading MapReduce
library('rhbase') # Loading HBase
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 43 / 70

Outline
Initializing the RHadoop
Initialize therhdfspackage with parameters specifying the location
of the hadoop conguration les.
Syntax:
hdfs.init(hadoop=PATH)
herePATHspecifys the location of the hadoop conguration le.
If we can't pass any parameter, by default conguration les taken
from theHADOOPCMDenvironment variable.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 44 / 70

Outline
hdfs.ls
It is useful to list les and directories of the HDFS. It returns the
data frames that columns corresponding to permissions, owner,
groups, size (in bytes), modication time and le or directory
name.
syntax:hdfs.ls(path, recurse=FALSE)
If recurse is TRUE, It recursively shows the sub directories.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 45 / 70

Outline
hdfs.defaults
This method is used to set and get the default congurations of
theHDFS
Syntax:
hdfs.defaults(arg)
argindicates name of the parameter or NULL.
This function list following values
local: rJava object corresponding to local system.
blocksize: default block size of the les stored in HDFS
fs: an rJava object corresponds to the HDFS
fu: Helper object for rhdfs
classpath: The java classpath
replication: default replication factor in HDFS
conf: name-value mappings for Hadoop conguration
parameters
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 46 / 70

Outline
hdfs.defaults : Examples
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 47 / 70

Outline
hdfs.cat
This method is useful to read the lines form a le onHDFS.
Syntax:
hdfs.cat(path,n,buersize)
path: Location of the source le
n: Number of line read form le
buersize: Size of the buer (Optional)
Example:
hdfs.cat('/RHadoop/1/example.txt')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 48 / 70

Outline
hdfs.put
This method is useful to transfer the data from thelocal system
toHDFS.
Syntax:
hdfs.put(src,dest,dstFS=hdfs.defaults("fs"))
src: Location of the source directory or le
dest: Location of the destination directory or le
dstFS: The destination le system (Optional)
Example:
hdfs.put('/home/dp/Desktop/example.txt','/RHadoop/1/')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 49 / 70

Outline
hdfs.get
This method is useful to transfer the data from theHDFStolocal
system.
Syntax:
hdfs.get(src,dest,srcFS=hdfs.defaults("fs"))
src: Location of the source directory or le
dest: Location of the destination directory or le
srcFS: The source le system (Optional)
Example:
hdfs.get('/RHadoop/1/','/home/dp/Desktop/1/')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 50 / 70

Outline
hdfs.copyjhdfs.cp
This method is useful to copy the data from one location of the
HDFSto another location inHDFS
Syntax:
hdfs.copy(src,dest,overwrite=FALSE)
src: Location of the source directory or le
dest: Location of the destination directory or le
overwrite: If le exist, whether or not it should be overwritten
Example:
hdfs.copy('/RHadoop/1/','/RHadoop/2/')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 51 / 70

Outline
hdfs.move
This method is useful to move the data from one location of the
HDFSto another location inHDFSand remove the source
directory or le.
Syntax:
hdfs.move(src,dest)
src: Location of the source directory or le
dest: Location of the destination directory or le
Example:
hdfs.move('/RHadoop/1/','/RHadoop/2/')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 52 / 70

Outline
hdfs.rename
This method is useful to rename the le or directory in HDFS
through R
Syntax:
hdfs.rename(src,dest)
src: Location of the source directory or le
dest: Location of the destination directory or le
Example:
hdfs.rename('/RHadoop/1/example.txt','/RHadoop/1/sample.txt')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 53 / 70

Outline
hdfs.rmjhdfs.rmrjhdfs.delete
These functions are used to delete les or directories of HDFS
using R.
Syntax:
hdfs.delete(path)
hdfs.rm(path)
hdfs.rmr(path)
Example:
hdfs.delete("/RHadoop/1/")
hdfs.rm("/RHadoop/1/")
hdfs.rmr("/RHadoop/1/")
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 54 / 70

Outline
hdfs.chmod
This method is useful to changing the permissions of HDFS les or
Directories
Syntax
hdfs.chmod(Path, permissions= '777')
permission is a character that represents permission of a le or
directory,.
Example
hdfs.chmod("/RHadoop", permissions= '777')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 55 / 70

Outline
hdfs.dircreatejhdfs.mkdir
Both these functions will be used for creating a directory over the
HDFS lesystem.
Syntax:
hdfs.mkdir(Dirname)
Example:
hdfs.mkdir("/RHadoop/3/")
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 56 / 70

Outline
hdfs.le
This is used to initialize the le to be used for read/write operation
on local system or HDFS.
Syntax:
hdfs.le(path, mode, buersize ..)
'r' for read mode, 'w' for write mode. Append mode is not
allowed.
Example:
f =
hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 57 / 70

Outline
hdfs.write
This is used to write in to the le stored at HDFS via streaming.
Syntax:
hdfs.write(object,con,hsync=FALSE)
Objectis any R object,conis HDFS connection
Example:
obj = c1,2,3,4,5,6,7
hdfs.write(object,con,hsync=FALSE)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 58 / 70

Outline
hdfs.read
This is used to read from binary les on the HDFS directory. This
will use the stream for the deserialization of the data.
Syntax:
hdfs.read(con,n,start)
nindicates number of bytes,startindicates starting block.
Example:
f =
hdfs.file("/RHadoop/2/README.txt","r",buffersize=104857600)
m = hdfs.read(f)
c = rawToChar(m)
print(c)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 59 / 70

Outline
hdfs.close
This is used to close the stream when a le operation is complete.
It will close the stream and will not allow further le operations.
Syntax:
hdfs.close(con)
conindicates connection of HDFS
Example:
hdfs.close(f)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 60 / 70

Outline
hdfs.le.info
This is used to get meta information about the le stored at
HDFS.
Syntax:
hdfs.le.info(PATH)
Example:
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 61 / 70

Outline
to.dfs
Write R objects to the le system.
Syntax:
to.dfs(kv,output,format="native")
kvmeans any valid key value pair or vector, matrix ect.,
outputis any valid path, andformatis string naming format
Example:small.ints to.dfs(1:10)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 62 / 70

Outline
from.dfs
This is used to read the R objects from the HDFS lesystem that
are in the binary encrypted format.
Syntax:
from.dfs(input,format)
inputis any valid path, andformatis string naming format
Example:
from.dfs('/tmp/RtmpRMIXzb/file2bda3fa07850')
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 63 / 70

Outline
mapreduce
This is used for dening and executing the MapReduce job.
Syntax:
mapreduce(input, output, map, reduce, input.format,
output.format)
input: Path to the input folder on HDFS
output: Path to the output folder on HDFS
map:An optional R function returning null or a value of
keyval()
reduce: An optional R function of two arguments, a key and a
data structure representing all the values associated with key
input.format: Type of input data
output.format: Type of output data
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 64 / 70

Outline
keyval
The keyval function is used to creates return values from map or
reduce functions, themselves parameters to mapreduce.
Syntax:
keyval(key,val)
Wherekeyis the desired key or keys, andvalis the desired
value or values.
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 65 / 70

Outline
WordCount Mapreduce source code
#Set Environment Variables
Sys.setenv(HADOOPCMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOPSTREAMING="/usr/local/hadoop/share
/hadoop/tools/lib/hadoop-streaming-2.7.1.jar")
Sys.setenv(HADOOPHOME="/usr/local/hadoop/")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
Cont..
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 66 / 70

Outline
WordCount Mapreduce source code - cont..
map function(k,lines)f
words.list strsplit(lines, ' ')
words unlist(words.list)
return( keyval(words, 1) )
g
reduce function(word, counts)f
keyval(word, sum(counts))
g
wordcount function (input, output)f
mapreduce(input=input, output=output, input.format="text", map=map,
reduce=reduce)
g
Cont..
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 67 / 70

Outline
WordCount Mapreduce source code - cont..
## read text files from folder /in1/wc/
hdfs.root '/in1'
hdfs.data file.path(hdfs.root, 'wc')
## save result in folder /in1/out
hdfs.out file.path(hdfs.root, 'out')
## Submit job
out wordcount(hdfs.data, hdfs.out)
results from.dfs(out)
results.df as.data.frame(results, stringsAsFactors=F)
colnames(results.df) c('word', 'count')
head(results.df)
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 68 / 70

Outline
WordCount Output
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 69 / 70

Outline
thank You
YSR Engineering College of YVU, Proddatur, Kadapa
Big Data Analytics with R and Hadoop
November 30, 2016 Slide: 70 / 70