Basics of R programming for analytics [Autosaved] (1).pdf

suanshu15 31 views 76 slides Sep 16, 2024
Slide 1
Slide 1 of 76
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76

About This Presentation

Basic under stand for R language


Slide Content

Basics of R programming
for analytics
Course code –PGP 207
PGP MCB 2023-25
Term:II

What is R
R is a statistical programming environment
Statistical Programming Environment = Where you can both write
code and do data analysis
Different from SPSS or SAS or other Statistical Packages
You can use for more than just data analyses
R stores everything in the form of objects
You can combine R with other writing environments such as LaTeX
and Markdown to write reports

Why Use R?
•It is a great resource for data analysis, data
visualization, data science and machine learning
•It provides many statistical techniques (such as
statistical tests, classification, clustering and data
reduction)
•It is easy to draw graphs in R, like pie charts,
histograms, box plot, scatter plot, etc
•It works on different platforms (Windows, Mac, Linux)
•It is open-source and free
•It has a large community support
•It has many packages (libraries of functions) that can be
used to solve different problems

Obtaining R
The best way to obtain R is to visit the CRAN Website
http://cran.r-project.org
You will need Internet access to download the files
Installation of R depends on the platform you have:
Select the appropriate binary version
A binary version = is the machine coded version that will directly
install R

Appearance of CRAN

Obtaining additional R Packages
For Working with R you will need additional packages
These packages are combination of data and functions
The packages are kept in package repositories
To Use packages you will have to install and then call them
Installing: use install.packages(“name of the package”, repos = “”,
dep = T)
To Use Packages, use library(name of the package), also
require(name of the package) [Use either]

Using R with an IDE
Always a good idea to use R with an integrated development
environment (IDE)
Integrated Development Environment will help you to write codes,
and view the outputs at the same time
You can also browse the objects, data, and graphs in the IDE
The IDE used in these set of exercises is RStudio
RStudio is free and open, and you can download from
http://rstudio.com
Download the RStudio Desktop version for your use in these
modules
Install R First and then RStudio

Download Page of RStudio

Your Set up to get Started
Source window:
used to edit a
script and run it.
Console window:
used to run a
particular packages
or to run particular
command.
Workspace window: it stores all
the variables used during
execution of command under
the environment tab
Plots and File window: the file tab is
used to track the working
directories
The plot tabs show all the graphical
output

What can we put in [>] and take out [<] from R?
From Spreadsheets [ > ]
Source Code Files [ > ]
From other Software [ > ]
Text Based Data [ > ] [ < ]
Tables of Data [ > ] [ < ]
Images [ < ]
Dump Files [ < ]

Assignment 1
Find the answers to log2(2^5) and log(exp(1)*exp(1)).

Data frame in R studio
ID <-c(1,2,3,4,5)
Name <-c(“Ramesh”, “Kaushik”, “Chaitali”, “Hardik”, “Komal”)
English <-c(45,65,72,80,57)
Hindi <-c(65,78,56,45,48)
Science <-c(45,55,68,74,63)
So_Science<-c(58,69,63,77,52)
Math <-c(88,63,59,70,76)
Stu_marks<-data.frame(Name,English, Hindi, Science, So_Science,
Math)
View(Stu_marks)
# extracting single column from given dataframe
Stu_marks$Math
Stu_marks$Hindi

Create new data frame with
Column : name
Computer_app
EVS
Enter the cmd:
New_df_name<-merge(df1, df2, by = “names”)
View(New_df_name)

Packages in R
1.A collection of R functions, complied code and sample data.
2.Stored under a directory called library in the R environment.
By default, R installs set of packages.
To see the number of packages installs in R enter the command in
console window:
> library()
> fraction (firstVar/secondVar)

Introduction to R script
An R script is a plain text file in which you can store your R code.
Script allows you to show your work to others and also reproduce and modify the results
How to set working directories?
In the console window write:
> getwd()
the current working directory is shown in the output
How to set our current working directory?
> setwd()
How to read and store “csv” file in R?
Type the following command on console window:
file_name= read.csv(“file_name.csv”)
To view the file enter the command:
View(filename)

How to create dataframein R?
> names <-c(“Rohit”, “Dhoni”, “Virat”, “Hardik”, “KL Rahul”, “Bumrah”)
> played <-c(45,49,47,47,40,25)
> won <-c(22,21,14,9,9,8)
> lost <-c(12,13,14,8,19,6)
> y <-c(2008, 2004,2007, 2009, 2010,2010)
>cricket_players<-data.frame(names, played, won, lost, y)
> View(cricket_players)
You can access the parts of data frame by the following cmd:
> cricket_players$names
> cricket_players$won

Suppose we want to find the ratio between no. of games played and
won:
> ratio <-cricket_players$won/cricket_players$played
The ratio is stored in the new variable name called “victory”
> cricket_players$victory<-ratio
To reduce the number of digits after decimal in victory column:
> options (digits=2)
> View(cricket_players)
mean(cricket_players$played)
> plot (cricket_players$names, cricket_players$played)

Inputting a Source File
A source file contains all the codes that you will need to run your
analyses. This is used to input data and commands to R. You ask R to run
your codes by typing:
source(“file.R”)
Remember to save the code with the extension “.R

Code to read data from console to R
mylar <-scan(“”, what = “numeric)
▪Reads directly from console
▪Saves the numbers to a variable

Code to read data from text files
Write the read.csv() code example
Comma separated value files (csv)
Need to indicate if you have a header
Here we have set the variable names manually
mydata<-read.csv(“DOB.csv”, header = T, sep= “ , ”)
names(mydata) <-c (“Id”, “Time”, “DOB”)

SUGGESTEDTEXTBOOKS
Hands-On Programming with R Write Your Own Functions and Simulations, Mumbai Shroff
Publishers & Distributors
Chambers, John M., Software for Data Analysis Programming With R, USA Springers
Grolemund, Garrett., Hands-On Programming with R Write Your Own Functions And
Simulations, Mumbai Shroff Publishers
E-Resources
•https://www.tutorialspoint.com/r/index.htm
•https://www.w3schools.com/r/r_intro.asp
•https://www.javatpoint.com/r-tutorial

Comments in R
Comments can be used to explain R code, and to make it more readable.
It can also be used to prevent execution when testing alternative code.
Comments starts with a #. When executing code, R will ignore anything
that starts with #.
Example: This example uses a comment before a line of code:
# This is a comment
“Hello World”
Example: This example uses a comment at the end of the line of code:
“Hello World” # This is a comment
Comments does not have to be text to explain the code, it can also be
used to prevent R from executing the code:
# "Good morning!"
"Good night!"

Reserved Words in R
Reserved words in R programming are a set of words that have
special meaning and cannot be used as an identifier (variable
name, function name etc.)
Reserved words in R
if else repeat while function
for in next break TRUE
FALSE NULL Inf NaN NA
NA_integer_ NA_real_
NA_complex
_
NA_characte
r_
...

Identifiers in R
Variables in R
Variables are used to store data, whose value can be changed
according to our need. Unique name given to variable (function
and objects as well) is identifier.
Rules for writing Identifiers in R
1.Identifiers can be a combination of letters, digits, period (.) and
underscore (_).
2.It must start with a letter or a period. If it starts with a period, it
cannot be followed by a digit.
3.Reserved words in R cannot be used as identifiers.

Valid identifiers in R
Total, sum, fine.with.dot, Number5, this_is_acceptable
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .one
Constants in R
Constants, as the name suggests, are entities whose value cannot
be altered. Basic types of constant are numeric constants and
character constants.

Data cleaning in R
Here we are using Excel file “Data cleaning in R”
To view the first 5 observations the cmdwill be
head(Data cleaning in R)
Handling missing values in R
mean(Data cleaning in R$Test1)
mean(Data cleaning in R$Test2)
mean(Data cleaning in R$Test3)
mean(Data cleaning in R$Test1. na.rm = TRUE)
summary(Data cleaning in R)

Imputing Excel file
To install “Excel” package
install.package(“xlsx”)
library(“xlsx”)
Reading excel File
# Read the first worksheet in the file input.xlsx.
data <-read.xlsx("input.xlsx", sheetIndex= 1)
print(data)

Class(file_name)
Typeof(file_name)
To access the top two rows of dataframe
head(dataframe,2)
Tail(dataframe,2)
Str(dataframe)

Matrix in R
mat<-matrix(c(1,2,3,4,5,6),nrow= 2, ncol= 3)
mat
mat[1,2]
mat[,2]
mat[1,]
mat[2,]
stringmatrix<-matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple",
"pear", "melon", "fig"), nrow= 3, ncol= 3)
newmatrix<-cbind(stringmatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
newmatrix

Data Visualization
A histogram is
A visual representation of the distribution of dataset.
Used to plot a frequency of score occurrences in a continuous dataset.
Working on movies dataset with file name: moviesData.csv
The script used here is myPlot.R
To plot histogram type the following command:
hist(movies$runtime)
How to add lablesand colour to the histogram for this we have to add
more arguments to the histogram:
hist(movies$runtime)
hist(movies$runtime, main = "Distribution of movies' length", xlab
= "Runtime of movies", xlim= c(0,300), col = "Blue", breaks = 4)

Pie chart
It is a circular chart
Divided into wedge-like sectors, illustrating proportion.
The total value of the pie chart is always 100 percent.
In the movie data set, we are making pie chart of the column “Genre”,
for that first we are making frequency table of the column Genre.
genrecount<-table(movies$genre)
View(genrecount)
pie(genreCount, main = "Proportion of movies' genre", border =
"blue", col = "orange")

Bar Chart
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable.
R uses the function barplotto create bar charts
We are plotting bar chart from the movie dataset, of the column
imdb_ratingsand for the sake of simplicity we are taking only 20
observations.
moviesSub<-movies[1:20,]
barplot(moviesSub$imdb_rating,
ylab= "IMDB Rating",
xlab= "Movies",
col = "blue",
ylim= c(0,10),
main = "Movies', IMDB Rating")

Output of Bar Chart

In continuation of the previous slide, we will add the movie names in
the x-axis
barplot(moviesSub$imdb_rating,
ylab= "IMDB Rating",
xlab= "Movies",
col = "blue",
ylim= c(0,10),
main = "Movies', IMDB Rating",
names.arg= moviesSub$title)
In the O/P, not all name are visible, for that we will add the name in the
perpendicular to the x-axis.

barplot(moviesSub$imdb_rating,
ylab= "IMDB Rating",
xlab= "Movies",
col = "blue",
ylim= c(0,10),
main = "Movies', IMDB Rating",
names.arg= moviesSub$title,
las = 2)

Let us analyse the relation between “imdb_ratings” and
“audience_score” for this we draw a scatter plot using the plot function
Scatter plot is a graph in which the values of the two variables are
plotted along two axes.
The pattern of the resulting points reveals the correlation.
plot(x = movies$imdb_rating,
y = movies$audience_score,
main = "IMDB Ratings vs Audience Score",
xlab= "IMDB Rating",
ylab= "Audience Score",
xlim= c(0,10),
ylim= c(0,100),
col = "blue")

Now, we will see the correlation between the imdb_ratingand
audience_score:
cor(movies$imdb_rating, movies$audience_score)
O/P
0.8651485

Box Plot
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
•xisavectororaformula.
•dataisthedataframe.
•notchisalogicalvalue.SetasTRUEtodrawanotch.
•varwidthisalogicalvalue.Setastruetodrawwidthoftheboxproportionatetothe
samplesize.
•namesarethegrouplabelswhichwillbeprintedundereachboxplot.
•mainisusedtogiveatitletothegraph.

boxplot(mtcars$mpg)
boxplot(mtcars$mpg, main="Mileage Data Boxplot", ylab="Miles Per
Gallon(mpg)", xlab="No. of Cylinders", col="orange")
boxplot(mpg ~ cyl, data = mtcars, xlab= "Number of Cylinders",
ylab= "Miles Per Gallon", main = "Mileage Data")

Introduction to ggplot2
Visualization is an important tool for insight generation
It is used to understand the data structure, identify outliers and find
patterns
There are two methods of data visualization in R:
Basic Graphics
Grammer of graphics (popularly known as ggplot2)
Basic Graphics
Following are the code for “sin” curve
plot(x,y, main = "Plotting
sin curve", ylab= "sin(x)")
Now, we will learn how to change the type of the curve
plot(x,y, main = "Plotting sin curve", ylab= "sin(x)", type = "l",
col = "blue")

To plot the “cosine” and “sin” curve on the same plot
plot(x, sin(x),
main = "Two Graphs in one plot",
ylab= "",
type = "l",
col = "blue")
lines(x, cos(x),
col = "red")
Here, we will use “legend” to differentiate between the two graphs
plot(x, sin(x), main = "Two Graphs in one plot", ylab= "", type =
"l", col = "blue")lines(x, cos(x), col = "red")legend("topleft",
c("sin(x)","cos(x)"), fill = c("blue", "red"))

ggplot2 graphics
ggplot2 package was created by Hadley Wickham in 2005
If offers a powerful graphics language for creating elegant and complex
plots
We will use “movies” dataset for exploring “ggplot2” package
library(ggplot2)
View(movies)
Now, we want to draw scatter plot between the “critics_score” and
“audience_score”:
Ggplot2 package take three arguments in its function:
1.Data
2.Aesthetics
3.Geometrical

ggplot(data = movies, mapping = aes(x=critics_score,
y=audience_score))+ geom_point()

There is positive correlation between critics_scoreand audience_score
How to save the ggplot2 graph using ggplotsave function in our current
working directory?
ggsave("scatter_plot.png")

Aesthetic mapping in ggplot2
We will learn:
1.What is aesthetic
2.How to create plots using aesthetic
3.Turning parameters in aesthetic

What is Aesthetic
Aesthetic is a visual property of the objects in a plot
It includes lines, points, symbols, colors and positions
It is used to add customization to our plots
# Load ggplot2
library(ggplot2)
# Clear R workspace
rm(list = ls() )
# Declare a variable to read and store moviesData
movies <-read.csv("moviesData.csv")
# View movies data frame
View(movies)
# Plot critics_scoreand audience_score
ggplot(data = movies, mapping = aes(x = critics_score, y = audience_score)) +
geom_point()

Now, we will assign the unique color to each “Genre” of movie column
ggplot(data = movies,mapping= aes(x = critics_score, y =
audience_score, color = genre)) + geom_point()
How to draw “Bar chart” using ggplotfunction
The following code represents the type of the column “mpaa_ratings”
and number of elements in this column:
str(movies$mpaa_ratings)
levels(movies$mpaa_ratings)
ggplot(data = movies,mapping= aes(x = movies$mpaa_rating))+
geom_bar()
We will learn how to add labels to this bar chart:

ggplot(data = movies, mapping = aes(x = movies$mpaa_rating,
fill=genre))+ geom_bar()+ labs(y="Rating counts", title="Count of
mpaarating")
Now we will draw histogram for the variable “run time”
# Histogram for "runtime“
ggplot(data = movies, mapping = aes(x=runtime))+geom_histogram()+
labs(x="Runtime of Movies", title="Distribution of Runtime")

Data manipulation using dplyrpackage
“dplyr” is a package for data manipulation, written and maintained by
Hadley Wickham
It comprises many functions that perform mostly used data
manipulation operations
# Clear R workspace
rm(list = ls())
# Declare a variable to list and store movies data
movies<-read.csv("moviesData.csv")
View(movies)

Now we will install “dplyr” package
install.packages(“dplyr”)
library(dplyr)
Key functions in “dplyr” package
Filter-to select cases based on their values
Arrange–to reorder the cases
Select–to select variables based on their names
Mutate–to add new variables that are functions of existing variables
Summarise–to condense multiple values to a single value
All these functions can be combined with group_byfunctions. It allows
us to perform any operation by group.

# Clear R workspace
rm(list = ls())
# Declare a variable to list and store movies data
movies<-read.csv("moviesData.csv")
View(movies)
# using "filter" function we will filter the column "genre" by comedy
movies
moviesComedy<-filter(movies, genre == “Comedy")
View(moviesComedy)
moviesComedyDr<-filter(movies, genre =="Comedy"|
genre == "Drama")
View(moviesComedyDr)

irisspecies<-filter(iris, Species==“Setosa”)
View(irisspecies)
irisspecies<-filter(iris,
Species==“Setosa”|Petal.Length>=1.5)
Vies(irisspecies)

# filter the movies data by genre "Comedy" having "imdb_rating"
greater than or equal to 7.5
moviesComedyIm<-filter(movies, genre == "Comedy" &
imdb_rating>=7.5)
View(moviesComedyIm)
# using "arrange" function arranging the imdb_ratingby ascending
order
moviesImA<-arrange(movies, imdb_rating)
View(moviesImA)

install.packages(“dplyr)
library(dplyr)
data(iris)
View(iris)
iris_pet_arr<-arrange(iris, Petal.Length)
View(iris_pet_arr)

# using "arrange" function arranging the imdb_ratingby descending
order
moviesImD<-arrange(movies,desc(imdb_rating))
View(moviesImD)
# Arrange the two columns "genre" by alphabetical order and
"imdb_rating" by ascending order
moviesGeIm<-arrange(movies, genre, imdb_rating)
View(moviesGeIm)

More functions in “dplyr” package
1.Select
2.Remane
3.Mutate
Here, we are using myVis.Rscript which is folder containgmoviesDataand set
myVisfolder as working directory.
Before using the above functions install the package “dplyr”

# using select function from dplyrpackage
moviesTGI<-select(movies, title, genre, imdb_rating)
View(moviesTGI)
Let us select the three columns “thtr_rel_year”, “thtr_rel_month” and
“thtr_rel_day” along with the “title” column
For that enter the following cmdin the console window:
moviesTHT<-select(movies, title, starts_with("thtr"))
View(moviesTHT)

Let us change the name of the column “thtr_rel_year” using “rename”
function
moviesR<-rename(movies, rel_year= "thtr_rel_year")
View(moviesR)
Suppose we want to add a new variable (column) in movies dataset for
that we will use “mutate” function
moviesLess<-select(movies, title:audience_score)
View(moviesLess)
# use of Mutate function
moviesMu<-mutate(moviesLess, criAud= critics_score-
audience_score)
View(moviesMu)

Pipe operator
We will learn about:
1.Summariseand group_byfunctions
2.Operations in summarisefunctions
3.Pipe operator
Make folder names “pipeops” in myprojectfolder and set “pipeops” as
working directory

Summarisefunction
1.Summarisefunction reduces a dataframeinto a single row.
2.It gives summaries like mean, median etc., of the variable available
in the dataframe
3.We use summarisealong with the group_byfunction
# use of summarisefunction
summarise(movies, mean(imdb_rating))
1.When we use group_byfunction, the data frame is divided into
groups.
We group the “genre” variable using group_byfunction

# use of group_byfunction
group_Movies<-group_by(movies, genre)
# using summarisefunction on the above cmd
summarise(group_Movies, mean(imdb_rating))
Now, we are using filter, group_byand summarisefunction to extract
the drama movies mean from mpaa_rating.
dramaMov<-filter(movies, genre == "Drama")
gr_dramaMov<-group_by(dramaMov, mpaa_rating)
summarise(gr_dramaMov, mean(imdb_rating))

Pipe operator
The pipe operator is denoted as
% > %
It prevents us from making unnecessary data frames
We can read the pipe as a series of imperative statements
If we want to find the cosine of sine for pi, we can write
Pi % > % sin() % > % cos()
We will learn how to do the same above analysis using pipe operator
movies %>% filter(genre =="Drama") %>% group_by(mpaa_rating) %>%
summarise(mean(imdb_rating))

Let us find the difference between “critics_score” and “audience_score”
from movies data frame. We will use box plot for this,usingthe pipe
operator we will combine the functions of “ggplot2” and “dplyr”
packages
movies %>% mutate(diff = audience_score-critics_score) %>% ggplot
(mapping = aes(x=genre, y=diff))+ geom_boxplot()
Now, we are going to find that number of category of movies in
mpaa_rating
movies %>% group_by(genre, mpaa_rating) %>% summarise(num = n())

Conditional statements
We will learn:
1.Conditional statements
2.If, else and else if statements
Conditional statements are used to execute some logical conditions in
the code
If, else and else if statements are some basic conditional statements

Statistical function for data analysis
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.
In the examples below (and for the next chapters), we will use the
mtcarsdata set, for statistical purposes:

To get in-built data set in R
data()
data(mtcars)
View(mtcars)
head(mtcars,6)
head(mtcars)
nrow(mtcars)
ncol(mtcars)
Example
# Print the mtcarsdata set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the
mtcarsdata set:
# Use the question mark to get information about the data set
?mtcars

Get Information
Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
Example
Data_Cars<-mtcars# create a variable of the mtcars
data set for better organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from
the data set
names(Data_Cars)

Sort Variable Values
To sort the values, use the sort() function:
Example
Data_Cars<-mtcars
sort(Data_Cars$cyl)
Analyzing the Data
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical
summary of the data:
Data_Cars<-mtcars
summary(Data_Cars)
sd(mtcars$cyl)

statistical function in R
Mean, Median, and Mode
In statistics, there are often three values that interests
us:
•Mean-The average value
•Median-The middle value
•Mode-The most common value
Data_Cars<-mtcars
mean(Data_Cars$wt)

Median
The median value is the value in the middle, after you have sorted all
the values.
If we take a look at the values of the wtvariable (from the mtcarsdata
set), we will see that there are two numbers in the middle:
Data_Cars<-mtcars
median(Data_Cars$wt)
mean(marks$Test1)
mean(marks$Test1, na.rm = TRUE)
d1 <-na.omit(old_filename)

Mode
The mode value is the value that appears the most number of times.
R does not have a function to calculate the mode. However, we can
create our own function to find it.
If we take a look at the values of the wtvariable (from the mtcarsdata
set), we will see that the numbers 3.440 are often shown:
Data_Cars<-mtcars
names(sort(-table(Data_Cars$wt)))[1]

http://www.sthda.com/english/wiki/ggplot2-
essentials#:~:text=There%20are%20two%20major%20functions,a%20pl
ot%20piece%20by%20piece.
Website give the details of ggplot2 package.
https://bookdown.org/jeffreytmonroe/business_analytics_with_r7/basi
cs.html
https://www.geeksforgeeks.org/packages-in-r-programming/?ref=lbp
https://www.modernstatisticswithr.com/datachapter.html
https://www.w3schools.com/r/r_stat_data_set.asp
https://www.geeksforgeeks.org/r-keywords/?ref=lbp