DATA MINING USING R (1).pptx

304 views 130 slides Nov 23, 2022
Slide 1
Slide 1 of 130
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130

About This Presentation

Data mining and visualisation with R for CSE students


Slide Content

ADIKAVI NANNAYA UNIVERSITY UNIVERSITY COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ONE DAY ORIENTATION PROGRAM ON DATA MINING USING R PROGRAMMING 11 TH Dec 2017. Dr. M. Kamala Kumari Assoc Prof

OUR WAY… DATA….A BASE THING DIFFERENCES BETWEEN RELATED TERMS OBJECTIVES OF PROCESSING THINGS STEPS IN DATA ANALYSIS DIFFERENT ANGLES OF DATA SCIENCE OBJECTIVES OF ALL STORIES WHAT IS THE ROLE OF R DEFINITIONS OF R VARIATIONS OF R COMPETETORS OF R WHY R CRAN R RSTUDIO BASIC COMMANDS PROGRAM 1 TO PROGRAM 13.

Base to anything---Data!! Processing Data =Applying Statistics on Data Data Context 345(423) 260 No: of UG Affiliated Colleges to AKNU No: of PG Affiliated Colleges to AKNU Total No: of Affiliated Colleges to AKNU Information AKNU has more number of UG Affiliations than PG Analysis = Understanding Information 85 Decision Making Decide whether to give affiliation for UG College or not!!

THE ABOVE PROCESS CAN BE VIEWED WITH R –SHOWING DATA,PROCESSING AND RESULTS ALL IN ONE ENVIRONMENT…..LET’S MAKE DECISION EASY WITH R!!!!

DATA, INFORMATION AND KNOWLEDGE KNOWLEDGE IS USEFUL INFORMATION OBTAINED THROUGH LEARNING AND EXPERIENCE KNOWLEDGE DOES NOT NEED DIRECT INTERACTION WIT WITH DATA PREDICTION IS POSSIBLE WITH REQUIRED KNOWLEDGE BUT NOT WITH INFORMATION ALONE NEED INFORMATION TO GET KNOWLEDGE INFORMATION IS PROCESSING DATA KNOWLEDGE IS PROCESSING PATTERNS OF INFORMATION ASSOCIATED WITH EXPERIENCE KNOWLEDGE REQUIRES COGNITIVE (REASONING, PERCEPTION) ABILITY ....WHERE AS INFORMATION NEED NOT INFORMATION KNOWLEDGE DATA

KNOWLEDGE==SCIENCE?? Data ==Facts Statistics ==Data + Formulae Information ==Description of Statistics(Reduce errors) Analysis == Understanding Information or Insights of Data and info Analytics == Algorithms/Techniques on Data Knowledge == Understanding information and technical results Data Mining == Analytics==Querying…???... YES

STEPS IN DATA ANALYSIS ETL DATA ANALYTICS Reports/Graphics Model Explore Clean Organize Collect DATA Remove errors and fill gaps Apply Statistics, Techniques Apply Algorithms Visualization Techniques/Tools Arrange in a particular format

DATA ANALYSIS DATA ANALYTICS DATA MINING AND DATA SCIENCE --- WE ALL ARE RELATED !! Data Science DATA ANALYSIS DCD DATA MINING DATA ANALYTICS DATA WAREHOUSING

DAWN TO DUSK=DATA SCIENCE!! Domain Expert SELECT H/W STATISTICS ETL Data Modeling Computing data Visualization Prediction

DATA SCIENCE ASSOCIATIONS

THE OBJECTIVES OF ALL THE STORIES BEHIND!!.....CONTD DESCRIPTION COMPARISION CLASSIFICATION COMBINE SIMILAR THINGS GENERATE RULES UNDERSTAND ACQUIRE KNOWLEDGE ….AND….. PREDICT/DECIDE

ROLE OF ‘R’…IN WHICH STORY The R  language is widely  used  among statisticians and data miners for developing statistical software and data analysis. Instead of long programming, R gives visualization of statistical computations in an easy way(instant methods and less programming with many packages included) R is one of the analytical tools

WE CAN DEFINE R TO BE…. R IS A PROGRAMMING LANGUAGE R IS AN ANALYTICAL TOOL R IS A SCRIPTING LANGUAGE R STUDIO IS A SOFTWARE ENVIRONMENT

A B C D E … S .. R..!!..? R  – A free and open source software programming language for statistical computing and graphics. Founders of R- R oss Ihaka & R obert Gentleman

R STUDIO R Studio is an IDE to develop R Founded by JJ Allaire R is an extension of S Language a Statistical Language. Latest version of R = R 3.4.2 for Windows 32/64bit

VARIATIONS OF R R  – free implementation of the S (programming language) pbdR – Programming with Big Data R R Commander– GUI interface for R Rattle GUI– GUI interface for R Revolution Analytics – production-grade software for the enterprise big data analytics RStudio  – GUI interface and development environment for R

COMPETITORS OF R MS Excel - Microsoft Excel Sheet SAS - Statistical Analysis System SPSS - Statistical Package for Social Science MATLAB -Matrix Laboratory OCTAVE -Helps in solving linear and nonlinear problems numerically. Python -Another Programming language which express concepts in fewer lines of code. Spark -Provides Interface for programming entire cluster with implicit data parallelism Storm - Distributed Real time computation System

THEN WHY R?? More powerful data manipulation capabilities Easier automation Faster computation It reads any type of data Easier project organization It supports larger data sets Reproducibility (important for detecting errors) Easier to find and fix errors It's free It's open source Advanced Statistics capabilities State-of-the-art graphics It runs on many platforms Anyone can contribute packages to improve its functionality

INVITE R AND RSTUDIO… Download and install the latest R:  http://www.r-project.org/ Download and install RStudio, the R IDE:  http://www.rstudio.com/

CRAN R The “ Comprehensive R Archive Network ” (  CRAN  ) is a collection of sites which carry identical material, consisting of the  R  distribution(s), the contributed extensions, documentation for  R , and binaries. R FAQ - The R Project for Statistical Computing CRAN  is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for  R . Please use the  CRAN  mirror nearest to you to minimize network load.

Welcome to RStudio..!!

Get and Set working directories >getwd() [1] "C:/Users/My Document/Documents" setwd("C:/Program Files/R/R-3.4.3/bin/i386") getwd() [1] "C:/Program Files/R/R-3.4.3/bin/i386" dir() data() ls()

SIMPLE COMMANDS TO INSTALL ANY PACKAGE >install.packages(“ package name “) We can install any package if we know the correct name suitable for that version TO SEE ALL LIST OF DATASETS >data() TO LOAD THAT INSTALLED PACKAGE/FUNCTION IN R >library(function name/package name) TO SEE LIST OF PACKAGES INSTALLED IN DIFFERENT LIBRARIES >library()

PACKAGE AND LIBRARY…???   Recently, the official repository ( CRAN ) reached 25,000 packages published, and many more are publicly available through the internet. A package is a like a book, a library is like a library; you use library() to check a package in the library---- Hadley Wickham Chief Scientist at Rstudio Functions are like pages in a package book!!

COMPLETIONS YELLOW COLOUR ARE VARIABLES BLUE COLOURS ARE FOR FUNCTIONS VOILET COLOUR AND P INSIDE WITH TWO ::BESIDE FOR PACKAGES VOILET FOR FUNCTION ARGUMENTS OR VECTORS GRID FOR DATAFRAMES

Program 1:BASIC COMMANDS-VECTORS A  vector  is a sequence of data elements of the same basic type. Members in a  vector  are officially called components or members. > 8.5:4.5 #sequence of numbers downline rnorm(10) c(1, 1:3, c(5, 8), 13) SAME CAN BE WRITTEN LIKE THIS ALSO vector("numeric", 5) >numeric(5) vector("complex", 5) >complex(5) vector("logical", 5) >logical(5) vector("list", 5) >list(5) vector("character", 5) >character(5) seq.int(3, 12) #same as 3:12 seq.int(3, 12, 2) seq.int(0.1, 0.01, -0.01) seq_len(5)

>seq_len(n) >pp <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers") >for(i in seq_along(pp)) print(pp[i]) >length(1:5) >length(c(TRUE, FALSE, NA)) >sn <- c(“Varma", “Persis", “Kamala“, ”PVRao”) >length(sn) >nchar(sn) R’s vectors each element can be given a name. Labeling the elements can often make your code much more readable. You can specify names when you create a vector in the form  name = value . If the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes. c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)

>x <- (1:5) ^ 2 >x[c(1, 3, 5)] >x[c(-2, -4)] >x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] Mixing positive and negative values is not allowed, and will throw an error: >x[c(1, -1)] #This doesn't make sense! >names(x) <- c("one", "four", "nine", "sixteen", "twenty five") >x[c("one", "nine", "twenty five")] > x[c(1, NA, 5)] >x[c(TRUE, FALSE, NA, FALSE, TRUE)] > 10/3 [1] 3.333333 > options(digits=8) > 10/3 [1] 3.3333333 > options(digits=10) > 10/3 [1] 3.333333333

The  which  function returns the locations where a logical vector is TRUE. This can be useful for switching from logical indexing to integer indexing: x<-c(23,12,45,11,2,3,4) > which(x>10) [1] 1 2 3 4 >which.min(x) >1:5 + 1 # adds one to each element of the vector >1:5 + 1:15 # Smaller vector adds and recycles with the larger one ADDING SCALARS TO VECTORS >rep(1:5, 3) #repeat function >rep(1:5, each = 3) >rep(1:5, times = 1:5) >rep(1:5, length.out = 7) >rep.int(1:5, 3) #the same as rep(1:5, 3) >rep_len(1:5, 13)

FEW MORE BASIC COMMANDS To see any dataset in Code editor, Type >View(women) in Console. To list the number of rows / columns respectively >nrow(women) >ncol(women)   To output a summary about the dataset’s columns. >summary(women)   To output a summary of a dataset’s structure.  >str(women) To get the dimensions of a dataset(number of obseravtions and columns) >dim(women) To access a column in a dataset >women$height To check the type (or class) of a variable, the class function can be used >class(women)  

COERCION > myNum <- 5.983904798274987298 > class(myNum) "numeric“ You can coerce (change type of) numeric string values into numeric types, like so: > myString <- "5.60“ > class(myString) "character“ > myNumber <- as.numeric(myString) > myNumber 5.6 > class(myNumber) "numeric"

> myInt <- 209173987 > class(myInt) "numeric“ To actually force them to be integers, we need to invoke a function that manually coerces them, called as.integer: > myInt <- as.integer(myInt) > class(myInt) "integer"

>myComparison <- 5 > 6 > myComparison FALSE > class(myComparison) "logical“ >myComplex <- complex(1, 3292, 8974892) >myComplex 3292+8974892i > class(myComplex) "complex"

PROGRAM NO:2 IMPORT FROM AND EXPORT TO CSV FILES CSV files(Comma Separated Values) are intentionally designed to be widely supported; any OS or application that imports or exports data usually has CSV support. They do nothing else but hold data - no text formatting for example. Excel files hold the same data, but in binary format. This allows the file to save specifc Excel features - charts, formatting, etc. > datacsv<-read.csv("D:/FDP/Stu Info.csv") > datacsv > s<-subset(datacsv,Sec.Lang=="Sanskrit") > write.csv(s,"output.csv") >View(“output.csv) View(s)

VECTORS AND LISTS The most essential of all, the vector, is a collection of elements of the same type. A vector can only have elements of the exact same type. Vectors are usually created with the shorthand  c (concatenate) function: > myVector <- c("Hello", "World", "Third Element") > class(myVector) "character" > myVector "Hello" "World" "Third Element"

>myVector <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen") > myVector [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" [8] "Eight" "Nine" "Ten" "Eleven" "Twelve" "Thirteen" "Fourteen" [15] "Fifteen"

Note that vectors are strictly one-dimensional. You cannot add another vector as an element inside an existing vector – their elements get merged into one: > v1 <- c("a", "b", "c") > v2 <- c("d", "e", "f") > v3 <- c(v1, v2) > v3 [1] "a" "b" "c" "d" "e" "f“ You can generate entire numeric vectors by specifying a range: > myRange <- c(1:10) > myRange [1] 1 2 3 4 5 6 7 8 9 10

LISTS Lists  are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They are built with the list function or with the c function if one of the elements you’re adding is a list:

LISTS VISUALIZATION

LISTS The following variable x is a list containing copies of three vectors n, s, b, and a numeric value 3. > n = c(2, 3, 5)  > s = c("aa", "bb", "cc", "dd", "ee")  > b = c(TRUE, FALSE, TRUE, FALSE, FALSE)  > x = list(n, s, b, 3) # x contains copies of n, s, b

pepper shaker is list  x x[1] is a single packet x[[1]] is a slice x[[1]][[1]] out of the list   In contrast, a double bracket will always return only one element. Before moving to double bracket a note to be kept in mind.   NOTE:THE MAJOR DIFFERENCE BETWEEN THE TWO IS THAT SINGLE BRACKET RETURNS YOU A LIST WITH AS MANY ELEMENTS AS YOU WISH WHILE A DOUBLE BRACKET WILL NEVER RETURN A LIST. RATHER A DOUBLE BRACKET WILL RETURN ONLY A SINGLE ELEMENT FROM THE LIST. Single bracket will always returns another list with number of elements equal to the number of elements or number of indices you pass into the single bracket.

Member Reference In order to reference a list member directly, we have to use the  double square bracket  "[[]]"operator. The following object x[[2]] is the second member of x. In other words, x[[2]] is a copy of s, but is  not  a slice containing s or its copy. > x[[2]]  [1] "aa" "bb" "cc" "dd" "ee" We can modify its content directly. > x[[2]][1] = "ta"  > x[[2]]  [1] "ta" "bb" "cc" "dd" "ee"  > s  [1] "aa" "bb" "cc" "dd" "ee"   # s is unaffected

MATRICES Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol) > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA > dim(m) [1] 2 3 > attributes(m) $dim [1] 2 3

>m<-matrix(nrow=3,ncol=2,c(1,2,3,4,5,6)) >m [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > m <- matrix(1:6, nrow = 2, ncol = 3) > m<-matrix(c(1,2,3,4)) m<-matrix(c(1,2,3,4),7,8) m<- matrix(1:9,nrow=3,ncol=3,byrow=TRUE) matrix(1,nrow=10,ncol=10) A <- matrix(0,3,4) z <- A[2,3] # returns 2 nd row and 3 rd col of matrix A and assigns to z > A[2:4,4:2] # Selecting 2 nd ,3 rd and 4 th rows and 4 th ,3 rd and 2 nd colmns and getting another sub matrix. > A[2,2:3] # Second row, 2 nd col and 3 rd col elements. >second.column <- A[,2] #returns second.column; >which(A>8) # returns elements which are greater than 8.

ARRAYS An array is just a vector plus information on the dimensions of the array. We can create an array from a vector: X <- array(1:24,dim=c(3,4,2)) # 24 elements in an array, with 3 rows, 4 cols, in 2 matrices form. x <- seq(1,27) > c(3,9) [1] 3 9 > dim(x)=c(3,9) > is.array(x) [1] TRUE > is.matrix(x) [1] TRUE > x [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1 4 7 10 13 16 19 22 25 [2,] 2 5 8 11 14 17 20 23 26 [3,] 3 6 9 12 15 18 21 24 27

DATA FRAMES Data frames are used to store tabular data. They are represented as a special type of list where every element of the list has to have the same length . Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Unlike matrices, data frames can store different classes of objects in each column (just like lists);

A  data frame  is used for storing data tables. It is a list of vectors of equal length. > n = c(2, 3, 5)  > s = c("aa", "bb", "cc")  > b = c(TRUE, FALSE, TRUE)  > df = data.frame(n, s, b)     # df is a data frame

Cell value from the first row, second column of mtcars. > mtcars[1, 2]  [1] 6 Can use the row and column names instead of the numeric coordinates. > mtcars["Mazda RX4", "cyl"]  [1] 6 Lastly, the number of data rows in the data frame is given by the nrow function. > nrow(mtcars)    # number of data rows  [1] 32 And the number of columns of a data frame is given by the ncol function. > ncol(mtcars)    # number of columns  [1] 11

We reference a data frame column with the  double square bracket  "[[]]" operator. For example, to retrieve the ninth column vector of the built-in data set  mtcars , we write mtcars[[9]]. > mtcars[[9]]   [1]  1 1 1 0 0 0 0 0 0 0 0 ... We can retrieve the same column vector by its name. > mtcars[["am"]]   [1]  1 1 1 0 0 0 0 0 0 0 0 ... We can also retrieve with the "$" operator in lieu of the double square bracket operator. > mtcars$am   [1]  1 1 1 0 0 0 0 0 0 0 0 ... Yet another way to retrieve the same column vector is to use the  single square   bracket  "[]"operator. We prepend the column name with a comma character, which signals a wildcard match for the row position. > mtcars[,"am"]   [1]  1 1 1 0 0 0 0 0 0 0 0 ...

>x <- read.csv("data1.csv",header=T, sep=",") >x2 <- read.csv("data2.csv",header=T, sep=",")   >x3 <- cbind(x,x2) >x3 Subtype Gender Expression Age City 1 A m -0.54 32 New York 2 A f -0.80 21 Houston 3 B f -1.03 34 Seattle 4 C m -0.41 67 Houston > which(A>=15,arr.ind=TRUE) row col [1,] 3 4 [2,] 4 4 Similarly we assign the values in the other way. >A[1,] <- c(2,4,5)

EXP NO 3: GETTING AND CLEANING DATA WITH SWIRL Swirl is an interactive package which will teach us and at the same time make us practice with the exercises. It has three types of exercises, basic, intermediate and advanced. Getting and cleaning data is an intermediate exercise.

WHAT IS SWIRL() IN R swirl  is a software package for the  R  programming language that turns the  R console into an interactive learning environment. Users receive immediate feedback as they are guided through self-paced lessons in data science and  R  programming. install.packages(“swirl”) library(swirl) install_from_swirl("Getting and Cleaning Data")

>install.packages(“swirl”) >library(swirl) install_course("Getting and Cleaning Data") swirl()

SWIRL() Flow.. | Please choose a course, or type 0 to exit swirl.   1: Getting and Cleaning Data 2: R Programming 3: Take me to the swirl course repository!   Selection: 1   | Please choose a lesson, or type 0 to return to course | menu.   1: Manipulating Data with dplyr 2: Grouping and Chaining with dplyr 3: Tidying Data with tidyr 4: Dates and Times with lubridate

ABOUT PACKAGES COMING WITH GETTING AND CLEANING DATA For this we use three types of packages: dplyr, tidyr, lubridate. Dplyr is a package that provides a consistent and concise grammar for manipulating tabular data. It makes data manipulation easier.

About dplyr package from swirl() According to the "Introduction to dplyr" vignette written by the package authors, "The dplyr philosophy is to have small functions that each do one thing well." Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

Data manipulation using dplyr install.packages ("dplyr") ## install You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; We recommend the RStudio mirror. library ("dplyr") ## load You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

Selecting columns and filtering rows To select columns of a data frame, use select(). The first argument to this function is the data frame (ToothGrowth), and the subsequent arguments are the columns to keep. select (ToothGrowth, len, supp, dose) >aa<-select(ToothGrowth,len,supp,dose)

Select(): To select columns of a data frame select (ToothGrowth, len, supp, dose) >plot(aa) Filter(): To choose rows filter (ToothGrowth, len==5)

Filter(): To choose rows filter (ToothGrowth, len>5) Pipes(>%>) nest functions (i.e. one function inside of another) Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. >ToothGrowth %>% + filter (len < 5) %>% + select (len,supp,dose)

To create a new object with this smaller version of the data we could do so by assigning it a new name. >ToothGrowth_sml <- ToothGrowth %>% + filter (len < 5) %>% + select (len,supp,dose)  MUTATE (): create new columns based on the values in existing columns

>ToothGrowth %>% + mutate (len = len/ 4) If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data >ToothGrowth %>% + mutate (len=len/4) %>% +head

The first few rows are full of NAs, so if we wanted to remove those we could insert filter() in this chain: > ToothGrowth %>% + mutate (len = len/ 4) %>% + filter (! is.na (len)) %>% + head

Groupby(): group_by() splits the data into groups upon which some operations can be run >ToothGrowth %>% group_by (supp) %>% tally () summarize(): single group_by() is often used together with summarize() which collapses each group into a -row summary of that group. >ToothGrowth %>% group_by (supp) %>% summarize (len= mean (len, na.rm = TRUE))

Data Frame Column Slice We retrieve a data frame column slice with the  single square bracket  "[]" operator. Numeric Indexing The following is a slice containing the first column of the built-in data set  mtcars . > mtcars[1]                      mpg  Mazda RX4     21.0  Mazda RX4 Wag     21.0  Datsun 710         22.8                     ............ Name Indexing We can retrieve the same column slice by its name. > mtcars["mpg"]                      mpg  Mazda RX4          21.0  Mazda RX4 Wag     21.0  Datsun 710         22.8                     ............ To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector inside the single square bracket operator. > mtcars[c("mpg", "hp")]                      mpg   hp  Mazda RX4          21.0  110  Mazda RX4 Wag      21.0  110  Datsun 710         22.8  93                     ............

Exp 5. Creating Data Frame emp.data <- data.frame( emp_id = c (1:5), emp_name = c(“Ratna",”Kumar”,“Kamala",“Prajwal",“Pravachan"), salary = c(623.3, 515.2, 611.0, 729.0, 843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE ) >emp.data # Add the "dept" coulmn. emp.data$dept <- c("IT","Operations","IT","HR","Finance") v <- emp.data print(v)

Extracting rows and columns A=emp.data$emp_id B=emp.data$emp_name a)C=data.frame(A,B) b)data.frame[1:2,] c)data.frame[c(3,5),c(2,4)]

emp.data[1:2,] emp_id emp_name salary start_date 1 1 Rick 623.3 2012-01-01 2 2 Dan 515.2 2013-09-23 > emp.data[c(3,5),c(2,4)] emp_name start_date 3 Michelle 2014-11-15 5 Gary 2015-03-27

PROGRAM 6: ‘apply’ group of functions

Function Arguments Objective Input Output apply apply(x, MARGIN, FUN) Apply a function to the rows or columns or both Data frame or matrix vector, list, array lapply lapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame list sapply sapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame vector or matrix

PROGRAM 7- cbind-ing and rbind-ing Matrices can be created by column-binding or row-binding with cbind() and rbind(). > x <- 1:3 > y <- 10:12 > cbind(x, y) x y [1,] 1 10 [2,] 2 11 [3,] 3 12 > rbind(x, y) [,1] [,2] [,3] x 1 2 3 y 10 11 12 >C <- cbind(1:3,4:6,5:7) >D <- rbind(1:3,4:6)

PROGRAM 7: Rbind() and cbind() functions. Matrices can be created by column-binding or row-binding with cbind() and rbind(). Data frames can also be appended by these functions. > x <- 1:3 > y <- 10:12 > cbind(x, y) x y [1,] 1 10 [2,] 2 11 [3,] 3 12 > rbind(x, y) [,1] [,2] [,3] x 1 2 3 y 10 11 12

Factor Variables Factor variables are nothing but nominal variables and also known as categorical variables. Levels are nothing but unique values in the variable values. gender <- c(rep("male",20), rep("female", 30)) gender<-factor(gender) Levels: female male # Factor variables summary(gender) female male 30 20

PROGRAM 8: DISCRETE IRIS iris$Seplen<- cut(iris$Sepal.Length, breaks=c(4.3,5.6,6.8,7.9), labels=c("low","medium","high")) > iris$Seplen [1] low low low low low low low low [9] low low low low low <NA> medium medium [17] low low medium low low low low low [25] low low low low low low low low [33] low low low low low low low low [41] low low low low low low low low [49] low low high medium high low medium medium [57] medium low medium low low medium medium medium [65] ….. Levels: low medium high

PROGRAM 9 - SCATTER PLOT USING ‘DPLYR’ ON GUINEA PIGS ‘TOOTHGROWTH’ DATA SET

aa<-select(ToothGrowth,len,supp,dose) #To choose rows we use filter() > filter(ToothGrowth,len<=14.5) > ToothGrowth%>%+ group_by(supp) > ToothGrowth%>% + group_by(supp)%>% + summarise(meanoflen=mean(len)) > plot(aa) >

gg-grammer of graphics library(dplyr) > library(ggplot) > library(ggplot2) >ggplot(aa,aes(x=factor(dose),y=len,fill=supp)) >gplot(aa,aes(x=factor(dose),y=len,fill=supp))+geom_boxplot() /*aes=aesthetic*/

PROGRAM-10…LINEAR AND MULTIPLE REGRESSION Regression : A technique for determining the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables. Linear Regression: Y=mX+c Y X Single Predictor, X

Multiple Linear Regression Y=aX 3 +bX 2 +cX+d 3 Predictors/Explanatory variables, X 3, X 2, X a,b,c are coefficients d is random error=bias value Y is a response variable Y is estimated or predicted dependent on 3 X variables.

Mtcars variables [, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000) [, 7] qsec 1/4 mile time [, 8] vs. V/S (Engine Cylinder confg V shape or S shape) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

lm=linear mode > library(ggplot2) >ggplot(mtcars,aes(wt,mpg)) >ggplot(mtcars,aes(wt,mpg))+geom_point() >ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm")

Mpg verses weight

For example in the mtcars dataset, you can build a linear model between the gas consumption (mpg) and the weight of the car (wt): mpg=β0+β1wt β1 is slope mpg is dependent β0 is intercept wt is independent

Residuals. The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the  residual  (e). Each data point has one  residual . y=10*3+5=35——-observed Model, m=9. y=9x+c y=9*3+5=32——predicted….

> mfit = lm(mpg ~ wt + disp + cyl, data=mtcars) > plot(mfit)

PROGRAM NO: 11 Major Clustering Approaches (I) Partitioning approach : Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach : Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach : Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach : based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

IRIS TYPES

K-means clustering names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" [4] "Petal.Width" "Species"   > x<-iris[,-5]   > y<-iris$Species   > kc<-kmeans(x,3)   > kc   K-means clustering with 3 clusters of sizes 38, 62, 50   Cluster means: Sepal.Length Sepal.Width Petal.Length Petal.Width 1 6.850000 3.073684 5.742105 2.071053 2 5.901613 2.748387 4.393548 1.433871 3 5.006000 3.428000 1.462000 0.246000  

Clustering vector: [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [29] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 [113] 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 [141] 1 1 2 1 1 1 2 1 1 2   Within cluster sum of squares by cluster: [1] 23.87947 39.82097 15.15100 (between_SS / total_SS = 88.4 %)

>plot(x[c("Sepal.Length","Sepal.Width")],col=kc$cluster)

K-means >points(kc$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=23, cex=3)

> library(fpc) > pamresult<-pamk(iris1) > pamresult$nc [1] 2 > pamresult$nc #nc-Number of Clusters [1] 2 > table(pamresult$pamobject$clustering,iris$Species) setosa versicolor virginica 1 50 1 0 2 0 49 50 > layout(matrix(c(1,2),1,2)) # > plot(pamresult$pamobject)

The  ggplot () command creates a plot object. In it we assigned a data set.  aes () creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. ... Another way to split up the way we look at data is with facets.

> ggplot(mtcars,aes(wt,mpg)) Error in ggplot(mtcars, aes(wt, mpg)) : could not find function "ggplot" > library(ggplot2) > library(ggplot2) > ggplot(mtcars,aes(wt,mpg)) > ggplot(mtcars,aes(wt,mpg))+geom_point() > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm") > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_abline()

> library(ggplot2) > ggplot(mtcars,aes(wt,mpg)) > ggplot(mtcars,aes(wt,mpg))+geom_point() >ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm“)

> ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()

What combination of predictors will best predict fuel efficiency?(Slope/Coefficients and intercepts) Which predictors increase our accuracy by a statistically significant amount? We should guess which predictors are significant, and to determine the ideal formula for prediction….WHICH IS WHAT WE CALL LINEAR REGRESSION.

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Density-Based Clustering: Basic Concepts Two parameters : Eps : Maximum radius of the neighbourhood MinPts : Minimum number of points in an Eps-neighbourhood of that point N Eps (p) : {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable : A point p is directly density-reachable from a point q w.r.t. Eps , MinPts if p belongs to N Eps (q) core point condition: | N Eps (q) | ≥ MinPts MinPts = 5 Eps = 1 cm p q

Density-Reachable and Density-Connected Density-reachable: A point p is density-reachable from a point q w.r.t. Eps , MinPts if there is a chain of points p 1 , … , p n , p 1 = q , p n = p such that p i+1 is directly density-reachable from p i Density-connected A point p is density-connected to a point q w.r.t. Eps , MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p 1 p q o

DBSCAN: Density-Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed

Before Package ‘rpart’ Title: Recursive Partitioning and Regression Trees A  regression line  is a straight line that attempts to predict the relationship between two points, also known as a trend line or line of best fit.  Simple linear regression  is a prediction when a variable ( y ) is dependent on a second variable ( x ) based on the regression equation of a given set of data.

Decision trees are of two types Classification Trees Regression Trees CTs are used when the target or response variable is of categorical in nature. RTs are used when the target variable is continuous or numeric.  It is the  target variable  that determines the type of decision tree needed.

DECISION TREES USING PARTY-PROGRAM 12 > install.packages(“readr”) > library(readr) > install.packages("party") Installing package into ‘C:/Users/My Document/Documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip' Content type 'application/zip' length 719826 bytes (702 KB) downloaded 702 KB   package ‘party’ successfully unpacked and MD5 sums checked   The downloaded binary packages are in C:\Users\My Document\AppData\Local\Temp\RtmpOAuKaM\downloaded_packages > library(party)

DECISION TREE USING RPART..PROGRAM 12 rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) tree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,method="class")

> iris$class<-as.factor(iris$class) > > View(iris) > iris$Species<-as.factor(iris$Species) > tree1<-ctree(Species~Sepal.Length, data=iris) > plot(tree1)

tree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,method="class")> plot(tree)

> plot(tree, uniform=TRUE,main="Classification Tree for Iris dataset")> text(tree, use.n=TRUE, all=TRUE, cex=.8)

SUPPORT VECTOR MACHINE

X1, X2 Attributes

ABOUT DIFFERENT TYPES OF VARIABLES

FEW GOOD WEB SITES ON R www.kaggle.com www.rdocumentation.org www.statmethods.net www.r-tutor.com www.tutorialspoint.com www.datacamp.com www.github.com https://drsimonj.svbtle.com/visualising-residuals