DATA MINING USING R (1).pptx

ADIKAVI NANNAYA UNIVERSITY UNIVERSITY COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ONE DAY ORIENTATION PROGRAM ON DATA MINING USING R PROGRAMMING 11 TH Dec 2017. Dr. M. Kamala Kumari Assoc Prof

OUR WAY… DATA….A BASE THING DIFFERENCES BETWEEN RELATED TERMS OBJECTIVES OF PROCESSING THINGS STEPS IN DATA ANALYSIS DIFFERENT ANGLES OF DATA SCIENCE OBJECTIVES OF ALL STORIES WHAT IS THE ROLE OF R DEFINITIONS OF R VARIATIONS OF R COMPETETORS OF R WHY R CRAN R RSTUDIO BASIC COMMANDS PROGRAM 1 TO PROGRAM 13.

Base to anything---Data!! Processing Data =Applying Statistics on Data Data Context 345(423) 260 No: of UG Affiliated Colleges to AKNU No: of PG Affiliated Colleges to AKNU Total No: of Affiliated Colleges to AKNU Information AKNU has more number of UG Affiliations than PG Analysis = Understanding Information 85 Decision Making Decide whether to give affiliation for UG College or not!!

THE ABOVE PROCESS CAN BE VIEWED WITH R –SHOWING DATA,PROCESSING AND RESULTS ALL IN ONE ENVIRONMENT…..LET’S MAKE DECISION EASY WITH R!!!!

DATA, INFORMATION AND KNOWLEDGE KNOWLEDGE IS USEFUL INFORMATION OBTAINED THROUGH LEARNING AND EXPERIENCE KNOWLEDGE DOES NOT NEED DIRECT INTERACTION WIT WITH DATA PREDICTION IS POSSIBLE WITH REQUIRED KNOWLEDGE BUT NOT WITH INFORMATION ALONE NEED INFORMATION TO GET KNOWLEDGE INFORMATION IS PROCESSING DATA KNOWLEDGE IS PROCESSING PATTERNS OF INFORMATION ASSOCIATED WITH EXPERIENCE KNOWLEDGE REQUIRES COGNITIVE (REASONING, PERCEPTION) ABILITY ....WHERE AS INFORMATION NEED NOT INFORMATION KNOWLEDGE DATA

KNOWLEDGE==SCIENCE?? Data ==Facts Statistics ==Data + Formulae Information ==Description of Statistics(Reduce errors) Analysis == Understanding Information or Insights of Data and info Analytics == Algorithms/Techniques on Data Knowledge == Understanding information and technical results Data Mining == Analytics==Querying…???... YES

STEPS IN DATA ANALYSIS ETL DATA ANALYTICS Reports/Graphics Model Explore Clean Organize Collect DATA Remove errors and fill gaps Apply Statistics, Techniques Apply Algorithms Visualization Techniques/Tools Arrange in a particular format

DATA ANALYSIS DATA ANALYTICS DATA MINING AND DATA SCIENCE --- WE ALL ARE RELATED !! Data Science DATA ANALYSIS DCD DATA MINING DATA ANALYTICS DATA WAREHOUSING

DAWN TO DUSK=DATA SCIENCE!! Domain Expert SELECT H/W STATISTICS ETL Data Modeling Computing data Visualization Prediction

DATA SCIENCE ASSOCIATIONS

THE OBJECTIVES OF ALL THE STORIES BEHIND!!.....CONTD DESCRIPTION COMPARISION CLASSIFICATION COMBINE SIMILAR THINGS GENERATE RULES UNDERSTAND ACQUIRE KNOWLEDGE ….AND….. PREDICT/DECIDE

ROLE OF ‘R’…IN WHICH STORY The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Instead of long programming, R gives visualization of statistical computations in an easy way(instant methods and less programming with many packages included) R is one of the analytical tools

WE CAN DEFINE R TO BE…. R IS A PROGRAMMING LANGUAGE R IS AN ANALYTICAL TOOL R IS A SCRIPTING LANGUAGE R STUDIO IS A SOFTWARE ENVIRONMENT

A B C D E … S .. R..!!..? R – A free and open source software programming language for statistical computing and graphics. Founders of R- R oss Ihaka & R obert Gentleman

R STUDIO R Studio is an IDE to develop R Founded by JJ Allaire R is an extension of S Language a Statistical Language. Latest version of R = R 3.4.2 for Windows 32/64bit

VARIATIONS OF R R – free implementation of the S (programming language) pbdR – Programming with Big Data R R Commander– GUI interface for R Rattle GUI– GUI interface for R Revolution Analytics – production-grade software for the enterprise big data analytics RStudio – GUI interface and development environment for R

COMPETITORS OF R MS Excel - Microsoft Excel Sheet SAS - Statistical Analysis System SPSS - Statistical Package for Social Science MATLAB -Matrix Laboratory OCTAVE -Helps in solving linear and nonlinear problems numerically. Python -Another Programming language which express concepts in fewer lines of code. Spark -Provides Interface for programming entire cluster with implicit data parallelism Storm - Distributed Real time computation System

THEN WHY R?? More powerful data manipulation capabilities Easier automation Faster computation It reads any type of data Easier project organization It supports larger data sets Reproducibility (important for detecting errors) Easier to find and fix errors It's free It's open source Advanced Statistics capabilities State-of-the-art graphics It runs on many platforms Anyone can contribute packages to improve its functionality

INVITE R AND RSTUDIO… Download and install the latest R: http://www.r-project.org/ Download and install RStudio, the R IDE: http://www.rstudio.com/

CRAN R The “ Comprehensive R Archive Network ” ( CRAN ) is a collection of sites which carry identical material, consisting of the R distribution(s), the contributed extensions, documentation for R , and binaries. R FAQ - The R Project for Statistical Computing CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R . Please use the CRAN mirror nearest to you to minimize network load.

Welcome to RStudio..!!

Get and Set working directories >getwd() [1] "C:/Users/My Document/Documents" setwd("C:/Program Files/R/R-3.4.3/bin/i386") getwd() [1] "C:/Program Files/R/R-3.4.3/bin/i386" dir() data() ls()

SIMPLE COMMANDS TO INSTALL ANY PACKAGE >install.packages(“ package name “) We can install any package if we know the correct name suitable for that version TO SEE ALL LIST OF DATASETS >data() TO LOAD THAT INSTALLED PACKAGE/FUNCTION IN R >library(function name/package name) TO SEE LIST OF PACKAGES INSTALLED IN DIFFERENT LIBRARIES >library()

PACKAGE AND LIBRARY…??? Recently, the official repository ( CRAN ) reached 25,000 packages published, and many more are publicly available through the internet. A package is a like a book, a library is like a library; you use library() to check a package in the library---- Hadley Wickham Chief Scientist at Rstudio Functions are like pages in a package book!!

COMPLETIONS YELLOW COLOUR ARE VARIABLES BLUE COLOURS ARE FOR FUNCTIONS VOILET COLOUR AND P INSIDE WITH TWO ::BESIDE FOR PACKAGES VOILET FOR FUNCTION ARGUMENTS OR VECTORS GRID FOR DATAFRAMES

Program 1:BASIC COMMANDS-VECTORS A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components or members. > 8.5:4.5 #sequence of numbers downline rnorm(10) c(1, 1:3, c(5, 8), 13) SAME CAN BE WRITTEN LIKE THIS ALSO vector("numeric", 5) >numeric(5) vector("complex", 5) >complex(5) vector("logical", 5) >logical(5) vector("list", 5) >list(5) vector("character", 5) >character(5) seq.int(3, 12) #same as 3:12 seq.int(3, 12, 2) seq.int(0.1, 0.01, -0.01) seq_len(5)

>seq_len(n) >pp <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers") >for(i in seq_along(pp)) print(pp[i]) >length(1:5) >length(c(TRUE, FALSE, NA)) >sn <- c(“Varma", “Persis", “Kamala“, ”PVRao”) >length(sn) >nchar(sn) R’s vectors each element can be given a name. Labeling the elements can often make your code much more readable. You can specify names when you create a vector in the form name = value . If the name of an element is a valid variable name, it doesn’t need to be enclosed in quotes. c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)

>x <- (1:5) ^ 2 >x[c(1, 3, 5)] >x[c(-2, -4)] >x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] Mixing positive and negative values is not allowed, and will throw an error: >x[c(1, -1)] #This doesn't make sense! >names(x) <- c("one", "four", "nine", "sixteen", "twenty five") >x[c("one", "nine", "twenty five")] > x[c(1, NA, 5)] >x[c(TRUE, FALSE, NA, FALSE, TRUE)] > 10/3 [1] 3.333333 > options(digits=8) > 10/3 [1] 3.3333333 > options(digits=10) > 10/3 [1] 3.333333333

The which function returns the locations where a logical vector is TRUE. This can be useful for switching from logical indexing to integer indexing: x<-c(23,12,45,11,2,3,4) > which(x>10) [1] 1 2 3 4 >which.min(x) >1:5 + 1 # adds one to each element of the vector >1:5 + 1:15 # Smaller vector adds and recycles with the larger one ADDING SCALARS TO VECTORS >rep(1:5, 3) #repeat function >rep(1:5, each = 3) >rep(1:5, times = 1:5) >rep(1:5, length.out = 7) >rep.int(1:5, 3) #the same as rep(1:5, 3) >rep_len(1:5, 13)

FEW MORE BASIC COMMANDS To see any dataset in Code editor, Type >View(women) in Console. To list the number of rows / columns respectively >nrow(women) >ncol(women) To output a summary about the dataset’s columns. >summary(women) To output a summary of a dataset’s structure. >str(women) To get the dimensions of a dataset(number of obseravtions and columns) >dim(women) To access a column in a dataset >women$height To check the type (or class) of a variable, the class function can be used >class(women)

COERCION > myNum <- 5.983904798274987298 > class(myNum) "numeric“ You can coerce (change type of) numeric string values into numeric types, like so: > myString <- "5.60“ > class(myString) "character“ > myNumber <- as.numeric(myString) > myNumber 5.6 > class(myNumber) "numeric"

> myInt <- 209173987 > class(myInt) "numeric“ To actually force them to be integers, we need to invoke a function that manually coerces them, called as.integer: > myInt <- as.integer(myInt) > class(myInt) "integer"

>myComparison <- 5 > 6 > myComparison FALSE > class(myComparison) "logical“ >myComplex <- complex(1, 3292, 8974892) >myComplex 3292+8974892i > class(myComplex) "complex"

PROGRAM NO:2 IMPORT FROM AND EXPORT TO CSV FILES CSV files(Comma Separated Values) are intentionally designed to be widely supported; any OS or application that imports or exports data usually has CSV support. They do nothing else but hold data - no text formatting for example. Excel files hold the same data, but in binary format. This allows the file to save specifc Excel features - charts, formatting, etc. > datacsv<-read.csv("D:/FDP/Stu Info.csv") > datacsv > s<-subset(datacsv,Sec.Lang=="Sanskrit") > write.csv(s,"output.csv") >View(“output.csv) View(s)

VECTORS AND LISTS The most essential of all, the vector, is a collection of elements of the same type. A vector can only have elements of the exact same type. Vectors are usually created with the shorthand c (concatenate) function: > myVector <- c("Hello", "World", "Third Element") > class(myVector) "character" > myVector "Hello" "World" "Third Element"

>myVector <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Ten", "Eleven", "Twelve", "Thirteen", "Fourteen", "Fifteen") > myVector [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" [8] "Eight" "Nine" "Ten" "Eleven" "Twelve" "Thirteen" "Fourteen" [15] "Fifteen"

Note that vectors are strictly one-dimensional. You cannot add another vector as an element inside an existing vector – their elements get merged into one: > v1 <- c("a", "b", "c") > v2 <- c("d", "e", "f") > v3 <- c(v1, v2) > v3 [1] "a" "b" "c" "d" "e" "f“ You can generate entire numeric vectors by specifying a range: > myRange <- c(1:10) > myRange [1] 1 2 3 4 5 6 7 8 9 10

LISTS Lists are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They are built with the list function or with the c function if one of the elements you’re adding is a list:

LISTS VISUALIZATION

LISTS The following variable x is a list containing copies of three vectors n, s, b, and a numeric value 3. > n = c(2, 3, 5) > s = c("aa", "bb", "cc", "dd", "ee") > b = c(TRUE, FALSE, TRUE, FALSE, FALSE) > x = list(n, s, b, 3) # x contains copies of n, s, b

pepper shaker is list x x[1] is a single packet x[[1]] is a slice x[[1]][[1]] out of the list In contrast, a double bracket will always return only one element. Before moving to double bracket a note to be kept in mind. NOTE:THE MAJOR DIFFERENCE BETWEEN THE TWO IS THAT SINGLE BRACKET RETURNS YOU A LIST WITH AS MANY ELEMENTS AS YOU WISH WHILE A DOUBLE BRACKET WILL NEVER RETURN A LIST. RATHER A DOUBLE BRACKET WILL RETURN ONLY A SINGLE ELEMENT FROM THE LIST. Single bracket will always returns another list with number of elements equal to the number of elements or number of indices you pass into the single bracket.

Member Reference In order to reference a list member directly, we have to use the double square bracket "[[]]"operator. The following object x[[2]] is the second member of x. In other words, x[[2]] is a copy of s, but is not a slice containing s or its copy. > x[[2]] [1] "aa" "bb" "cc" "dd" "ee" We can modify its content directly. > x[[2]][1] = "ta" > x[[2]] [1] "ta" "bb" "cc" "dd" "ee" > s [1] "aa" "bb" "cc" "dd" "ee" # s is unaffected

MATRICES Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol) > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA > dim(m) [1] 2 3 > attributes(m) $dim [1] 2 3

>m<-matrix(nrow=3,ncol=2,c(1,2,3,4,5,6)) >m [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > m <- matrix(1:6, nrow = 2, ncol = 3) > m<-matrix(c(1,2,3,4)) m<-matrix(c(1,2,3,4),7,8) m<- matrix(1:9,nrow=3,ncol=3,byrow=TRUE) matrix(1,nrow=10,ncol=10) A <- matrix(0,3,4) z <- A[2,3] # returns 2 nd row and 3 rd col of matrix A and assigns to z > A[2:4,4:2] # Selecting 2 nd ,3 rd and 4 th rows and 4 th ,3 rd and 2 nd colmns and getting another sub matrix. > A[2,2:3] # Second row, 2 nd col and 3 rd col elements. >second.column <- A[,2] #returns second.column; >which(A>8) # returns elements which are greater than 8.

ARRAYS An array is just a vector plus information on the dimensions of the array. We can create an array from a vector: X <- array(1:24,dim=c(3,4,2)) # 24 elements in an array, with 3 rows, 4 cols, in 2 matrices form. x <- seq(1,27) > c(3,9) [1] 3 9 > dim(x)=c(3,9) > is.array(x) [1] TRUE > is.matrix(x) [1] TRUE > x [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1 4 7 10 13 16 19 22 25 [2,] 2 5 8 11 14 17 20 23 26 [3,] 3 6 9 12 15 18 21 24 27

DATA FRAMES Data frames are used to store tabular data. They are represented as a special type of list where every element of the list has to have the same length . Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. Unlike matrices, data frames can store different classes of objects in each column (just like lists);

A data frame is used for storing data tables. It is a list of vectors of equal length. > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame

Cell value from the first row, second column of mtcars. > mtcars[1, 2] [1] 6 Can use the row and column names instead of the numeric coordinates. > mtcars["Mazda RX4", "cyl"] [1] 6 Lastly, the number of data rows in the data frame is given by the nrow function. > nrow(mtcars) # number of data rows [1] 32 And the number of columns of a data frame is given by the ncol function. > ncol(mtcars) # number of columns [1] 11

We reference a data frame column with the double square bracket "[[]]" operator. For example, to retrieve the ninth column vector of the built-in data set mtcars , we write mtcars[[9]]. > mtcars[[9]] [1] 1 1 1 0 0 0 0 0 0 0 0 ... We can retrieve the same column vector by its name. > mtcars[["am"]] [1] 1 1 1 0 0 0 0 0 0 0 0 ... We can also retrieve with the "$" operator in lieu of the double square bracket operator. > mtcars$am [1] 1 1 1 0 0 0 0 0 0 0 0 ... Yet another way to retrieve the same column vector is to use the single square bracket "[]"operator. We prepend the column name with a comma character, which signals a wildcard match for the row position. > mtcars[,"am"] [1] 1 1 1 0 0 0 0 0 0 0 0 ...

>x <- read.csv("data1.csv",header=T, sep=",") >x2 <- read.csv("data2.csv",header=T, sep=",") >x3 <- cbind(x,x2) >x3 Subtype Gender Expression Age City 1 A m -0.54 32 New York 2 A f -0.80 21 Houston 3 B f -1.03 34 Seattle 4 C m -0.41 67 Houston > which(A>=15,arr.ind=TRUE) row col [1,] 3 4 [2,] 4 4 Similarly we assign the values in the other way. >A[1,] <- c(2,4,5)

EXP NO 3: GETTING AND CLEANING DATA WITH SWIRL Swirl is an interactive package which will teach us and at the same time make us practice with the exercises. It has three types of exercises, basic, intermediate and advanced. Getting and cleaning data is an intermediate exercise.

WHAT IS SWIRL() IN R swirl is a software package for the R programming language that turns the R console into an interactive learning environment. Users receive immediate feedback as they are guided through self-paced lessons in data science and R programming. install.packages(“swirl”) library(swirl) install_from_swirl("Getting and Cleaning Data")

>install.packages(“swirl”) >library(swirl) install_course("Getting and Cleaning Data") swirl()

SWIRL() Flow.. | Please choose a course, or type 0 to exit swirl. 1: Getting and Cleaning Data 2: R Programming 3: Take me to the swirl course repository! Selection: 1 | Please choose a lesson, or type 0 to return to course | menu. 1: Manipulating Data with dplyr 2: Grouping and Chaining with dplyr 3: Tidying Data with tidyr 4: Dates and Times with lubridate

ABOUT PACKAGES COMING WITH GETTING AND CLEANING DATA For this we use three types of packages: dplyr, tidyr, lubridate. Dplyr is a package that provides a consistent and concise grammar for manipulating tabular data. It makes data manipulation easier.

About dplyr package from swirl() According to the "Introduction to dplyr" vignette written by the package authors, "The dplyr philosophy is to have small functions that each do one thing well." Specifically, dplyr supplies five 'verbs' that cover most fundamental data manipulation tasks: select(), filter(), arrange(), mutate(), and summarize().

Data manipulation using dplyr install.packages ("dplyr") ## install You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; We recommend the RStudio mirror. library ("dplyr") ## load You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.

Selecting columns and filtering rows To select columns of a data frame, use select(). The first argument to this function is the data frame (ToothGrowth), and the subsequent arguments are the columns to keep. select (ToothGrowth, len, supp, dose) >aa<-select(ToothGrowth,len,supp,dose)

Select(): To select columns of a data frame select (ToothGrowth, len, supp, dose) >plot(aa) Filter(): To choose rows filter (ToothGrowth, len==5)

Filter(): To choose rows filter (ToothGrowth, len>5) Pipes(>%>) nest functions (i.e. one function inside of another) Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. >ToothGrowth %>% + filter (len < 5) %>% + select (len,supp,dose)

To create a new object with this smaller version of the data we could do so by assigning it a new name. >ToothGrowth_sml <- ToothGrowth %>% + filter (len < 5) %>% + select (len,supp,dose) MUTATE (): create new columns based on the values in existing columns

>ToothGrowth %>% + mutate (len = len/ 4) If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data >ToothGrowth %>% + mutate (len=len/4) %>% +head

The first few rows are full of NAs, so if we wanted to remove those we could insert filter() in this chain: > ToothGrowth %>% + mutate (len = len/ 4) %>% + filter (! is.na (len)) %>% + head

Groupby(): group_by() splits the data into groups upon which some operations can be run >ToothGrowth %>% group_by (supp) %>% tally () summarize(): single group_by() is often used together with summarize() which collapses each group into a -row summary of that group. >ToothGrowth %>% group_by (supp) %>% summarize (len= mean (len, na.rm = TRUE))

Data Frame Column Slice We retrieve a data frame column slice with the single square bracket "[]" operator. Numeric Indexing The following is a slice containing the first column of the built-in data set mtcars . > mtcars[1] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............ Name Indexing We can retrieve the same column slice by its name. > mtcars["mpg"] mpg Mazda RX4 21.0 Mazda RX4 Wag 21.0 Datsun 710 22.8 ............ To retrieve a data frame slice with the two columns mpg and hp, we pack the column names in an index vector inside the single square bracket operator. > mtcars[c("mpg", "hp")] mpg hp Mazda RX4 21.0 110 Mazda RX4 Wag 21.0 110 Datsun 710 22.8 93 ............

Exp 5. Creating Data Frame emp.data <- data.frame( emp_id = c (1:5), emp_name = c(“Ratna",”Kumar”,“Kamala",“Prajwal",“Pravachan"), salary = c(623.3, 515.2, 611.0, 729.0, 843.25), start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")), stringsAsFactors = FALSE ) >emp.data # Add the "dept" coulmn. emp.data$dept <- c("IT","Operations","IT","HR","Finance") v <- emp.data print(v)

Extracting rows and columns A=emp.data$emp_id B=emp.data$emp_name a)C=data.frame(A,B) b)data.frame[1:2,] c)data.frame[c(3,5),c(2,4)]

emp.data[1:2,] emp_id emp_name salary start_date 1 1 Rick 623.3 2012-01-01 2 2 Dan 515.2 2013-09-23 > emp.data[c(3,5),c(2,4)] emp_name start_date 3 Michelle 2014-11-15 5 Gary 2015-03-27

PROGRAM 6: ‘apply’ group of functions

Function Arguments Objective Input Output apply apply(x, MARGIN, FUN) Apply a function to the rows or columns or both Data frame or matrix vector, list, array lapply lapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame list sapply sapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame vector or matrix

PROGRAM 7- cbind-ing and rbind-ing Matrices can be created by column-binding or row-binding with cbind() and rbind(). > x <- 1:3 > y <- 10:12 > cbind(x, y) x y [1,] 1 10 [2,] 2 11 [3,] 3 12 > rbind(x, y) [,1] [,2] [,3] x 1 2 3 y 10 11 12 >C <- cbind(1:3,4:6,5:7) >D <- rbind(1:3,4:6)

PROGRAM 7: Rbind() and cbind() functions. Matrices can be created by column-binding or row-binding with cbind() and rbind(). Data frames can also be appended by these functions. > x <- 1:3 > y <- 10:12 > cbind(x, y) x y [1,] 1 10 [2,] 2 11 [3,] 3 12 > rbind(x, y) [,1] [,2] [,3] x 1 2 3 y 10 11 12

Factor Variables Factor variables are nothing but nominal variables and also known as categorical variables. Levels are nothing but unique values in the variable values. gender <- c(rep("male",20), rep("female", 30)) gender<-factor(gender) Levels: female male # Factor variables summary(gender) female male 30 20

PROGRAM 8: DISCRETE IRIS iris$Seplen<- cut(iris$Sepal.Length, breaks=c(4.3,5.6,6.8,7.9), labels=c("low","medium","high")) > iris$Seplen [1] low low low low low low low low [9] low low low low low <NA> medium medium [17] low low medium low low low low low [25] low low low low low low low low [33] low low low low low low low low [41] low low low low low low low low [49] low low high medium high low medium medium [57] medium low medium low low medium medium medium [65] ….. Levels: low medium high

PROGRAM 9 - SCATTER PLOT USING ‘DPLYR’ ON GUINEA PIGS ‘TOOTHGROWTH’ DATA SET

aa<-select(ToothGrowth,len,supp,dose) #To choose rows we use filter() > filter(ToothGrowth,len<=14.5) > ToothGrowth%>%+ group_by(supp) > ToothGrowth%>% + group_by(supp)%>% + summarise(meanoflen=mean(len)) > plot(aa) >

gg-grammer of graphics library(dplyr) > library(ggplot) > library(ggplot2) >ggplot(aa,aes(x=factor(dose),y=len,fill=supp)) >gplot(aa,aes(x=factor(dose),y=len,fill=supp))+geom_boxplot() /*aes=aesthetic*/

PROGRAM-10…LINEAR AND MULTIPLE REGRESSION Regression : A technique for determining the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables. Linear Regression: Y=mX+c Y X Single Predictor, X

Multiple Linear Regression Y=aX 3 +bX 2 +cX+d 3 Predictors/Explanatory variables, X 3, X 2, X a,b,c are coefficients d is random error=bias value Y is a response variable Y is estimated or predicted dependent on 3 X variables.

Mtcars variables [, 1] mpg Miles/(US) gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [, 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (lb/1000) [, 7] qsec 1/4 mile time [, 8] vs. V/S (Engine Cylinder confg V shape or S shape) [, 9] am Transmission (0 = automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb Number of carburetors

lm=linear mode > library(ggplot2) >ggplot(mtcars,aes(wt,mpg)) >ggplot(mtcars,aes(wt,mpg))+geom_point() >ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm")

Mpg verses weight

For example in the mtcars dataset, you can build a linear model between the gas consumption (mpg) and the weight of the car (wt): mpg=β0+β1wt β1 is slope mpg is dependent β0 is intercept wt is independent

Residuals. The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). Each data point has one residual . y=10*3+5=35——-observed Model, m=9. y=9x+c y=9*3+5=32——predicted….

> mfit = lm(mpg ~ wt + disp + cyl, data=mtcars) > plot(mfit)

PROGRAM NO: 11 Major Clustering Approaches (I) Partitioning approach : Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach : Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach : Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach : based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

IRIS TYPES

K-means clustering names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" [4] "Petal.Width" "Species" > x<-iris[,-5] > y<-iris$Species > kc<-kmeans(x,3) > kc K-means clustering with 3 clusters of sizes 38, 62, 50 Cluster means: Sepal.Length Sepal.Width Petal.Length Petal.Width 1 6.850000 3.073684 5.742105 2.071053 2 5.901613 2.748387 4.393548 1.433871 3 5.006000 3.428000 1.462000 0.246000

Clustering vector: [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [29] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 [113] 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 [141] 1 1 2 1 1 1 2 1 1 2 Within cluster sum of squares by cluster: [1] 23.87947 39.82097 15.15100 (between_SS / total_SS = 88.4 %)

>plot(x[c("Sepal.Length","Sepal.Width")],col=kc$cluster)

K-means >points(kc$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=23, cex=3)

> library(fpc) > pamresult<-pamk(iris1) > pamresult$nc [1] 2 > pamresult$nc #nc-Number of Clusters [1] 2 > table(pamresult$pamobject$clustering,iris$Species) setosa versicolor virginica 1 50 1 0 2 0 49 50 > layout(matrix(c(1,2),1,2)) # > plot(pamresult$pamobject)

The ggplot () command creates a plot object. In it we assigned a data set. aes () creates what Hadley Wickham calls an aesthetic: a mapping of variables to various parts of the plot. ... Another way to split up the way we look at data is with facets.

> ggplot(mtcars,aes(wt,mpg)) Error in ggplot(mtcars, aes(wt, mpg)) : could not find function "ggplot" > library(ggplot2) > library(ggplot2) > ggplot(mtcars,aes(wt,mpg)) > ggplot(mtcars,aes(wt,mpg))+geom_point() > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm") > ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_abline()

> library(ggplot2) > ggplot(mtcars,aes(wt,mpg)) > ggplot(mtcars,aes(wt,mpg))+geom_point() >ggplot(mtcars,aes(wt,mpg))+geom_point()+geom_smooth(method="lm“)

> ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + geom_point()

What combination of predictors will best predict fuel efficiency?(Slope/Coefficients and intercepts) Which predictors increase our accuracy by a statistically significant amount? We should guess which predictors are significant, and to determine the ideal formula for prediction….WHICH IS WHAT WE CALL LINEAR REGRESSION.

Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Density-Based Clustering: Basic Concepts Two parameters : Eps : Maximum radius of the neighbourhood MinPts : Minimum number of points in an Eps-neighbourhood of that point N Eps (p) : {q belongs to D | dist(p,q) ≤ Eps} Directly density-reachable : A point p is directly density-reachable from a point q w.r.t. Eps , MinPts if p belongs to N Eps (q) core point condition: | N Eps (q) | ≥ MinPts MinPts = 5 Eps = 1 cm p q

Density-Reachable and Density-Connected Density-reachable: A point p is density-reachable from a point q w.r.t. Eps , MinPts if there is a chain of points p 1 , … , p n , p 1 = q , p n = p such that p i+1 is directly density-reachable from p i Density-connected A point p is density-connected to a point q w.r.t. Eps , MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts p q p 1 p q o

DBSCAN: Density-Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Core Border Outlier Eps = 1cm MinPts = 5

DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts If p is a core point, a cluster is formed If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database Continue the process until all of the points have been processed

Before Package ‘rpart’ Title: Recursive Partitioning and Regression Trees A regression line is a straight line that attempts to predict the relationship between two points, also known as a trend line or line of best fit. Simple linear regression is a prediction when a variable ( y ) is dependent on a second variable ( x ) based on the regression equation of a given set of data.

Decision trees are of two types Classification Trees Regression Trees CTs are used when the target or response variable is of categorical in nature. RTs are used when the target variable is continuous or numeric. It is the target variable that determines the type of decision tree needed.

DECISION TREES USING PARTY-PROGRAM 12 > install.packages(“readr”) > library(readr) > install.packages("party") Installing package into ‘C:/Users/My Document/Documents/R/win-library/3.4’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/party_1.2-3.zip' Content type 'application/zip' length 719826 bytes (702 KB) downloaded 702 KB package ‘party’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\My Document\AppData\Local\Temp\RtmpOAuKaM\downloaded_packages > library(party)

DECISION TREE USING RPART..PROGRAM 12 rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) tree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,method="class")

> iris$class<-as.factor(iris$class) > > View(iris) > iris$Species<-as.factor(iris$Species) > tree1<-ctree(Species~Sepal.Length, data=iris) > plot(tree1)

tree<-rpart(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,method="class")> plot(tree)

> plot(tree, uniform=TRUE,main="Classification Tree for Iris dataset")> text(tree, use.n=TRUE, all=TRUE, cex=.8)

SUPPORT VECTOR MACHINE

X1, X2 Attributes

ABOUT DIFFERENT TYPES OF VARIABLES

FEW GOOD WEB SITES ON R www.kaggle.com www.rdocumentation.org www.statmethods.net www.r-tutor.com www.tutorialspoint.com www.datacamp.com www.github.com https://drsimonj.svbtle.com/visualising-residuals

DATA MINING USING R (1).pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DATA MINING USING R (1).pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77