Data preprocessing in Machine learning

pyingkodimaran1 2,799 views 31 slides Aug 23, 2021
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Researcher may know the available techniques in data pre processing


Slide Content

Data Preprocessing Dr.M.Pyingkodi AP/MCA Kongu Engineering College Erode, Tamilnadu

Data Preprocessing Process of preparing the data for analysis.  technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. Real-world data : Incomplete Inconsistent likely to contain many errors.

Data cleaning Noise, outliers, missing values, duplicate data Dealing with categorical data Data   integration D ata   transformation   Data   reduction Sampling Imputation Discretization Feature extraction Splitting the dataset into training and testing sets Scaling the features Preprocessing Techniques

Types of data Numerical data Discrete - Date, No. of students in a class Continuous - Cost of a house(in decimal) Categ orical data Nominal – Gender Ordinal – Grades of the student(splitting it into groups) Dichotomous – Cancerous, Non-cancerous Time Series Data sequence of numbers collected at regular intervals over some period of time. Text words

quality of data 1. Accuracy Human/computer errors Incorrect formats 2. Completeness 3. Consistency Data preprocessing is divided into four stages Data cleaning, Data integration, Data reduction, and Data transformation.

Data cleaning  Process of detecting and correcting (or removing) corrupt or inaccurate records from a record set Identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data within a dataset. Duplicate observations Irrelevant observations Fixing Structural errors Noise, outliers, missing values, duplicate data and Dealing with categorical data

Data with a large amount of additional meaningless information in it Noisy data can be handled by following Binning Regression Clustering Noisy data

Outliers Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

Finding Outliers Box plot Scatter plot Z-Score expectation-maximization. linear correlations (principle component analysis) cluster, density or nearest neighbor analysis. interquartile range (IQR)

1. Removing the training example 2. Filling in missing value manually: 3. Using a standard value to replace the missing value 4. Using central tendency (mean, median, mode) for attribute to replace the missing value: 5. Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value 6. Using the most probable value to fill in the missing value: Missing values

Handling missing values

Techniques of dealing with missing data Drop missing values/columns/rows Imputation A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value. There are lot of ways to impute the data. A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values A mean, median or mode value for the column A value estimated by another predictive model Multiple Imputation

Data   integration combine data from disparate sources into meaningful and valuable information data from various sources(technologies) It includes multiple databases, data cubes or flat files Issues Schema Integration Redundancy Detection and resolution of data value conflicts.  

Data   reduction Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information. Dimensionality reduction can be done in two different ways: By only keeping the most relevant variables from the original dataset (this technique is called feature selection) By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables.

Data reduction Process that reduced the volume of original data and represents it in a much smaller volume.  ensure the integrity of data while reducing the data. Missing values ratio Low variance filter High correlation filter Principal component analysis

Data   transformation Taking data stored in one format and converting it to another. D atasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Smoothing Attribute/feature construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data

  Data transformation MinMax Scaler It just scales all the data between 0 and 1. The formula for calculating the scaled value is- x_scaled = (x – x_min )/( x_max – x_min ) Standard Scaler the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). df_std MaxAbsScaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. scales the data between the range [-1, 1]. Robust Scaler to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly.  The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.

Data   transformation Log Transform take the log of the values in a column and use these values as the column instead. It is primarily used to convert a  skewed distribution  to a normal distribution/less-skewed distribution the  log-transformed data follows a normal or near normal distribution. Reducing the impact of too-low values Reducing the impact of too-high values. Unit Vector Scaler / Normalizer Normalization is the process of scaling individual samples to have unit norm. Normalizer works on the rows If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1 If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1 50, 250, 400  0.05, 0.25 and 0.4.  

Handling Categorical Data Find and Replace Label  Encoding Binary encoding One Hot Encoding pd.get_dummies ( obj_df , columns=[" drive_wheels "]).head() OrdinalEncoder   from sklearn.preprocessing import OrdinalEncoder ord_enc = OrdinalEncoder () obj_df [" make_code "] = ord_enc.fit_transform ( obj_df [["make"]]) obj_df [["make", " make_code "]].head(11)

Sampling Sampling is done to draw conclusions about populations from samples, it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population.

TYPES OF Sampling Simple Random Sampling Systematic Sampling Stratified Sampling Cluster Sampling

Resampling   Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. Cross-validation (CV)  Imbalance Dataset Eg :  In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent  Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 %

Resampling Techniques Random Under-Sampling  Random Over-Sampling Cluster-Based Over Sampling Informed Over Sampling

Discretization  To divide the attributes of the continuous nature into data with intervals.  Binning Histogram analysis Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. Equal Width Partioning :  Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. Clustering: Grouping the similar data together.

Python Packages/Tools for Data Mining Scikit -learn Orange Pandas MLPy MDP PyBrain … and many more

Some Other Basic Packages NumPy and SciPy Fundamental Packages for scientific computing with Python Contains powerful n-dimensional array objects Useful linear algebra, random number and other capabilities Pandas Contains useful data structures and algorithms Matplotlib Contains functions for plotting/visualizing data.

https://www.ritchieng.com/pandas-randomly-sample-rows/ https://slideplayer.com/slide/15394883/