pyingkodimaran1
2,799 views
31 slides
Aug 23, 2021
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Researcher may know the available techniques in data pre processing
Size: 666.66 KB
Language: en
Added: Aug 23, 2021
Slides: 31 pages
Slide Content
Data Preprocessing Dr.M.Pyingkodi AP/MCA Kongu Engineering College Erode, Tamilnadu
Data Preprocessing Process of preparing the data for analysis. technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. Real-world data : Incomplete Inconsistent likely to contain many errors.
Data cleaning Noise, outliers, missing values, duplicate data Dealing with categorical data Data integration D ata transformation Data reduction Sampling Imputation Discretization Feature extraction Splitting the dataset into training and testing sets Scaling the features Preprocessing Techniques
Types of data Numerical data Discrete - Date, No. of students in a class Continuous - Cost of a house(in decimal) Categ orical data Nominal – Gender Ordinal – Grades of the student(splitting it into groups) Dichotomous – Cancerous, Non-cancerous Time Series Data sequence of numbers collected at regular intervals over some period of time. Text words
quality of data 1. Accuracy Human/computer errors Incorrect formats 2. Completeness 3. Consistency Data preprocessing is divided into four stages Data cleaning, Data integration, Data reduction, and Data transformation.
Data cleaning Process of detecting and correcting (or removing) corrupt or inaccurate records from a record set Identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data within a dataset. Duplicate observations Irrelevant observations Fixing Structural errors Noise, outliers, missing values, duplicate data and Dealing with categorical data
Data with a large amount of additional meaningless information in it Noisy data can be handled by following Binning Regression Clustering Noisy data
Outliers Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
Finding Outliers Box plot Scatter plot Z-Score expectation-maximization. linear correlations (principle component analysis) cluster, density or nearest neighbor analysis. interquartile range (IQR)
1. Removing the training example 2. Filling in missing value manually: 3. Using a standard value to replace the missing value 4. Using central tendency (mean, median, mode) for attribute to replace the missing value: 5. Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value 6. Using the most probable value to fill in the missing value: Missing values
Handling missing values
Techniques of dealing with missing data Drop missing values/columns/rows Imputation A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value. There are lot of ways to impute the data. A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values A mean, median or mode value for the column A value estimated by another predictive model Multiple Imputation
Data integration combine data from disparate sources into meaningful and valuable information data from various sources(technologies) It includes multiple databases, data cubes or flat files Issues Schema Integration Redundancy Detection and resolution of data value conflicts.
Data reduction Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information. Dimensionality reduction can be done in two different ways: By only keeping the most relevant variables from the original dataset (this technique is called feature selection) By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables.
Data reduction Process that reduced the volume of original data and represents it in a much smaller volume. ensure the integrity of data while reducing the data. Missing values ratio Low variance filter High correlation filter Principal component analysis
Data transformation Taking data stored in one format and converting it to another. D atasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Smoothing Attribute/feature construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data
Data transformation MinMax Scaler It just scales all the data between 0 and 1. The formula for calculating the scaled value is- x_scaled = (x – x_min )/( x_max – x_min ) Standard Scaler the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). df_std MaxAbsScaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. scales the data between the range [-1, 1]. Robust Scaler to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.
Data transformation Log Transform take the log of the values in a column and use these values as the column instead. It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution the log-transformed data follows a normal or near normal distribution. Reducing the impact of too-low values Reducing the impact of too-high values. Unit Vector Scaler / Normalizer Normalization is the process of scaling individual samples to have unit norm. Normalizer works on the rows If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1 If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1 50, 250, 400 0.05, 0.25 and 0.4.
Handling Categorical Data Find and Replace Label Encoding Binary encoding One Hot Encoding pd.get_dummies ( obj_df , columns=[" drive_wheels "]).head() OrdinalEncoder from sklearn.preprocessing import OrdinalEncoder ord_enc = OrdinalEncoder () obj_df [" make_code "] = ord_enc.fit_transform ( obj_df [["make"]]) obj_df [["make", " make_code "]].head(11)
Sampling Sampling is done to draw conclusions about populations from samples, it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population.
TYPES OF Sampling Simple Random Sampling Systematic Sampling Stratified Sampling Cluster Sampling
Resampling Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. Cross-validation (CV) Imbalance Dataset Eg : In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 %
Resampling Techniques Random Under-Sampling Random Over-Sampling Cluster-Based Over Sampling Informed Over Sampling
Discretization To divide the attributes of the continuous nature into data with intervals. Binning Histogram analysis Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. Equal Width Partioning : Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. Clustering: Grouping the similar data together.
Python Packages/Tools for Data Mining Scikit -learn Orange Pandas MLPy MDP PyBrain … and many more
Some Other Basic Packages NumPy and SciPy Fundamental Packages for scientific computing with Python Contains powerful n-dimensional array objects Useful linear algebra, random number and other capabilities Pandas Contains useful data structures and algorithms Matplotlib Contains functions for plotting/visualizing data.