Data preprocessing in Machine learning

pyingkodimaran1 2,799 views 31 slides Aug 23, 2021

Slide 1 of 31

About This Presentation

Researcher may know the available techniques in data pre processing

Size: 666.66 KB

Language: en

Added: Aug 23, 2021

Slides: 31 pages

Slide Content

Data Preprocessing Dr.M.Pyingkodi AP/MCA Kongu Engineering College Erode, Tamilnadu

Data Preprocessing Process of preparing the data for analysis. technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. Real-world data : Incomplete Inconsistent likely to contain many errors.

Data cleaning Noise, outliers, missing values, duplicate data Dealing with categorical data Data integration D ata transformation Data reduction Sampling Imputation Discretization Feature extraction Splitting the dataset into training and testing sets Scaling the features Preprocessing Techniques

Types of data Numerical data Discrete - Date, No. of students in a class Continuous - Cost of a house(in decimal) Categ orical data Nominal – Gender Ordinal – Grades of the student(splitting it into groups) Dichotomous – Cancerous, Non-cancerous Time Series Data sequence of numbers collected at regular intervals over some period of time. Text words

quality of data 1. Accuracy Human/computer errors Incorrect formats 2. Completeness 3. Consistency Data preprocessing is divided into four stages Data cleaning, Data integration, Data reduction, and Data transformation.

Data cleaning Process of detecting and correcting (or removing) corrupt or inaccurate records from a record set Identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data within a dataset. Duplicate observations Irrelevant observations Fixing Structural errors Noise, outliers, missing values, duplicate data and Dealing with categorical data

Data with a large amount of additional meaningless information in it Noisy data can be handled by following Binning Regression Clustering Noisy data

Outliers Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution. An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

Finding Outliers Box plot Scatter plot Z-Score expectation-maximization. linear correlations (principle component analysis) cluster, density or nearest neighbor analysis. interquartile range (IQR)

1. Removing the training example 2. Filling in missing value manually: 3. Using a standard value to replace the missing value 4. Using central tendency (mean, median, mode) for attribute to replace the missing value: 5. Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value 6. Using the most probable value to fill in the missing value: Missing values

Handling missing values

Techniques of dealing with missing data Drop missing values/columns/rows Imputation A slightly better approach towards handling missing data is Imputation. Imputation means to replace or fill the missing data with some value. There are lot of ways to impute the data. A constant value that belongs to the set of possible values of that variable, such as 0, distinct from all other values A mean, median or mode value for the column A value estimated by another predictive model Multiple Imputation

Data integration combine data from disparate sources into meaningful and valuable information data from various sources(technologies) It includes multiple databases, data cubes or flat files Issues Schema Integration Redundancy Detection and resolution of data value conflicts.

Data reduction Dimension reduction compresses large set of features onto a new feature subspace of lower dimensional without losing the important information. Dimensionality reduction can be done in two different ways: By only keeping the most relevant variables from the original dataset (this technique is called feature selection) By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables.

Data reduction Process that reduced the volume of original data and represents it in a much smaller volume. ensure the integrity of data while reducing the data. Missing values ratio Low variance filter High correlation filter Principal component analysis

Data transformation Taking data stored in one format and converting it to another. D atasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Smoothing Attribute/feature construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data

Data transformation MinMax Scaler It just scales all the data between 0 and 1. The formula for calculating the scaled value is- x_scaled = (x – x_min )/( x_max – x_min ) Standard Scaler the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance). df_std MaxAbsScaler takes the absolute maximum value of each column and divides each value in the column by the maximum value. scales the data between the range [-1, 1]. Robust Scaler to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.

Data transformation Log Transform take the log of the values in a column and use these values as the column instead. It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution the log-transformed data follows a normal or near normal distribution. Reducing the impact of too-low values Reducing the impact of too-high values. Unit Vector Scaler / Normalizer Normalization is the process of scaling individual samples to have unit norm. Normalizer works on the rows If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1 If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1 50, 250, 400 0.05, 0.25 and 0.4.

Handling Categorical Data Find and Replace Label Encoding Binary encoding One Hot Encoding pd.get_dummies ( obj_df , columns=[" drive_wheels "]).head() OrdinalEncoder from sklearn.preprocessing import OrdinalEncoder ord_enc = OrdinalEncoder () obj_df [" make_code "] = ord_enc.fit_transform ( obj_df [["make"]]) obj_df [["make", " make_code "]].head(11)

Sampling Sampling is done to draw conclusions about populations from samples, it enables us to determine a population’s characteristics by directly observing only a portion (or sample) of the population.

TYPES OF Sampling Simple Random Sampling Systematic Sampling Stratified Sampling Cluster Sampling

Resampling Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. Cross-validation (CV) Imbalance Dataset Eg : In an utilities fraud detection data set you have the following data: Total Observations = 1000 Fraudulent Observations = 20 Non Fraudulent Observations = 980 Event Rate= 2 %

Resampling Techniques Random Under-Sampling Random Over-Sampling Cluster-Based Over Sampling Informed Over Sampling

Discretization To divide the attributes of the continuous nature into data with intervals. Binning Histogram analysis Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. Equal Width Partioning : Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. Clustering: Grouping the similar data together.

Python Packages/Tools for Data Mining Scikit -learn Orange Pandas MLPy MDP PyBrain … and many more

Some Other Basic Packages NumPy and SciPy Fundamental Packages for scientific computing with Python Contains powerful n-dimensional array objects Useful linear algebra, random number and other capabilities Pandas Contains useful data structures and algorithms Matplotlib Contains functions for plotting/visualizing data.

https://www.ritchieng.com/pandas-randomly-sample-rows/ https://slideplayer.com/slide/15394883/

Data preprocessing in Machine learning

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Data preprocessing in Machine learning

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx