Data Preparation
•Important Part of Data Science, Machine Learning and
DeepLearning
•DataCleaning
•FeatureEngineering
Improves Performance and
Accuracy of ML and DL Models
Data Preprocessing
–Data Refining
–Data Integration
–Data Transformation
–Data Reduction
Data Wrangling
•Executedatthetimeofmakinganinteractivemodel
•ConversionofRawDatatoConsumableSetofData
•TechniquealsoknownasDataMunging
Examples:
•Sortingusingspecificalgorithm
•StoragetootherDatabaseFormat
•
Filtering,
Grouping
and
Selecting
Appropriate
Data
Need of Data Preparation and Preprocessing
–Data Leakage (Causes invalid ML/DL predictions)
•Leakage in Training Data
–Missing Values (Inaccurate Data)
–Noisy (Erroneous Data)
–Inconsistent Data
–Usage of data outside the scope of Applied Algorith m
•
Filtering,
Grouping
and
Selecting
Appropriate
Data
Data Preprocessing on Missing Values
•IgnoringtheMissingValues
•Fillingthemissingvaluesmanually
•FillingusingComputedValues
Data Preprocessing on Noisy Data
DataBinning PreprocessinginClusters Machine
Learning
Manual
Removal
Machine
Learning
Manual
Removal
Data Preprocessing Vs. Data Wrangling
DataPreprocessing
isperformedbeforeDataWrangling.
•DataPreprocessingdataispreparedexactlyafterreceiving
thedatafromthedatasource.
•In this initial transformations, Data Cleaning or any
aggregationofdataisperformed.Itisexecutedonce.
•
It
is
the
concept
that
is
performed
before
applying
any
•
It
is
the
concept
that
is
performed
before
applying
any
iterativemodelandwillbeexecutedonceintheproject. Data Wrangling
is performed during the iterative analysis
andmodelbuilding.
•Thisconceptatthetimeoffeatureengineering.
•The conceptual viewof the dataset changes as different
modelsisappliedtoachievegoodanalyticmodel.
Free and Open Source Programming Platforms
and Tools
•Tabula
•Python-Pandas
•OpenRefine
•RLanguage
•
Weka
•
Weka
•RapidMiner
•ELKI
•KNIME
•CSVKit
•Orange
Data Extraction and Transformation
https://tabula.technology/
PDF reports into Excel spreadsheets, CSVs, and JSON
filesforuseinanalysisanddatabaseapplications