Advance Data_Preprocessing_and_Wrangling

Bhushan134837 13 views 8 slides May 31, 2024
Slide 1
Slide 1 of 8
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8

About This Presentation

data science


Slide Content

Data Preprocessing From Raw to Refined

Agenda Data Preprocessing Need of data preprocessing Type of Data Objectives of data preprocessing Hands on

Data Preprocessing Data preprocessing refers to the steps and techniques involved in preparing raw data for analysis or further processing. It is a crucial phase in data analysis and machine learning workflows, aiming to improve data quality, consistency, and compatibility for effective data mining, modeling, and interpretation . The main objectives of data preprocessing include cleaning noisy or incomplete data, transforming data into a suitable format for analysis, integrating data from multiple sources, and enhancing data quality to facilitate accurate and meaningful insights Overall, data preprocessing plays a fundamental role in ensuring that data is well-organized, standardized, and ready for exploration and modeling tasks. The quality of the data should be checked before applying machine learning or data mining algorithms.

Need of Data Preprocessing Preprocessing of data is mainly to check the data quality. The quality can be checked by the following. Accuracy : To check whether the data entered is correct or not. Completeness : To check whether the data is available or not recorded. Consistency: To check whether the same data is kept in all the places that do or do not match Interpretability : The understandability of the data. Believability : The data should be trustable. Timeliness : The data should be updated correctly Real world data are generally: • Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data • Noisy: Containing errors or outliers • Inconsistent: Containing discrepancies in codes or names

Objectives of Data Preprocessing Preprocessing of data is mainly to check the data quality. The quality can be checked by the following. To transform the raw data into an understandable format To transform data for its usable format To eliminate inconsistencies in data To remove duplicates in data To give more accurate data for preprocessing To give assurance for incorrect or missing values in data To reduce dimensionalities in data

Python Library used for Data Preprocessing Pandas: Pandas is one of the most widely used libraries for data manipulation and analysis. It provides data structures like DataFrame and Series, which are highly efficient for handling structured data. Functions for data cleaning, filtering, merging, reshaping , and more are available in Pandas. NumPy: NumPy is fundamental for numerical computing in Python. It offers powerful array operations and mathematical functions that are crucial for data preprocessing tasks. NumPy arrays are used extensively in conjunction with Pandas DataFrames for data manipulation. Matplotlib and Seaborn: These libraries are used for data visualization, which is an important aspect of data preprocessing to understand the data's distribution and identify outliers. Matplotlib offers a wide range of plotting functions, while Seaborn provides high-level statistical visualization capabilities.

Hands on Notebook
Tags