Introduction to ML_Data Preprocessing.pptx

mousmiin 22 views 26 slides Oct 06, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

Machine learning basics with various types of data preprocessing


Slide Content

Introduction to ML

Machine Learning Definition Definition by Arthur Samuel (1959): Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed 2

Data Preprocessing

Data scientists process and analyse data using a number of methods and tools, such as statistical models, machine learning algorithms, and data visualisation software. Data science seeks to uncover patterns in data that can help with decision-making, process improvement, and the creation of new opportunities. Business, engineering, and the social sciences are all included in this interdisciplinary field. Data Preprocessing Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. Intoduction

Need of Data Preprocessing  For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set. Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.

Steps in Data Preprocessing Step 1: Import the necessary libraries Step 2: Load the dataset Step 3: Statistical Analysis Step 4: Check the  outliers Step 5:  Correlation Step 6: Separate independent features and Target Variables Step 7:  Normalization or Standardization

Steps in Data Preprocessing Step 1: Import the necessary libraries # importing libraries import pandas as pd : Pandas is a powerful library used for data manipulation and analysis, especially for handling structured data like spreadsheets or databases. import scipy : SciPy builds on top of NumPy and provides functions for optimization, signal processing, linear algebra, statistics, and more. import numpy as np : NumPy is a fundamental package for scientific computing in Python. It provides support for arrays, matrices, mathematical functions, and operations that are essential for numerical computations. from sklearn.preprocessing import MinMaxScaler : This line imports the MinMaxScaler class from the scikit-learn library ( sklearn ). Scikit-learn is a popular machine learning library in Python. The MinMaxScaler is used for scaling features in a dataset to a specific range, usually between 0 and 1. import seaborn as sns : Seaborn is a data visualization library built on top of Matplotlib that provides a higher-level interface for creating aesthetically pleasing and informative statistical graphics. import matplotlib.pyplot as plt : This line imports the pyplot submodule from the Matplotlib library and assigns it the alias plt . Matplotlib is a widely used plotting library in Python. The pyplot submodule provides a simple and convenient interface for creating various types of plots and visualizations.

Step 2: Load the dataset # Load the dataset df = pd.read_csv ( ' /content/drive/ MyDrive /Data/diabetes.csv ' ) print ( df.head ())

Check the data info df.info() As we can see from the above info that the our dataset has 9 columns and each columns has 768 values. There is no Null values in the dataset. We can also check the null values using df.isnull ()

Step 3: Statistical Analysis In statistical analysis, first, we use the df.describe () which will give a descriptive overview of the dataset. df.describe () The above table shows the count, mean, standard deviation, min, 25%, 50%, 75%, and max values for each column. When we carefully observe the table we will find that. Insulin, Pregnancies, BMI, BloodPressure columns has outliers. 

Let’s plot the boxplot for each column for easy understanding. Step 4: Check the  outliers : # Box Plots fig, axs = plt.subplots ( 9 , 1 ,dpi = 95 , figsize = ( 7 , 17 )) i = for col in df.columns :      axs [ i ].boxplot( df [col], vert = False )      axs [ i ]. set_ylabel (col)      i += 1 plt.show ()

Correlation refers to  the statistical relationship between the two entities . It measures the extent to which two variables are linearly related. For example, the height and weight of a person are related, and taller people tend to be heavier than shorter people.  Step 5:  Correlation

Step 5:  Correlation #correlation corr = df.corr ()   plt.figure (dpi = 130 ) sns.heatmap ( df.corr (), annot = True , fmt = '.2f' ) plt.show () corr [ 'Outcome' ]. sort_values (ascending = Falase We can also camapare by single columns in descending order

Step 6: Separate independent features and Target Variables # separate array into input and output components X = df.drop (columns =['Outcome']) Y = df.Outcome Step 7:  Normalization or Standardization Normalization MinMaxScaler scales the data so that each feature is in the range [0, 1].  It works well when the features have different scales and the algorithm being used is sensitive to the scale of the features, such as k-nearest neighbors or neural networks. Rescale your data using scikit-learn using the  MinMaxScaler . Standardization Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. We can standardize data using scikit-learn with the  StandardScaler   class. It works well when the features have a normal distribution or when the algorithm being used is not sensitive to the scale of the features
Tags