linear regression application of machine learning.pptx
imrannazeer2957
33 views
30 slides
Jul 04, 2024
Slide 1 of 30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
About This Presentation
linear regression application of Machine Learning
Size: 1.6 MB
Language: en
Added: Jul 04, 2024
Slides: 30 pages
Slide Content
The Linear Regression Applications to Machine Learning with Practical Implementation F22071ED87 BC190201392 BC180407936 BC190404854, BC190205385 BC190405179 BC190410680 Supervisor Irfan Ullah Project Title Program BS(Mathematics) Group Members Group ID بسم اللہ الرحمن الرحیم
Table o f Contents 01 Statistics 02 Types of statistics 03 Descriptive statistics and types 04 Inferential statistics and types 05 Parameters and statists 06 Variable and types 07 Scales of measurement 08 Measure of central tendency 09 Measure of dispersion 10 Sample Space and Event 11 Random Variable and Types 12 Probability and types 13 Probability of dependent and independent 14 Data Organizing and Frequency Distribution 15 Regression analysis and types 16 Linear regression and Multiple regression What is python 18 Python uses and Data types 19 Python operation and libraries 20 Machine Learning 21 Terminologies of ML 22 Steps and Types of ML 23 Data Analysis 24 Data Manipulation 25 Assumption of linear Regression 26 Practical implementation of python 27 Linearity and normality 28 outliers 29 Simple Linear regression 30 Train test 31 Finding Slope as Coefficient and y Intercept as Intercept
Statistics The science of collecting, analyzing, presenting, and interpreting data. Types of Statistics Descriptive Statistics Inferential Statistics Population Simply a population includes all the elements or items that are under consideration in a statistical study. Sample It is defined as the subset or a small part of all the possible data values that are part of the specified field of study. Sampling Sampling is the process of selecting the sample from the population. Types of sampling Probability Sampling Non-Probability Sampling Probability Sampling Take sample that cannot be selected at the discretion of the researcher. Non-Probability Sampling Take sample that can be selected at the discretion of the researcher.
Descriptive Statistics It is to describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data. Types of Descriptive Statistics Measures of Central Tendency Measures of Dispersion Measures of Central Tendency It is a single value that attempts to describe a set of data by identifying the central position within that set of data , includes Mean (Geometric Mean, Harmonic Mean, Weighted Mean), Median and Mode. Measures of Dispersion It is to interpret the variability of data i.e. to know how much homogenous or heterogeneous the data is? includes Mean deviation, Variance, Standard deviation, Range and Inter-quartile range.
Inferential Statistics Inferential statistics is a branch of statistics that makes the use of various analytical tools to draw inferences about the population data from sample data. Types of Inferential Statistics Hypothesis testing Regression analysis Hypothesis Testing It is used to test assumptions and draw conclusions about the population from the available sample data, includes Z-Test, F-Test, T-Test, ANOVA Test, Wilcoxon Signed Rank Test and Mann-Whitney U Test . Regression Analysis It is to quantify how one variable will change with respect to another variable, includes simple linear, logistic, multiple linear, ordinal, and nominal regression. The most common is linear regression.
Parameter A number describing a whole population. Statistic A number describing a sample. Variable A characteristic that can be measured and that can assume different values. Types of Variables Qualitative Variables Quantitative Variables Qualitative variables That expresses a qualitative attribute. Quantitative variables Also called numeric variables, are those variables that are measured in terms of numbers.
Types of Quantitative Variable Discrete Variable Continuous Variable Discrete Variable It is restricted to certain values, usually (but not necessarily) consists of whole numbers. Continuous Variable It may take on an infinite number of intermediate values along a specified interval. Scales of Measurement In Statistics, the variables or numbers are defined and categorized using different scales of measurements. Levels of Measurements Nominal Scale Ordinal Scale Interval Scale Ratio Scale
Nominal Scale ( 1 st level of measurement) A nominal scale usually deals with the non-numeric variables or the numbers that do not have any value. Ordinal Scale ( 2 nd level of measurement) Ordinal represents the “order.” Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also ranked. Interval Scale ( 3 rd level of measurement) In it variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary. Ratio Scale ( 4 th level of measurement ) It allows researchers to compare the differences or intervals. The ratio scale has a unique feature. It possesses the character of the origin or zero points. Measure of Central Tendency In statistics, that measures the average values of data sets .The three most common measures of central tendency are Mean Median Mode
Mean Mean is the average of the given numbers. Arithmetic Mean for Grouped Data For grouped data, we can find the mean using either of the following formulas. Types of Mean Arithmetic Mean It is calculated by dividing the sum of given numbers by the total number of numbers . Arithmetic Mean Geometric Mean Harmonic Mean
Geometric Mean It is calculated by raising the product of a series of numbers to the inverse of the total length of the series. Geometric mean of Ungrouped data Geometric mean of Grouped data It is the reciprocal of the average of the reciprocals of the data values. Harmonic Mean Harmonic Mean of Ungrouped Data Harmonic Mean of Grouped Data Harmonic Mean (HM)= =
Mode It is the value that is repeatedly occurring in a given set. The most frequently occurred value in the data set. Mode of Ungrouped Data Mode of Grouped Data Median It is middlemost observation, obtained after arranging the data in ascending or descending order. Median of Ungrouped Data Median = observation Median of Grouped Data Median =
Range Inter-quartile Range Variance Standard Deviation Range The range is the difference between largest and smallest value in a sample data. Measure of dispersion Inter-quartile Range It is defined as the difference between the 75th and 25th percentiles of the data. Variance It is the mean of square deviations from their mean. Standard Deviation The positive square root of the variance is called standard deviation.
Sample Space It is a collection or a set of possible outcomes of a random experiment. Events These are the outcomes of an experiment. Types of Events in Probability Impossible and Sure Events Simple Events Compound Events Independent and Dependent Events Complementary Events Mutually Exclusive Events Exhaustive Events Random Variable A random variable is a type of variable in statistics whose possible values depend on the outcomes of a certain random experiment. Types of Random Variables Discrete Random Variable Continuous Random Variable Probability Probability is a measure of the likelihood of an event to occur.
Types of Probability Theoretical Probability Experimental Probability Axiomatic Probability Probability of Dependent Events Dependent events influence the probability of other events – or their probability of occurring is affected by other events Probability of Independent Events Independent events do not affect one another and do not increase or decrease the probability of another event happening. Data Organizing and Frequency Distribution Types of Data Qualitative Data Quantitative Data Forms of Data Discrete Data Continuous Data
Classification of Data Classification is the process of arranging the collected data into classes and to subclasses according to their common characteristics. Types of classification Geographical classification Chronological classification Qualitative classification Quantitative classification Tabulation It is defined as the process of placing classified data in tabular form. Types of Tabulation Simple Tabulation or One-way Tabulation Double Tabulation or Two-way Tabulation Complex Tabulation Frequency Distribution A frequency distribution is a representation, either in a graphical or tabular format that displays the number of observations within a given interval. Types of Frequency Distribution Ungrouped frequency distribution Grouped frequency distribution Relative frequency distribution Cumulative frequency distribution Frequency Distribution Graphs Bar Graphs Histograms Pie Chart Frequency Polygon
Regression Analysis It is a set of statistical method that analyzes the relation between a dependent variable and one or more independent variables. Types of regression Analysis Linear Regression Logistic Regression Ridge Regression Lasso Regression Polynomial Regression Bayesian Linear Regression Correlation Correlation refers to the statistical relationship between two entities Linear Regression Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. General Linear Model Linear regression is actually a form of the General Linear Model where the parameters area, the slope of the line, and b, the intercept. y = ax + b + ε Multiple regressions The different x variables are combined in a linear way and each has its own regression coefficient: y = a 1 x 1 + a 2 x 2 +…..+ a n x n + b + ε
What is Python? Python is a popular programing language that is object-oriented used for general-purpose programing. to create web applications, create workflows and handle big data and perform complex mathematics. Python is used Python syntax compared to other programing languages It is for readability, and has some similarities to the English language with influence from mathematics, as opposed to other programing languages which often use semicolons or parentheses. Python data types Numeric data types: int , float, complex, String data types:str , Sequence types: list, tuple, range, Binary types: bytes, byte array , memory view, Mapping data type: dict., Boolean type: bool.
Operations in Python There are six operations in python which are Addition, Subtractions, Multiplications, Division, Floor division, Module and Power. Python Libraries It is a reuse able chunk of code e.g. Matplotlib, Pandas and Numpy. List A dynamically sizes array that gets declared in other languages. Tuple Collections of various objects of python departed by commas. Sets The sets are an unordered collection of data types. Python coding With python compliers we can edit code and see the results in browser.
Learning It is “to gain knowledge, or understanding of, or skill in, by study, instruction, or experience,” and “modification of a behavioral tendency by experience.” Machine learning It usually refers to the changes in systems that perform tasks associated with artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot control, prediction, etc. Terminologies Used in ML Algorithm Machine Learning Machine Learning Model Black Box Model Interpretable Machine Learning Dataset Instance Target Training Machine Learning Task Over fitting Under-fitting
Steps in Machine Learning There are following 7 steps in Machine Learning Data Collection Data Preparation Choose a Model Train the Model Evaluate the Model Parameter Tuning Make Predictions Types of Machine Learning Machine Learning is broadly categorized under the following headings i.e. Machine learning evolved from Supervised Learning Unsupervised Learning Reinforcement Learning Deep Learning Deep Reinforcement Learning
Data Analysis Reading data in Python Reading data into pandas data frames is to often the very first step when conducting data analysis in python. Data exploration It is to visually explore data sets look for similarities, patterns and outliers and to identify the relationships between different variables. Data cleaning It is the process of correcting or removing corrupt, incorrect, or unnecessary data from a data set before data analysis. Removing Null values: There are a several ways to remove null value from list in python. filter (), join() and remove() functions to delete empty string from list. Removing duplicates Iterate through the elements of the list and store the first occurrence of an element in a temporary list while ignoring any other occurrences of that element. Removing Outliers Outliers are the values in dataset which standouts from the rest of the data. The outliers can be a result of error in reading, fault in the system, manual error or misreading Following are two robust methods to remove outliers from the data IQR – Interquartile Range Z-Score method for Outlier Removal IQR – Interquartile Range IQR is part of Descriptive statistics and also called as midspead , middle 50% IQR is first Quartile minus the Third Quartile (Q3-Q1)
Data Manipulation It enables users in data organization in order to make reading or interpreting the insights from the data more structured and comprises of having better design. Filtering The filter() method filters the given sequence with the help of a function that tests each element in the sequence to be true or not. Syntax: Filter(function, sequence) Sorting The sort ( ) method sorts the list ascending by default. You can also make a function to decide the sorting criteria(s ). Syntax: List.Sort(reverse = True|False,key = myfunc) Creating New Columns We perform a vast array of operations on the data to get the data in the desired form like, we want to create new columns in the Data Frame based on the result of some operations on the existing columns in the DataFrame. Example : We can use Data Frame.apply() function to achieve this task.
Assumptions of Linear Regression Linear Relationship It can be done by making a scatter plot for each independent variable with the dependent variable. Normality The X and Y variables should be normally distributed. Histograms, KDE plots, Q-Q plots can be used to check the normality assumptions. Independence / No Multi-co-linearity If the VIF score is greater than 5 then the variables are highly correlated. In short, observations are independent of each other. Consequences of the violation of any of the Assumptions The violation of the assumptions leads to a decrease in accuracy of the model therefore the predictions are not accurate and error is also high.
Practical Implementation by Python Coding in Python Reading Data set For checking missing and null values
Linearity Normality We can check it by creating histogram. Independence / No Multi-co-linearity
Outlier We can use different methods to find outlier. By making box plot we can evaluate outlier. Simple Linear Regression Python has methods for finding a relationship between data-points and to draw a line of linear regression. Train/Test To measure if the model is good enough, we can use a method called Train/Test. It is called Train/Test because you split the data set into two sets: a training set and a testing set. Example Result: It looks like the original data set, so it seems to be a fair selection: Creating Training the model
Finding Slope as Coefficient and y Intercept as Intercept Output Predicting Values
Evaluating Performance of the Model Mean square method (the more close to zero the more accurate model is .) R squared (the more close to 1 the more accurate model is.)
Plotting the predicted and actual y values The more the graph is tend to look like a straight line the more it is accurate. Multiple Regression Analysis Since the model is not that accurate that' s we should try multiple regression analysis Linear Regression()