UNIT 3.pptx.......................................

Unit 3 Data science components

Linear regression 2

Linear regression 3 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error

Logistic regression 4 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1

Linear Vs Logistic Regression 5

Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 6

Linear regression 7 Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x) variables

Linear regression 8

Linear regression 9 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error

Logistic regression 10 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1

Logistic regression 11 Logistic regression is used for solving the classification problems . In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). Th e cu r v e f r o m the logi s tic functio n indi c a t es t h e li k elihood o f som e thing suc h as wh e th e r t h e c ells a r e c anc e r ou s o r not , a do g is pu pp y o r no t ba s ed o n i ts weight, etc. Logistic regression is based on the concept of Maximum Likelihood estimation. According to this estimation, the observed data should be most probable.

Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 12

Linear Vs Logistic Regression 13

Introducing the Gaussian Carl Friedrich GAUSS ranked among “history's most influential mathematicians” discovered normal distribution It is also called Gaussian distribution . It is often called the bell curve , because the graph of its probability looks like a bell. Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean Examples: Heights of people, Measurement errors, Blood pressure, Test marks, IQ scores, Salaries

Properties of normal distribution -Mean, Median, Mode and Standard deviation -The mean, mode and median are all equal -The curve is symmetric at the center -Exactly half of the values are to the left of center and exactly half the values are to the right

Gaussian Distribution 68% of the data falls within one standard deviation of the mean 95% of the data falls within two standard deviations of the mean 99.7% of the data falls within three standard deviations of the mean

Standard Deviation Standard Deviation is a measure of the amount of variation A low standard deviation indicates that the values tend to be close to the mean A high standard deviation indicates that the values are spread out over a wider range

Example 1

Mean

Variance

Standard Deviation

Example 2

Solution Mean=27 Variance=24.86 SD=4.96

Solution Mean= 107 variance=886.9 SD= 29.9

Introduction to Standardization Standardization is scaling technique where the values are centered around the mean with a unit standard deviation. It means that the mean of the attribute becomes and the resultant distribution has a unit (1) standard deviation Standard scores are most commonly called z -score

Standard Normal Probability Distribution in Excel The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =NORMDIST( x,mean,standard_dev,cumulative ) 1. X (required argument) – This is the value for which we wish to calculate the distribution. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution. 4. Cumulative (required argument) – This is a logical value. It specifies the type of distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE (Normal Probability Density Function).

Example

If we wish to calculate the probability mass function for the data above, the formula to use is: We get

Conti….

STANDARDIZE Z-Score Function The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =STANDARDIZE(x, mean, standard_dev ) The STANDARDIZE function uses the following arguments: 1. X (required argument) – This is the value that we want to normalize. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution.

Example

Conti,,,,

Using z-Scores to find a Probability Example: The mean score for the population is 21, and the standard deviation is 5. How will you determine the probability that a score fall on -higher than 30 -between the range of 23 and 27 -between 15 and 20. - less than 20. 34

Standardization z score 35

higher than 30 36

between the range of 23 and 27 37

between 15 and 20 38

less than 20 39

Central Limit Theorem “Given a dataset with unknown distribution (it could be uniform, binomial or completely random), the sample means will approximate the normal distribution” The Central limit theorem shows how the mean of a sample distribution approaches the normal distribution when the size of the sample gets larger .

Algebra with Gaussians Gauss elimination method is used to solve a system of linear equations Gaussian elimination is the name of the method to perform the three types of matrix row operations Interchanging two rows Multiplying a row by a constant (any constant - not 0) Adding a row to another row This technique is also called row reduction and it consists of two stages: Forward elimination Back substitution

Algebra with Gaussians The forward elimination step refers to the row reduction needed to simplify the matrix Back substitution step refers to substitute the value to solve the equation Example If we were to have the following system of linear equations containing three equations for three unknowns: 

Algebra with Gaussians

Algebra with Gaussians Row reducing (applying the Gaussian elimination method to) the augmented matrix

45

Markowitz Portfolio Optimization Markowitz Portfolio Optimization assists in the selection of the most efficient portfolio by analyzing various possible portfolios of the given securities. -Also known as Mean Variance model. = Where, Rp - the expected return to portfolio X1 - proportion of total portfolio invested in security i R1 - expected return to security i n - total number of securities in portfolio 48

Terminologies 49 Portfolio --- a collection of investments. Expected risk --- the total amount of money that can be Lost. Expected return --- future income from invested capital Portfolio effect --- portfolio that will reduce total risk of Investment Portfolio manager --- proje c t manager Efficient portfolio --- provides the lowest expected risk for a given expected return, or the greatest expected return for a given level of risk.

Markowitz Portfolio Optimization - Approach According to theory, the effects of one security purchase over the effects of the other security purchase are taken into consideration The results are evaluated and helpful to reduce the risk minimization 50

Example Security Expected Return R i % Proportion X i % 1 10 25 2 3 20 30 75 80 51 The return on the portfolio on combining the two securities will be Rp = R1X1 + R2X2 Rp = 0.10(0.25)+ 0.20(0.75) Rp = 17.5%

Advantages It is believed that holding multiple securities is less risky than having only one investment in a person’s portfolio When multiple stocks are taken on a portfolio and if they have negative correlation, then risk can be completely reduced because the gain on one can offset the loss on the other The effect of multiple securities can also be studied when one security is more risky when compared to the other security 52

Standardizing x and y Coordinates for Linear Regression 53 To standardize the set of coordinates that are more deviated from the normal range of values Standardization results in mean of the entire coordinates becomes Zero and unit standard deviation Mean = 0 SD = 1

Example 1

Co-efficient corelation formula

Formula

Coefficient Co-relation

Regression equation

Example 2

Example 61

Standardization Simplifies Linear Regression In order to simplify the standardization of equation of line, Residual Sum of Square (SSR) should be minimized 62

Residual Sum of Square SSR - Residual Sum of Square SSR = ∑( yi −y^) 2 y^ = m x+ c where, m = slope and c = intercept 63

Modeling Error in Linear Regression The coefficient of determination, or R 2 is a measure that provides information about the goodness of fit of a model In the context of regression, it is a statistical measure of how well the regression line approximates the actual data It is important when a statistical model is used either to predict future outcomes or in the testing of hypothesis 64

R 2 Measure (co-efficient of determination ) R-squared will give you an estimate of the relationship between movements of a dependent variable based on an independent variable's movements . R-squared gives you the percentage variation in y explained by x-variables The range is 0 to 1. (i.e.) 0% to 100% of the variation in y can be explained by the x-variables. R 2 = 1- SSR - Residual Sum of Squares SST - Total Sum of Squares SSR=∑( yi −y^) 2 SST=∑(y−y¯) 2 65

Example 66

Information Gain from Linear Regression Information gain is calculated by comparing the entropy of the dataset before and after a transformation Entropy of a random variable Y can be represented as H( Y ), which tells about the uncertainty about the random variable Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset 67

Information Gain from Linear Regression For example , we may wish to evaluate the impact on purity by splitting a dataset S by a random variable with a range of values, then IG(Y, X) = H(Y) – H(Y | X) IG(Y, X) is the information for the dataset Y for the variable X H(Y) is the entropy for the dataset before any change and H(Y | X) is the conditional entropy for the dataset given the variable X 68

By Gaussian distribution, IG(Y, X) can be written as, IG(Y, X) = H(Y) – H(Y | X) = H(Ø(0,1)) - H(Ø(0, ) = Signal+Noise-Noise IG(Y, X) = Signal 69

Thank You 70

UNIT 3.pptx.......................................

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

UNIT 3.pptx.......................................

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society