UNIT 3.pptx.......................................

vijayannamratha 11 views 70 slides Oct 10, 2024
Slide 1
Slide 1 of 70
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70

About This Presentation

notes


Slide Content

Unit 3 Data science components

Linear regression 2

Linear regression 3 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error

Logistic regression 4 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1

Linear Vs Logistic Regression 5

Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 6

Linear regression 7 Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x) variables

Linear regression 8

Linear regression 9 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error

Logistic regression 10 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1

Logistic regression 11 Logistic regression is used for solving the classification problems . In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). Th e cu r v e f r o m the logi s tic functio n indi c a t es t h e li k elihood o f som e thing suc h as wh e th e r t h e c ells a r e c anc e r ou s o r not , a do g is pu pp y o r no t ba s ed o n i ts weight, etc. Logistic regression is based on the concept of Maximum Likelihood estimation. According to this estimation, the observed data should be most probable.

Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 12

Linear Vs Logistic Regression 13

Introducing the Gaussian Carl Friedrich GAUSS  ranked among “history's most influential mathematicians” discovered   normal distribution It is also called Gaussian distribution . It is often called the bell   curve , because the graph of its probability looks like a bell. Normal distribution, also known as the Gaussian distribution, is a  probability distribution  that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean Examples: Heights of people, Measurement errors, Blood pressure, Test marks, IQ scores, Salaries

Properties of normal distribution -Mean, Median, Mode and Standard deviation -The mean, mode and median are all equal -The curve is symmetric at the center -Exactly half of the values are to the left of center and exactly half the values are to the right

Gaussian Distribution 68% of the data falls within one standard deviation of the mean 95% of the data falls within two standard deviations of the mean 99.7% of the data falls within three standard deviations of the mean

Standard Deviation Standard Deviation  is a measure of the amount of variation A low standard deviation indicates that the values tend to be close to the mean  A high standard deviation indicates that the values are spread out over a wider range

Example 1

Mean

Variance

Standard Deviation

Example 2

Solution Mean=27 Variance=24.86 SD=4.96

Solution Mean= 107 variance=886.9 SD= 29.9

Introduction to Standardization Standardization is scaling technique where the values are centered around the mean with a unit standard deviation. It means that the mean of the attribute becomes and the resultant distribution has a unit (1) standard deviation Standard scores  are most commonly called  z -score

Standard Normal Probability Distribution in Excel The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =NORMDIST( x,mean,standard_dev,cumulative ) 1. X (required argument) – This is the value for which we wish to calculate the distribution. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution. 4. Cumulative (required argument) – This is a logical value. It specifies the type of distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE (Normal Probability Density Function).

Example

If we wish to calculate the probability mass function for the data above, the formula to use is: We get

Conti….

STANDARDIZE Z-Score Function The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =STANDARDIZE(x, mean, standard_dev ) The STANDARDIZE function uses the following arguments: 1. X (required argument) – This is the value that we want to normalize. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution.

Example

Conti,,,,

Using z-Scores to find a Probability Example: The mean score for the population is 21, and the standard deviation is 5. How will you determine the probability that a score fall on -higher than 30 -between the range of 23 and 27 -between 15 and 20. - less than 20. 34

Standardization z score 35

higher than 30 36

between the range of 23 and 27 37

between 15 and 20 38

  less than 20 39

Central Limit Theorem “Given a dataset with unknown distribution (it could be uniform, binomial or completely random), the sample means will approximate the normal distribution” The Central limit theorem shows how the mean of a sample distribution approaches the normal distribution when the size of the sample gets larger .

Algebra with Gaussians Gauss elimination method is used to solve a system of linear equations Gaussian elimination is the name of the method to perform the three types of matrix row operations Interchanging two rows Multiplying a row by a constant (any constant - not 0) Adding a row to another row This technique is also called row reduction and it consists of two stages: Forward elimination Back substitution

Algebra with Gaussians The forward elimination step refers to the row reduction needed to simplify the matrix  Back substitution  step refers to substitute the value to solve the equation Example If we were to have the following system of linear equations containing three equations for three unknowns: 

Algebra with Gaussians

Algebra with Gaussians Row reducing (applying the Gaussian elimination method to) the augmented matrix

45

Markowitz Portfolio Optimization Markowitz Portfolio Optimization assists in the selection of the most efficient  portfolio  by analyzing various possible  portfolios  of the given securities. -Also known as Mean Variance model. = Where, Rp - the expected return to portfolio X1 - proportion of total portfolio invested in security i R1 - expected return to security i n - total number of securities in portfolio   48

Terminologies 49 Portfolio --- a collection of investments. Expected risk --- the total amount of money that can be Lost. Expected return --- future income from invested capital Portfolio effect --- portfolio that will reduce total risk of Investment Portfolio manager --- proje c t manager Efficient portfolio --- provides the lowest expected risk for a given expected return, or the greatest expected return for a given level of risk.

Markowitz Portfolio Optimization - Approach According to theory, the effects of one security purchase over the effects of the other security purchase are taken into consideration The results are evaluated and helpful to reduce the risk minimization 50

Example Security Expected Return R i % Proportion X i % 1 10 25 2 3 20 30 75 80 51 The return on the portfolio on combining the two securities will be Rp = R1X1 + R2X2 Rp = 0.10(0.25)+ 0.20(0.75) Rp = 17.5%

Advantages It is believed that holding multiple securities is less risky than having only one investment in a person’s portfolio When multiple stocks are taken on a portfolio and if they have negative correlation, then risk can be completely reduced because the gain on one can offset the loss on the other The effect of multiple securities can also be studied when one security is more risky when compared to the other security 52

Standardizing x and y Coordinates for Linear Regression 53 To standardize the set of coordinates that are more deviated from the normal range of values Standardization results in mean of the entire coordinates becomes Zero and unit standard deviation Mean = 0 SD = 1

Example 1

Co-efficient corelation formula

Formula

Coefficient Co-relation

Regression equation

Regression equation

Example 2

Example 61

Standardization Simplifies Linear Regression In order to simplify the standardization of equation of line, Residual Sum of Square (SSR) should be minimized 62

Residual Sum of Square SSR - Residual Sum of Square SSR = ∑( yi −y^) 2 y^ = m x+ c where, m = slope and c = intercept 63

Modeling Error in Linear Regression The coefficient of determination, or R 2 is a measure that provides information about the goodness of fit of a model In the context of regression, it is a statistical measure of how well the regression line approximates the actual data It is important when a statistical model is used either to predict future outcomes or in the testing of hypothesis 64

R 2 Measure (co-efficient of determination ) R-squared will give you an estimate of the relationship between movements of a dependent variable based on an independent variable's movements . R-squared gives you the percentage variation in y explained by x-variables The range is 0 to 1. (i.e.) 0% to 100% of the variation in y can be explained by the x-variables. R 2 = 1- SSR - Residual Sum of Squares SST - Total Sum of Squares SSR=∑( yi −y^) 2 SST=∑(y−y¯) 2   65

Example 66

Information Gain from Linear Regression Information gain is calculated  by comparing the entropy of the dataset before and after a transformation Entropy of a random variable  Y  can be represented as H( Y ), which tells about the uncertainty about the random variable Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset 67

Information Gain from Linear Regression For example , we may wish to evaluate the impact on purity by splitting a dataset  S  by a random variable with a range of values, then IG(Y, X) = H(Y) – H(Y | X) IG(Y, X)  is the information for the dataset  Y  for the variable X H(Y)  is the entropy for the dataset before any change and   H(Y | X)  is the conditional entropy for the dataset given the variable  X 68

By Gaussian distribution, IG(Y, X) can be written as, IG(Y, X) = H(Y) – H(Y | X) = H(Ø(0,1)) - H(Ø(0, ) = Signal+Noise-Noise IG(Y, X) = Signal   69

Thank You 70
Tags