UNIT 3.pptx.......................................
vijayannamratha
11 views
70 slides
Oct 10, 2024
Slide 1 of 70
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
About This Presentation
notes
Size: 8.2 MB
Language: en
Added: Oct 10, 2024
Slides: 70 pages
Slide Content
Unit 3 Data science components
Linear regression 2
Linear regression 3 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error
Logistic regression 4 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1
Linear Vs Logistic Regression 5
Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 6
Linear regression 7 Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (x) variables
Linear regression 8
Linear regression 9 y= mx+c+ ε y= Dependent Variable (Target Variable) x= Independent Variable (predictor Variable) c= y intercept of the line m= slope ε= error
Logistic regression 10 Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between and 1
Logistic regression 11 Logistic regression is used for solving the classification problems . In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). Th e cu r v e f r o m the logi s tic functio n indi c a t es t h e li k elihood o f som e thing suc h as wh e th e r t h e c ells a r e c anc e r ou s o r not , a do g is pu pp y o r no t ba s ed o n i ts weight, etc. Logistic regression is based on the concept of Maximum Likelihood estimation. According to this estimation, the observed data should be most probable.
Logistic regression it has the ability to provide probabilities and classify new data using continuous and discrete datasets 12
Linear Vs Logistic Regression 13
Introducing the Gaussian Carl Friedrich GAUSS ranked among “history's most influential mathematicians” discovered normal distribution It is also called Gaussian distribution . It is often called the bell curve , because the graph of its probability looks like a bell. Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean Examples: Heights of people, Measurement errors, Blood pressure, Test marks, IQ scores, Salaries
Properties of normal distribution -Mean, Median, Mode and Standard deviation -The mean, mode and median are all equal -The curve is symmetric at the center -Exactly half of the values are to the left of center and exactly half the values are to the right
Gaussian Distribution 68% of the data falls within one standard deviation of the mean 95% of the data falls within two standard deviations of the mean 99.7% of the data falls within three standard deviations of the mean
Standard Deviation Standard Deviation is a measure of the amount of variation A low standard deviation indicates that the values tend to be close to the mean A high standard deviation indicates that the values are spread out over a wider range
Example 1
Mean
Variance
Standard Deviation
Example 2
Solution Mean=27 Variance=24.86 SD=4.96
Solution Mean= 107 variance=886.9 SD= 29.9
Introduction to Standardization Standardization is scaling technique where the values are centered around the mean with a unit standard deviation. It means that the mean of the attribute becomes and the resultant distribution has a unit (1) standard deviation Standard scores are most commonly called z -score
Standard Normal Probability Distribution in Excel The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =NORMDIST( x,mean,standard_dev,cumulative ) 1. X (required argument) – This is the value for which we wish to calculate the distribution. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution. 4. Cumulative (required argument) – This is a logical value. It specifies the type of distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE (Normal Probability Density Function).
Example
If we wish to calculate the probability mass function for the data above, the formula to use is: We get
Conti….
STANDARDIZE Z-Score Function The STANDARDIZE Function is available under Excel Statistical functions. It will return a normalized value (z-score) based on the mean and standard deviation. =STANDARDIZE(x, mean, standard_dev ) The STANDARDIZE function uses the following arguments: 1. X (required argument) – This is the value that we want to normalize. 2. Mean (required argument) – The arithmetic mean of the distribution. 3. Standard_dev (required argument) – This is the standard deviation of the distribution.
Example
Conti,,,,
Using z-Scores to find a Probability Example: The mean score for the population is 21, and the standard deviation is 5. How will you determine the probability that a score fall on -higher than 30 -between the range of 23 and 27 -between 15 and 20. - less than 20. 34
Standardization z score 35
higher than 30 36
between the range of 23 and 27 37
between 15 and 20 38
less than 20 39
Central Limit Theorem “Given a dataset with unknown distribution (it could be uniform, binomial or completely random), the sample means will approximate the normal distribution” The Central limit theorem shows how the mean of a sample distribution approaches the normal distribution when the size of the sample gets larger .
Algebra with Gaussians Gauss elimination method is used to solve a system of linear equations Gaussian elimination is the name of the method to perform the three types of matrix row operations Interchanging two rows Multiplying a row by a constant (any constant - not 0) Adding a row to another row This technique is also called row reduction and it consists of two stages: Forward elimination Back substitution
Algebra with Gaussians The forward elimination step refers to the row reduction needed to simplify the matrix Back substitution step refers to substitute the value to solve the equation Example If we were to have the following system of linear equations containing three equations for three unknowns:
Algebra with Gaussians
Algebra with Gaussians Row reducing (applying the Gaussian elimination method to) the augmented matrix
45
Markowitz Portfolio Optimization Markowitz Portfolio Optimization assists in the selection of the most efficient portfolio by analyzing various possible portfolios of the given securities. -Also known as Mean Variance model. = Where, Rp - the expected return to portfolio X1 - proportion of total portfolio invested in security i R1 - expected return to security i n - total number of securities in portfolio 48
Terminologies 49 Portfolio --- a collection of investments. Expected risk --- the total amount of money that can be Lost. Expected return --- future income from invested capital Portfolio effect --- portfolio that will reduce total risk of Investment Portfolio manager --- proje c t manager Efficient portfolio --- provides the lowest expected risk for a given expected return, or the greatest expected return for a given level of risk.
Markowitz Portfolio Optimization - Approach According to theory, the effects of one security purchase over the effects of the other security purchase are taken into consideration The results are evaluated and helpful to reduce the risk minimization 50
Example Security Expected Return R i % Proportion X i % 1 10 25 2 3 20 30 75 80 51 The return on the portfolio on combining the two securities will be Rp = R1X1 + R2X2 Rp = 0.10(0.25)+ 0.20(0.75) Rp = 17.5%
Advantages It is believed that holding multiple securities is less risky than having only one investment in a person’s portfolio When multiple stocks are taken on a portfolio and if they have negative correlation, then risk can be completely reduced because the gain on one can offset the loss on the other The effect of multiple securities can also be studied when one security is more risky when compared to the other security 52
Standardizing x and y Coordinates for Linear Regression 53 To standardize the set of coordinates that are more deviated from the normal range of values Standardization results in mean of the entire coordinates becomes Zero and unit standard deviation Mean = 0 SD = 1
Example 1
Co-efficient corelation formula
Formula
Coefficient Co-relation
Regression equation
Regression equation
Example 2
Example 61
Standardization Simplifies Linear Regression In order to simplify the standardization of equation of line, Residual Sum of Square (SSR) should be minimized 62
Residual Sum of Square SSR - Residual Sum of Square SSR = ∑( yi −y^) 2 y^ = m x+ c where, m = slope and c = intercept 63
Modeling Error in Linear Regression The coefficient of determination, or R 2 is a measure that provides information about the goodness of fit of a model In the context of regression, it is a statistical measure of how well the regression line approximates the actual data It is important when a statistical model is used either to predict future outcomes or in the testing of hypothesis 64
R 2 Measure (co-efficient of determination ) R-squared will give you an estimate of the relationship between movements of a dependent variable based on an independent variable's movements . R-squared gives you the percentage variation in y explained by x-variables The range is 0 to 1. (i.e.) 0% to 100% of the variation in y can be explained by the x-variables. R 2 = 1- SSR - Residual Sum of Squares SST - Total Sum of Squares SSR=∑( yi −y^) 2 SST=∑(y−y¯) 2 65
Example 66
Information Gain from Linear Regression Information gain is calculated by comparing the entropy of the dataset before and after a transformation Entropy of a random variable Y can be represented as H( Y ), which tells about the uncertainty about the random variable Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset 67
Information Gain from Linear Regression For example , we may wish to evaluate the impact on purity by splitting a dataset S by a random variable with a range of values, then IG(Y, X) = H(Y) – H(Y | X) IG(Y, X) is the information for the dataset Y for the variable X H(Y) is the entropy for the dataset before any change and H(Y | X) is the conditional entropy for the dataset given the variable X 68
By Gaussian distribution, IG(Y, X) can be written as, IG(Y, X) = H(Y) – H(Y | X) = H(Ø(0,1)) - H(Ø(0, ) = Signal+Noise-Noise IG(Y, X) = Signal 69