DataScienceConcept_Kanchana_Weerasinghe.pptx

weerasnghe 85 views 178 slides Jul 12, 2024
Slide 1
Slide 1 of 285
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192
Slide 193
193
Slide 194
194
Slide 195
195
Slide 196
196
Slide 197
197
Slide 198
198
Slide 199
199
Slide 200
200
Slide 201
201
Slide 202
202
Slide 203
203
Slide 204
204
Slide 205
205
Slide 206
206
Slide 207
207
Slide 208
208
Slide 209
209
Slide 210
210
Slide 211
211
Slide 212
212
Slide 213
213
Slide 214
214
Slide 215
215
Slide 216
216
Slide 217
217
Slide 218
218
Slide 219
219
Slide 220
220
Slide 221
221
Slide 222
222
Slide 223
223
Slide 224
224
Slide 225
225
Slide 226
226
Slide 227
227
Slide 228
228
Slide 229
229
Slide 230
230
Slide 231
231
Slide 232
232
Slide 233
233
Slide 234
234
Slide 235
235
Slide 236
236
Slide 237
237
Slide 238
238
Slide 239
239
Slide 240
240
Slide 241
241
Slide 242
242
Slide 243
243
Slide 244
244
Slide 245
245
Slide 246
246
Slide 247
247
Slide 248
248
Slide 249
249
Slide 250
250
Slide 251
251
Slide 252
252
Slide 253
253
Slide 254
254
Slide 255
255
Slide 256
256
Slide 257
257
Slide 258
258
Slide 259
259
Slide 260
260
Slide 261
261
Slide 262
262
Slide 263
263
Slide 264
264
Slide 265
265
Slide 266
266
Slide 267
267
Slide 268
268
Slide 269
269
Slide 270
270
Slide 271
271
Slide 272
272
Slide 273
273
Slide 274
274
Slide 275
275
Slide 276
276
Slide 277
277
Slide 278
278
Slide 279
279
Slide 280
280
Slide 281
281
Slide 282
282
Slide 283
283
Slide 284
284
Slide 285
285

About This Presentation

I've developed a comprehensive handbook covering the core concepts of data science and the mathematics behind them. It's a culmination of my experience and knowledge in the field. I hope this will be a great starting point for beginners


Slide Content

Data Science

Basics

Data Types Has a Meaningful Zero Ex: Height No Meaningful Zero Ex: Temperature

Standard deviation is how close the values in the data set are to the mean on average, the data points differ from the mean by

Statistical inference  is the process of drawing conclusions about an underlying population based on a sample or subset of the data. In most cases, it is not practical to obtain all the measurements in a given population. Population and Sample Point Estimators

Z-Score measure that indicates how many standard deviations a data point is from the mean of a dataset Application Cross-Group Comparisons : Z-scores allow for the comparison of scores from different groups that may have different means , standard deviations and distributions. For example, comparing test scores from students in different schools or different countries. O utliers detection in data sets. Observations with Z-scores that are significantly higher or lower than the typical range (usually considered to be Z-scores less than -3 or greater than 3) are often regarded as outliers Normalization of Data : Z-scores are used in statistical analysis to normalize data, ensuring that every datum has a comparable scale. This is useful in multivariate statistics where data on different scales are combined.

The total area under the curve for any pdf is  always equal to 1 it shows the probability

Confident Interval The degrees of freedom Range of values such that with X % probability, the range will contain the true unknown value of the parameter.

Sample Size > = 30 Sample Size < 30 Z-Statistics T-Statistics

T T S

Data Types: Continuous Data : Numerical data that can take on any value within a range. Examples Discrete Data : Numerical data that can take on a limited number of values. For example, the number of students in a class. Nominal Data (Categorical) Gender (Male, Female, Other) Blood type (A, B, AB, O) Colors (Red, Blue, Green) Ordinal Data (Categorical): O rder or ranking among them, but the differences between the ranks are not necessarily equal Education level (High School, Bachelor's, Master's, PhD) — While you can say a PhD is higher than a Master's, the difference between the levels is not measured. Satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied) Economic status (Low Income, Middle Income, High Income) Interval Data (Numerical): Interval data are numerical data that have meaningful differences between values, and the data have a specific order Calendar years — The year 2000 is as long as 1990, and the difference between years is consistent. But year zero does not mean "no year.“ Temporal data : A lso known as time-series data, refers to a sequence of data points collected or recorded at time intervals, which can be regular or irregular Has Time stamp , It is sequential and cannot be shifted , Used for identifying the Pattern and Trend Ratio Data : Similar to interval data but with a meaningful zero, allowing for all arithmetic operations. Examples include height, weight, and age.

Relationships Among data : Linear Relationship : As described earlier, a linear relationship is one where the change in one variable is proportional to the change in another variable, resulting in a straight line when plotted on a graph. Exponential Relationship : In an exponential relationship, one variable grows or decays at a rate that depends on an exponent of another variable. This relationship often appears as a curve that rises or falls rapidly.

Logarithmic Relationship : A logarithmic relationship involves one variable being the logarithm of another variable. This relationship may appear as a curve that rises or falls rapidly at first but then levels off. Polynomial Relationship : Polynomial relationships involve one variable being a polynomial function of another variable. Depending on the degree of the polynomial, the relationship may exhibit different degrees of curvature. Periodic Relationship : Periodic relationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena and periodic functions. Monotonic Relationship : A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Nonlinear Relationship : Nonlinear relationships include any relationship that cannot be adequately represented by a straight line. This category encompasses all relationships mentioned above except for linear relationships.

Periodic Relationship : Periodic relationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena and periodic functions. Monotonic Relationship : A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Nonlinear Relationship : Nonlinear relationships include any relationship that cannot be adequately represented by a straight line. This category encompasses all relationships mentioned above except for linear relationships.

A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Non Monotonic Monotonic Relationship

While correlation measures strength and direction of the linear relationship , monotonicity captures any systematic change in the relationship, whether linear or not . Therefore, monotonicity can be present even if correlation is close to 0, indicating a weak linear relationship. Correlation Correlation measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where: 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Concave , Convex :

Parametric vs Non Parametric Machine Learning Examples : Linear regression, logistic regression, linear discriminant analysis (LDA), and some neural networks (when they have a fixed number of layers and nodes). Examples : k-nearest neighbors (KNN), decision trees, random forests, support vector machines (SVM) with non-linear kernels, and some types of neural networks (such as deep learning).

Histogram: A histogram is a graphical representation of the distribution of numerical data

Common data Distribution:

Central tendency and variation are  two measures used in statistics to summarize data . Measure of central tendency shows where the center or middle of the data set is located, whereas measure of variation shows the dispersion among data values Central tendency

Dispersion   Dispersion in statistics is  a way of describing how to spread out a set of data is

Reducible Error

Bias Bias refers to the difference between the expected predictions of a model and the true values of the target variable. A model with high bias is not complex enough to capture the underlying patterns in the data, resulting in underfitting. This means that the model is too simple and cannot capture the complexity of the data, leading to poor performance on both the training and test data. Variance Variance, on the other hand, refers to the variability of the model’s predictions for different training sets . A model with high variance is too complex and captures noise in the training data, resulting in overfitting. This means that the model is too complex and fits the training data too closely, leading to poor performance on new, unseen dat a.

Noise noise refers to the random variations and irrelevant information within a dataset that cannot be attributed directly to the underlying relationships being modeled. Noise can come from various sources and significantly impacts the quality of the predictions made by a model

Minimizing irreducible error

Machine learning

Collection and Data Exploration (EDA –Exploratory data analytics) : Data Cleaning: Handle missing values: impute or drop them based on context. Detect and handle duplicates. Identify and handle outliers. Standardize data formats and units. Resolve inconsistencies and errors. Validate data against predefined rules or constraints.

Feature Engineering: Create new features based on domain knowledge. Feature scaling Generate interaction features (e.g., product, division). Extract time-based features (e.g., day of week, hour of day). Perform dimensionality reduction (e.g., PCA, t-SNE). Engineer features from raw data (e.g., text, images). Select relevant features for modeling. Data Transformation: Normalize numeric features. Scale features to a consistent range. Encode categorical variables (one-hot encoding, label encoding, etc.). Extract features from text or datetime data. Aggregate data at different levels (e.g., group by, pivot tables). Apply mathematical transformations (log, square root, etc.).

Modeling Model Selection : Choose appropriate machine learning algorithms for the task. Model Training : Train models using the processed and engineered features. Model Evaluation : Evaluate model performance using appropriate metrics and validation techniques. Deployment Model Deployment : Deploy the model to a production environment where it can make predictions on new data. Monitoring and Maintenance : Continuously monitor the model's performance and update it as necessary when new data becomes available or when model performance degrades. Feedback Loop Iterative Improvement : Use feedback from the model's performance and any new data collected to refine the feature engineering and modeling steps, continuously improving the model over time.

Business Problem Understanding

Collection and Data Exploration (EDA –Exploratory data analytics)

Data Collection Gather data from various sources such as databases, APIs, files, etc. Extract data using appropriate tools and techniques. Ensure data integrity during extraction. Data Exploration – Univariate analysis – Expletory data analysis Review data documentation and metadata. Understand the general information of dataset Types / Count / Number of unique values / Missing values Numerical features understanding Min / Max / Mean / Mode / Quartiles / Missing Values / Coefficient of Variation Normality and spread Distribution / STD / Skewness / Kurtosis Categorical feature understanding Distribution / Frequency / Relationship / Credibility / Missing values Outliers identification Correlation analysis (dependent and independent variables ) Multicollinearity testing

Coefficient of Variability It is a measure of relative variability and is often used to compare the variability of different datasets or variables , especially when their means are different.

Features Scaling and Transformation T echnique used to standardize or normalize the range of independent variables or features in a dataset. The goal of feature scaling is to bring all features to a similar scale, which can be beneficial for various machine learning algorithms. Important In Not Important In K-Nearest Neighbors (KNN): Support Vector Machines (SVM): Principal Component Analysis (PCA): Linear Regression, Logistic Regression, and Regularized Regression: Neural Networks: K-Means Clustering: Gradient Boosting Algorithms (e.g., XGBoost , LightGBM ): Ridge Regression and Lasso Regression: Tree-Based Algorithms: Rule-Based Algorithms: Sparse Data: If the dataset is sparse, meaning most feature values are zero or close to zero, feature scaling may not be necessary. Non-Numerical Features: Categorical variables represented as one-hot encoded vectors, ordinal variables, or binary features typically do not require feature scaling.

Numerical features : Identify the distribution of the each continuous variable Most of the time that will align to one of a know distribution as follows Based on the ML type we need to transform the feature to the appropriate distribution for better performance of the model Ex: Skewed distribution to the normal distribution using transformation techniques

Log Transformation for X or Y

Label encoding : This works if the categorical variable has only two categories Categorical features transformation : One Hot Encoding : First check the frequency of each category and the identify most used values other will be “Other Type” each category value is converted into a new categorical column and assigned is called dummy variable Disadvantage: Dimensionality Increase Sparse Matrix : most value are zero Loss of Information: ordinality (order) is lost .

Dummy Variable Encoding : Dummy encoding uses N-1 features to represent N labels/categories. The Dummy Variable Trap occurs when different  input variables perfectly predict each other – leading to multicollinearity Multicollinearity is a scenario when two or more input variables are highly correlated with each other This scenario we attempt to avoid as it won’t necessarily affect the overall predictive accuracy of the model. To avoid this issue we drop one of the newly created columns produced by one-hot encoding

Frequency Encoding or Count Encoding : E ncodes categorical features based on the frequency of each category in the dataset .

To reduce the number of features (dimensions) in a dataset while preserving the most important information Features Engineering (Dimensional Reduction)

Feature Selection Main Technique Feature selection is the process of identifying and selecting a subset of relevant features (variables, predictors) from the original dataset, which are most useful for building a predictive model. This process helps to improve the model's performance by removing redundant, irrelevant, or noisy data, leading to better generalization, reduced overfitting, and often shorter training times. categorized into three types : filter methods, wrapper methods, and embedded methods. Filter Methods These methods typically use statistical techniques to assess the relationship between each feature and the target variable. Correlation Coefficient : Measures the linear relationship between each feature and the target. Features with high correlation with the target and low correlation with other features are preferred. Useful for both regression and classification tasks, especially for linear relationships. Chi-Square Test : Evaluates the independence between categorical features and the target variable . Typically used for classification tasks with categorical features. ANOVA (Analysis of Variance): Used to assess the significance of features in relation to the target, especially for categorical features with continuous target variables. Useful when dealing with categorical features and continuous targets, primarily for classification tasks.

https://medium.com/analytics-vidhya/feature-selection-extended-overview-b58f1d524c1c Wrapper Methods Wrapper methods evaluate feature subsets by training and evaluating a machine learning model . They search for the best subset of features by considering the interaction between them and their combined impact on model performance. Forward Selection : Starts with an empty set of features and iteratively adds the feature that improves model performance the most. Backward Elimination : Starts with all features and iteratively removes the least significant feature. These can be used with any type of machine learning model (e.g., linear regression, decision trees, SVMs) and are applicable to both regression and classification tasks. However, they can be computationally expensive for models with a large number of features. Recursive Feature Elimination (RFE) : Trains a model and removes the least important feature(s) based on model weights, recursively until the desired number of features is reached. Mutual Information : Measures the amount of information obtained about one variable through another variable , capturing non-linear relationships. Applicable to both regression and classification tasks, capturing non-linear relationships.

Features in the Model an be selected using following Evaluation Method I nteraction term is a product of two or more predictors To provide a more accurate model when such interactions are present in the data.

Note : Goodness of fit refers to how well a statistical model describes the observed data

What is we do the model selection only on p-value of the predictors Calculate the P value Now Age is not Significant

When we consider all the evaluation criterial it is easy to decide the better model

Forward stepwise selection Start with No Predictors : Begin with the simplest model, which includes no predictors (just the intercept). Add Predictors One by One : At each step, evaluate all predictors that are not already in the model. For each predictor not in the model, fit a model that includes all the predictors currently in the model plus this new predictor. Calculate a criterion for model performance, such as Residual Sum of Squares (RSS), Akaike Information Criterion (AIC), or Bayesian Information Criterion (BIC), for each of these models. Select the Best Predictor : Identify the predictor whose inclusion in the model results in the best performance according to the chosen criterion (e.g., the predictor that reduces the RSS the most, or has the lowest AIC/BIC).

Embedded Methods These methods are specific to particular learning algorithms and incorporate feature selection as a part of the model building phase. Lasso Regression (L1 Regularization) : Penalizes the absolute size of the regression coefficients, effectively shrinking some coefficients to zero, thus performing feature selection. Primarily used in linear regression and logistic regression for feature selection. Ridge Regression (L2 Regularization) : Penalizes the square of the coefficient magnitudes but does not perform feature selection by shrinking coefficients exactly to zero. Used in linear models but does not perform feature selection (included here for comparison). Elastic Net : Combines both L1 and L2 regularization to balance between feature selection and regularization. Combines L1 and L2 regularization, used in linear and logistic regression. Tree-based methods (e.g., Random Forest, Gradient Boosting) : Use feature importance scores derived from the tree-building process to select the most important features. Applicable to both regression and classification tasks. These methods provide feature importance scores, which can be used for feature selection.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is indeed a powerful technique for dimensionality reduction and can be applied to many types of machine learning tasks. The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat PCA vs. Feature Selection PCA : Aims to reduce dimensionality by transforming the original features into a new set of orthogonal features (principal components) that capture the maximum variance in the data . It creates new features rather than selecting a subset of existing ones. Produces new composite features (principal components) that are linear combinations of the original features. These components are not directly interpretable in terms of the original features. The new features (principal components) are abstract and not easily interpretable. This can be a disadvantage when model interpretability is crucial. Feature Selection : Seeks to identify and retain the most relevant and informative subset of the original features , improving model interpretability and performance by eliminating irrelevant, redundant, or noisy features. Retains a subset of the original features, making the model easier to interpret and understand, as it directly works with the original features. Selected features are part of the original feature set, maintaining their interpretability and relevance to the problem domain.

Use Case : PCA : Often used when the goal is to reduce dimensionality for visualization, to combat the curse of dimensionality, or to preprocess data for other machine learning algorithms that may struggle with high-dimensional data. Feature Selection : Used when the goal is to improve model performance and interpretability by focusing on the most relevant features. PCA is widely used in clustering due to several key advantages it offers: PCA reduces the number of features while preserving as much variability as possible.This simplification helps clustering algorithms (like K-means or hierarchical clustering) perform better by reducing noise and focusing on the most informative components. By reducing dimensions, PCA enhances the performance and speed of clustering algorithms, making it easier to identify distinct clusters.

How PCA Works The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat

Low High

The number of principal components created for a given dataset is equal to the number of features in the original dataset. However, not all principal components capture the same amount of variance in the data. Typically, only a subset of the principal components is retained for dimensionality reduction, usually those corresponding to the largest eigenvalues.

Regression Model Engineering Model Fitting and Model Evaluation

Regularization L1 Regularization L2 Regularization

Unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using Sample Mean – Population Mean -

For linear regression, the 95 % confidence interval for β1 approximately takes the form there is approximately a 95 % chance that the interval will contain the true value of β1

Model Fitting Technique goal of these techniques is to find the best parameters that allow the model to predict or classify new data accurately.

KNN Regression

https://medium.com/analytics-vidhya/k-neighbors-regression-analysis-in-python-61532d56d8e4 Low K (e.g., K=1) : Bias : With a low K value, the model tends to have lower bias because it captures more detailed patterns in the training data . Each prediction is influenced by only a single data point, leading to more complex decision boundaries. Variance : However, with low K, the model tends to have higher variance because it is more sensitive to noise in the training data. The predictions can be highly influenced by the specific training instances, leading to overfitting. High K (e.g., K=N, where N is the number of training instances) : Bias : With a high K value, the model tends to have higher bias because it averages over more data points , potentially leading to oversimplified decision boundaries. It might miss subtle patterns in the data. Variance : On the other hand, with high K, the model tends to have lower variance because it smooths out the predictions by averaging over a larger number of neighbors . This can reduce the impact of individual noisy data points, leading to more stable predictions.

Ordinary Least Squares (OLS) – Model Fitting

 Residual Sum of squares (RSS)

OLS

Regularization Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve the generalization performance of models. Overfitting occurs when a model learns the training data too well , capturing noise or random fluctuations in the data, which leads to poor performance on unseen data.

Regularization

Gradient Decent

Cost Functions

Learning Rate

Validation Set Approach

Cross validation techniques Resubstitution Hold-out K-fold cross-validation LOOCV Random subsampling Bootstrapping Validation techniques in machine learning are used to get the error rate of the ML model, which can be considered as close to the true error rate of the population

Ensemble Technique

C ombining multiple models to improve the predictive performance over any single model.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning meth B bootstrapped training set In Regression OR Majority Vote In Classification

Another approach for improving the predictions resulting from a decision tree

Trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling ; instead each tree is ft on a modified version of the original data set The number of trees B. Unlike bagging and random forests, boosting can overft if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B To find the best split this always consider a one feature and iterate through all the features – This use Gini Index Stump

Generate a random number between 0 and 1 and the pic a record from the bin To create the second sample list and the Do the same process

KNN Regressio n SVM

Classification

Logistic Regression

Logistic regression is a type of statistical model used for binary classification tasks . It predicts the probability of a binary outcome (i.e., an event with two possible values, such as 0 and 1, true and false, yes and no). Probability Output : Unlike linear regression, logistic regression provides probabilities for class membership, which can be useful for decision-making processes. The core of logistic regression is the logistic function (also called the sigmoid function), which maps any real-valued number into the range (0, 1): Or

In statistics and probability theory, odds represent the ratio of the probability of success to the probability of failure in a given event . The odds of an event can be expressed in different ways: as odds in favor, odds against, or simply as odds. odds

log-odds

Likelihood Calculation

Maximum Likelihood Calculation

A Bayes classifier, also known as a Naive Bayes classifier, is a probabilistic machine learning algorithm based on Bayes' theorem.

Decision Tree

N on-parametric supervised learning algorithm, which is utilized for both classification and regression tasks . It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes

Decision Tree Regression

Decision tree regression is a type of supervised learning algorithm used in machine learning, primarily for regression tasks. In decision tree regression, the algorithm builds a tree-like structure by recursively splitting the data into subsets based on the features that best separate the target variable (continuous in regression) into homogeneous groups.

An impurity measure , also known as a splitting criterion or splitting rule, is a metric used in decision tree algorithms to evaluate the homogeneity of a set of data points with respect to the target variable The impurity measure serves as a criterion for selecting the best feature and split point at each node of the tree. The goal is to find the feature and split point that result in the most homogeneous child nodes, leading to better predictions and a more accurate decision tree model.

Leaf Node Prediction : Once a leaf node is reached, the prediction is made based on the majority class (for classification) or the mean (for regression) of the target variable in that leaf node. This prediction becomes the output of the decision tree model for the given instance.

M ean squared error (MSE) as the impurity measure in decision tree regression. By minimizing the MSE at each split, decision tree regression effectively partitions the feature space into regions that are more homogeneous with respect to the target variable, leading to accurate predictions for unseen data points.

Xo Will be selected

three-region partition

Random Forest

Bagging vs Boosting Feature Selection : Bagging : Uses all features available for each split in the decision trees. Random Forest : Randomly selects a subset of features for each split in the decision trees, which introduces additional randomness and reduces the correlation between the trees. Bias-Variance Tradeoff : Bagging : bagging will not lead to a substantial reduction in variance over a single tree in this setting but by averaging multiple models it reduce the variance.B ut does not inherently reduce correlation between the models. Random Forest : Reduces both variance and correlation between models by introducing randomness in feature selection, leading to lower overall variance and improved model performance. Random forests overcome model correlation problem by forcing each split to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable Performance : Bagging : Can be applied to any base model and improves performance by reducing overfitting through model averaging. Random Forest : Specifically designed for decision trees, typically performs better than bagging with decision trees due to the reduced correlation between trees.

The k-nearest neighbors algorithm (k-NN) is a  non-parametric ,  lazy learning  method used for classification and regression. The output based on the majority vote (for classification) or mean (or median, for regression) of the k-nearest neighbors in the feature space.

SVM

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It's particularly effective for binary classification problems, where the goal is to classify data points into one of two categories.

Hyperplane based in the dimension One Dimension it is a point Two Dimension it is a Line 3 Dimension it is a surface

Regression Model Evaluation

Regression (Residual) Sum of Squares (RSS) = Sum of Squared Errors (SSE) Total Sum of Squares (TSS) = SST

Mean Squared Error (MSE) MSE measures the average squared error, with higher values indicating more significant discrepancies between predicted and actual values. MSE penalizes more significant errors due to squaring, making it sensitive to outliers. It is commonly used due to its mathematical properties but may be less interpretable than other metrics . It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. Importance : MSE penalizes larger errors more than smaller ones due to squaring the errors. It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. What It Tells About the Model : A lower MSE indicates a model with fewer large errors. It provides a sense of the average error squared, which can emphasize the impact of larger errors .

The common shape of the Mean Squared Error (MSE) graph, when plotted as a function of the model parameters, is typically a convex curve.

Mean Absolute Error (MAE) MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. Importance : MAE is a straightforward measure of error magnitude. It is less sensitive to outliers compared to MSE and RMSE because it doesn’t square the errors. What It Tells About the Model : A lower MAE indicates a model that makes smaller errors on average . Since it uses absolute differences, it provides a clear indication of the typical size of the errors in the same units as the target variable.

Root Mean Squared Error (RMSE) RMSE is the square root of the average of squared differences between prediction and actual observation. It represents the standard deviation of the prediction errors . Importance : RMSE is the square root of MSE, bringing the metric back to the same units as the target variable . It is more sensitive to outliers than MAE due to the squaring of errors before averaging. What It Tells About the Model : A lower RMSE indicates better fit, similar to MSE but more interpretable in the context of the target variable's scale . It provides an idea of how large the errors are in absolute terms. Why RMSE is Considered as Standard Deviation of Prediction Errors If we assume that the prediction errors (residuals) are normally distributed with a mean of zero, then the RMSE provides an estimate of the standard deviation of this normal distribution. This is because, under the normal distribution, the standard deviation is a measure of the average distance of the data points from the mean, which in this case is zero.

Residual Standard Error : The Residual Standard Error (RSE) is a measure used in regression analysis to quantify the typical size of the residuals (prediction errors) from a linear regression model. It provides an estimate of the standard deviation of the residuals, which helps in understanding how well the model fits the data. RSE is in the same units as the dependent variable, making it straightforward to interpret. Adjustment for Predictors : Unlike simple measures like RMSE, RSE accounts for the number of predictors in the model. This adjustment (using 𝑛−𝑝−1 n − p −1 in the denominator) helps prevent overfitting by penalizing models with more predictors. Model Comparison : Comparison Tool : RSE allows for the comparison of different models. When comparing models with the same dependent variable, a lower RSE indicates a better fit. Relative Measure : While RSE itself doesn't provide an absolute goodness-of-fit measure, it is useful when comparing models to determine which one better explains the variability in the data.

Large RSE values may indicate a poor fit, suggesting that the model is not capturing all the relevant information in the data. Model Assessment : RSE helps assess the accuracy of a regression model. A lower RSE value indicates a model that better captures the data's variability. Predictive Accuracy : RSE provides insights into the model’s predictive accuracy, indicating how close the predicted values are to the actual values on average. Identification of Outliers or Influential Points : Large residuals can indicate outliers or influential points that may unduly affect the model's performance. By examining these cases closely, researchers can decide whether to include, exclude, or transform them to improve model fit. Detection of Heteroscedasticity : Heteroscedasticity occurs when the variability of the residuals is not constant across all levels of the predictor variables. RSE can help identify this issue, prompting researchers to explore transformations or alternative modeling techniques to address it.

R esidual plot A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A typical residual plot has the residual values on the Y-axis and the independent variable on the x-axis

Heterogeneity in residuals" refers to the situation where the variability of the residuals is not consistent across all levels of the predictor variables. In other words, the spread or dispersion of residuals varies systematically with the values of one or more predictor variables.

characteristics of a good residual plot:

Identifying whether the Error is high or low Scale of the Target Variable : If the target variable has a large range (e.g., house prices ranging from $100,000 to $1,000,000), an RMSE of $10,000 might be considered low. Conversely, for smaller ranges, such as predicting daily temperature, an RMSE of 10 degrees might be high. Industry Standards : Different fields have established benchmarks for acceptable error rates. For instance, in some financial models, an RMSE of a few dollars might be acceptable, while in other domains, such as temperature prediction, an RMSE of a few degrees could be too high. Historical Data : Compare the error values to those of previous models or known standards within the same domain. This helps in understanding the expected range of errors. Impact of Errors : Consider the practical implications of the error. For instance, in medical diagnostics, even small errors can be critical, whereas, in movie recommendation systems, higher errors might be more tolerable. Business Goals : Align the acceptable error rates with business goals and requirements. Sometimes, a slightly higher error might be acceptable if it results in significant cost savings or other benefits.

Residual Analysis

Coefficient Analysis H0 : There is no relationship between X and Y Mathematically, this corresponds to testing H0 : β1 = 0 Y = β0 + ", and X is not associated with Y . To test the null hypothesis, we need to determine whether βˆ1, our estimate for β1, is sufficiently far from zero that we can be confident that β1 is non-zero How far is far enough?

These coefficients represent the estimated change in the dependent variable (response variable) for a one-unit change in the corresponding predictor variable , holding all other variables constant . For example, if the estimate for a predictor variable X1 is 0.5, it means that, on average, for each one-unit increase in X1, the dependent variable is estimated to increase by 0.5 units, assuming all other variables in the model remain constant.

Coefficient Magnitude : Look at the magnitude of the coefficients. Larger coefficients imply a stronger relationship between the predictor variable and the response variable . For example, a coefficient of 2 means that a one-unit increase in the predictor variable is associated with a two-unit Coefficient Direction : Determine the direction of the relationship between the predictor variable and the response variable. A positive coefficient indicates a positive relationship, meaning that as the predictor variable increases, the response variable also tends to increase . Conversely, a negative coefficient suggests a negative relationship, where an increase in the predictor variable is associated with a decrease in the response variable. Confounding Variables : Be aware of confounding variables or multicollinearity issues . If coefficients change substantially when adding or removing variables from the model , it could indicate that the variables are correlated with each other, leading to potential issues in interpretation.

Standard Error Understanding the standard error helps in assessing the stability and robustness of the model's parameter estimates The standard error provides an estimate of how much we would expect the coefficient estimates to vary from the true population parameters across different samples of the same size from the population

T Value also known as the t-statistic , is calculated as the ratio of the coefficient estimate to its standard error in regression analysis. the t-value represents the standardized deviation of the coefficient estimate from zero, expressed in terms of standard errors Why is it important? Significance Testing: t-value is used to conduct hypothesis tests on the coefficients. whether the corresponding predictor variable has a statistically significant effect on the response variable. This is essential for understanding which predictors are truly influential in the model Higher t-values indicate stronger evidence against the null hypothesis (that the coefficient is zero), suggesting that the corresponding predictor is more likely to be important in explaining the variation in the response variable Comparing t-values across different coefficients allows researchers to assess the relative importance of different predictors in the model Lower t-values across all coefficients may indicate that the model is not capturing important relationships between the predictors and the response variable.

P Value

The p-value, is probability that measure of the strength of evidence against the null hypothesis in statistical hypothesis testing If the p-value is less than the significance level, the coefficient is considered statistically significant When interpreting p-values, it's essential to consider the chosen significance level (e.g., 0.05) and whether multiple comparisons are being made (which may require adjustment of the significance level). A low p-value indicates strong evidence against the null hypothesis, suggesting that the coefficient estimate is statistically significant. A high p-value suggests weak evidence against the null hypothesis, indicating that the coefficient estimate is not statistically significant.

Importance : indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a normalized metric, meaning it ranges from 0 to 1 (or can be negative if the model is worse than a horizontal line). What It Tells About the Model : A higher 𝑅2 R 2 (closer to 1) means a better fit. It shows how well the independent variables explain the variance in the dependent variable . However, it doesn't provide information on the size of the errors.

Liner Regression Classification Under Fitting and Over-Fitting

Residual Plot

Bias vs Variance trade off

Bias vs Variance trade off

Training Data Testing Data

https://www.youtube.com/watch?v=BGlEv2CTfeg

Multicollinearity Testing

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, which can lead to unstable estimates of the regression coefficients and inflated standard errors due to Unreliable Estimates of Regression Coefficients : When predictor variables are highly correlated with each other, it becomes difficult for the regression model to determine the individual effect of each predictor on the outcome variable. As a result, the estimated regression coefficients may be unstable or have high standard errors. Uninterpretable Coefficients : In the presence of multicollinearity, the coefficients of the regression model may have counterintuitive signs or magnitudes, making their interpretation challenging or misleading. Inflated Standard Errors : Multicollinearity inflates the standard errors of the regression coefficients, which can lead to wider confidence intervals and less precise estimates of the coefficients' true values. Reduced Statistical Power : High multicollinearity reduces the statistical power of the regression model, making it less likely to detect significant relationships between predictor variables and the outcome variable, even if those relationships truly exist.

The Variance Inflation Factor (VIF) is a measure used to quantify the severity of multicollinearity in regression analysis

Log Loss Log loss, also known as logistic loss or cross-entropy loss, is a performance metric for classification models, particularly those that output probabilities for each class Log loss quantifies the difference between the predicted probabilities and the actual class labels. For a classification problem, it is defined as:

Interpretation Lower Log Loss : Indicates that the predicted probabilities are close to the actual class labels, suggesting a better model. Higher Log Loss : Indicates that the predicted probabilities are far from the actual class labels, suggesting a poorer model. What Log Loss Tells About the Model Probability Calibration : Log loss evaluates how well the predicted probabilities are calibrated with respect to the true outcomes. It penalizes both overconfident wrong predictions and underconfident correct predictions. Model Performance : It provides a nuanced measure of model performance, beyond just accuracy. While accuracy measures the fraction of correct predictions, log loss considers the confidence of those predictions. Handling Class Imbalance : Log loss can handle imbalanced classes better than accuracy because it takes the predicted probabilities into account, rather than just the final classification.

Confusion Matrix Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made. The matrix can be represented as

Precision and Recall should be calculated for each class Precision is based on the prediction Recall based on the ground truth

Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision Always based on the prediction is important when the cost of false positives is high. It helps assess the quality of positive predictions. Recall (Sensitivity) Always based on the ground truth Is crucial when capturing all actual positiv es is essential. It measures the model's ability to identify positive instances. F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.

Importance: Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions.

Importance: Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions. Recall (Sensitivity) is crucial when capturing all actual positives is essential. It measures the model's ability to identify positive instances. F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.

C lassification report Overall Metrics: Accuracy : 0.60 This means that 60% of the total predictions were correct. Macro Average : computing the metric independently for each class and then taking the average of these metrics. It treats all classes equally, without considering the class distribution

These macro average metrics provide an overall measure of model performance that treats all classes equally, regardless of their frequency in the dataset. W eighted average performance metrics for each class are weighted by the number of instances in that class, giving more importance to classes with more instances

The weighted average provides a more realistic measure of overall model performance by giving more importance to the classes with more instances. This is particularly useful in datasets with imbalanced class distributions, as it ensures that the performance metrics reflect the model's ability to correctly classify the more prevalent classes. How to use those result for model improvements Weighted averages might be significantly higher than macro averages , indicating that the model performs well on frequent classes but poorly on rare ones. Oversampling Minority Classes : Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate more samples for underrepresented classes. Under sampling Majority Classes : Reduce the number of samples in the overrepresented classes to balance the class distribution. Class Weights : Modify the loss function to give higher weights to minority classes during training, encouraging the model to focus more on these classes.

Low precision, recall, and F1 scores for specific classes Class-Specific Data Augmentation : Create additional synthetic data or collect more real data for the poorly performing classes. Feature Engineering : Develop new features that may be more informative for the difficult classes. Class-Specific Models : Train separate models for each class or use ensemble methods that can better handle class-specific peculiarities. High performance on training data but low performance on certain test classes. Regularization : Apply techniques like L1/L2 regularization to prevent overfitting. Pruning Decision Trees : If using decision trees or random forests, prune the trees to reduce complexity and prevent overfitting. Cross-Validation : Use cross-validation to ensure that the model generalizes well across different subsets of the data.

Consistent low recall or precision across multiple classes in both macro and weighted averages . Hyperparameter Tuning : Use grid search or random search to find the optimal hyperparameters for your model. Ensemble Methods : Combine multiple models to leverage their strengths and mitigate individual weaknesses. Methods like bagging, boosting, and stacking can improve overall performance. Regular Updates : Regularly update the model with new data to ensure it captures the most recent patterns and trends. If current improvements are insufficient, it might be indicative of the need for a different model architecture. Algorithm Choice : Experiment with different algorithms (e.g., switching from a decision tree to a gradient boosting machine or neural network) to find one that better captures the data patterns. Neural Network Layers : For deep learning models, adjust

Practical Steps: Evaluate Metrics : Carefully analyze the precision, recall, and F1-score for each class. Compare macro and weighted averages to understand overall versus individual class performance. Diagnose Issues : Identify which classes are underperforming and why (e.g., lack of data, inherent difficulty). Implement Improvements : Choose and apply the appropriate techniques from the actions listed above based on your diagnosis. Regularly monitor the impact of these changes on your model's performance metrics. Iterate and Optimize : Continuously iterate on the model, using new data and feedback to further refine performance. Use tools like learning curves to understand the impact of more data or different algorithms.

Logistic regression Model evolution

deviance residuals in a logistic regression table provide detailed information about the fit of the model to individual data points and help identify potential outliers or issues with the model . high max value compared to the other values might suggest that there are outliers or poorly fitted observations in the data. The deviance is a measure of the difference between a fitted model and the perfect model (also called the saturated model). The deviance for a logistic regression model can be divided into two parts: Null Deviance : This is the deviance of a model with no predictors, only an intercept. It serves as a baseline to compare with the fitted model. Residual Deviance : This is the deviance of the fitted model with the predictors included.

Ridge

Lasso

There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.

What do we mean by the variance and bias of a statistical learning method? Variance refers to the amount by which ˆf would change if we estimated it using a diferent training data set Since the training data are used to ft the statistical learning method , diferent training data sets will result in a diferent ˆf. But ideally the estimate for f should not vary too much between training sets . However, if a method has high variance then small changes in the training data can result in large changes in ˆf. In general, more fexible statistical methods have higher variance

population mean µ of a random variable Y How far of will that single estimate of µˆ be? standard error of µˆ residual standard error Standard errors can be used to compute confdence intervals

https://www.youtube.com/watch?v=7WPfuHLCn_k&t=427s https://www.youtube.com/watch?v=-H5tcISshKg