DataScienceConcept_Kanchana_Weerasinghe.pptx

Data Science

Basics

Data Types Has a Meaningful Zero Ex: Height No Meaningful Zero Ex: Temperature

Standard deviation is how close the values in the data set are to the mean on average, the data points differ from the mean by

Statistical inference is the process of drawing conclusions about an underlying population based on a sample or subset of the data. In most cases, it is not practical to obtain all the measurements in a given population. Population and Sample Point Estimators

Z-Score measure that indicates how many standard deviations a data point is from the mean of a dataset Application Cross-Group Comparisons : Z-scores allow for the comparison of scores from different groups that may have different means , standard deviations and distributions. For example, comparing test scores from students in different schools or different countries. O utliers detection in data sets. Observations with Z-scores that are significantly higher or lower than the typical range (usually considered to be Z-scores less than -3 or greater than 3) are often regarded as outliers Normalization of Data : Z-scores are used in statistical analysis to normalize data, ensuring that every datum has a comparable scale. This is useful in multivariate statistics where data on different scales are combined.

The total area under the curve for any pdf is always equal to 1 it shows the probability

Confident Interval The degrees of freedom Range of values such that with X % probability, the range will contain the true unknown value of the parameter.

Sample Size > = 30 Sample Size < 30 Z-Statistics T-Statistics

T T S

Data Types: Continuous Data : Numerical data that can take on any value within a range. Examples Discrete Data : Numerical data that can take on a limited number of values. For example, the number of students in a class. Nominal Data (Categorical) Gender (Male, Female, Other) Blood type (A, B, AB, O) Colors (Red, Blue, Green) Ordinal Data (Categorical): O rder or ranking among them, but the differences between the ranks are not necessarily equal Education level (High School, Bachelor's, Master's, PhD) — While you can say a PhD is higher than a Master's, the difference between the levels is not measured. Satisfaction rating (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied) Economic status (Low Income, Middle Income, High Income) Interval Data (Numerical): Interval data are numerical data that have meaningful differences between values, and the data have a specific order Calendar years — The year 2000 is as long as 1990, and the difference between years is consistent. But year zero does not mean "no year.“ Temporal data : A lso known as time-series data, refers to a sequence of data points collected or recorded at time intervals, which can be regular or irregular Has Time stamp , It is sequential and cannot be shifted , Used for identifying the Pattern and Trend Ratio Data : Similar to interval data but with a meaningful zero, allowing for all arithmetic operations. Examples include height, weight, and age.

Relationships Among data : Linear Relationship : As described earlier, a linear relationship is one where the change in one variable is proportional to the change in another variable, resulting in a straight line when plotted on a graph. Exponential Relationship : In an exponential relationship, one variable grows or decays at a rate that depends on an exponent of another variable. This relationship often appears as a curve that rises or falls rapidly.

Logarithmic Relationship : A logarithmic relationship involves one variable being the logarithm of another variable. This relationship may appear as a curve that rises or falls rapidly at first but then levels off. Polynomial Relationship : Polynomial relationships involve one variable being a polynomial function of another variable. Depending on the degree of the polynomial, the relationship may exhibit different degrees of curvature. Periodic Relationship : Periodic relationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena and periodic functions. Monotonic Relationship : A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Nonlinear Relationship : Nonlinear relationships include any relationship that cannot be adequately represented by a straight line. This category encompasses all relationships mentioned above except for linear relationships.

Periodic Relationship : Periodic relationships involve the values of one variable repeating at regular intervals as the values of another variable change. This type of relationship is common in cyclic phenomena and periodic functions. Monotonic Relationship : A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Nonlinear Relationship : Nonlinear relationships include any relationship that cannot be adequately represented by a straight line. This category encompasses all relationships mentioned above except for linear relationships.

A monotonic relationship is one where the values of one variable consistently increase or decrease as the values of another variable increase. There are two types of monotonic relationships: Strictly Monotonic : Every value of one variable corresponds to a unique value of another variable, and the relationship never reverses direction. Non-strictly Monotonic : Similar to strictly monotonic relationships, but some values of one variable may correspond to the same value of another variable. Non Monotonic Monotonic Relationship

While correlation measures strength and direction of the linear relationship , monotonicity captures any systematic change in the relationship, whether linear or not . Therefore, monotonicity can be present even if correlation is close to 0, indicating a weak linear relationship. Correlation Correlation measures the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where: 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Concave , Convex :

Parametric vs Non Parametric Machine Learning Examples : Linear regression, logistic regression, linear discriminant analysis (LDA), and some neural networks (when they have a fixed number of layers and nodes). Examples : k-nearest neighbors (KNN), decision trees, random forests, support vector machines (SVM) with non-linear kernels, and some types of neural networks (such as deep learning).

Histogram: A histogram is a graphical representation of the distribution of numerical data

Common data Distribution:

Central tendency and variation are two measures used in statistics to summarize data . Measure of central tendency shows where the center or middle of the data set is located, whereas measure of variation shows the dispersion among data values Central tendency

Dispersion Dispersion in statistics is a way of describing how to spread out a set of data is

Reducible Error

Bias Bias refers to the difference between the expected predictions of a model and the true values of the target variable. A model with high bias is not complex enough to capture the underlying patterns in the data, resulting in underfitting. This means that the model is too simple and cannot capture the complexity of the data, leading to poor performance on both the training and test data. Variance Variance, on the other hand, refers to the variability of the model’s predictions for different training sets . A model with high variance is too complex and captures noise in the training data, resulting in overfitting. This means that the model is too complex and fits the training data too closely, leading to poor performance on new, unseen dat a.

Noise noise refers to the random variations and irrelevant information within a dataset that cannot be attributed directly to the underlying relationships being modeled. Noise can come from various sources and significantly impacts the quality of the predictions made by a model

Minimizing irreducible error

Machine learning

Collection and Data Exploration (EDA –Exploratory data analytics) : Data Cleaning: Handle missing values: impute or drop them based on context. Detect and handle duplicates. Identify and handle outliers. Standardize data formats and units. Resolve inconsistencies and errors. Validate data against predefined rules or constraints.

Feature Engineering: Create new features based on domain knowledge. Feature scaling Generate interaction features (e.g., product, division). Extract time-based features (e.g., day of week, hour of day). Perform dimensionality reduction (e.g., PCA, t-SNE). Engineer features from raw data (e.g., text, images). Select relevant features for modeling. Data Transformation: Normalize numeric features. Scale features to a consistent range. Encode categorical variables (one-hot encoding, label encoding, etc.). Extract features from text or datetime data. Aggregate data at different levels (e.g., group by, pivot tables). Apply mathematical transformations (log, square root, etc.).

Modeling Model Selection : Choose appropriate machine learning algorithms for the task. Model Training : Train models using the processed and engineered features. Model Evaluation : Evaluate model performance using appropriate metrics and validation techniques. Deployment Model Deployment : Deploy the model to a production environment where it can make predictions on new data. Monitoring and Maintenance : Continuously monitor the model's performance and update it as necessary when new data becomes available or when model performance degrades. Feedback Loop Iterative Improvement : Use feedback from the model's performance and any new data collected to refine the feature engineering and modeling steps, continuously improving the model over time.

Business Problem Understanding

Collection and Data Exploration (EDA –Exploratory data analytics)

Data Collection Gather data from various sources such as databases, APIs, files, etc. Extract data using appropriate tools and techniques. Ensure data integrity during extraction. Data Exploration – Univariate analysis – Expletory data analysis Review data documentation and metadata. Understand the general information of dataset Types / Count / Number of unique values / Missing values Numerical features understanding Min / Max / Mean / Mode / Quartiles / Missing Values / Coefficient of Variation Normality and spread Distribution / STD / Skewness / Kurtosis Categorical feature understanding Distribution / Frequency / Relationship / Credibility / Missing values Outliers identification Correlation analysis (dependent and independent variables ) Multicollinearity testing

Coefficient of Variability It is a measure of relative variability and is often used to compare the variability of different datasets or variables , especially when their means are different.

Features Scaling and Transformation T echnique used to standardize or normalize the range of independent variables or features in a dataset. The goal of feature scaling is to bring all features to a similar scale, which can be beneficial for various machine learning algorithms. Important In Not Important In K-Nearest Neighbors (KNN): Support Vector Machines (SVM): Principal Component Analysis (PCA): Linear Regression, Logistic Regression, and Regularized Regression: Neural Networks: K-Means Clustering: Gradient Boosting Algorithms (e.g., XGBoost , LightGBM ): Ridge Regression and Lasso Regression: Tree-Based Algorithms: Rule-Based Algorithms: Sparse Data: If the dataset is sparse, meaning most feature values are zero or close to zero, feature scaling may not be necessary. Non-Numerical Features: Categorical variables represented as one-hot encoded vectors, ordinal variables, or binary features typically do not require feature scaling.

Numerical features : Identify the distribution of the each continuous variable Most of the time that will align to one of a know distribution as follows Based on the ML type we need to transform the feature to the appropriate distribution for better performance of the model Ex: Skewed distribution to the normal distribution using transformation techniques

Log Transformation for X or Y

Label encoding : This works if the categorical variable has only two categories Categorical features transformation : One Hot Encoding : First check the frequency of each category and the identify most used values other will be “Other Type” each category value is converted into a new categorical column and assigned is called dummy variable Disadvantage: Dimensionality Increase Sparse Matrix : most value are zero Loss of Information: ordinality (order) is lost .

Dummy Variable Encoding : Dummy encoding uses N-1 features to represent N labels/categories. The Dummy Variable Trap occurs when different input variables perfectly predict each other – leading to multicollinearity Multicollinearity is a scenario when two or more input variables are highly correlated with each other This scenario we attempt to avoid as it won’t necessarily affect the overall predictive accuracy of the model. To avoid this issue we drop one of the newly created columns produced by one-hot encoding

Frequency Encoding or Count Encoding : E ncodes categorical features based on the frequency of each category in the dataset .

To reduce the number of features (dimensions) in a dataset while preserving the most important information Features Engineering (Dimensional Reduction)

Feature Selection Main Technique Feature selection is the process of identifying and selecting a subset of relevant features (variables, predictors) from the original dataset, which are most useful for building a predictive model. This process helps to improve the model's performance by removing redundant, irrelevant, or noisy data, leading to better generalization, reduced overfitting, and often shorter training times. categorized into three types : filter methods, wrapper methods, and embedded methods. Filter Methods These methods typically use statistical techniques to assess the relationship between each feature and the target variable. Correlation Coefficient : Measures the linear relationship between each feature and the target. Features with high correlation with the target and low correlation with other features are preferred. Useful for both regression and classification tasks, especially for linear relationships. Chi-Square Test : Evaluates the independence between categorical features and the target variable . Typically used for classification tasks with categorical features. ANOVA (Analysis of Variance): Used to assess the significance of features in relation to the target, especially for categorical features with continuous target variables. Useful when dealing with categorical features and continuous targets, primarily for classification tasks.

https://medium.com/analytics-vidhya/feature-selection-extended-overview-b58f1d524c1c Wrapper Methods Wrapper methods evaluate feature subsets by training and evaluating a machine learning model . They search for the best subset of features by considering the interaction between them and their combined impact on model performance. Forward Selection : Starts with an empty set of features and iteratively adds the feature that improves model performance the most. Backward Elimination : Starts with all features and iteratively removes the least significant feature. These can be used with any type of machine learning model (e.g., linear regression, decision trees, SVMs) and are applicable to both regression and classification tasks. However, they can be computationally expensive for models with a large number of features. Recursive Feature Elimination (RFE) : Trains a model and removes the least important feature(s) based on model weights, recursively until the desired number of features is reached. Mutual Information : Measures the amount of information obtained about one variable through another variable , capturing non-linear relationships. Applicable to both regression and classification tasks, capturing non-linear relationships.

Features in the Model an be selected using following Evaluation Method I nteraction term is a product of two or more predictors To provide a more accurate model when such interactions are present in the data.

Note : Goodness of fit refers to how well a statistical model describes the observed data

What is we do the model selection only on p-value of the predictors Calculate the P value Now Age is not Significant

When we consider all the evaluation criterial it is easy to decide the better model

Forward stepwise selection Start with No Predictors : Begin with the simplest model, which includes no predictors (just the intercept). Add Predictors One by One : At each step, evaluate all predictors that are not already in the model. For each predictor not in the model, fit a model that includes all the predictors currently in the model plus this new predictor. Calculate a criterion for model performance, such as Residual Sum of Squares (RSS), Akaike Information Criterion (AIC), or Bayesian Information Criterion (BIC), for each of these models. Select the Best Predictor : Identify the predictor whose inclusion in the model results in the best performance according to the chosen criterion (e.g., the predictor that reduces the RSS the most, or has the lowest AIC/BIC).

Embedded Methods These methods are specific to particular learning algorithms and incorporate feature selection as a part of the model building phase. Lasso Regression (L1 Regularization) : Penalizes the absolute size of the regression coefficients, effectively shrinking some coefficients to zero, thus performing feature selection. Primarily used in linear regression and logistic regression for feature selection. Ridge Regression (L2 Regularization) : Penalizes the square of the coefficient magnitudes but does not perform feature selection by shrinking coefficients exactly to zero. Used in linear models but does not perform feature selection (included here for comparison). Elastic Net : Combines both L1 and L2 regularization to balance between feature selection and regularization. Combines L1 and L2 regularization, used in linear and logistic regression. Tree-based methods (e.g., Random Forest, Gradient Boosting) : Use feature importance scores derived from the tree-building process to select the most important features. Applicable to both regression and classification tasks. These methods provide feature importance scores, which can be used for feature selection.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is indeed a powerful technique for dimensionality reduction and can be applied to many types of machine learning tasks. The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat PCA vs. Feature Selection PCA : Aims to reduce dimensionality by transforming the original features into a new set of orthogonal features (principal components) that capture the maximum variance in the data . It creates new features rather than selecting a subset of existing ones. Produces new composite features (principal components) that are linear combinations of the original features. These components are not directly interpretable in terms of the original features. The new features (principal components) are abstract and not easily interpretable. This can be a disadvantage when model interpretability is crucial. Feature Selection : Seeks to identify and retain the most relevant and informative subset of the original features , improving model interpretability and performance by eliminating irrelevant, redundant, or noisy features. Retains a subset of the original features, making the model easier to interpret and understand, as it directly works with the original features. Selected features are part of the original feature set, maintaining their interpretability and relevance to the problem domain.

Use Case : PCA : Often used when the goal is to reduce dimensionality for visualization, to combat the curse of dimensionality, or to preprocess data for other machine learning algorithms that may struggle with high-dimensional data. Feature Selection : Used when the goal is to improve model performance and interpretability by focusing on the most relevant features. PCA is widely used in clustering due to several key advantages it offers: PCA reduces the number of features while preserving as much variability as possible.This simplification helps clustering algorithms (like K-means or hierarchical clustering) perform better by reducing noise and focusing on the most informative components. By reducing dimensions, PCA enhances the performance and speed of clustering algorithms, making it easier to identify distinct clusters.

How PCA Works The primary goal of PCA is to transform the original feature space into a new set of orthogonal axes (principal components) that maximize the variance of the dat

Low High

The number of principal components created for a given dataset is equal to the number of features in the original dataset. However, not all principal components capture the same amount of variance in the data. Typically, only a subset of the principal components is retained for dimensionality reduction, usually those corresponding to the largest eigenvalues.

Regression Model Engineering Model Fitting and Model Evaluation

Regularization L1 Regularization L2 Regularization

Unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using Sample Mean – Population Mean -

For linear regression, the 95 % confidence interval for β1 approximately takes the form there is approximately a 95 % chance that the interval will contain the true value of β1

Model Fitting Technique goal of these techniques is to find the best parameters that allow the model to predict or classify new data accurately.

KNN Regression

https://medium.com/analytics-vidhya/k-neighbors-regression-analysis-in-python-61532d56d8e4 Low K (e.g., K=1) : Bias : With a low K value, the model tends to have lower bias because it captures more detailed patterns in the training data . Each prediction is influenced by only a single data point, leading to more complex decision boundaries. Variance : However, with low K, the model tends to have higher variance because it is more sensitive to noise in the training data. The predictions can be highly influenced by the specific training instances, leading to overfitting. High K (e.g., K=N, where N is the number of training instances) : Bias : With a high K value, the model tends to have higher bias because it averages over more data points , potentially leading to oversimplified decision boundaries. It might miss subtle patterns in the data. Variance : On the other hand, with high K, the model tends to have lower variance because it smooths out the predictions by averaging over a larger number of neighbors . This can reduce the impact of individual noisy data points, leading to more stable predictions.

Ordinary Least Squares (OLS) – Model Fitting

Residual Sum of squares (RSS)

OLS

Regularization Regularization is a technique used in machine learning and statistical modeling to prevent overfitting and improve the generalization performance of models. Overfitting occurs when a model learns the training data too well , capturing noise or random fluctuations in the data, which leads to poor performance on unseen data.

Regularization

Gradient Decent

Cost Functions

Learning Rate

Validation Set Approach

Cross validation techniques Resubstitution Hold-out K-fold cross-validation LOOCV Random subsampling Bootstrapping Validation techniques in machine learning are used to get the error rate of the ML model, which can be considered as close to the true error rate of the population

Ensemble Technique

C ombining multiple models to improve the predictive performance over any single model.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning meth B bootstrapped training set In Regression OR Majority Vote In Classification

Another approach for improving the predictions resulting from a decision tree

Trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling ; instead each tree is ft on a modified version of the original data set The number of trees B. Unlike bagging and random forests, boosting can overft if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B To find the best split this always consider a one feature and iterate through all the features – This use Gini Index Stump

Generate a random number between 0 and 1 and the pic a record from the bin To create the second sample list and the Do the same process

KNN Regressio n SVM

Classification

Logistic Regression

Logistic regression is a type of statistical model used for binary classification tasks . It predicts the probability of a binary outcome (i.e., an event with two possible values, such as 0 and 1, true and false, yes and no). Probability Output : Unlike linear regression, logistic regression provides probabilities for class membership, which can be useful for decision-making processes. The core of logistic regression is the logistic function (also called the sigmoid function), which maps any real-valued number into the range (0, 1): Or

In statistics and probability theory, odds represent the ratio of the probability of success to the probability of failure in a given event . The odds of an event can be expressed in different ways: as odds in favor, odds against, or simply as odds. odds

log-odds

Likelihood Calculation

Maximum Likelihood Calculation

A Bayes classifier, also known as a Naive Bayes classifier, is a probabilistic machine learning algorithm based on Bayes' theorem.

Decision Tree

N on-parametric supervised learning algorithm, which is utilized for both classification and regression tasks . It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes

Decision Tree Regression

Decision tree regression is a type of supervised learning algorithm used in machine learning, primarily for regression tasks. In decision tree regression, the algorithm builds a tree-like structure by recursively splitting the data into subsets based on the features that best separate the target variable (continuous in regression) into homogeneous groups.

An impurity measure , also known as a splitting criterion or splitting rule, is a metric used in decision tree algorithms to evaluate the homogeneity of a set of data points with respect to the target variable The impurity measure serves as a criterion for selecting the best feature and split point at each node of the tree. The goal is to find the feature and split point that result in the most homogeneous child nodes, leading to better predictions and a more accurate decision tree model.

Leaf Node Prediction : Once a leaf node is reached, the prediction is made based on the majority class (for classification) or the mean (for regression) of the target variable in that leaf node. This prediction becomes the output of the decision tree model for the given instance.

M ean squared error (MSE) as the impurity measure in decision tree regression. By minimizing the MSE at each split, decision tree regression effectively partitions the feature space into regions that are more homogeneous with respect to the target variable, leading to accurate predictions for unseen data points.

Xo Will be selected

three-region partition

Random Forest

Bagging vs Boosting Feature Selection : Bagging : Uses all features available for each split in the decision trees. Random Forest : Randomly selects a subset of features for each split in the decision trees, which introduces additional randomness and reduces the correlation between the trees. Bias-Variance Tradeoff : Bagging : bagging will not lead to a substantial reduction in variance over a single tree in this setting but by averaging multiple models it reduce the variance.B ut does not inherently reduce correlation between the models. Random Forest : Reduces both variance and correlation between models by introducing randomness in feature selection, leading to lower overall variance and improved model performance. Random forests overcome model correlation problem by forcing each split to consider only a subset of the predictors. Therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable Performance : Bagging : Can be applied to any base model and improves performance by reducing overfitting through model averaging. Random Forest : Specifically designed for decision trees, typically performs better than bagging with decision trees due to the reduced correlation between trees.

The k-nearest neighbors algorithm (k-NN) is a non-parametric , lazy learning method used for classification and regression. The output based on the majority vote (for classification) or mean (or median, for regression) of the k-nearest neighbors in the feature space.

SVM

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It's particularly effective for binary classification problems, where the goal is to classify data points into one of two categories.

Hyperplane based in the dimension One Dimension it is a point Two Dimension it is a Line 3 Dimension it is a surface

Regression Model Evaluation

Regression (Residual) Sum of Squares (RSS) = Sum of Squared Errors (SSE) Total Sum of Squares (TSS) = SST

Mean Squared Error (MSE) MSE measures the average squared error, with higher values indicating more significant discrepancies between predicted and actual values. MSE penalizes more significant errors due to squaring, making it sensitive to outliers. It is commonly used due to its mathematical properties but may be less interpretable than other metrics . It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. Importance : MSE penalizes larger errors more than smaller ones due to squaring the errors. It is widely used in optimization and model training because it is differentiable, which is important for gradient-based methods. What It Tells About the Model : A lower MSE indicates a model with fewer large errors. It provides a sense of the average error squared, which can emphasize the impact of larger errors .

The common shape of the Mean Squared Error (MSE) graph, when plotted as a function of the model parameters, is typically a convex curve.

Mean Absolute Error (MAE) MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight. Importance : MAE is a straightforward measure of error magnitude. It is less sensitive to outliers compared to MSE and RMSE because it doesn’t square the errors. What It Tells About the Model : A lower MAE indicates a model that makes smaller errors on average . Since it uses absolute differences, it provides a clear indication of the typical size of the errors in the same units as the target variable.

Root Mean Squared Error (RMSE) RMSE is the square root of the average of squared differences between prediction and actual observation. It represents the standard deviation of the prediction errors . Importance : RMSE is the square root of MSE, bringing the metric back to the same units as the target variable . It is more sensitive to outliers than MAE due to the squaring of errors before averaging. What It Tells About the Model : A lower RMSE indicates better fit, similar to MSE but more interpretable in the context of the target variable's scale . It provides an idea of how large the errors are in absolute terms. Why RMSE is Considered as Standard Deviation of Prediction Errors If we assume that the prediction errors (residuals) are normally distributed with a mean of zero, then the RMSE provides an estimate of the standard deviation of this normal distribution. This is because, under the normal distribution, the standard deviation is a measure of the average distance of the data points from the mean, which in this case is zero.

Residual Standard Error : The Residual Standard Error (RSE) is a measure used in regression analysis to quantify the typical size of the residuals (prediction errors) from a linear regression model. It provides an estimate of the standard deviation of the residuals, which helps in understanding how well the model fits the data. RSE is in the same units as the dependent variable, making it straightforward to interpret. Adjustment for Predictors : Unlike simple measures like RMSE, RSE accounts for the number of predictors in the model. This adjustment (using 𝑛−𝑝−1 n − p −1 in the denominator) helps prevent overfitting by penalizing models with more predictors. Model Comparison : Comparison Tool : RSE allows for the comparison of different models. When comparing models with the same dependent variable, a lower RSE indicates a better fit. Relative Measure : While RSE itself doesn't provide an absolute goodness-of-fit measure, it is useful when comparing models to determine which one better explains the variability in the data.

Large RSE values may indicate a poor fit, suggesting that the model is not capturing all the relevant information in the data. Model Assessment : RSE helps assess the accuracy of a regression model. A lower RSE value indicates a model that better captures the data's variability. Predictive Accuracy : RSE provides insights into the model’s predictive accuracy, indicating how close the predicted values are to the actual values on average. Identification of Outliers or Influential Points : Large residuals can indicate outliers or influential points that may unduly affect the model's performance. By examining these cases closely, researchers can decide whether to include, exclude, or transform them to improve model fit. Detection of Heteroscedasticity : Heteroscedasticity occurs when the variability of the residuals is not constant across all levels of the predictor variables. RSE can help identify this issue, prompting researchers to explore transformations or alternative modeling techniques to address it.

R esidual plot A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A typical residual plot has the residual values on the Y-axis and the independent variable on the x-axis

Heterogeneity in residuals" refers to the situation where the variability of the residuals is not consistent across all levels of the predictor variables. In other words, the spread or dispersion of residuals varies systematically with the values of one or more predictor variables.

characteristics of a good residual plot:

Identifying whether the Error is high or low Scale of the Target Variable : If the target variable has a large range (e.g., house prices ranging from $100,000 to $1,000,000), an RMSE of $10,000 might be considered low. Conversely, for smaller ranges, such as predicting daily temperature, an RMSE of 10 degrees might be high. Industry Standards : Different fields have established benchmarks for acceptable error rates. For instance, in some financial models, an RMSE of a few dollars might be acceptable, while in other domains, such as temperature prediction, an RMSE of a few degrees could be too high. Historical Data : Compare the error values to those of previous models or known standards within the same domain. This helps in understanding the expected range of errors. Impact of Errors : Consider the practical implications of the error. For instance, in medical diagnostics, even small errors can be critical, whereas, in movie recommendation systems, higher errors might be more tolerable. Business Goals : Align the acceptable error rates with business goals and requirements. Sometimes, a slightly higher error might be acceptable if it results in significant cost savings or other benefits.

Residual Analysis

Coefficient Analysis H0 : There is no relationship between X and Y Mathematically, this corresponds to testing H0 : β1 = 0 Y = β0 + ", and X is not associated with Y . To test the null hypothesis, we need to determine whether βˆ1, our estimate for β1, is sufficiently far from zero that we can be confident that β1 is non-zero How far is far enough?

These coefficients represent the estimated change in the dependent variable (response variable) for a one-unit change in the corresponding predictor variable , holding all other variables constant . For example, if the estimate for a predictor variable X1 is 0.5, it means that, on average, for each one-unit increase in X1, the dependent variable is estimated to increase by 0.5 units, assuming all other variables in the model remain constant.

Coefficient Magnitude : Look at the magnitude of the coefficients. Larger coefficients imply a stronger relationship between the predictor variable and the response variable . For example, a coefficient of 2 means that a one-unit increase in the predictor variable is associated with a two-unit Coefficient Direction : Determine the direction of the relationship between the predictor variable and the response variable. A positive coefficient indicates a positive relationship, meaning that as the predictor variable increases, the response variable also tends to increase . Conversely, a negative coefficient suggests a negative relationship, where an increase in the predictor variable is associated with a decrease in the response variable. Confounding Variables : Be aware of confounding variables or multicollinearity issues . If coefficients change substantially when adding or removing variables from the model , it could indicate that the variables are correlated with each other, leading to potential issues in interpretation.

Standard Error Understanding the standard error helps in assessing the stability and robustness of the model's parameter estimates The standard error provides an estimate of how much we would expect the coefficient estimates to vary from the true population parameters across different samples of the same size from the population

T Value also known as the t-statistic , is calculated as the ratio of the coefficient estimate to its standard error in regression analysis. the t-value represents the standardized deviation of the coefficient estimate from zero, expressed in terms of standard errors Why is it important? Significance Testing: t-value is used to conduct hypothesis tests on the coefficients. whether the corresponding predictor variable has a statistically significant effect on the response variable. This is essential for understanding which predictors are truly influential in the model Higher t-values indicate stronger evidence against the null hypothesis (that the coefficient is zero), suggesting that the corresponding predictor is more likely to be important in explaining the variation in the response variable Comparing t-values across different coefficients allows researchers to assess the relative importance of different predictors in the model Lower t-values across all coefficients may indicate that the model is not capturing important relationships between the predictors and the response variable.

P Value

The p-value, is probability that measure of the strength of evidence against the null hypothesis in statistical hypothesis testing If the p-value is less than the significance level, the coefficient is considered statistically significant When interpreting p-values, it's essential to consider the chosen significance level (e.g., 0.05) and whether multiple comparisons are being made (which may require adjustment of the significance level). A low p-value indicates strong evidence against the null hypothesis, suggesting that the coefficient estimate is statistically significant. A high p-value suggests weak evidence against the null hypothesis, indicating that the coefficient estimate is not statistically significant.

Importance : indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a normalized metric, meaning it ranges from 0 to 1 (or can be negative if the model is worse than a horizontal line). What It Tells About the Model : A higher 𝑅2 R 2 (closer to 1) means a better fit. It shows how well the independent variables explain the variance in the dependent variable . However, it doesn't provide information on the size of the errors.

Liner Regression Classification Under Fitting and Over-Fitting

Residual Plot

Bias vs Variance trade off

Training Data Testing Data

https://www.youtube.com/watch?v=BGlEv2CTfeg

Multicollinearity Testing

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, which can lead to unstable estimates of the regression coefficients and inflated standard errors due to Unreliable Estimates of Regression Coefficients : When predictor variables are highly correlated with each other, it becomes difficult for the regression model to determine the individual effect of each predictor on the outcome variable. As a result, the estimated regression coefficients may be unstable or have high standard errors. Uninterpretable Coefficients : In the presence of multicollinearity, the coefficients of the regression model may have counterintuitive signs or magnitudes, making their interpretation challenging or misleading. Inflated Standard Errors : Multicollinearity inflates the standard errors of the regression coefficients, which can lead to wider confidence intervals and less precise estimates of the coefficients' true values. Reduced Statistical Power : High multicollinearity reduces the statistical power of the regression model, making it less likely to detect significant relationships between predictor variables and the outcome variable, even if those relationships truly exist.

The Variance Inflation Factor (VIF) is a measure used to quantify the severity of multicollinearity in regression analysis

Log Loss Log loss, also known as logistic loss or cross-entropy loss, is a performance metric for classification models, particularly those that output probabilities for each class Log loss quantifies the difference between the predicted probabilities and the actual class labels. For a classification problem, it is defined as:

Interpretation Lower Log Loss : Indicates that the predicted probabilities are close to the actual class labels, suggesting a better model. Higher Log Loss : Indicates that the predicted probabilities are far from the actual class labels, suggesting a poorer model. What Log Loss Tells About the Model Probability Calibration : Log loss evaluates how well the predicted probabilities are calibrated with respect to the true outcomes. It penalizes both overconfident wrong predictions and underconfident correct predictions. Model Performance : It provides a nuanced measure of model performance, beyond just accuracy. While accuracy measures the fraction of correct predictions, log loss considers the confidence of those predictions. Handling Class Imbalance : Log loss can handle imbalanced classes better than accuracy because it takes the predicted probabilities into account, rather than just the final classification.

Confusion Matrix Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the model. The confusion matrix provides a more insightful picture which is not only the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made. The matrix can be represented as

Precision and Recall should be calculated for each class Precision is based on the prediction Recall based on the ground truth

Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision Always based on the prediction is important when the cost of false positives is high. It helps assess the quality of positive predictions. Recall (Sensitivity) Always based on the ground truth Is crucial when capturing all actual positiv es is essential. It measures the model's ability to identify positive instances. F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.

Importance: Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions.

Importance: Accuracy gives an overall understanding of how well the model is performing, but it can be misleading if classes are imbalanced. Precision is important when the cost of false positives is high. It helps assess the quality of positive predictions. Recall (Sensitivity) is crucial when capturing all actual positives is essential. It measures the model's ability to identify positive instances. F1 Score provides a balance between precision and recall, especially when there's an uneven class distribution. It's a better measure of a model's performance when there's a trade-off between false positives and false negatives.

C lassification report Overall Metrics: Accuracy : 0.60 This means that 60% of the total predictions were correct. Macro Average : computing the metric independently for each class and then taking the average of these metrics. It treats all classes equally, without considering the class distribution

These macro average metrics provide an overall measure of model performance that treats all classes equally, regardless of their frequency in the dataset. W eighted average performance metrics for each class are weighted by the number of instances in that class, giving more importance to classes with more instances

The weighted average provides a more realistic measure of overall model performance by giving more importance to the classes with more instances. This is particularly useful in datasets with imbalanced class distributions, as it ensures that the performance metrics reflect the model's ability to correctly classify the more prevalent classes. How to use those result for model improvements Weighted averages might be significantly higher than macro averages , indicating that the model performs well on frequent classes but poorly on rare ones. Oversampling Minority Classes : Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate more samples for underrepresented classes. Under sampling Majority Classes : Reduce the number of samples in the overrepresented classes to balance the class distribution. Class Weights : Modify the loss function to give higher weights to minority classes during training, encouraging the model to focus more on these classes.

Low precision, recall, and F1 scores for specific classes Class-Specific Data Augmentation : Create additional synthetic data or collect more real data for the poorly performing classes. Feature Engineering : Develop new features that may be more informative for the difficult classes. Class-Specific Models : Train separate models for each class or use ensemble methods that can better handle class-specific peculiarities. High performance on training data but low performance on certain test classes. Regularization : Apply techniques like L1/L2 regularization to prevent overfitting. Pruning Decision Trees : If using decision trees or random forests, prune the trees to reduce complexity and prevent overfitting. Cross-Validation : Use cross-validation to ensure that the model generalizes well across different subsets of the data.

Consistent low recall or precision across multiple classes in both macro and weighted averages . Hyperparameter Tuning : Use grid search or random search to find the optimal hyperparameters for your model. Ensemble Methods : Combine multiple models to leverage their strengths and mitigate individual weaknesses. Methods like bagging, boosting, and stacking can improve overall performance. Regular Updates : Regularly update the model with new data to ensure it captures the most recent patterns and trends. If current improvements are insufficient, it might be indicative of the need for a different model architecture. Algorithm Choice : Experiment with different algorithms (e.g., switching from a decision tree to a gradient boosting machine or neural network) to find one that better captures the data patterns. Neural Network Layers : For deep learning models, adjust

Practical Steps: Evaluate Metrics : Carefully analyze the precision, recall, and F1-score for each class. Compare macro and weighted averages to understand overall versus individual class performance. Diagnose Issues : Identify which classes are underperforming and why (e.g., lack of data, inherent difficulty). Implement Improvements : Choose and apply the appropriate techniques from the actions listed above based on your diagnosis. Regularly monitor the impact of these changes on your model's performance metrics. Iterate and Optimize : Continuously iterate on the model, using new data and feedback to further refine performance. Use tools like learning curves to understand the impact of more data or different algorithms.

Logistic regression Model evolution

deviance residuals in a logistic regression table provide detailed information about the fit of the model to individual data points and help identify potential outliers or issues with the model . high max value compared to the other values might suggest that there are outliers or poorly fitted observations in the data. The deviance is a measure of the difference between a fitted model and the perfect model (also called the saturated model). The deviance for a logistic regression model can be divided into two parts: Null Deviance : This is the deviance of a model with no predictors, only an intercept. It serves as a baseline to compare with the fitted model. Residual Deviance : This is the deviance of the fitted model with the predictors included.

Ridge

Lasso

There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.

What do we mean by the variance and bias of a statistical learning method? Variance refers to the amount by which ˆf would change if we estimated it using a diferent training data set Since the training data are used to ft the statistical learning method , diferent training data sets will result in a diferent ˆf. But ideally the estimate for f should not vary too much between training sets . However, if a method has high variance then small changes in the training data can result in large changes in ˆf. In general, more fexible statistical methods have higher variance

population mean µ of a random variable Y How far of will that single estimate of µˆ be? standard error of µˆ residual standard error Standard errors can be used to compute confdence intervals

https://www.youtube.com/watch?v=7WPfuHLCn_k&t=427s https://www.youtube.com/watch?v=-H5tcISshKg

DataScienceConcept_Kanchana_Weerasinghe.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DataScienceConcept_Kanchana_Weerasinghe.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77