When Models Meet Data: From ancient science to todays Artificial Intelligence, Data is the change

ssuserbbbef4 22 views 25 slides Feb 27, 2025
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

A presentation about Data and Machine Learning


Slide Content

When Models Meet Data Dr. R. THANGARAJAN, Professor and Head, Department of Information Technology, Kongu Engineering College, Perundurai – 638 060, Erode.

Introduction: Data, Models and Learning Three Major Components of Machine Learning: Data Models Learning How do you define a good model ? Good model is one that performs well on unseen data

Data as Vectors Name Gender Degree Postcode Age Annual salary Aditya M MSc W21BG 36 89 , 563 Bob M PhD EC1A1BA 47 123,543 Chlo´e F BEcon SW1A1BH 26 23,989 Daisuke M BSc SE207AT 68 138,769 Elisabeth F MBA SE10AA 33 113,888 Gender ID Degree Latitude Longitude Age Annual Salary (in degrees) (in degrees) (in thousands) -1 2 51.5073 0.1290 36 89.563 -1 3 51.5074 0.1275 47 123.543 +1 1 51.5071 0.1278 26 23.989 -1 1 51.5075 0.1281 68 138.769 +1 2 51.5074 0.1278 33 113.888

Each input x n is a D-dimensional vector of real numbers, which are called features , attributes , or covariates . The subscript example n refers to the fact that this is the n th example out of a total of N examples in the dataset. Each column represents a particular feature of interest about the example, and we index the features as d = 1,…,D . Recall that data is represented as vectors, which means that each example (each datapoint) is a D-dimensional vector A dataset is written as a set of example-label pairs For instance, x represents age and y represents salary in our toy example. In short we can write:  

Predicting Salary from Age Better Representation for Data as vectors. Finding lower-dimensional approximations of the original feature vector Using nonlinear higher-dimensional combinations of the original feature vector.

In Dimensionality Reduction , we will see an example of finding a low-dimensional approximation of the original data space by finding the principal components . Finding principal components is closely related to concepts of eigenvalue and singular value decomposition . For the high-dimensional representation, we will see an explicit feature map that allows us to represent inputs using a higher-dimensional representation . The main motivation for higher-dimensional representations is that we can construct new features as non-linear combinations of the original features , which in turn may make the learning problem easier. Feature kernel map are used in Polynomial Regression and show how this feature map leads to a kernel in Support Vector Machines (SVM). In recent years, deep learning methods (Goodfellow et al., 2016) have shown promise in using the data itself to learn new good features and have been very successful in areas, such as computer vision, speech recognition, and natural language processing .  

Models as Functions : A predictive function: We consider the special case of linear functions: - Non-probabilistic case.  

We often consider data to be noisy observations of some true underlying effect, and hope that by applying machine learning we can identify the signal from the noise. This requires us to have a language for quantifying the effect of noise. We often would also like to have predictors that express some sort of uncertainty, e.g., to quantify the confidence we have about the value of the prediction for a particular test data point. Use of Probability and Probabilistic Graphic Models

Learning is finding Parameters The goal of learning is to find a model and its corresponding parameters such that the resulting predictor will perform well on unseen data. There are conceptually three distinct algorithmic phases when discussing machine learning algorithms: Prediction ( predictor function ) or inference ( probabilistic predictor ). Training or parameter estimation (Empirical Risk Minimization vs. Maximum Likelihood Estimation). Hyperparameter tuning or model selection ( choice of a no. of components or class of probability distribution ) – This is a.k.a. Model Selection.

Some more Terms to define… We are interested in learning a model based on data such that it performs well on future data . It is not enough for the model to only fit the training data well, the predictor needs to perform well on unseen data . We simulate the behavior of our predictor on future unseen data using cross-validation. In order to achieve the goal of performing well on unseen data, we will need to balance between fitting well on training data and finding “simple” explanations of the phenomenon. This trade-off is achieved using regularization or by adding a prior. This process is called Abduction. According to the Stanford Encyclopaedia of Philosophy , abduction is the process of inference to the best explanation (You may watch the movie “AI Abduction”)

Empirical Risk Minimization Now, we consider the case of a predictor that is a function . We describe the idea of empirical risk minimization, which was originally popularized by the proposal of the support vector machine . However, its general principles are widely applicable and allow us to ask the question of what is learning without explicitly constructing probabilistic models. There are four design choices : What is the set of functions we allow the predictor to take? How do we measure how well the predictor performs on the training data? How do we construct predictors from only training data that performs well on unseen test data? What is the procedure for searching over the space of models ?

Hypothesis Class Functions ; Parameterized by We hope to be able to find a good parameter * such that we fit the data well, that is, We use the notation, to indicate the output of the predictor. We consider an example with The predictor is a linear function or We can summarize as :  

Loss function for Training Consider the label for a particular example; and the corresponding prediction that we make based on . To define what it means to fit the data well, loss function we need to specify a loss function that takes the ground truth label and the prediction as input and produces a non-negative number (referred to as the loss) representing how much error we have made. Our goal for finding a good parameter vector * is to minimize the average loss on the set of N training examples. We further assume that ) are independent and identically distributed i.e. there is no statistical dependence between any pair of inputs.  

Given an Input vector X and a label vector Y, we can find empirical risk as follows: where We take the case of Least Squares Loss :  

In matrix notation, We are not interested in a predictor that only performs well on the training data. Instead, we seek a predictor that performs well (has low risk) on unseen test data. More formally, we are interested in finding a predictor f (with parameters fixed) that minimizes the expected risk .  

Regularization to reduce Over-fitting Regularized Least Squares: The additional term is called the regularizer , and the parameter is the regularization parameter . It is also called the penalty . The regularization parameter trades regularization off minimizing the loss on the training set and the magnitude of the parameters.  

Cross-Validation to assess the Generalization Parameter We mentioned in the previous section that we measure the generalization error by estimating it by applying the predictor on test data . This data is also sometimes referred to as the validation set . The validation set is a sub-set of the available training data that we keep aside. A practical issue with this approach is that the amount of data is limited , and ideally we would use as much of the data available to train the model. This would require us to keep our validation set small, which then would lead to a noisy estimate (with high variance) of the predictive performance. One solution is to use K-fold validation .  

K-Fold Cross-Validation   Cross Validation is an embarrassingly parallel task . Given sufficient computing resources (e.g., cloud computing, server farms), cross-validation does not require longer than a single performance assessment.

Parameter Estimation How Probability distribution is used to explicitly model the uncertainty in our observations and parameters? We introduce Likelihood analogous to Loss functions The concept of Priors is analogous to Regularization in minimization of Loss functions. In Maximum Likelihood (ML) Estimation , we would like to maximize the negative log likelihood function on . The trick is to minimize the function i.e. maximize its negative . Maximum a Posteriori (MAP) Estimation just adds prior knowledge.  

Maximum Likelihood (ML) Estimation for a given random variable and probability distribution parametrized by . As an example, we consider the simplest case of Gaussian Distribution: Since we assume that the examples are i.i.d ., the likelihood of the whole dataset is just the products of the likelihood of each individual example. And therefore, Now, when we maximize on parameter , we get:  

After substituting the values: For a constant, variance , the second term is independent of and the Loss, is minimized, which is similar to the Least Square minimization.  

Maximum a Posteriori (MAP) Estimation Analogous to regularization, if we have prior knowledge about the distribution of the parameters , we can multiply an additional term to the likelihood. This additional term is a prior probability distribution on parameters . This is derived from the Bayes Theorem: Recall that we are interested in finding the parameter that maximizes the posterior. Since the distribution does not depend on , we can ignore the value of the denominator for the optimization and obtain: This is called the MAP estimate  

To express things diagrammatically,

Model Fitting

Thank You