Build_Machine_Learning_System for Machine Learning Course

ssuserfece35 15 views 31 slides Mar 03, 2025
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

A pipeline for Building Machine Learning System


Slide Content

DTS304TC: Machine Learning Lecture 5: Building Machine Learning System Dr Kang Dang D-5032, Taicang Campus [email protected] Tel: 88973341 1

Machine Learning Pipeline 2 M achine learning involves a comprehensive workflow, not just training models.

Q & A In practical machine learning roles, what percentage of time do you think is typically spent on data preparation and feature engineering? (A) 20% (B) 40% (C) 60% (D) 80% 3

Data Preparation and Feature Engineering 4 The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct  feature engineering . — Luca Massaron

Q&A How would you handle missing values in a table? Fill with zeros or use other methods? What issues might arise from filling with zeros? 5

Different types of missing values 3 Main Types of Missing Data | Do THIS Before Handling Missing Values! – YouTube 6

Missing Value Imputation 7

Mean/Median/Mode Imputation Missing Data Nature : Confirmed as Missing Completely at Random (MCAR). Extent of Missing Data : Limited to a maximum of 5% per variable. Imputation Technique for Categorical Variables : Utilize mode imputation for the most frequent category. Imputation Data Source : Calculate mean, median, or mode exclusively from the training dataset to prevent data leakage and maintain validation/test set integrity. 8

Regression Imputation – Miss Forest 9 Another great application of Random Forest! Assume Data Missing At Random. Utilizes entire dataset's information for imputation, enhancing the predictive accuracy of imputed values over simple mean/median/mode imputation

Regression Imputation – Miss Forest 10 Iterative Approach: First, fill missing values with a simple method (e.g., the mean). Pick one column with missing data, use the available data to train a Random Forest model, and predict the missing values. Move to the next column and repeat the process. Continue this cycle until the missing values stop changing significantly or after 5-6 rounds.

MissForest vs Zero or Mean Imputation If computational resources are not a limitation, prefer MissForest over simple imputations like zero or mean, which can distort the dataset's original distribution 11 Activate to view larger image,      

Q & A 12 Suppose I train a KNN feature classifier without scaling the features. For instance, one feature ranges from -1000 to 1000, while another ranges from -0.001 to 0.001. What potential issues could arise?

Feature Scaling Examples - KNN 13 Without normalization, all the nearest neighbors will be biased to feature with larger range(x2) leading to incorrect classification.

Feature Scaling Examples - KNN Feature scaling can lead to completely different model in terms of decision boundary 14

Feature Scaling Use when different numeric features have different scales (different range of values) Features with much higher values may overpower the others Goal: bring them all within the same range Especially Important for the following models: KNN: Distances depend mainly on feature with larger values SVMs: (kernelized) dot products are also based on distances Linear model: Feature scale affects regularization. Converge Faster! 15

Feature Scaling 16

But how to handle feature scaling with outliers? Question: What is median? What is 75th percentile? Robust Scaler: Reduces the influence of outliers on scaling. Centers using the median and scales using the IQR. x_scaled = (x – median) / IQR Use when outliers are present and need to be mitigated. IQR Calculation: IQR = Q3 – Q1 ( the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset) 17

Q & A Suppose you have a dataset with categorical features, such as 'dog' and 'cat'. Logistic regression, however, cannot directly handle categorical features. To make these features compatible with the model, we might encode 'dog' as '0' and 'cat' as '1'. Is this a good approach? Why or why not? 18

Categorical Feature Encoding Ordinal encoding For example, “Jan, Feb, Mar, Apr” Simply assigns an integer value to each category in the order they are encountered Only really useful if there exist a natural order in categories Model will consider one category to be ‘higher’ or ‘closer’ to another 19

Categorical Feature Encoding – One Hot Encoding One-hot encoding (dummy encoding) For example, “Cat, Dog, …” Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category Can explode if a feature has lots of values, causing issues with high dimensionality What if test set contains a new category not seen in training data? Either ignore it (just use all 0’s in row), or handle manually ( eg. imputation) 20

Model Validation Scheme Always evaluate models as if they are predicting future data We do not have access to future data, so we pretend that some data is hidden Simplest way: the holdout (simple train- val -test split) if dataset is sufficiently large Randomly split data (and corresponding labels) into training and test set (e.g. 60%-20%-20%) Train (fit) a model on the training data and tweak it on the validation data, then score on the test data 21

Q & A What are issues with simple train- val -test split, when dataset is really small? 22

K-Fold Cross Validation Each random split can yield very different models (and scores) e.g. all easy (of hard) examples could end up in the test set Split data into  k  equal-sized parts, called  folds Create  k  splits, each time using a different fold as the test set Compute  k  evaluation scores, aggregate afterwards (e.g. take the mean) Examine the score variance to see how  sensitive  (unstable) models are Large  k  gives better estimates (more training data), but is expensive 23

K-Fold Cross Validation for Hyperparameter Tuning After we obtained best hyperparameters (models) using cross validation, we can further apply it on a separate test data In our coursework: we use simple train- val -test for simplicity, but you can also try this as additional technique 24

K-Fold Cross Validation for Model Ensembling 25 We can create model ensemble using K-Fold Cross Validation One of the most common used tricks in Kaggle Competition!

Model Evaluation We have a positive and a negative class 2 different kind of errors: False Positive : model predicts positive while true label is negative False Negative: model predicts negative while true label is positive 26

Q&A Suppose someone has cancer but was not diagnosed (missed detection). Suppose someone was healthy but was diagnosed with cancer (false detection). What are the consequences? Which situation is more serious? 27

Binary Model Evaluation – Confusion Matrix We can represent all predictions (correct and incorrect) in a confusion matrix n by n array (n is the number of classes) Rows correspond to true classes, columns to predicted classes Count how often samples belonging to a class C are classified as C or any other class. For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP) 28

Binary Model Evaluation – Precision, Recall and F1 Precision: use when the goal is to limit FPs Clinical trails: you only want to test drugs that really work Search engines: you want to avoid bad search results Recall: Use when the goal is to limit FNs Cancer diagnosis: you don’t want to miss a serious disease Search engines: You don’t want to omit important hits F1-score: Trades off precision and recall: 29

Multi-class Evaluations Train models  per class  : one class viewed as positive, other(s) also negative, then calculate metrics per class, you can get a per-class evaluation score. M icro-averaging: count total TP, FP, TN, FN (every sample equally important) M acro-averaging: average of scores obtained on each class Preferable for imbalanced classes (if all classes are equally important) macro-averaged recall is also called  balanced accuracy   Weighted averaging 30

Summary We discuss various feature engineering techniques, including feature scaling, missing value imputation, outlier handling and categorial feature encoding We discuss the model selection and evaluation procedure, specifically cross-validation and evaluation metrics. 31