M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf

yireme8491 44 views 54 slides Jun 30, 2024
Slide 1
Slide 1 of 54
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54

About This Presentation

Unsupervised and Semi-Supervised Learning


Slide Content

Introduction

Spectrumof Low-Labeled Learning
Supervised
Learning
⬣Train Input: !,#
⬣Learning output:
$∶!→#, '()|+)
⬣e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣Input: !
⬣Learning
output: '+
⬣Example: Clustering,
density estimation, etc.
Less Labels
Semi-Supervised
Learning (10+
labels/category +
unlabeled data)
Few-Shot Learning (1-5/category)
(no unlabeled data)
Self-Supervised
Learning
(No labels for
representation
learning)
These are just common settings,
can be combined!
⬣E.g. semi-supervised few-shot
learning

What to Learn?
Traditional unsupervised learning methods:
Similar in deep learning, but from neural network/learning perspective
Modeling !"Comparing/
Grouping
Representation
Learning
Principal
Component
Analysis
ClusteringDensity
estimation
Almost all deep learning!Metric learning & clusteringDeep Generative Models

Dealing with Unlabeled Data
Labeled data
Cross
Entropy
Unlabeled data
?
Several considerations:
⬣Loss function (especially for
unlabeled data)
⬣Optimization / Training
Procedure
⬣Architecture
⬣Transfer learning?

Common Key Ideas
Pseudo-labeling for Unlabeled Data
(Semi-Supervised Learning)
Figures Adapted from: Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with
Consistency and Confidence
Cross-View/Augmentation & Consistency
Surrogate Tasks (Self-Supervised Learning)Meta-Learning (Few-Shot Learning)
Figure from Meta-Learning: from Few-Shot Learning to Rapid Reinforcement Learning, ICML
2019 Tutorial. Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations
⬣Cross-
entropy
⬣(Soft)
Knowledge
Distillation

Semi-
Supervised
Learning

Semi-Supervised Learning
Key question: Can we overcome small amount of
labeled data using unlabeled data?
It is often much cheaper (in terms of cost, time, etc.) to get large-scale
unlabeleddatasets
Somewhat Few
LabeledLots of
Unlabeled

IdealPerformance of Semi-Supervised Learning
Past WorkRecent Methods (sometimes)
It is often much cheaper (in terms of cost, time, etc.) to get large-scale
unlabeleddatasets
Ideally would like to improve performance all the way to highly-labeled case

An Old Idea: Predictions of Multiple Views
⬣Simple idea: Learn model on labeled data, make predictionson unlabeled
data, add as new training data, repeat
⬣Combine idea with co-training: Predicting across multiple views
⬣Avrim& Mitchell, Combining Labeled and Unlabeled Data with Co-
Training, 1998!
Somewhat Few
LabeledLots of
Unlabeled

Pseudo-Labeling
Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Pseudo-labeling for UnlabeledNormal Training for
Labeled Data
Cross
Entropy
Somewhat Few
LabeledLots of
Unlabeled

Pseudo-Labelingin Practice
Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Cross Entropy
(Labeled Data)
Prediction (Pseudo-Label)
Labeled
Examples
Unlabeled
Examples
Cross Entropy
(Pseudo-Labeled
Data)
Weakly
Augmented
Weakly
Augmented
Strongly
Augmented

Details and Hyper-Parameters
Several details:
⬣Labeled batch size of 64
⬣Unlabeled batch size of 448
⬣Key is large batch size
⬣Confidence threshold of 0.95
⬣Cosinelearning rate schedule
⬣Differs for more complex datasets like
ImageNet
⬣Batch sizes 1024/5120 (!)
⬣Confidence threshold 0.7
⬣Inference with exponential moving
average (EMA) of weights
Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

ResultsSohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Scaling Semi-Supervised LearningSohnet al., FixMatch: Simplifying Semi-Supervised
Learning with Consistency and Confidence

Other Methods
Large number of methods:
⬣MixMatch/ReMixMatch(Berthelot et al.): More complex variations prior to
FixMatch
⬣Temperature scaling and entropy minimization
⬣Multiple augmentations and ensemblingto improve pseudo-labels
⬣Virtual AdversarialTraining (Miyatoet al.): Augmentation through
adversarial examples (via backprop)
⬣Mean Teacher –Student/teaching distillation consistency method
(Tarveinenet al.) with exponential moving average
Miyatioet al., Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
Berthelot et al., MixMatch: A Holistic Approach to Semi-Supervised Learning
Berthelot et al., ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring
Tarveinenet al., Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning

Label PropagationIscenet al., Label Propagation for Deep Semi-supervised Learning

Summary
Unlabeled data can be usedsuccessfully
to improve supervised learning
Methods are relatively simple:
⬣Data augmentation
⬣Pseudo-labeling
⬣(Less necessary) Label Propagation
Methods scale to large unlabeled sets:
⬣Not clear how many labels each
unlabeled data is “worth”

Few-Shot
Learning

Few-Shot Learning
Chen et al., A Closer Look at Few-Shot Learning
Lots of
Labels (Base
categories)
Very Few
Labels (New
categories)

FinetuningBaseline
Chen et al., A Closer Look at Few-Shot Learning
Dhillonet al., A Baseline for Few-Shot Image Classification
Tian et al., Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?
⬣Do what we always do: Fine-tuning
⬣Train classifier on base classes
⬣Optionally freeze feature extractor
⬣Learn classifier weights for new classes using few amounts of labeled data
(during “query” time)
⬣Surprisingly effective compared to more sophisticated approaches (Chen
et al., Dhillonet al., Tian et al.)

Cosine Classifier
Chen et al., A Closer Look at Few-Shot Learning
https://en.wikipedia.org/wiki/Cosine_similarity
We can use a cosine (similarity-based) classifier rather than fully
connected linear layer

Cons of Normal Approach
⬣The training we do on the
base classes does not factor
the task into account
⬣No notion that we will be
performing a bunch of N-
way tests
⬣Idea: simulate what we will
see during test time

Meta-Training
Set up a set of smaller tasks during training which simulateswhat we
will be doing during testing: N-Way K-Shot Tasks
⬣Can optionally pre-train features on held-out base classes
Testing stage is now the same, but with new classes
Meta-Train
Meta-Test

Approaches using Meta-Training
Learning a model conditioned on support set
Chen et al., A Closer Look at Few-Shot Learning

How to parametrize learning algorithms?
Two approaches to defining a meta-learner:
Take inspiration from a known learning algorithm
kNN/kernel machine: Matching networks (Vinyalset al. 2016)
Gaussianclassifier: PrototypicalNetworks (Snellet al. 2017)
Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) ,
Model-Agnostic Meta-Learning MAML (Finn et al. 2017)
Derive it from a black box neural network
MANN (Santoroet al. 2016)
SNAIL (Mishraet al. 2018)
25Meta-Learner
Slide Credit: Hugo Larochelle

Learn gradient descent:
Parameter initialization and update rules
Output:
Parameter initialization
Meta-learner that decides howto update parameters
Learn just an initialization and use normal gradient descent (MAML)
Output:
Just parameter initialization!
We are using SGD
26
Slide Credit: Hugo Larochelle
More Sophisticated Meta-Learning Approaches

27Meta-Learner LSTM
Slide Credit: Hugo Larochelle

28Meta-Learner LSTM
Slide Credit: Hugo Larochelle

29Meta-Learner LSTM
Slide Credit: Hugo Larochelle

30Meta-Learner LSTM
Slide Credit: Hugo Larochelle

31Meta-Learner LSTM
Slide Credit: Hugo Larochelle

32Meta-Learner LSTM
Slide Credit: Hugo Larochelle

33Model-Agnostic Meta-Learning (MAML)
Slide Credit: Hugo Larochelle

34
Slide Credit: Hugo Larochelle\
Model-Agnostic Meta-Learning (MAML)
Slide Credit: Hugo Larochelle

35
Slide Credit: Hugo Larochelle
Model-Agnostic Meta-Learning (MAML)

36Comparison
Slide Credit: Hugo Larochelle

Unsupervised
and Self-
Supervised
Learning

Unsupervised Learning
Density
Estimation
Classification
Regression
Clustering
Dimensionality
Reduction
x y
x y
Discrete
Continuous
x cDiscrete
x zContinuous
Supervised Learning
Unsupervised Learning
x p(x)On simplex

Spectrum of Low-Labeled Learning
What can we do with no labels at all?
Supervised
Learning
⬣Train Input: !,#
⬣Learning output:
$∶!→#, '()|+)
⬣e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣Input: !
⬣Learning
output: '+
⬣Example: Clustering,
density estimation, etc.
Less Labels

Autoencoders
EncoderDecoder
Low dimensional embedding
Minimize the difference (with MSE)
Linear layers with reduced
dimension or Conv-2d
layers with stride
Linear layers with increasing
dimension or Conv-2d layers
with bilinear upsampling

Fine-Tuning
EncoderClassifier

Clustering Assumption
Clustering Assumption
⬣High density region forms a cluster while low density region
separate clusters which hold a coherent semantic meaning.
We hope:
DNN K-Means
Original feature spaceLearned feature space
(The assumption)

Deep Clustering
The clustering assumption leads to good feature learning with a
careful engineering to avoid:
⬣Empty cluster
⬣Trivial parameterization
Caron et al., Deep Clustering for Unsupervised Learning of Visual Features

Surrogate Tasks
There are a lot of other surrogate
tasks!
⬣Reconstruction
⬣Rotate images, predict if image is
rotated or not
⬣Colorization
⬣Relative image patch location
(jigsaw)
⬣Video: Next frame prediction
⬣Instance Prediction
⬣…

Colorization
⬣Input: Grayscale
image
⬣Output: Color image
⬣Objective function:
MSE
Zhang et al., Colorful Image Colorization

Jigsaw
⬣Input: Image patches
⬣Output: Prediction of discrete image patch location relative to center
⬣Objective function: Cross-Entropy (classification)
Doerschal., Unsupervised Visual Representation Learning by Context Prediction

Rotation Prediction
⬣Input: Image with
various rotations
⬣Output:
Prediction rotation
amount
⬣Objective
function: Cross-
Entropy
(classification)
Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations

Evaluation
Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations
⬣Train the model with a surrogate
task
⬣Extract the ConvNet(or encoder
part)
⬣Transfer to the actual task
⬣Use it to initialize the model of
another supervised learning task
⬣Use it to extract features for
learning a separate classifier
(ex: NN or SVM)
⬣Often classifier is limited to
linear layer and features are
frozen

Instance Discrimination
Augmentation1
Augmentation2
Augmented
Negative
Examples
Positive Example
Key question: Where should
the negatives come from?
Considerations:
⬣Efficiency (feature extraction)
⬣Characteristics of negative
examples
Contrastive Loss:E.g. dot product
(similarity) between augmentation 1
andpositive & negative examples

Momentum Encoder
He et al., Momentum Contrast for Unsupervised Visual Representation Learning

Memory Banks
⬣We can use a queuethat comes from
the previous mini-batches to serve as
negative examples
⬣This means no extra feature extraction
needed!
⬣Feature may be stale (since the
encoder weights have been updated)
but still works
Queue
Pop
Mini-
Batch

Momentum Encoder
He et al., Momentum Contrast for Unsupervised Visual Representation Learning

Results
Unlabeled
Examples
Augmented
Augmented
Augmented
He et al., Momentum Contrast for Unsupervised Visual Representation Learning
⬣Linear layers learned from learned frozen encoder
⬣Similar or even better results than supervised!
⬣Features generalize to other tasks (object
detection)

Large number of surrogate tasks and variations!
⬣E.g. contrastive across image patches or context
⬣Different types of loss functions and training regimes
Two have become dominant as extremely effective:
⬣Contrastive losses (e.g. instance discrimination with
positive and negative instances)
⬣Pseudo-labeling (hard pseudo-label) and knowledge
distillation (soft teacher/student)
Data augmentation is key
⬣Methods tend to be sensitive to choice of augmentation
Summary
Tags