M4L18 Unsupervised and Semi-Supervised Learning - Slides v2.pdf
yireme8491
44 views
54 slides
Jun 30, 2024
Slide 1 of 54
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
About This Presentation
Unsupervised and Semi-Supervised Learning
Size: 6.37 MB
Language: en
Added: Jun 30, 2024
Slides: 54 pages
Slide Content
Introduction
Spectrumof Low-Labeled Learning
Supervised
Learning
⬣Train Input: !,#
⬣Learning output:
$∶!→#, '()|+)
⬣e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣Input: !
⬣Learning
output: '+
⬣Example: Clustering,
density estimation, etc.
Less Labels
Semi-Supervised
Learning (10+
labels/category +
unlabeled data)
Few-Shot Learning (1-5/category)
(no unlabeled data)
Self-Supervised
Learning
(No labels for
representation
learning)
These are just common settings,
can be combined!
⬣E.g. semi-supervised few-shot
learning
What to Learn?
Traditional unsupervised learning methods:
Similar in deep learning, but from neural network/learning perspective
Modeling !"Comparing/
Grouping
Representation
Learning
Principal
Component
Analysis
ClusteringDensity
estimation
Almost all deep learning!Metric learning & clusteringDeep Generative Models
Dealing with Unlabeled Data
Labeled data
Cross
Entropy
Unlabeled data
?
Several considerations:
⬣Loss function (especially for
unlabeled data)
⬣Optimization / Training
Procedure
⬣Architecture
⬣Transfer learning?
Common Key Ideas
Pseudo-labeling for Unlabeled Data
(Semi-Supervised Learning)
Figures Adapted from: Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with
Consistency and Confidence
Cross-View/Augmentation & Consistency
Surrogate Tasks (Self-Supervised Learning)Meta-Learning (Few-Shot Learning)
Figure from Meta-Learning: from Few-Shot Learning to Rapid Reinforcement Learning, ICML
2019 Tutorial. Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations
⬣Cross-
entropy
⬣(Soft)
Knowledge
Distillation
Semi-
Supervised
Learning
Semi-Supervised Learning
Key question: Can we overcome small amount of
labeled data using unlabeled data?
It is often much cheaper (in terms of cost, time, etc.) to get large-scale
unlabeleddatasets
Somewhat Few
LabeledLots of
Unlabeled
IdealPerformance of Semi-Supervised Learning
Past WorkRecent Methods (sometimes)
It is often much cheaper (in terms of cost, time, etc.) to get large-scale
unlabeleddatasets
Ideally would like to improve performance all the way to highly-labeled case
An Old Idea: Predictions of Multiple Views
⬣Simple idea: Learn model on labeled data, make predictionson unlabeled
data, add as new training data, repeat
⬣Combine idea with co-training: Predicting across multiple views
⬣Avrim& Mitchell, Combining Labeled and Unlabeled Data with Co-
Training, 1998!
Somewhat Few
LabeledLots of
Unlabeled
Pseudo-Labeling
Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Pseudo-labeling for UnlabeledNormal Training for
Labeled Data
Cross
Entropy
Somewhat Few
LabeledLots of
Unlabeled
Details and Hyper-Parameters
Several details:
⬣Labeled batch size of 64
⬣Unlabeled batch size of 448
⬣Key is large batch size
⬣Confidence threshold of 0.95
⬣Cosinelearning rate schedule
⬣Differs for more complex datasets like
ImageNet
⬣Batch sizes 1024/5120 (!)
⬣Confidence threshold 0.7
⬣Inference with exponential moving
average (EMA) of weights
Sohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
ResultsSohnet al., FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
Scaling Semi-Supervised LearningSohnet al., FixMatch: Simplifying Semi-Supervised
Learning with Consistency and Confidence
Other Methods
Large number of methods:
⬣MixMatch/ReMixMatch(Berthelot et al.): More complex variations prior to
FixMatch
⬣Temperature scaling and entropy minimization
⬣Multiple augmentations and ensemblingto improve pseudo-labels
⬣Virtual AdversarialTraining (Miyatoet al.): Augmentation through
adversarial examples (via backprop)
⬣Mean Teacher –Student/teaching distillation consistency method
(Tarveinenet al.) with exponential moving average
Miyatioet al., Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning
Berthelot et al., MixMatch: A Holistic Approach to Semi-Supervised Learning
Berthelot et al., ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring
Tarveinenet al., Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning
Label PropagationIscenet al., Label Propagation for Deep Semi-supervised Learning
Summary
Unlabeled data can be usedsuccessfully
to improve supervised learning
Methods are relatively simple:
⬣Data augmentation
⬣Pseudo-labeling
⬣(Less necessary) Label Propagation
Methods scale to large unlabeled sets:
⬣Not clear how many labels each
unlabeled data is “worth”
Few-Shot
Learning
Few-Shot Learning
Chen et al., A Closer Look at Few-Shot Learning
Lots of
Labels (Base
categories)
Very Few
Labels (New
categories)
FinetuningBaseline
Chen et al., A Closer Look at Few-Shot Learning
Dhillonet al., A Baseline for Few-Shot Image Classification
Tian et al., Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need?
⬣Do what we always do: Fine-tuning
⬣Train classifier on base classes
⬣Optionally freeze feature extractor
⬣Learn classifier weights for new classes using few amounts of labeled data
(during “query” time)
⬣Surprisingly effective compared to more sophisticated approaches (Chen
et al., Dhillonet al., Tian et al.)
Cosine Classifier
Chen et al., A Closer Look at Few-Shot Learning
https://en.wikipedia.org/wiki/Cosine_similarity
We can use a cosine (similarity-based) classifier rather than fully
connected linear layer
Cons of Normal Approach
⬣The training we do on the
base classes does not factor
the task into account
⬣No notion that we will be
performing a bunch of N-
way tests
⬣Idea: simulate what we will
see during test time
Meta-Training
Set up a set of smaller tasks during training which simulateswhat we
will be doing during testing: N-Way K-Shot Tasks
⬣Can optionally pre-train features on held-out base classes
Testing stage is now the same, but with new classes
Meta-Train
Meta-Test
Approaches using Meta-Training
Learning a model conditioned on support set
Chen et al., A Closer Look at Few-Shot Learning
How to parametrize learning algorithms?
Two approaches to defining a meta-learner:
Take inspiration from a known learning algorithm
kNN/kernel machine: Matching networks (Vinyalset al. 2016)
Gaussianclassifier: PrototypicalNetworks (Snellet al. 2017)
Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) ,
Model-Agnostic Meta-Learning MAML (Finn et al. 2017)
Derive it from a black box neural network
MANN (Santoroet al. 2016)
SNAIL (Mishraet al. 2018)
25Meta-Learner
Slide Credit: Hugo Larochelle
Learn gradient descent:
Parameter initialization and update rules
Output:
Parameter initialization
Meta-learner that decides howto update parameters
Learn just an initialization and use normal gradient descent (MAML)
Output:
Just parameter initialization!
We are using SGD
26
Slide Credit: Hugo Larochelle
More Sophisticated Meta-Learning Approaches
27Meta-Learner LSTM
Slide Credit: Hugo Larochelle
28Meta-Learner LSTM
Slide Credit: Hugo Larochelle
29Meta-Learner LSTM
Slide Credit: Hugo Larochelle
30Meta-Learner LSTM
Slide Credit: Hugo Larochelle
31Meta-Learner LSTM
Slide Credit: Hugo Larochelle
32Meta-Learner LSTM
Slide Credit: Hugo Larochelle
33Model-Agnostic Meta-Learning (MAML)
Slide Credit: Hugo Larochelle
34
Slide Credit: Hugo Larochelle\
Model-Agnostic Meta-Learning (MAML)
Slide Credit: Hugo Larochelle
35
Slide Credit: Hugo Larochelle
Model-Agnostic Meta-Learning (MAML)
36Comparison
Slide Credit: Hugo Larochelle
Unsupervised
and Self-
Supervised
Learning
Unsupervised Learning
Density
Estimation
Classification
Regression
Clustering
Dimensionality
Reduction
x y
x y
Discrete
Continuous
x cDiscrete
x zContinuous
Supervised Learning
Unsupervised Learning
x p(x)On simplex
Spectrum of Low-Labeled Learning
What can we do with no labels at all?
Supervised
Learning
⬣Train Input: !,#
⬣Learning output:
$∶!→#, '()|+)
⬣e.g. classification
Sheep
Dog
Cat
Lion
Giraffe
Unsupervised
Learning
⬣Input: !
⬣Learning
output: '+
⬣Example: Clustering,
density estimation, etc.
Less Labels
Autoencoders
EncoderDecoder
Low dimensional embedding
Minimize the difference (with MSE)
Linear layers with reduced
dimension or Conv-2d
layers with stride
Linear layers with increasing
dimension or Conv-2d layers
with bilinear upsampling
Fine-Tuning
EncoderClassifier
Clustering Assumption
Clustering Assumption
⬣High density region forms a cluster while low density region
separate clusters which hold a coherent semantic meaning.
We hope:
DNN K-Means
Original feature spaceLearned feature space
(The assumption)
Deep Clustering
The clustering assumption leads to good feature learning with a
careful engineering to avoid:
⬣Empty cluster
⬣Trivial parameterization
Caron et al., Deep Clustering for Unsupervised Learning of Visual Features
Surrogate Tasks
There are a lot of other surrogate
tasks!
⬣Reconstruction
⬣Rotate images, predict if image is
rotated or not
⬣Colorization
⬣Relative image patch location
(jigsaw)
⬣Video: Next frame prediction
⬣Instance Prediction
⬣…
Colorization
⬣Input: Grayscale
image
⬣Output: Color image
⬣Objective function:
MSE
Zhang et al., Colorful Image Colorization
Jigsaw
⬣Input: Image patches
⬣Output: Prediction of discrete image patch location relative to center
⬣Objective function: Cross-Entropy (classification)
Doerschal., Unsupervised Visual Representation Learning by Context Prediction
Rotation Prediction
⬣Input: Image with
various rotations
⬣Output:
Prediction rotation
amount
⬣Objective
function: Cross-
Entropy
(classification)
Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations
Evaluation
Gidariset al., Unsupervised Representation Learning by Predicting Image Rotations
⬣Train the model with a surrogate
task
⬣Extract the ConvNet(or encoder
part)
⬣Transfer to the actual task
⬣Use it to initialize the model of
another supervised learning task
⬣Use it to extract features for
learning a separate classifier
(ex: NN or SVM)
⬣Often classifier is limited to
linear layer and features are
frozen
Instance Discrimination
Augmentation1
Augmentation2
Augmented
Negative
Examples
Positive Example
Key question: Where should
the negatives come from?
Considerations:
⬣Efficiency (feature extraction)
⬣Characteristics of negative
examples
Contrastive Loss:E.g. dot product
(similarity) between augmentation 1
andpositive & negative examples
Momentum Encoder
He et al., Momentum Contrast for Unsupervised Visual Representation Learning
Memory Banks
⬣We can use a queuethat comes from
the previous mini-batches to serve as
negative examples
⬣This means no extra feature extraction
needed!
⬣Feature may be stale (since the
encoder weights have been updated)
but still works
Queue
Pop
Mini-
Batch
Momentum Encoder
He et al., Momentum Contrast for Unsupervised Visual Representation Learning
Results
Unlabeled
Examples
Augmented
Augmented
Augmented
He et al., Momentum Contrast for Unsupervised Visual Representation Learning
⬣Linear layers learned from learned frozen encoder
⬣Similar or even better results than supervised!
⬣Features generalize to other tasks (object
detection)
Large number of surrogate tasks and variations!
⬣E.g. contrastive across image patches or context
⬣Different types of loss functions and training regimes
Two have become dominant as extremely effective:
⬣Contrastive losses (e.g. instance discrimination with
positive and negative instances)
⬣Pseudo-labeling (hard pseudo-label) and knowledge
distillation (soft teacher/student)
Data augmentation is key
⬣Methods tend to be sensitive to choice of augmentation
Summary