Lecture notes on the subject Osai important

SanaMateen7 13 views 50 slides Jun 23, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

Lecture notes ll
On subject


Slide Content

CSC 311: Introduction to Machine Learning
Lecture 9 - PCA, Matrix Completion, Autoencoders
Roger Grosse Rahul G. Krishnan Guodong Zhang
University of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec9 1 / 50

Today
So far in this course: supervised learning
Today we start unsupervised learning
ZNo labels, so the purpose is to nd patterns in data
ZNeed to specify what kind of patterns to look for
This week:dimensionality reduction
ZLinear dimensionality reduction (Principal Component Analysis)
ZMatrix completion (needed for the project) is closely related to
PCA.
ZNonlinear dimensionality reduction (autoencoders)
Week 11:clustering
Intro ML (UofT) CSC311-Lec9 2 / 50

Motivating Examples
Energy disaggregation
Kolter and Johnson, \REDD: A public data set for energy disaggregation research"
Intro ML (UofT) CSC311-Lec9 3 / 50

Motivating Examples
Modeling the change in scientic topics over time
Griths and Steyvers, \Finding scientic topics"
Intro ML (UofT) CSC311-Lec9 4 / 50

Motivating Examples
Modeling the change in scientic topics over time
Griths and Steyvers, \Finding scientic topics"
Intro ML (UofT) CSC311-Lec9 5 / 50

Motivating Examples
The models for those tasks are fairly complicated. In this course, we'll
focus on two simpler instances of unsupervised learning:
Clustering
Dimensionality Reduction
Intro ML (UofT) CSC311-Lec9 6 / 50

Linear Dimensionality Reduction
We'll start with a simpler form of dimensionality reduction:
dimensionality reduction
Example:suppose you're a psychologist interested in modeling
the variation in human personality
ZYou've asked lots of participants to take a survey with lots of
personality questions.
ZBy guring out which questions are highly correlated with each
other, you can uncover the main factors describing human
personality.
A linear dimensionality reduction model called
found ve key personality traits called the Big Five:
Zextraversion, agreeableness, openness to experience,
conscientiousness, neuroticism
In this lecture, we'll consider a dierent but closely related model
called.
Intro ML (UofT) CSC311-Lec9 7 / 50

PCA: Overview
Principal Component Analysis (PCA) is our rst unsupervised
learning algorithm, and an example of linear dimensionality
reduction.
Dimensionality reduction:
ZSave computation/memory
ZReduce overtting, achieve better generalization
ZVisualize in 2 dimensions
Since PCA is a linear model, this mapping will be a.
Image credit: Elements of Statistical Learning
Intro ML (UofT) CSC311-Lec9 8 / 50

Euclidean pro jection
Pro jection onto a 1-D subspace
Subspace S is the line along the
unit vector u
Z
r u x is a S : any p oint in
S can b e written as z u for some
z .
Pro jection of x on S is denoted by Pro j
S
x
Recall: x
?
u ? x ?? u ? cos ? x ? cos Pro j
S
x x
?
u
? ?????? ???????? ?
length of pro j
u
???
direction of pro j
? ~x ? u
Intro ML (UofT) CSC311-Lec9 9 / 50

Supplement: Deriving PCA
Maximizea
ã
a<
D
j1
ja
2
jforaQ
ã
u.
ZThis is a change-of-basis to the eigenbasis of.
Assume theiare in sorted order,1'2;':::
Observation: sinceuis a unit vector, then by unitarity,ais also a
unit vector:a
ã
au
ã
QQ
ã
uu
ã
u, i.e.,<
j
a
2
j1.
By inspection, seta11 andaj0 forjj1.
Hence,uQaq1(the top eigenvector). A similar argument shows that thekth principal component is the
kth eigenvector of.
Intro ML (UofT) CSC311-Lec9 19 / 50

Decorrelation
Interesting fact: the dimensions ofzare decorrelated. For now, let
Cov denote the empirical covariance.
CovzCovU
ã
x
U
ã
CovxU
U
ã
U
U
ã
QQ
ã
U Vspectral decomposition
I 0
I
0
Vby orthogonality
top leftKKblock of
If the covariance matrix is diagonal, this means the features are
uncorrelated.
Intro ML (UofT) CSC311-Lec9 20 / 50

Recap
Recap:
Dimensionality reduction aims to nd a low-dimensional
representation of the data.
PCA projects the data onto a subspace which maximizes the
projected variance, or equivalently, minimizes the reconstruction
error.
The optimal subspace is given by the top eigenvectors of the
empirical covariance matrix.
PCA gives a set of decorrelated features.
Intro ML (UofT) CSC311-Lec9 21 / 50

Applying PCA to faces
Consider running PCA on 2429 19x19 grayscale images (CBCL data)
Can get good reconstructions with only 3 components
PCA for pre-processing: can apply classier to latent representation
ZOriginal data is 361 dimensional
ZFor face recognition PCA with 3 components obtains 79% accuracy
on face/non-face discrimination on test data vs. 76.8% for a
Gaussian mixture model (GMM) with 84 states. (We'll cover
GMMs later in the course.)
Can also be good for visualization
Intro ML (UofT) CSC311-Lec9 22 / 50

Applying PCA to faces: Learned basis
Principal components of face images (\eigenfaces")
Intro ML (UofT) CSC311-Lec9 23 / 50

Applying PCA to digits
Intro ML (UofT) CSC311-Lec9 24 / 50

Next
Two more interpretations of PCA, which have interesting
generalizations.
1.Matrix factorization
2.
Intro ML (UofT) CSC311-Lec9 25 / 50

Some recommender systems in action
Ideally recommendations should combine global and seasonal interests, look at
your history if available, should adapt with time, be coherent and diverse, etc.
Intro ML (UofT) CSC311-Lec9 26 / 50

Some recommender systems in action
Intro ML (UofT) CSC311-Lec9 27 / 50

The Netix problem
Movie recommendation: •.
UserMovie Rating
Thor • ” ” ” ”Chained • • ” ” ”Frozen • • • ” ”Chained • • • • ”Bambi • • • • •Titanic • • • ” ”Goodfellas• • • • •Dumbo • • • • • Twilight• • ” ” ” Frozen • • • • • Tangled • ” ” ” ”
Because users only rate a few items, one would like to infer their
preference for unrated items
Intro ML (UofT) CSC311-Lec9 28 / 50

Netix Prize
Intro ML (UofT) CSC311-Lec9 29 / 50

PCA as Matrix Factorization
Recall PCA: each input vectorx
i
"R
D
is approximated as
^Uz
i
,
x
i
~x
i
^Uz
i
where^
1
n
<
i
x
i
is the data mean,U"R
DK
is the orthogonal
basis for the principal subspace, andz
i
"R
K
is the code vector,
and~x
i
"R
D
isx
i
's reconstruction or approximation.
Assume for simplicity that the data is centered:^0. Then, the
approximation looks like
x
i
~x
i
Uz
i
:
Intro ML (UofT) CSC311-Lec9 30 / 50

PCA as Matrix Factorization
PCA(on centered data): input vectorx
i
is approximated asUz
i
x
i
Uz
i
Write this in matrix form, we haveXZU
ã
whereXandZare
matrices with onerowper data point
X
Z
^
^
^
^
^
^
^
^
^
^
^
^
^
^
\
x
1

ã
x
2

ã

x
N

ã
[
_
_
_
_
_
_
_
_
_
_
_
_
_
_
]
"R
ND
andZ
Z
^
^
^
^
^
^
^
^
^
^
^
^
^
^
\
z
1

ã
z
2

ã

z
N

ã
[
_
_
_
_
_
_
_
_
_
_
_
_
_
_
]
"R
NK
Can write the squared reconstruction error as
N
=
i1
½x
i
Uz
i
½
2
½XZU
ã
½
2
F;
½½Fdenotes the:
½Y½
2
F½Y
ã
½
2
F=
i;j
y
2
ij=
i
½y
i
½
2
:
Intro ML (UofT) CSC311-Lec9 31 / 50

PCA as Matrix Factorization
So PCA is approximatingXZU
ã
, or equivalentlyX
ã
UZ
ã
.
Based on the sizes of the matrices, this is a rank-Kapproximation.
SinceUwas chosen to minimize reconstruction error, this is the
optimalrank-Kapproximation, in terms of error½X
ã
UZ
ã
½
2
F.
Intro ML (UofT) CSC311-Lec9 32 / 50

Supplement: Singular-Value Decomposition (SVD)
This has a close relationship to the
(SVD) Xwhich is a matrix factorization technique. Consider an
NDmatrixX"R
ND
with SVD
XQSU
ã
Properties:
Q,S, andU
ã
provide a real-valued matrix factorization ofX.
Qis aNDmatrix with orthonormal columns,Q
ã
QID,
whereIDis theDDidentity matrix.
Uis an orthonormalDDmatrix,U
ã
U
1
. Sis aDDdiagonal matrix, with non-negative singular values,
s1; s2; : : : ; sD, on the diagonal, where the singular values are
conventionally ordered from largest to smallest.
Note that standard SVD notation isXUDV
ã
. We are usingXQSU
ã
for notational convenience.
Intro ML (UofT) CSC311-Lec9 33 / 50

Matrix Completion
We just saw that PCA gives the optimal low-rank matrix
factorization to a matrixX.
Can we generalize this to the case whereXis only partially
observed?
ZA sparse 10001000 matrix with 50,000 observations (only 5%
observed).
ZA rank 5 approximation requires only 10,000 parameters, so it's
reasonable to t this.
ZUnfortunately, no closed form solution.
Intro ML (UofT) CSC311-Lec9 34 / 50

The Netix problem
Movie recommendation:
bad.
UserMovie Rating
Thor • ” ” ” ”Chained • • ” ” ”Frozen • • • ” ”Chained • • • • ”Bambi • • • • •Titanic • • • ” ”Goodfellas• • • • •Dumbo • • • • • Twilight• • ” ” ” Frozen • • • • • Tangled • ” ” ” ”
Because users only rate a few items, one would like to infer their
preference for unrated items
Intro ML (UofT) CSC311-Lec9 35 / 50

Matrix Completion
Matrix completion problem: Nusers byMmovies
matrixRChained
Frozen Bambi Titanic
Goodfellas
Dumbo Twilight
Thor
Tangled
Ninja
Cat
Angel
Nursey
Tongey
Neutral
2 3 ? ? ? ? ? 1 ?
4 ? 5 ? ? ? ? ? ?
? ? ? 3 5 5 ? ? ?
? ? ? ? ? ? 2 ? ?
? 5 ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? 1
Rating matrix
Data: Users rate some movies.
R
user;movie. Very sparse
Task: Predict missing entries,
i.e. how a user would rate a
movie they haven't previously
rated
Evaluation Metric: Squared
error (used by Netix
Competition). Is this a
reasonable metric?
Intro ML (UofT) CSC311-Lec9 36 / 50

Matrix Completion
In our current setting,
ratings by characterizing both movies and users on a number of
factorsKinferred from the ratings patterns.
That is, we seek representations for movies and users as vectors in
R
K
that can ultimately be translated to ratings.
For simplicity, we can associate these factors (i.e. the dimensions
of the vectors) with idealized concepts like
Zcomedy
Zdrama
Zaction
ZBut also uninterpretable dimensions
Can we use the sparse ratings matrixRto nd these latent factors
automatically?
Intro ML (UofT) CSC311-Lec9 37 / 50

Matrix Completion
Let the representation of useriin theK-dimensional space beu
iand
the representation of moviejbez
j
ZIntuition: maybe the rst entry ofu
isays how much the user likes
horror lms, and the rst entry ofz
jsays how much moviejis a
horror lm.
Assume the rating userigives to moviejis given by a dot product:
R
iju
ã
iz
j
In matrix form, if:
U
Z
^
^
^
^
^
^
^
^
^
\
|u
ã
1|

|u
ã
N|
[
_
_
_
_
_
_
_
_
_
]
andZ
ã

Z
^
^
^
^
^
^
^
^
\
¶ ¶
z
1: : :z
M
¶ ¶
[
_
_
_
_
_
_
_
_
]
then:RUZ
ã
This is a matrix factorization problem!
Intro ML (UofT) CSC311-Lec9 38 / 50

Matrix Completion
Recall PCA: To enforceX
ã
UZ
ã
, we minimized
min
U;Z
½X
ã
UZ
ã
½
2
F=
i;j
xjiu
ã
izj
2
whereuiandziare thei-th rows of matricesUandZ,
respectively.
What's dierent about the Netix problem?
ZMost entries are missing!
ZWe only want to count the error for the observed entries.Intro ML (UofT) CSC311-Lec9 39 / 50

Matrix Completion
LetOrn; mentryn; mof matrixRis observedx
Using the squared error loss, matrix completion requires solving
min
U;Z
1
2
=
i;j"O
R
iju
ã
iz
j
2
The objective is non-convex inUandZjointly, and in fact it's generally
NP-hard to minimize the above cost function exactly.
As a function of eitherUorZindividually, the problem is convex and
easy to optimize. We can use coordinate descent, just like with K-means
and mixture models!
Alternating Least Squares (ALS): Zand optimizeU, followed by xU
and optimizeZ, and so on until convergence.
Intro ML (UofT) CSC311-Lec9 40 / 50

Alternating Least Squares
Want to minimize the squared error cost with respect to the factor
U. (The case ofZis exactly symmetric.)
We can decompose the cost into a sum of independent terms:
=
i;j"O
Riju
ã
izj
2
=
i
=
ji;j"O
Riju
ã
izj
2
ÍÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÑÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÏ
only depends onui
This can be minimized independently for eachui.
This is a linear regression problem in disguise. Its optimal solution
is:
ui



=
ji;j"O
zjz
ã
j



1
=
ji;j"O
Rijzj
Intro ML (UofT) CSC311-Lec9 41 / 50

Alternating Least Squares
ALS for Matrix Completion problem
1. UandZrandomly
2.
3. fori1; ::; Ndo
4. u
i<
ji;j"O
z
jz
ã
j
1
<
ji;j"O
R
ijz
j
5. forj1; ::; Mdo
6. z
j<
ii;j"O
u
iu
ã
i
1
<
ii;j"O
R
iju
i
Intro ML (UofT) CSC311-Lec9 42 / 50

Next
Two more interpretations of PCA, which have interesting
generalizations.
1.
2.Autoencoder
Intro ML (UofT) CSC311-Lec9 43 / 50

Autoencoders
An
an inputxand predictx.
To make this non-trivial, we need to add a
dimension is much smaller than the input.
Intro ML (UofT) CSC311-Lec9 44 / 50

Linear Autoencoders
Why autoencoders?
Map high-dimensional data to two dimensions for visualization
Learn abstract features in an unsupervised way so you can apply
them to a supervised task
ZUnlabled data can be much more plentiful than labeled data
Intro ML (UofT) CSC311-Lec9 45 / 50

Linear Autoencoders
The simplest kind of autoencoder has one
hidden layer, linear activations, and squared
error loss.
Lx;~x½x~x½
2
This network computes~xW2W1x, which
is a linear function.
IfK'D, we can chooseW2andW1such
thatW2W1is the identity matrix. This isn't
very interesting.
But supposeK$D:
ZW
1mapsxto aK-dimensional space, so it's doing dimensionality
reduction.
Intro ML (UofT) CSC311-Lec9 46 / 50

Linear Autoencoders
Observe that the output of the autoencoder must lie in a
K-dimensional subspace spanned by the columns ofW2. This is
because~xW2z
We saw that the best possible (min error)K-dimensional linear
subspace in terms of reconstruction error is the PCA subspace.
The autoencoder can achieve this by settingW1U
ã
and
W2U.
Therefore, the optimal weights for a linear autoencoder are just
the principal components!
Intro ML (UofT) CSC311-Lec9 47 / 50

Nonlinear Autoencoders
Deep nonlinear autoencoders learn to project the data, not onto a
subspace, but onto a nonlinear
This manifold is the image of the decoder.
This is a kind of.
Intro ML (UofT) CSC311-Lec9 48 / 50

Nonlinear Autoencoders
Nonlinear autoencoders can learn more powerful codes for a given
dimensionality, compared with linear autoencoders (PCA)
Intro ML (UofT) CSC311-Lec9 49 / 50

Nonlinear Autoencoders
Here's a 2-dimensional autoencoder representation of newsgroup
articles. They're color-coded by topic, but the algorithm wasn't given
the labels.
Intro ML (UofT) CSC311-Lec9 50 / 50