Optimization Methods for Machine Learning and Engineering: Optimization in Vector Spaces

FadiAkil2 74 views 31 slides Apr 30, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Optimization Methods for Machine Learning and Engineering
Lecture 7 – Optimization in Vector Spaces


Slide Content

Optimization Methods for Machine Learning and Engineering
Lecture 7 – Optimization in Vector Spaces
Julius Pfrommer
Updated February 12, 2021
CC BY-SA 4.0

Agenda
1. Vector Spaces
2. Norms and Banach Spaces
3. Inner Products, Hilbert Spaces and the Projection Theorem
4. Applications
1/24

Vector Spaces

Vector Spaces
Aset of elementsXwith the operations
Addition:8x;y2X;x+y2X
Scalar Multiplication:8x2X; 2R;x2X
is called avector spaceif in addition the followingaxiomsare fulfilled
for any elementsx;y;z2Xand scalars; 2R:
1.x+y=y+x (Commutative Law)
2.(x+y) +z=x+ (y+z) (Associative Law)
3.()x=(x) (Associative Law)
4.(x+y) =x+y (Distributive Law)
5.(+)x=x+x (Distributive Law)
6.902Xsuch thatx+0=x;8x2X (Null Vector)
7.0x=0;1x=x
The operations on vectors fromR
n
Additionxyx+y
Scalar Multiplication2xx
The elements ofXcould be fromR
n
, keeping with our previous notion of a “vector”. But manyother types
of mathematical objectsalso form vector spaces. Andnot allXR
n
obey the axioms.
2/24

Quiz: IsXa Vector Space?
X=R
n
withn2N
Yes
3/24

Quiz: IsXa Vector Space?
0XpM
M=fm2R
3
jm3= 0g
p2R
3
X=fm2M+pg
No
4/24

Quiz: IsXa Vector Space?
M0
MR
n
withn2N
X=fxj 9m2M;0
x=mg
No
5/24

Quiz: IsXa Vector Space?
f(t)tsin(t)cos(t)
X=ffj 9;2R;
f=t7!sin(t) +cos(t)g
Yes
For the addition of functions use(f+g)(t) =f(t) +g(t).
The null vector0of function-space isf0(t) = 0.
6/24

Linear Combinations, Linear Dependence, Basis and Dimensions
A vectorxfrom a vector spaceXis alinear combination
of the vectorsfy1;y2; : : :g Xif there exist scalars
f1; 2; : : :gso thatx=
P
i
ixi. Note that we could
have infinitely manyyifor the linear combination.
The vectorsfx1; : : : ;xngfrom a vector spaceXarelinearly
independentif
P
i
ixi=0impliesi= 0for alli.
A set of linearly independent vectorsfx1;x2; : : :gis called
abasisofXif its linear combinations spanX.
Thedimensionof a vector spaceXis defined by the number
of elements in its basis.
We first encountered these concepts in the context of Linear
Algebra inR
n
. But they aremore general and can be applied
to any vector space.This is a common theme for this lecture.0Xx1x2y
•Vectorsx1andx2are a basis for the
vector spaceX
•yis linearly independent fromx1and
x2and is therefore not an element ofX
7/24

Norms and Banach Spaces

Normed Vector Spaces
Anormed vector spaceadditionally has a real-valued function that maps
each elementx2Xinto areal numberkxkcalled the norm ofxwhere
the following axioms hold:
1.kxk 0;kxk= 0iffx=0
2.kx+yk kxk+kyk 8x;y2X (Triangle Inequality)
3.kxk=jj A kxk 82R
Every norm implies ametric, i.e. a distance functiondbetween vectors
x;y2X:
d(x;y) :=kxyk
There, from the norm axioms, we have
1.d(x;y) = 0,x=y (Identity of Indiscernibles)
2.d(x;z)d(x;y) +d(y;z) (Triangle Inequality)
3.d(x;y) =d(y;x) (Symmetry)
Norm of a vector as its lengthxkxk
Distance between vectorsxyd(x;y)
8/24

Thep-Norms
For elementsx2R
n
, the previously encounteredEuclidean
Normk A k2is only a special case from the family ofp-Norms
kxkp=
P
i
jxij
p
A
1=p
forp1. Other common values forpare:
p= 1TheManhattan Normis simply the sum of the absolute
values.
p=1TheMaximum Normarises in the limit whenpis
increased. It can be defined alternatively as
kxk1= max
i
jxij.
In the example on the right-hand side, there is a unique shortest
path in the Euclidean distance (implied by the Euclidean Norm)
across the grid (red). In Manhattan distance there are several
paths with the same length.
Distances in the Manhattan Norm(p= 1)
Unit circle for differentpp=1;p= 2;p= 1;p= 1=2
9/24

Convergence and Banach Spaces
In the context of open/closed sets, we previously saw aconvergent sequence.
Now we can make this notion of convergence more precise.
Letfxig Xan infinite series from the normed vector spaceX. The
series converges if there exists some elementy2Xfor whichkyxik
converges to zero. More precisely, for every" >0there exists an indexm
such thatkyxik< "for allim. We then writexi!y.
A sequencefxigis said to be aCauchy sequenceifkxixjk !0as
i; j! 1; i.e., given" >0, there is an indexmsuch thatkxixjk< "
for alli; jm.
In a normed space every convergent sequence is a Cauchy sequence.
A normed vector spaceXiscompleteif every Cauchy sequence
fromXhas a limit inX. A complete normed vector space is called
aBanach space.
Stefan Banach (1892 – 1945)
A non-Cauchy sequence
10/24

Completeness and the existence of Fixed Points
In a normed vector space,any finite-dimensional
subspace is complete.So all normed vector spaces
embedded inR
n
are Banach spaces.
Completeness is a prerequisite for many of the
optimization algorithms we saw prior.For example, to
show convergence of Gradient Descent and the
Newton Method in general normed vector spaces.
LetSbe a subset of a normed vector spaceXand let
fbe a transformationf:S!S. Thenfis a
contractionif there exists anwith0 <1such
thatkf(x)f(y)k kxykfor allx;y2S.
Banach Fixed Point Theorem
Iffis a contraction on a closed subsetSof a Banach
space, there is a fixed pointx

2Ssatisfying
x

=f(x

). Furthermore,x

can be obtained by
starting with an arbitraryx02Sand following a
sequencexi+1=f(xi).
A Non-Complete Normed Space [Luenberger69]
Consider the normed vector space of continuous functions
L
2
[0;1]. Let a sequence of functionsfifrom this space:
fi(t) =
8
>
<
>
:
0 for0t
1
2

1
i
it
i
2
+ 1for
1
2

1
i
t
1
2
1 fort
1
2
Each functionfiis continuous for finitei. However the
sequence converges in the limit to the step function which
is not continuous and not inL
2
[0;1].
11/24

Inner Products, Hilbert Spaces
and the Projection Theorem

The Inner Product
LetXa vector space. Theinner producthxjyiis a function defined onXXthat maps each pair of
vectorsx;y2Xto a scalar while satisfying the following axioms:
1.hx+yjzi=hxjzi+hyjzi (Linearity in the first argument)
2.hxjyi=hxjyi (Linearity in the first argument)
3.hxjyi=hyjxi (Conjugate Symmetry)
4.hxjxi 0andhxjxi= 0iffx=0 (Positive Definiteness)
Theoverbardenotes complex conjugation (complex-valued vector spaces are not considered in the course).
A vector space with an inner product defined is apre-Hilbert space.
Everyinner product implies a normkxk=
p
hxjxi.
Euclidean Inner Product
A vector spaceXR
n
with elementsx;yand
the inner product
hxjyi=
n
X
i=1
xiyi:
Function Spaces
The vector spaceL
2
[a; b]of continuous functions
f; gwith
R
b
a
f(t)
2
dt <1and the inner product
hfjgi=
Z
b
a
f(t)g(t)dt :
12/24

Orthogonality and the Projection Theorem
Two elementsx;yfrom a pre-Hilbert space are said to beorthogonalifhxjyi= 0, denoted asx?y.
Ifx;yare orthogonalx?y, thenkx+yk
2
=kxk
2
+kyk
2
.
Proof:kx+yk
2
=hx+yjx+yi=hxjx+yi+hyjx+yi=hx+yjxi+hx+yjyi=
hxjxi+hyjxi+hxjyi+hyjyi=kxk
2
+kyk
2
Consider the following optimization problem: Let a pre-Hilbert space
Xand a subspaceMX. Given an elementx2X, what is the
elementm2Mthat minimizeskxmk?
Projection Theorem for pre-Hilbert Spacessee [Luenberger69]
If there is an elementm

2Msuch thatkxm

k kxmk
for allm2M, thenm

is unique. The elementm

is a unique
minimizer inMiff the residualxm

is orthogonal toM.m

xm

x0XM
13/24

Hilbert Spaces
A complete pre-Hilbert space is called aHilbert space.
Concerning the Projection Theorem, we know that aunique minimizer must exist
for Hilbert spaces.
Results from Linear Algebra are generalized to infinite-dimensional Vector Spaces.
Linear Operatorstranslate between different Hilbert Spaces. Matrix multiplication
is a special case for linear operators in the finite-dimensional case.
Hilbert Spaces are used in many different fields
John Von Neumann.Mathematische Grundlagen der Quan-
tenmechanik.Springer, 1932
Bernhard Schölkopf and Alexander J Smola.Learning
with kernels: support vector machines, regularization, op-
timization, and beyond.MIT press, 2002
Kevin W Cassel.Variational methods with applications in
science and engineering.Cambridge University Press, 2013
David Hilbert (1862 – 1943)
The last person who knew all
of mathematics (Folklore)
14/24

Gram-Schmidt-Orthogonalization
In anorthogonal setSall elements are mutually orthogonal8x;y2S;
x6=y)x?y.
IfSisorthonormal(in addition to orthogonal), then8x2S;kxk= 1.
Givenx;y2Xandkyk= 1, thenhxjyiyis the projection ofxony.
Theresidual of the projectionr=xhxjyiyis orthogonal toy.
Proof:

x hxjyiy


y

=hxjyi hxjyihyjyi= 0
Residual of the Projection
ryxhxjyiy
Letfb1;b2; : : : ;bnga finite basis for the subspaceMof a Hilbert spaceHM. We can construct an
orthonormal basisfe1;e2; : : : ;engforMusingGram-Schmidt-Orthogonalization:
e1=
b1
kb1k
;en=
bn
P
n1
i=1
hbnjeiiei
kbn
P
n1
i=1
hbnjeiieik
By the Projection Theorem we findm

2Mwith minimum distance to somex2Has
m

= arg min
1;2;:::;n
kx
P
n
i=1
ibik=x
P
n
i=1
hxjeiiei
15/24

The Normal Equations
Again, we look at the minimum norm projectionm

= arg min
1;2;:::;n
kx
P
n
i=1
ibikwhere thebispan
a subspace of a Hilbert spaceH. But instead of justm

we are also interested in thei.Gram-Schmidt-
Orthogonalization does not immediately give us those.
From the Projection Theorem we know that the residualx
P
n
i=1
ibiis orthogonal to allbi.
hx
P
n
i=1
ibijbii= 0; 8i= 1; : : : ; n
h
P
n
i=1
ibijbii=hxjbii;8i= 1; : : : ; n
We can further unpack the left-hand side to get a systemG=cofnlinear equations withnunknowns.
These are known as theNormal Equations. Note that onlycdepends on the vectorxthat we want to project.
hb1jb1i1+hb2jb1i2+: : :+hbnjb1in=hxjb1i
hb1jb2i1+hb2jb2i2+: : :+hbnjb2in=hxjb2i
:
:
:
:
:
:
:
:
:
:
:
:
hb1jbni1+hb2jbni2+: : :+hbnjbnin=hxjbni
16/24

The Gram Matrix
Letfb1;b2; : : :ga linearly independent basis from a Hilbert space. ItsGram matrixcan be precomputed as
G(b1;b2; : : : ;bn) =
0
B
B
B
@
hb1jb1i hb2jb1i: : :hbnjb1i
hb1jb2i hb2jb2i: : :hbnjb2i
:
:
:
:
:
:
:
:
:
hb1jbni hb2jbni: : :hbnjbni
1
C
C
C
A
:
Theorem:The determinant of the Gram matrix is non-nulljG(b1;b2; : : : ;bn)j 6= 0iff thebiare
linearly independent.
In that case, the matrix is invertible and we can solveG=cforwith standard methods.
Hence, for every finite basis embedded in a Hilbert space, we can compute the minimum distance projection
and express it by coefficientsifor the basis elements.
17/24

Minimum Distance in Julia
#Normanddistancefunction
norm(L, x) = sqrt(inner(L, x, x))
dist(L, x, y) = norm(L, x-y)
#ExamplefortheEuclideanp2-Norm
inner(::Val{:P2}, x, y) = x' * y
dist(Val(:P2), [0,0], [1,1]) #1.4142
functiongram_schmidt(L, b)
e = copy(b)
fori = 1:length(b)
forj=1:i-1
e[i] -= e[j] * inner(L, b[i], e[j])
end
nn = norm(L, e[i])
ifnn > 0.0001#Normalizeifnon-zero
e[i] = e[i] / nn
end
end
returne
end
#Projectiononthesubspacedefinedbya(not
#necessarilyorthogonal)basis
functionproj(L, x, basis)
ob = gram_schmidt(L, basis) #orthonormalbasis
returnsum([ob[i] * inner(L, x, ob[i])
fori=1:length(ob)])
end
#Returnstheprojectionanditscoefficients
#forthebasiselements
functionproj_normal(L, x, basis)
nb = length(basis)
G = zeros(nb,nb) #Grammatrix,alwayssymmetric
fori=1:nb, j=1:i
G[i,j] = inner(L, basis[i], basis[j])
G[j,i] = G[i,j]
end
c = [inner(L, x, basis[i]) fori=1:nb]
alpha = G \ c#G*alpha=c
returnsum(basis .* alpha), alpha
end
18/24

Applications

Catching Bad Guys with Eigenfaces
=+1+2+:::
•From a database of face images, compute the “average face” andnEigenfaces.
•The Eigenfaces are extracted using the Eigen-decomposition technique already encountered
for Fibonacci-in-constant-time (not further discussed here).
•TheEigenfaces are a basisfor a (finite)n-dimensionalvector space.
•For every face image, we can find aminimum-distance projection on the face-space.
This gives usn-dimensional coefficientsthat we can use as features.
•Recognize a person bynearest-neighbor lookup for the Eigenface coefficientsof known faces.
Lawrence Sirovich and Michael Kirby.“Low-dimensional procedure for the characterization of human faces”.
In:Journal of the Optical Society of America A4.3 (1987), pp. 519–524
19/24

Approximatingsinwith a Polynomial
The vector spaceL
2
[0;1]containscontinuous functionsA: [0;1]!R
•with the inner producthf; gi=
R
1
0
f(t)g(t)dtand the corresponding
•normkfk=
q
R
1
0
f(t)
2
dt(restrict tofwherekfk<1).
Let the vector spacePnL
2
[0;1]with thepolynomials ofnth degree.
•A polynomial ofnth degree can be represented by an(n+ 1)-vector of
its coefficients (including the intercept).
•Any set of polynomial functions spans a subspace ofL
2
[0;1]. We can
compute an orthonormal basis for it.
With this, we can perform aminimum-distance projectionfrom the
continuous functions on the polynomialsofnth degree.
Which polynomial ofnth degree most closely representsg(t) = sin(t)?
Solve as aminimum norm problemfn= arg min
h2Pn
khgk.
f2(t) 0:050+4:121t4:121t
20.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
sin
poly-2
f4(t)0:001 + 3:087t+
0:536t
2
7:247t
3
+ 3:623t
40.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
sin
poly-4
20/24

Approximatingsinwith a Polynomial in Julia
importBase: +,-,*,/
#Polynomialsrepresentation andevaluation
structPoly
c::Vector{Float64} #Coefficients(intercept1st)
end
(f::Poly)(x) = sum([x^(i-1)*f.c[i] fori=1:length(f.c)])
#Additionandsubtraction
+(f::Poly, g::Poly) = Poly(f.c .+ g.c)
-(f::Poly, g::Poly) = Poly(f.c .- g.c)
#Multiplication anddivisionwitharealscalar
*(f::Poly, y::T) whereT <:Real = Poly(f.c * y)
/(f::Poly, y::T) whereT <:Real = Poly(f.c / y)
#Examples
pp = Poly([1,2,0])
pp(2.0)#1+2*2.0+0*2.0^2=5.0
pp2 = pp*2 + Poly([1,1,1])
pp2(1.5)#3+5*1.5+1*1.5^2=12.75
#InnerproductforfunctionsfromL2[0,1]
functioninner(::Val{:L2}, f, g)
dt = 0.001#Approximatetheintegral
returnsum([f(t)*g(t)*dt fort=0.0:dt:1.0])
end
#Projectsinontheseconddegreepolynomials
g(t) = sin(t*pi)
p_basis = [Poly([1,0,0]),
Poly([0,1,0]),
Poly([0,0,1])]
g_proj, g_coeff = proj_normal(Val(:L2), g, p_basis)
#g_coeff=[-0.05016328783041,
# 4.12100032032210,
# -4.12100032032211]
#Howisthesinefunctionactuallycomputed
#bytheOS/standardmathlibrary(libm)?
#-http://www.netlib.org/fdlibm/k_sin.c
#-http://www.netlib.org/fdlibm/s_sin.c
#OrviaCORDICalgorithmsinhardware
21/24

Quadratic Optimization with Equality Constraints
x

= arg min
x2R
n
x
>
Qx
subject toAx=b
All solutions fulfilling the equality constraint lie in alinear variety
V.Linear varieties contain elements from some vector space
with an additional offset away from the null vector.
Note thathxjxiQ=x
>
Qxis a valid inner product. Which
element ofVis closest to0wrt. the implied distance metric?
1.Find somevthat fulfills the constraintAv=b.
2.Let the Hilbert Space
~
Hthenullspace ofAwith the inner
producth j iQ.
~
His parallel toV. Projectvonto
~
H:
h= arg min
g2
~
H
hvjgiQ
3.The solution isx

=vh.
Projection with Equality Constraints
0V
~
Hhvx

Application Example: Sea-of-Gates
VLSI Optimization [Kleinhans1991]
22/24

Conjugate Gradient (CG)
Similar to Gradient Descent, but with an additional processing step for
the gradient [Hestenes1952; Hestenes1980].
First step direction:d
(1)
=rf(x
(0)
)
Later step directions:
1.Start with
~
d
(k)
=rf(x
(k1)
).
2.Computed
(k)
by orthogonalization of
~
d
(k)
wrt. the previous
step directionsfd
(1)
; : : : ;d
(k1)
g.
3.Additional linesearch (specialized linesearch methods for CG
exist).
For an unconstrained quadratic optimization problem inndimensions,
Conjugate Gradient converges withinnsteps.
The Newton method would solve it in one step. But with the added
cost of computing the Hessian and solving a linear equation for it.
Note that Hestenes et al.developed CG on a Zuse Z4 computer.
Conjugate Gradient for a
Quadratic ObjectiveImage Source: Wikipedia
23/24

Summary of what you learned today
•Vector spacesand their axioms
•Banach spacesand norms beyond Euclidean distances
•Hilbert spacesand inner products with a notion oforthogonality
•Computing an orthonormal basis with theGram-Schmidt Algorithm
•Minimum-Norm Projectionon the subspace of a Hilbert space via theNormal
Equations
•Applicationsfor Minimum-Norm Projection
•Catching Bad Guys with Eigenfaces
•Approximating the sine function with a polynomial
•Quadratic Optimization with Equality Constraints
•Conjugate Gradient
24/24

That’s it for today.
See you next week for Lecture 8 on
Duality
24/24

Referenzeni
[Cassel2013] Kevin W Cassel.Variational methods with applications in science and engineering.Cambridge
University Press, 2013.
[Hestenes1980] Magnus Rudolph Hestenes.Conjugate direction methods in optimization.Springer, 1980.
[Hestenes1952] Magnus R Hestenes, Eduard Stiefel, et al.“Methods of conjugate gradients for solving linear
systems”.In:Journal of research of the National Bureau of Standards49.6 (1952), pp. 409–436.
[Kleinhans1991] Jürgen M Kleinhans et al.“GORDIAN: VLSI placement by quadratic programming and slicing
optimization”.In:IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
10.3 (1991), pp. 356–365.
[Luenberger69] David G Luenberger.Optimization by Vector Space Methods.John Wiley & Sons, 1969.
[Sirovich1987] Lawrence Sirovich and Michael Kirby.“Low-dimensional procedure for the characterization of
human faces”.In:Journal of the Optical Society of America A4.3 (1987), pp. 519–524.
[Schölkopf2002] Bernhard Schölkopf and Alexander J Smola.Learning with kernels: support vector machines,
regularization, optimization, and beyond.MIT press, 2002.
[VonNeumann1932]John Von Neumann.Mathematische Grundlagen der Quantenmechanik.Springer, 1932.