.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Wasserstein GAN
JIN HO LEE
2018-11-30
JIN HO LEE Wasserstein GAN 2018-11-30 1 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
1. Introduction
2 Different Distances
3 Wasserstein GAN
4 Empirical Results
▷4.1 Experimental Procedure
▷4.2 Meaningful loss metric
▷4.3 Improved stability
5 Related Work
JIN HO LEE Wasserstein GAN 2018-11-30 2 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1. Introduction
1. Introduction
Main Goal : Learning GAN by using Wasserstein distanceW(Pr;Pg)
In Section 2, we provide how the Earth Mover (EM) distance behaves in
comparison to Total Variation (TV), Kullback-Leibler (KL) divergence and
Jensen-Shannon (JS) divergence.
In Section 3, we define Wasserstein-GAN and efficient approximation of
the EM distance
we empirically show that WGANs cure the main training problems of
GANs.
JIN HO LEE Wasserstein GAN 2018-11-30 3 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances
2. Different Distances
AA-algebraof subst ofXis a collectionof subsets ofXsatisfying
the following conditions
(a)∅ 2
(b) ifB2thenB
c
2
(c) ifB1;B2; is a countable collection of sets inthen[
1
n=1
Bn2
Borel algebra: the smallestA-algebra containing the open sets
Aprobability spaceconsists of sample spaceΩ, eventsFand
probability measurePwhere the set of eventsFis aA-algebra
A functionis aprobability measureon a probability space(X;;P)if
(a)(X)= 1,(∅) = 0,(A)2[0;1]for everyA2
(b) countable additivity : for all countable collectionsfEigof pairwise
disjoint sets:
([iEi) =
∑
i
(Ei):
JIN HO LEE Wasserstein GAN 2018-11-30 4 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances
TheTotal Variation (TV)distance
(Pr;Pg) =sup
A2
jPr(A)Pg(A)j:
TheKullback-Leibler (KL)divergence
KL(PrjjPg) =
∫
log
(
Pr(x)
Pg(x)
)
Pr(x)d(x):
TheJensen-Shannon (JS)divergence
JS(Pr;Pg) =KL(PrjjPm) +KL(PgjjPm);
wherePm= (Pr+Pg)/2is the mixture.
TheEarth-Mover (EM) distanceorWasserstein-1
W(Pr;Pg) =inf
2(Pr;Pg)
E
(x;y)[jjxyjj];
where(Pr;Pg)denotes the set of all joint distributions(x;y)whose
marginals are respectivelyPrandPg, that isis a coupling ofPrandPg.
JIN HO LEE Wasserstein GAN 2018-11-30 5 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesCouplings
Couplings
: compact metric space
: the set of all Borel subset of
Prob(): probability measures on
Definition
Letandbe probability measures on the same measurable space(S;).
Acouplingofandis a probability measure on the coupling product
space(SS;)such that the marginals of coincide withand, i.e.,
(AS) =(A)and(SA) =(A)8A2:
JIN HO LEE Wasserstein GAN 2018-11-30 6 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesCouplings
Example
For0p1p21;qi= 1pi(i= 1;2), we consider the following joint
distributions:
Since
~
XBer(p1)and
~
YBer(p2),fandgare couplings ofBer(p1)and
Ber(p2).
JIN HO LEE Wasserstein GAN 2018-11-30 7 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesExample of Wasserstein Distance
Example
For previous joint distributionsfandg, we assume that(it’s not true)
[Ber(p1);Ber(p2)] =ff;gg:
Then we have
W(Ber(p1);Ber(p2)) =minfq1p2+p1q2;p2p1g:
Proof.
Since[Ber(p1);Ber(p2)] =ff;gg, we consider only two cases.
case 1.f2[Ber(p1);Ber(p2)].
E
(x;y)f[jjxyjj]
=f(0;0)jj00jj+f(0;1)jj01jj+f(1;0)jj10jj+f(1;1)jj11jj
=q1p2+p1q2
JIN HO LEE Wasserstein GAN 2018-11-30 8 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesExample of Wasserstein Distance
case 2.g2[Ber(p1);Ber(p2)].
E
(x;y)g[jjxyjj]
=g(0;0)jj00jj+g(0;1)jj01jj+g(1;0)jj10jj+g(1;1)jj11jj
=p2p1
By case 1 and 2, we have
W(Ber(p1);Ber(p2)) = inf
2[Ber(p1);Ber(p2)]
E
(x;y)[jjxyjj]
=inf
2ff;gg
E
(x;y)[jjxyjj]
=minfq1p2+p1q2;p2p1g:
JIN HO LEE Wasserstein GAN 2018-11-30 9 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesAn example of coulpings
Lemma
For p1;p22[0;1], the set of all couplings[Ber(p1);Ber(p2)]of Ber(p1)
and Ber(p2)isfpaja2[0;1]gwhere
pa(0;0) =a
pa(0;1) =q1a
pa(1;0) =q2a
pa(1;1) =p2q1+a
Proof.
Let2[Ber(p1);Ber(p2)]. Then we have the following table
Y= 0Y= 1y(x;y)
X= 0 q1
X= 1 q2
x(x;y)q2 p2
JIN HO LEE Wasserstein GAN 2018-11-30 10 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesAn example of coulpings
Fora2[0;1], if(0;0) =a, then the following table is completely
determined.
Y= 0 Y= 1 y(x;y)
X= 0 a q1a q1
X= 1q2ap2(q1a) q2
x(x;y)q2 p2
It means that, fora2[0;1], we can have a couplingofBer(p1)and
Ber(p2)such that(0;0) =a. This complete the proof.
JIN HO LEE Wasserstein GAN 2018-11-30 11 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesA computational result of Wasserstein Distance
Theorem
For p1p2, we have
W(Ber(p1);Ber(p2)) =p2p1:
Proof.
From the previous Lemma, we have[Ber(p1);Ber(p2)] =fpaja2[0;1]g
wherepa(0;0) =a. Then we obtain
E
(x;y)pa
[jjxyjj]
=pa(0;0)jj00jj+pa(0;1)jj01jj+pa(1;0)jj10jj+pa(1;1)jj11jj
= 2p1p22a
Sincep1andp2are constants andais less or equal to marginal
probabilities, we haveaminfq1;q2g. From the assumptionp1p2, we
haveq1q2andminfq1;q2g=q2.
JIN HO LEE Wasserstein GAN 2018-11-30 12 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesA computational result of Wasserstein Distance
The functionE
(x;y)pa
[jjxyjj] = 2p1p22ais linear byaand
aq2, we have
2p1p22a2p1p22(1p2) =p2p1:
JIN HO LEE Wasserstein GAN 2018-11-30 13 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesExample 1
Example (1)
We assume that
▷ZU[0;1]: uniform distribution on the unit interval.
▷P0: be the distribution of(0;Z)2R
2
, uniform on a straight vertical
line passing through the origin.
▷g((z) = ((;z)with(a single real parameter.
Then we obtain the following.
W(P0;P() =j(j
JS(P0;P() =
{
log2if(̸= 0;
0 if(= 0;
KL(P(jjP0) =KL(P0jjP() =
{
+1if(̸= 0;
0 if(= 0;
JIN HO LEE Wasserstein GAN 2018-11-30 14 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesExample 1
(P0;P() =
{
1if(̸= 0;
0if(= 0;
When(t!0, the sequence(P()t2Nconverges toP0under the EM
distance, but does not convege at all under either us JS, KL, reverse KL,
or TV divergences.
JIN HO LEE Wasserstein GAN 2018-11-30 15 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesTheorem 1
Theorem (1)
Let Prbe a fixed distribution over X. Let Z be a random variable (e.g
Gaussian) over another spaceZ. Let g:Z R
d
!be a function, that
will be denoted g(z)with z the first coordinate andthe second. Let P
denote the distribution of g(z). Then,
1. If g is continuous in, so is W(Pr;P).
2. If g is locally Lipschitz and satisfies regularity assumption 1, then
W(Pr;P)is continuous everywhere, and differentiable almost everywhere.
3. Statements 1-2 are false for the Jensen-Shannon divergence JS(Pr;P)
and all the KLs.
JIN HO LEE Wasserstein GAN 2018-11-30 16 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesTheorem 1
The following corollary tells us that learning by minimizing the EM
distance makes sense (at least in theory) with neural networks.
Corollary
Let gbe any feedforward neural network parameterized by, and p(z)a
prior over z such thatE
zp(z)[jjzjj]<1(e.g. Gaussian, uniform, etc.).
Then assumption 1 is satisfied and therefore W(Pr;P)is continuous
everywhere and differentiable almost everywhere.
JIN HO LEE Wasserstein GAN 2018-11-30 17 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different DistancesTheorem 2
Theorem (2)
Let P be a distribution on a compact space X and(Pn)n2Nbe a sequence
of distributions on X. Then, considering all limits as n! 1,
1. The following statements are equivalent
(Pn;P)!0withthe total variation distance.
JS(Pn;P)!0with JS the Jensen-Shannon divergence.
2. The following statements are equivalent
W(Pn;P)!0.
Pn
D
!P where
D
!represents convergence in distribution for random
variables.
3. KL(PnjjP)!0or KL(Pjjn)!0imply the statements in (2)
JIN HO LEE Wasserstein GAN 2018-11-30 18 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN
3. Wasserstein GAN
ComputingW(Pr;Pg)is intractible from the definition of Wasserstein
distance. However, then Kantorovich-Rubinstein duality tell us that:
W(Pr;Pg) =sup
jjfjjL1
ExPr
[f(x)]ExP
[f(x)]
wherejjfjjL1means thatfsatisfies 1-Lipschitz condition.
Note that, if we replacejjfjjL1forjjfjjLKfor someK, we have
KW(Pr;Pg) =sup
jjfjjLK
ExPr
[f(x)]ExP
[f(x)]:
If we have a parametrized family functionsffwgw2Wthat are all
K-Lipschitz for someK, then we have:
max
w2W
ExPr
[fw(x)]ExP
[fw(x)]sup
jjfjjLK
ExPr
[f(x)]ExP
[f(x)]
=KW(Pr;P)
JIN HO LEE Wasserstein GAN 2018-11-30 19 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Theorem 3
Theorem (3)
Let Prbe any distribution. Let Pbe the distribution of g(Z)with Z a
random variable with density p and ga function satisfying assumption 1.
Then, there is a solution f:!Rto the problem
max
jjfjjL1
ExPr
[f(x)]ExP
[f(x)]
and we have
∇W(Pr;P) =E
zp(z)[∇f(g(z))]
when both terms are well-defined.
Objective functions:
L
WGAN
D
=ExPr
[fw(x)]E
zP(z)[fw(g(z))]
L
WGAN
G
=E
zP(z)[f(g(z))]
wherewD clip(w;0:01;0:01)inLD.
JIN HO LEE Wasserstein GAN 2018-11-30 20 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Figure 2
In this paper, the Authors call discriminator critic. In Figure 2, we train a
GAN discriminator and a WGAN critic still optimality. The discriminator
learn very quickly to distinguish between fake and real. But, the critic
can’t saturate and converges to a linear function.
JIN HO LEE Wasserstein GAN 2018-11-30 22 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Empirical Results
We claim two main benefits:
▷a meaningful loss metric that correlates with the generator’s
convergence and sample quality
▷improved stability of the optimization process
JIN HO LEE Wasserstein GAN 2018-11-30 23 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Empirical Results4.1 Experimental Procedure
Training curves and the visualization of samples at different stages of
training show clear correlation between the Wasserstein estimate and the
generated image quality.
JIN HO LEE Wasserstein GAN 2018-11-30 24 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
Let2R
d
be a compact set, that is closed and bounded by Heine-Borel
Theorem andProb()a probability measure over.
We define
Cb() =ff:!Rjfis continuous and boundedg
Forf2Cb(), we can define a normjjfjj1=max
x2
jf(x)j, sincefis
bounded.
Then we have a normed vector space(Cb();jj jj1).
The dual space
Cb()
=fϕ:Cb()!Rjϕis linear and continuousg
has normjjϕjj= sup
f2Cb();jjfjj11
jϕ(f)j:
JIN HO LEE Wasserstein GAN 2018-11-30 26 / 26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
Letbe a signed measure over, and let the Total Variational distance
jjjjTV=sup
A
j(A)j
whereAis a Borel subset in. For two probability distributionsPrand
P, we the function
(Pr;P) =jjPrPjjTV
is a distance inProb()(called the Total Variation distance)
We can consider
: (Prob(); )!(Cb()
;jj jj)
where(P)(f) =ExP[f(x)]is a linear function overCb().
By the Riesz Representation Theorem,is an isometric immersion, that
is(P;Q) =jj(P)(Q)jjandϕis a 1-1 correspondence.
JIN HO LEE Wasserstein GAN 2018-11-30 26 / 26