Learning Sparse Representation

8,563 views 91 slides Dec 13, 2012
Slide 1
Slide 1 of 91
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91

About This Presentation

Slides of the keynote presentation at the conference ADA7, Cargese, France, 14-18 May 2012.


Slide Content

Learning Sparse

Representations
Gabriel Peyré
www.numerical-tours.com

Mathematical image prior:
!compression, denoising, super-resolution, . . .
Image Priors

!
||!f||
2
Mathematical image prior:
!compression, denoising, super-resolution, . . .
Smooth images:
Sobolev prior:
Low-pass Fourier coe!cients.
Image Priors

!
||!f||
2
Mathematical image prior:
!compression, denoising, super-resolution, . . .
Smooth images:
Sobolev prior:
Low-pass Fourier coe!cients.
Total variation prior:
Piecewise smooth images:
Sparse wavelets coe!cients.
!
||!f||
Image Priors

!
||!f||
2
Mathematical image prior:
!compression, denoising, super-resolution, . . .
Smooth images:
Sobolev prior:
Low-pass Fourier coe!cients.
Total variation prior:
Piecewise smooth images:
Sparse wavelets coe!cients.
!
||!f||
!Learning the prior from exemplars?
Image Priors

Overview
•Sparsity and Redundancy
•Dictionary Learning
•Extensions
•Task-driven Learning
•Texture Synthesis

DictionaryD={dm}
Q!1
m=0
of atomsdm!R
N
.
Image decomposition:
f=
Q!1
!
m=0
xmdm=Dx
f
D x
xm
=
!
dm
Image Representation
dm

DictionaryD={dm}
Q!1
m=0
of atomsdm!R
N
.
Image decomposition:
f=
Q!1
!
m=0
xmdm=Dx
f
D x
xm
=
!
Image approximation:
f!Dx
dm
Image Representation
dm

DictionaryD={dm}
Q!1
m=0
of atomsdm!R
N
.
Image decomposition:
f=
Q!1
!
m=0
xmdm=Dx
f
D x
xm
=
!
Orthogonal dictionary:N=Q
xm=!f, dm"
Image approximation:
f!Dx
dm
Image Representation
dm

DictionaryD={dm}
Q!1
m=0
of atomsdm!R
N
.
Image decomposition:
f=
Q!1
!
m=0
xmdm=Dx
f
D x
xm
=
!
Orthogonal dictionary:N=Q
xm=!f, dm"
Redundant dictionary:N!Q
!xis not unique.
Image approximation:
f!Dx
Examples:TI wavelets, curvelets, . . .
dm
Image Representation
dm

Sparsity:mostxmare small.
Decomposition:
Coe!cientsx
Imagef
Sparsity
f=
Q!1
!
m=0
xmdm=Dx
Example:wavelet transform.

Sparsity:mostxmare small.
Ideal sparsity:mostxmare zero.
J0(x)=|{m\xm!=0}|
Decomposition:
Coe!cientsx
Imagef
Sparsity
f=
Q!1
!
m=0
xmdm=Dx
Example:wavelet transform.

Sparsity:mostxmare small.
Ideal sparsity:mostxmare zero.
J0(x)=|{m\xm!=0}|
Decomposition:
Coe!cientsx
Imagef
Approximate sparsity:compressibility
||f!Dx||is small withJ0(x)!M.
Sparsity
f=
Q!1
!
m=0
xmdm=Dx
Example:wavelet transform.

Redundant dictionaryD={dm}
Q!1
m=0
,Q!N.
!non-unique representationf=Dx.
Sparsest decomposition:
min
f=Dx
J0(x)
Sparse Coding

Redundant dictionaryD={dm}
Q!1
m=0
,Q!N.
!non-unique representationf=Dx.
Sparsest decomposition:
min
f=Dx
J0(x)
Sparsest approximation:
min
x
1
2
||f!Dx||
2
+!J0(x)
min
J0(x)!M
||f!Dx||
min
||f!Dx||!!
J0(x)
Equivalence
!!M!⇥
Sparse Coding

Redundant dictionaryD={dm}
Q!1
m=0
,Q!N.
!non-unique representationf=Dx.
Sparsest decomposition:
min
f=Dx
J0(x)
Sparsest approximation:
min
x
1
2
||f!Dx||
2
+!J0(x)
min
J0(x)!M
||f!Dx||
min
||f!Dx||!!
J0(x)
Equivalence
!!M!⇥
Ortho-basisD:
xm=
!
!f, dm⇥if|xm|!

2!
0 otherwise.
!
Pick theMlargest
coe!cients
in{!f, dm⇥}m
Sparse Coding

Redundant dictionaryD={dm}
Q!1
m=0
,Q!N.
!non-unique representationf=Dx.
Sparsest decomposition:
min
f=Dx
J0(x)
Sparsest approximation:
min
x
1
2
||f!Dx||
2
+!J0(x)
min
J0(x)!M
||f!Dx||
min
||f!Dx||!!
J0(x)
Equivalence
!!M!⇥
Ortho-basisD:
General redundant dictionary: NP-hard.
xm=
!
!f, dm⇥if|xm|!

2!
0 otherwise.
!
Pick theMlargest
coe!cients
in{!f, dm⇥}m
Sparse Coding

Image with 2 pixels:
q=0
J0(x)=|{m\xm!=0}|
d1
J0(x)=0 null image.
J0(x)=1 sparse image.
J0(x)=2 non-sparse image.
Convex Relaxation: L1 Prior
d0

Image with 2 pixels:
q=0 q=1 q=2
q=3/2q=1/2
J0(x)=|{m\xm!=0}|
Jq(x)=
!
m
|xm|
q
d1
J0(x)=0 null image.
J0(x)=1 sparse image.
J0(x)=2 non-sparse image.
Convex Relaxation: L1 Prior
!
q
priors:
(convex forq!1)
d0

Image with 2 pixels:
q=0 q=1 q=2
q=3/2q=1/2
J0(x)=|{m\xm!=0}|
Jq(x)=
!
m
|xm|
q
d1
J0(x)=0 null image.
J0(x)=1 sparse image.
J0(x)=2 non-sparse image.
J1(x)=||x||1=
!
m
|xm|
Convex Relaxation: L1 Prior
Sparse!
1
prior:
!
q
priors:
(convex forq!1)
d0

Denoising/approximation:!= Id.
Inverse Problems

Examples:Inpainting, super-resolution, compressed-sensing
Denoising/approximation:!= Id.
Inverse Problems

Fidelity
Denoising/compression:y=f0+w!R
N
.
Sparse approximation:f
!
=Dx
!
where
x
!
⇥argmin
x
1
2
||y"Dx||
2
+!||x||1
Regularized Inversion

Fidelity
Denoising/compression:y=f0+w!R
N
.
Sparse approximation:f
!
=Dx
!
where
x
!
⇥argmin
x
1
2
||y"Dx||
2
+!||x||1
x
!
⇥argmin
x
1
2
||y"!Dx||
2
+!||x||1
Inverse problemsy=!f0+w!R
P
.
Replace
Dby!D
Regularized Inversion

Fidelity
Denoising/compression:y=f0+w!R
N
.
Sparse approximation:f
!
=Dx
!
where
x
!
⇥argmin
x
1
2
||y"Dx||
2
+!||x||1
x
!
⇥argmin
x
1
2
||y"!Dx||
2
+!||x||1
Inverse problemsy=!f0+w!R
P
.
Numerical solvers:proximal splitting schemes.
www.numerical-tours.com
Replace
Dby!D
Regularized Inversion

Inpainting Results

Overview
•Sparsity and Redundancy
•Dictionary Learning
•Extensions
•Task-driven Learning
•Texture Synthesis

Set of (noisy) exemplars{yk}k.
Sparse approximation:
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
Dictionary Learning: MAP Energy

Set of (noisy) exemplars{yk}k.
Sparse approximation:
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
!
k
min
D!C
Dictionary learning
Dictionary Learning: MAP Energy

Set of (noisy) exemplars{yk}k.
Sparse approximation:
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
!
k
min
D!C
Dictionary learning
Constraint:
C={D=(dm)m\!m,||dm||!1}
Otherwise:
D!+",X!0
Dictionary Learning: MAP Energy

Set of (noisy) exemplars{yk}k.
Sparse approximation:
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
!
k
min
D!C
Dictionary learning
Matrix formulation:
minf(X, D)=
1
2
||Y!DX||
2
+!||X||1
X⇥R
Q!K
D⇥C"R
N!Q
Constraint:
C={D=(dm)m\!m,||dm||!1}
Otherwise:
D!+",X!0
Dictionary Learning: MAP Energy

Set of (noisy) exemplars{yk}k.
Sparse approximation:
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
!
k
min
D!C
Dictionary learning
Matrix formulation:
minf(X, D)=
1
2
||Y!DX||
2
+!||X||1
X⇥R
Q!K
D⇥C"R
N!Q
!Convex with respect toX.
!Convex with respect toD.
!Non-onvex with respect to (X, D).
Constraint:
C={D=(dm)m\!m,||dm||!1}
Otherwise:
D!+",X!0
D
Local minima
min
X
f(X, D)
Dictionary Learning: MAP Energy

Step 1:!k, minimization onxk
!Convex sparse coding.
D, initialization
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
Dictionary Learning: Algorithm

Step 1:!k, minimization onxk
!Convex sparse coding.
min
D!C
||Y!DX||
2
!Convex constraint minimization.
D, initialization
Step 2:Minimization onD
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
Dictionary Learning: Algorithm

Step 1:!k, minimization onxk
!Convex sparse coding.
min
D!C
||Y!DX||
2
!Convex constraint minimization.
Projected gradient descent:
D, initialization
Step 2:Minimization onD
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
D
(!+1)
= Proj
C
!
D
(!)
!!!(D
(!)
X!Y)X
!
"
Dictionary Learning: Algorithm

Step 1:!k, minimization onxk
!Convex sparse coding.
min
D!C
||Y!DX||
2
!Convex constraint minimization.
Projected gradient descent:
D, initialization
D, convergence
Convergence:toward a stationary point
off(X, D).
Step 2:Minimization onD
min
xk
1
2
||yk!Dxk||
2
+!||xk||1
D
(!+1)
= Proj
C
!
D
(!)
!!!(D
(!)
X!Y)X
!
"
Dictionary Learning: Algorithm

LearningD
Exemplar patchesyk
!State of the art denoising [Elad et al. 2006]
DictionaryD
[Olshausen, Fields 1997]
Patch-based Learning

LearningD
Exemplar patchesyk
!State of the art denoising [Elad et al. 2006]
DictionaryD
[Olshausen, Fields 1997]
!Sparse texture synthesis, inpainting [Peyr´e 2008]
Patch-based Learning
LearningD

D
(k)
=(dm)
k!1
m=0PCA dimensionality reduction:
⇥k,min
D
||Y"D
(k)
X||
Linear (PCA):Fourier-like atoms.
Comparison with PCA
RUBINSTEINet al.: DICTIONARIES FOR SPARSE REPRESENTATION 3
Fig. 1. Left: A few12£12DCT atoms. Right: The first 40 KLT atoms,
trained using12£12image patches fromLena.
B. Non-Linear Revolution and Elements of Modern Dictionary
Design
In statistics research, the 1980’s saw the rise of a new
powerful approach known asrobust statistics. Robust statistics
advocates sparsity as a key for a wide range of recovery and
analysis tasks. The idea has its roots in classical Physics, and
more recently in Information Theory, and promotes simplicity
and conciseness in guiding phenomena descriptions. Motivated
by these ideas, the 1980’s and 1990’s were characterized
by a search for sparser representations and more efficient
transforms.
Increasing sparsity required departure from the linear model,
towards a more flexiblenon-linearformulation. In the non-
linear case, each signal is allowed to use a different set
of atoms from the dictionary in order to achieve the best
approximation. Thus, the approximation process becomes
x⇤
X
n!IK(x)
cn!
n, (5)
whereIK(x)is an index set adapted to each signal individually
(we refer the reader to [5], [7] for more information on this
wide topic).
The non-linear view paved the way to the design of
newer, more efficient transforms. In the process, many of
the fundamental concepts guiding modern dictionary design
were formed. Following the historic time line, we trace the
emergence of the most important modern dictionary design
concepts, which are mostly formed during the last two decades
of the 20th century.
Localization: To achieve sparsity, transforms required better
localization. Atoms with concentrated supports allow more
flexible representations based on the local signal characteris-
tics, and limit the effects of irregularities, which are observed
to be the main source of large coefficients. In this spirit, one
of the first structures to be used was the Short Time Fourier
Transform (STFT) [8], which emerges as a natural extension
to the Fourier transform. In the STFT, the Fourier transform is
applied locally to (possibly overlapping) portions of the signal,
revealing atime-frequency(or space-frequency) description
of the signal. An example of the STFT is the JPEG image
compression algorithm [9], which is based on this concept.
During the 1980’s and 1990’s, the STFT was extensively
researched and generalized, becoming more known as the
Gabortransform, named in homage of Dennis Gabor, who
first suggested the time-frequency decomposition back in
1946 [10]. Gabor’s work was independently rediscovered in
1980 by Bastiaans [11] and Janssen [12], who studied the
fundamental properties of the expansion.
A basic 1-D Gabor dictionary consists of windowed wave-
forms
G=
©
⇤n,m(x)=w(x"⇥m)e
i2!"nx

n,m!Z
,
wherew(·)is a low-pass window function localized at 0
(typically a Gaussian), and#and⇥control the time and
frequency resolution of the transform. Much of the mathe-
matical foundations of this transform were laid out during the
late 1980’s by Daubechies, Grossman and Meyer [13], [14]
who studied the transform from the angle of frame theory,
and by Feichtinger and Gr¨ochenig [15]–[17] who employed a
generalized group-theoretic point of view. Study of the discrete
version of the transform and its numerical implementation
followed in the early 1990’s, with notable contributions by
Wexler and Raz [18] and by Qian and Chen [19].
In higher dimensions, more complex Gabor structures were
developed which adddirectionality, by varying the orientation
of the sinusoidal waves. This structure gained substantial
support from the work of Daugman [20], [21], who discovered
oriented Gabor-like patterns in simple-cell receptive fields in
the visual cortex. These results motivated the deployment of
the transform to image processing tasks, led by works such as
Daugman [22] and Porat and Zeevi [23]. Today, practical uses
of the Gabor transform are mainly in analysis and detection
tasks, as a collection of directional filters. Figure 2 shows
some examples of 2-D Gabor atoms of various orientations
and sizes.
Multi-Resolution: One of the most significant conceptual
advancements achieved in the 1980’s was the rise ofmulti-
scaleanalysis. It was realized that natural signals, and images
specifically, exhibited meaningful structures over many scales,
and could be analyzed and described particularly efficiently
by multi-scale constructions. One of the simplest and best
known such structures is theLaplacian pyramid, introduced
in 1984 by Burt and Adelson [24]. The Laplacian pyramid
represents an image as a series of difference images, where
each one corresponds to a different scale and roughly a
different frequency band.
In the second half of the 1980’s, though, the signal process-
ing community was particularly excited about the development
of a new very powerful tool, known aswavelet analysis[5],
[25], [26]. In a pioneering work from 1984, Grossman and
Morlet [27] proposed a signal expansion over a series of
translated and dilated versions of a single elementary function,
taking the form
W=
n
⇤n,m(x)=#
n/2
f(#
n
x"⇥m)
o
n,m!Z
.
This simple idea captivated the signal processing and harmonic
analysis communities, and in a series of influential works by
Meyer, Daubechies, Mallat and others [13], [14], [28]–[33],
an extensive wavelet theory was formalized. The theory was
formulated for both the continuous and discrete domains, and
a complete mathematical framework relating the two was put
forth. A significant breakthrough came from Meyer’s work in
1985 [28], who found that unlike the Gabor transform (and
RUBINSTEINet al.: DICTIONARIES FOR SPARSE REPRESENTATION 3
Fig. 1. Left: A few12£12DCT atoms. Right: The first 40 KLT atoms,
trained using12£12image patches fromLena.
B. Non-Linear Revolution and Elements of Modern Dictionary
Design
In statistics research, the 1980’s saw the rise of a new
powerful approach known asrobust statistics. Robust statistics
advocates sparsity as a key for a wide range of recovery and
analysis tasks. The idea has its roots in classical Physics, and
more recently in Information Theory, and promotes simplicity
and conciseness in guiding phenomena descriptions. Motivated
by these ideas, the 1980’s and 1990’s were characterized
by a search for sparser representations and more efficient
transforms.
Increasing sparsity required departure from the linear model,
towards a more flexiblenon-linearformulation. In the non-
linear case, each signal is allowed to use a different set
of atoms from the dictionary in order to achieve the best
approximation. Thus, the approximation process becomes
x⇤
X
n!IK(x)
cn!
n, (5)
whereIK(x)is an index set adapted to each signal individually
(we refer the reader to [5], [7] for more information on this
wide topic).
The non-linear view paved the way to the design of
newer, more efficient transforms. In the process, many of
the fundamental concepts guiding modern dictionary design
were formed. Following the historic time line, we trace the
emergence of the most important modern dictionary design
concepts, which are mostly formed during the last two decades
of the 20th century.
Localization: To achieve sparsity, transforms required better
localization. Atoms with concentrated supports allow more
flexible representations based on the local signal characteris-
tics, and limit the effects of irregularities, which are observed
to be the main source of large coefficients. In this spirit, one
of the first structures to be used was the Short Time Fourier
Transform (STFT) [8], which emerges as a natural extension
to the Fourier transform. In the STFT, the Fourier transform is
applied locally to (possibly overlapping) portions of the signal,
revealing atime-frequency(or space-frequency) description
of the signal. An example of the STFT is the JPEG image
compression algorithm [9], which is based on this concept.
During the 1980’s and 1990’s, the STFT was extensively
researched and generalized, becoming more known as the
Gabortransform, named in homage of Dennis Gabor, who
first suggested the time-frequency decomposition back in
1946 [10]. Gabor’s work was independently rediscovered in
1980 by Bastiaans [11] and Janssen [12], who studied the
fundamental properties of the expansion.
A basic 1-D Gabor dictionary consists of windowed wave-
forms
G=
©
⇤n,m(x)=w(x"⇥m)e
i2!"nx

n,m!Z
,
wherew(·)is a low-pass window function localized at 0
(typically a Gaussian), and#and⇥control the time and
frequency resolution of the transform. Much of the mathe-
matical foundations of this transform were laid out during the
late 1980’s by Daubechies, Grossman and Meyer [13], [14]
who studied the transform from the angle of frame theory,
and by Feichtinger and Gr¨ochenig [15]–[17] who employed a
generalized group-theoretic point of view. Study of the discrete
version of the transform and its numerical implementation
followed in the early 1990’s, with notable contributions by
Wexler and Raz [18] and by Qian and Chen [19].
In higher dimensions, more complex Gabor structures were
developed which adddirectionality, by varying the orientation
of the sinusoidal waves. This structure gained substantial
support from the work of Daugman [20], [21], who discovered
oriented Gabor-like patterns in simple-cell receptive fields in
the visual cortex. These results motivated the deployment of
the transform to image processing tasks, led by works such as
Daugman [22] and Porat and Zeevi [23]. Today, practical uses
of the Gabor transform are mainly in analysis and detection
tasks, as a collection of directional filters. Figure 2 shows
some examples of 2-D Gabor atoms of various orientations
and sizes.
Multi-Resolution: One of the most significant conceptual
advancements achieved in the 1980’s was the rise ofmulti-
scaleanalysis. It was realized that natural signals, and images
specifically, exhibited meaningful structures over many scales,
and could be analyzed and described particularly efficiently
by multi-scale constructions. One of the simplest and best
known such structures is theLaplacian pyramid, introduced
in 1984 by Burt and Adelson [24]. The Laplacian pyramid
represents an image as a series of difference images, where
each one corresponds to a different scale and roughly a
different frequency band.
In the second half of the 1980’s, though, the signal process-
ing community was particularly excited about the development
of a new very powerful tool, known aswavelet analysis[5],
[25], [26]. In a pioneering work from 1984, Grossman and
Morlet [27] proposed a signal expansion over a series of
translated and dilated versions of a single elementary function,
taking the form
W=
n
⇤n,m(x)=#
n/2
f(#
n
x"⇥m)
o
n,m!Z
.
This simple idea captivated the signal processing and harmonic
analysis communities, and in a series of influential works by
Meyer, Daubechies, Mallat and others [13], [14], [28]–[33],
an extensive wavelet theory was formalized. The theory was
formulated for both the continuous and discrete domains, and
a complete mathematical framework relating the two was put
forth. A significant breakthrough came from Meyer’s work in
1985 [28], who found that unlike the Gabor transform (and
DCT
PCA

D
(k)
=(dm)
k!1
m=0PCA dimensionality reduction:
⇥k,min
D
||Y"D
(k)
X||
Linear (PCA):Fourier-like atoms.
Sparse (learning):Gabor-like atoms.
Comparison with PCA
RUBINSTEINet al.: DICTIONARIES FOR SPARSE REPRESENTATION 3
Fig. 1. Left: A few12£12DCT atoms. Right: The first 40 KLT atoms,
trained using12£12image patches fromLena.
B. Non-Linear Revolution and Elements of Modern Dictionary
Design
In statistics research, the 1980’s saw the rise of a new
powerful approach known asrobust statistics. Robust statistics
advocates sparsity as a key for a wide range of recovery and
analysis tasks. The idea has its roots in classical Physics, and
more recently in Information Theory, and promotes simplicity
and conciseness in guiding phenomena descriptions. Motivated
by these ideas, the 1980’s and 1990’s were characterized
by a search for sparser representations and more efficient
transforms.
Increasing sparsity required departure from the linear model,
towards a more flexiblenon-linearformulation. In the non-
linear case, each signal is allowed to use a different set
of atoms from the dictionary in order to achieve the best
approximation. Thus, the approximation process becomes
x⇤
X
n!IK(x)
cn!
n, (5)
whereIK(x)is an index set adapted to each signal individually
(we refer the reader to [5], [7] for more information on this
wide topic).
The non-linear view paved the way to the design of
newer, more efficient transforms. In the process, many of
the fundamental concepts guiding modern dictionary design
were formed. Following the historic time line, we trace the
emergence of the most important modern dictionary design
concepts, which are mostly formed during the last two decades
of the 20th century.
Localization: To achieve sparsity, transforms required better
localization. Atoms with concentrated supports allow more
flexible representations based on the local signal characteris-
tics, and limit the effects of irregularities, which are observed
to be the main source of large coefficients. In this spirit, one
of the first structures to be used was the Short Time Fourier
Transform (STFT) [8], which emerges as a natural extension
to the Fourier transform. In the STFT, the Fourier transform is
applied locally to (possibly overlapping) portions of the signal,
revealing atime-frequency(or space-frequency) description
of the signal. An example of the STFT is the JPEG image
compression algorithm [9], which is based on this concept.
During the 1980’s and 1990’s, the STFT was extensively
researched and generalized, becoming more known as the
Gabortransform, named in homage of Dennis Gabor, who
first suggested the time-frequency decomposition back in
1946 [10]. Gabor’s work was independently rediscovered in
1980 by Bastiaans [11] and Janssen [12], who studied the
fundamental properties of the expansion.
A basic 1-D Gabor dictionary consists of windowed wave-
forms
G=
©
⇤n,m(x)=w(x"⇥m)e
i2!"nx

n,m!Z
,
wherew(·)is a low-pass window function localized at 0
(typically a Gaussian), and#and⇥control the time and
frequency resolution of the transform. Much of the mathe-
matical foundations of this transform were laid out during the
late 1980’s by Daubechies, Grossman and Meyer [13], [14]
who studied the transform from the angle of frame theory,
and by Feichtinger and Gr¨ochenig [15]–[17] who employed a
generalized group-theoretic point of view. Study of the discrete
version of the transform and its numerical implementation
followed in the early 1990’s, with notable contributions by
Wexler and Raz [18] and by Qian and Chen [19].
In higher dimensions, more complex Gabor structures were
developed which adddirectionality, by varying the orientation
of the sinusoidal waves. This structure gained substantial
support from the work of Daugman [20], [21], who discovered
oriented Gabor-like patterns in simple-cell receptive fields in
the visual cortex. These results motivated the deployment of
the transform to image processing tasks, led by works such as
Daugman [22] and Porat and Zeevi [23]. Today, practical uses
of the Gabor transform are mainly in analysis and detection
tasks, as a collection of directional filters. Figure 2 shows
some examples of 2-D Gabor atoms of various orientations
and sizes.
Multi-Resolution: One of the most significant conceptual
advancements achieved in the 1980’s was the rise ofmulti-
scaleanalysis. It was realized that natural signals, and images
specifically, exhibited meaningful structures over many scales,
and could be analyzed and described particularly efficiently
by multi-scale constructions. One of the simplest and best
known such structures is theLaplacian pyramid, introduced
in 1984 by Burt and Adelson [24]. The Laplacian pyramid
represents an image as a series of difference images, where
each one corresponds to a different scale and roughly a
different frequency band.
In the second half of the 1980’s, though, the signal process-
ing community was particularly excited about the development
of a new very powerful tool, known aswavelet analysis[5],
[25], [26]. In a pioneering work from 1984, Grossman and
Morlet [27] proposed a signal expansion over a series of
translated and dilated versions of a single elementary function,
taking the form
W=
n
⇤n,m(x)=#
n/2
f(#
n
x"⇥m)
o
n,m!Z
.
This simple idea captivated the signal processing and harmonic
analysis communities, and in a series of influential works by
Meyer, Daubechies, Mallat and others [13], [14], [28]–[33],
an extensive wavelet theory was formalized. The theory was
formulated for both the continuous and discrete domains, and
a complete mathematical framework relating the two was put
forth. A significant breakthrough came from Meyer’s work in
1985 [28], who found that unlike the Gabor transform (and
RUBINSTEINet al.: DICTIONARIES FOR SPARSE REPRESENTATION 3
Fig. 1. Left: A few12£12DCT atoms. Right: The first 40 KLT atoms,
trained using12£12image patches fromLena.
B. Non-Linear Revolution and Elements of Modern Dictionary
Design
In statistics research, the 1980’s saw the rise of a new
powerful approach known asrobust statistics. Robust statistics
advocates sparsity as a key for a wide range of recovery and
analysis tasks. The idea has its roots in classical Physics, and
more recently in Information Theory, and promotes simplicity
and conciseness in guiding phenomena descriptions. Motivated
by these ideas, the 1980’s and 1990’s were characterized
by a search for sparser representations and more efficient
transforms.
Increasing sparsity required departure from the linear model,
towards a more flexiblenon-linearformulation. In the non-
linear case, each signal is allowed to use a different set
of atoms from the dictionary in order to achieve the best
approximation. Thus, the approximation process becomes
x⇤
X
n!IK(x)
cn!
n, (5)
whereIK(x)is an index set adapted to each signal individually
(we refer the reader to [5], [7] for more information on this
wide topic).
The non-linear view paved the way to the design of
newer, more efficient transforms. In the process, many of
the fundamental concepts guiding modern dictionary design
were formed. Following the historic time line, we trace the
emergence of the most important modern dictionary design
concepts, which are mostly formed during the last two decades
of the 20th century.
Localization: To achieve sparsity, transforms required better
localization. Atoms with concentrated supports allow more
flexible representations based on the local signal characteris-
tics, and limit the effects of irregularities, which are observed
to be the main source of large coefficients. In this spirit, one
of the first structures to be used was the Short Time Fourier
Transform (STFT) [8], which emerges as a natural extension
to the Fourier transform. In the STFT, the Fourier transform is
applied locally to (possibly overlapping) portions of the signal,
revealing atime-frequency(or space-frequency) description
of the signal. An example of the STFT is the JPEG image
compression algorithm [9], which is based on this concept.
During the 1980’s and 1990’s, the STFT was extensively
researched and generalized, becoming more known as the
Gabortransform, named in homage of Dennis Gabor, who
first suggested the time-frequency decomposition back in
1946 [10]. Gabor’s work was independently rediscovered in
1980 by Bastiaans [11] and Janssen [12], who studied the
fundamental properties of the expansion.
A basic 1-D Gabor dictionary consists of windowed wave-
forms
G=
©
⇤n,m(x)=w(x"⇥m)e
i2!"nx

n,m!Z
,
wherew(·)is a low-pass window function localized at 0
(typically a Gaussian), and#and⇥control the time and
frequency resolution of the transform. Much of the mathe-
matical foundations of this transform were laid out during the
late 1980’s by Daubechies, Grossman and Meyer [13], [14]
who studied the transform from the angle of frame theory,
and by Feichtinger and Gr¨ochenig [15]–[17] who employed a
generalized group-theoretic point of view. Study of the discrete
version of the transform and its numerical implementation
followed in the early 1990’s, with notable contributions by
Wexler and Raz [18] and by Qian and Chen [19].
In higher dimensions, more complex Gabor structures were
developed which adddirectionality, by varying the orientation
of the sinusoidal waves. This structure gained substantial
support from the work of Daugman [20], [21], who discovered
oriented Gabor-like patterns in simple-cell receptive fields in
the visual cortex. These results motivated the deployment of
the transform to image processing tasks, led by works such as
Daugman [22] and Porat and Zeevi [23]. Today, practical uses
of the Gabor transform are mainly in analysis and detection
tasks, as a collection of directional filters. Figure 2 shows
some examples of 2-D Gabor atoms of various orientations
and sizes.
Multi-Resolution: One of the most significant conceptual
advancements achieved in the 1980’s was the rise ofmulti-
scaleanalysis. It was realized that natural signals, and images
specifically, exhibited meaningful structures over many scales,
and could be analyzed and described particularly efficiently
by multi-scale constructions. One of the simplest and best
known such structures is theLaplacian pyramid, introduced
in 1984 by Burt and Adelson [24]. The Laplacian pyramid
represents an image as a series of difference images, where
each one corresponds to a different scale and roughly a
different frequency band.
In the second half of the 1980’s, though, the signal process-
ing community was particularly excited about the development
of a new very powerful tool, known aswavelet analysis[5],
[25], [26]. In a pioneering work from 1984, Grossman and
Morlet [27] proposed a signal expansion over a series of
translated and dilated versions of a single elementary function,
taking the form
W=
n
⇤n,m(x)=#
n/2
f(#
n
x"⇥m)
o
n,m!Z
.
This simple idea captivated the signal processing and harmonic
analysis communities, and in a series of influential works by
Meyer, Daubechies, Mallat and others [13], [14], [28]–[33],
an extensive wavelet theory was formalized. The theory was
formulated for both the continuous and discrete domains, and
a complete mathematical framework relating the two was put
forth. A significant breakthrough came from Meyer’s work in
1985 [28], who found that unlike the Gabor transform (and
DCT
PCA
Gabor
Learned
4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX
Fig. 2. Left: A few12£12Gabor atoms at different scales and orientations.
Right: A few atoms trained by Olshausen and Field (extracted from [34]).
contrary to common belief) the wavelet transform could be
designed to beorthogonalwhile maintaining stability — an
extremely appealing property to which much of the initial
success of the wavelets can be attributed to.
Specifically of interest to the signal processing community
was the work of Mallat and his colleagues [31]–[33] which
established the wavelet decomposition as a multi-resolution
expansion and put forth efficient algorithms for computing
it. In Mallat’s description, a multi-scale wavelet basis is
constructed from a pair of localized functions referred to as
thescaling functionand themother wavelet, see Figure 3.
The scaling function is a low frequency signal, and along
with its translations, spans the coarse approximation of the
signal. The mother wavelet is a high frequency signal, and
with its various scales and translations spans the signal detail.
In the orthogonal case, the wavelet basis functions at each
scale are critically sampled, spanning precisely the new detail
introduced by the finer level.
Non-linear approximation in the wavelet basis was shown
to be optimal for piecewise-smooth 1-D signals with a finite
number of discontinuities, see e.g. [32]. This was a striking
finding at the time, realizing that this is achieved without
prior detection of the discontinuity locations. Unfortunately,
in higher dimensions the wavelet transform loses its opti-
mality; the multi-dimensional transform is a simple separable
extension of the 1-D transform, with atoms supported over
rectangular regions of different sizes (see Figure 3). This
separability makes the transform simple to apply, however the
resulting dictionary is only effective for signals withpoint
singularities, while most natural signals exhibit elongatededge
singularities. The JPEG2000 image compression standard,
based on the wavelet transform, is indeed known for itsringing
(smoothing) artifacts near edges.
Adaptivity: Going to the 1990’s, the desire to push sparsity
even further, and describe increasingly complex phenomena,
was gradually revealing the limits of approximation in orthog-
onal bases. The weakness was mostly associated with the small
and fixed number of atoms in the dictionary — dictated by the
orthogonality — from which the optimal representation could
be constructed. Thus, one option to obtain further sparsity was
to adaptthe transform atoms themselvesto the signal content.
One of the first such structures to be proposed was the
wavelet packettransform, introduced by Coifman, Meyer
and Wickerhauser in 1992 [35]. The transform is built upon
the success of the wavelet transform, adding adaptivity to
allow finer tuning to the specific signal properties. The main
observation of Coifmanet al.was that the wavelet transform
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Scaling function
Mother wavelet
Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).
Right: Some 2-D separable Coiflet atoms.
enforced a very specific time-frequency structure, with high
frequency atoms having small supports and low frequency
atoms having large supports. Indeed, this choice has deep
connections to the behavior of real natural signals; however,
for specific signals, better partitionings may be possible. The
wavelet packet dictionary essentially unifies all dyadic time-
frequency atoms which can be derived from a specific pair
of scaling function and mother wavelet, so atoms of different
frequencies can come in an array of time supports. Out of
this large collection, the wavelet packet transform allows to
efficiently select an optimizedorthogonalsub-dictionary for
any given signal, with the standard wavelet basis being just
one of an exponential number of options. The process was
thus named by the authors aBest Basissearch. The wavelet
packet transform is, by definition, at least as good as wavelets
in terms of coding efficiency. However, we note that the multi-
dimensional wavelet packet transform remains a separable and
non-oriented transform, and thus does not generally provide a
substantial improvement over wavelets for images.
Geometric Invariance and Overcompleteness: In 1992, Si-
moncelliet al.[36] published a thorough work advocating a
dictionary property they termedshiftability, which describes
the invariance of the dictionary under certain geometric defor-
mations, e.g. translation, rotation or scaling. Indeed, the main
weakness of the wavelet transform is its strong translation-
sensitivity, as well as rotation-sensitivity in higher dimensions.
The authors concluded that achieving these properties required
abandoning orthogonality in favor ofovercompleteness, since
the critical number of atoms in an orthogonal transform was
simply insufficient. In the same work, the authors developed
an overcompleteorientedwavelet transform — thesteerable
wavelet transform— which was based on their previous work
on steerable filters and consisted of localized 2-D wavelet
atoms in many orientations, translations and scales.
For the basic 1-D wavelet transform, translation-invariance
can be achieved by increasing the sampling density of the
atoms. Thestationary wavelet transform, also known as the
undecimated or non-subsampled wavelet transform, is obtained
from the orthogonal transform by eliminating the sub-sampling
and collectingalltranslations of the atoms over the signal
domain. The algorithmic foundation for this was laid by
Beylkin in 1992 [37], with the development of an efficient
algorithm for computing the undecimated transform. The
stationary wavelet transform was indeed found to substantially
improve signal recovery compared to orthogonal wavelets,
and its benefits were independently demonstrated in 1995 by
Nason and Silverman [38] and Coifman and Donoho [39].
4 IEEE PROCEEDINGS, VOL. X, NO. X, XX 20XX
Fig. 2. Left: A few12£12Gabor atoms at different scales and orientations.
Right: A few atoms trained by Olshausen and Field (extracted from [34]).
contrary to common belief) the wavelet transform could be
designed to beorthogonalwhile maintaining stability — an
extremely appealing property to which much of the initial
success of the wavelets can be attributed to.
Specifically of interest to the signal processing community
was the work of Mallat and his colleagues [31]–[33] which
established the wavelet decomposition as a multi-resolution
expansion and put forth efficient algorithms for computing
it. In Mallat’s description, a multi-scale wavelet basis is
constructed from a pair of localized functions referred to as
thescaling functionand themother wavelet, see Figure 3.
The scaling function is a low frequency signal, and along
with its translations, spans the coarse approximation of the
signal. The mother wavelet is a high frequency signal, and
with its various scales and translations spans the signal detail.
In the orthogonal case, the wavelet basis functions at each
scale are critically sampled, spanning precisely the new detail
introduced by the finer level.
Non-linear approximation in the wavelet basis was shown
to be optimal for piecewise-smooth 1-D signals with a finite
number of discontinuities, see e.g. [32]. This was a striking
finding at the time, realizing that this is achieved without
prior detection of the discontinuity locations. Unfortunately,
in higher dimensions the wavelet transform loses its opti-
mality; the multi-dimensional transform is a simple separable
extension of the 1-D transform, with atoms supported over
rectangular regions of different sizes (see Figure 3). This
separability makes the transform simple to apply, however the
resulting dictionary is only effective for signals withpoint
singularities, while most natural signals exhibit elongatededge
singularities. The JPEG2000 image compression standard,
based on the wavelet transform, is indeed known for itsringing
(smoothing) artifacts near edges.
Adaptivity: Going to the 1990’s, the desire to push sparsity
even further, and describe increasingly complex phenomena,
was gradually revealing the limits of approximation in orthog-
onal bases. The weakness was mostly associated with the small
and fixed number of atoms in the dictionary — dictated by the
orthogonality — from which the optimal representation could
be constructed. Thus, one option to obtain further sparsity was
to adaptthe transform atoms themselvesto the signal content.
One of the first such structures to be proposed was the
wavelet packettransform, introduced by Coifman, Meyer
and Wickerhauser in 1992 [35]. The transform is built upon
the success of the wavelet transform, adding adaptivity to
allow finer tuning to the specific signal properties. The main
observation of Coifmanet al.was that the wavelet transform
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Scaling function
Mother wavelet
Fig. 3. Left: Coiflet 1-D scaling function (solid) and mother wavelet (dashed).
Right: Some 2-D separable Coiflet atoms.
enforced a very specific time-frequency structure, with high
frequency atoms having small supports and low frequency
atoms having large supports. Indeed, this choice has deep
connections to the behavior of real natural signals; however,
for specific signals, better partitionings may be possible. The
wavelet packet dictionary essentially unifies all dyadic time-
frequency atoms which can be derived from a specific pair
of scaling function and mother wavelet, so atoms of different
frequencies can come in an array of time supports. Out of
this large collection, the wavelet packet transform allows to
efficiently select an optimizedorthogonalsub-dictionary for
any given signal, with the standard wavelet basis being just
one of an exponential number of options. The process was
thus named by the authors aBest Basissearch. The wavelet
packet transform is, by definition, at least as good as wavelets
in terms of coding efficiency. However, we note that the multi-
dimensional wavelet packet transform remains a separable and
non-oriented transform, and thus does not generally provide a
substantial improvement over wavelets for images.
Geometric Invariance and Overcompleteness: In 1992, Si-
moncelliet al.[36] published a thorough work advocating a
dictionary property they termedshiftability, which describes
the invariance of the dictionary under certain geometric defor-
mations, e.g. translation, rotation or scaling. Indeed, the main
weakness of the wavelet transform is its strong translation-
sensitivity, as well as rotation-sensitivity in higher dimensions.
The authors concluded that achieving these properties required
abandoning orthogonality in favor ofovercompleteness, since
the critical number of atoms in an orthogonal transform was
simply insufficient. In the same work, the authors developed
an overcompleteorientedwavelet transform — thesteerable
wavelet transform— which was based on their previous work
on steerable filters and consisted of localized 2-D wavelet
atoms in many orientations, translations and scales.
For the basic 1-D wavelet transform, translation-invariance
can be achieved by increasing the sampling density of the
atoms. Thestationary wavelet transform, also known as the
undecimated or non-subsampled wavelet transform, is obtained
from the orthogonal transform by eliminating the sub-sampling
and collectingalltranslations of the atoms over the signal
domain. The algorithmic foundation for this was laid by
Beylkin in 1992 [37], with the development of an efficient
algorithm for computing the undecimated transform. The
stationary wavelet transform was indeed found to substantially
improve signal recovery compared to orthogonal wavelets,
and its benefits were independently demonstrated in 1995 by
Nason and Silverman [38] and Coifman and Donoho [39].

[Aharon & Elad 2006]
yk(·)=f(zk+·)
Patch-based Denoising
Step 1:Extract patches.
Noisy image:f=f0+w.
yk

Step 2:Dictionary learning.
[Aharon & Elad 2006]
yk(·)=f(zk+·)
min
D,(xk)k
!
k
1
2
||yk!Dxk||
2
+!||xk||1
Patch-based Denoising
Step 1:Extract patches.
Noisy image:f=f0+w.
yk

Step 3:Patch averaging.
Step 2:Dictionary learning.
˜yk=Dxk
[Aharon & Elad 2006]
˜
f(·)⇥
!
k
˜yk(·"zk)
yk(·)=f(zk+·)
min
D,(xk)k
!
k
1
2
||yk!Dxk||
2
+!||xk||1
Patch-based Denoising
Step 1:Extract patches.
Noisy image:f=f0+w.
˜yk
yk

Inverse problem:
D!C
pk(f)=f(zk+·)
Learning with Missing Data
y=!f0+w
min
f,(xk)k
1
2
||y!!f||
2
+!
!
k
1
2
||pk(f)!Dxk||
2
+⇥||xk||1
Patch extractor:





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.
pk
y
f0

Inverse problem:
D!C
!Convex sparse coding.
pk(f)=f(zk+·)
Learning with Missing Data
y=!f0+w
min
f,(xk)k
1
2
||y!!f||
2
+!
!
k
1
2
||pk(f)!Dxk||
2
+⇥||xk||1
Step 1:!k, minimization onxk
Patch extractor:





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.
pk
y
f0

Inverse problem:
D!C
Step 2:Minimization onD
!Convex sparse coding.
!Quadratic constrained.
pk(f)=f(zk+·)
Learning with Missing Data
y=!f0+w
min
f,(xk)k
1
2
||y!!f||
2
+!
!
k
1
2
||pk(f)!Dxk||
2
+⇥||xk||1
Step 1:!k, minimization onxk
Patch extractor:





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.
pk
y
f0

Inverse problem:
D!C
Step 2:Minimization onD
Step 3:Minimization onf
!Convex sparse coding.
!Quadratic constrained.
!Quadratic.
pk(f)=f(zk+·)
Learning with Missing Data
y=!f0+w
min
f,(xk)k
1
2
||y!!f||
2
+!
!
k
1
2
||pk(f)!Dxk||
2
+⇥||xk||1
Step 1:!k, minimization onxk
Patch extractor:





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.
pk
y
f0

Imagef0
Observations
y=!f0+w
Regularizedf
[Mairal et al. 2008]
Inpainting Example





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.





































































Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
(a)Original (b)Damaged
(c)Restored,N=1 (d)Restored,N=2
Fig. 14.Inpainting usingN=2andn= 16!16(bottom-right image), orN=1andn=8!8
(bottom-left).J= 100iterations were performed, producing an adaptive dictionary. During the
learning,50%of the patches were used. A sparsity factorL= 10has been used during the learning
process andL= 25for the finalreconstruction. The damaged image was created by removing75%of
the data from the original image. The initial PSNR is6.13dB. The resulting PSNR forN=2is
33.97dB and31.75dB forN=1.

[Peyr´e, Fadili, Starck 2010]
Adaptive Inpainting and Separation
!
Local DCT
Local DCT
Wavelets
Wavelets Learned

Overview
•Sparsity and Redundancy
•Dictionary Learning
•Extensions
•Task-driven Learning
•Texture Synthesis

MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
Higher Dimensional Learning
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61
Fig. 7. Data set used for evaluating denoising experiments.
TABLE I
PSNR RESULTS OF OURDENOISINGALGORITHMWITH256 ATOMS OFSIZE773FOR AND663FOR .EACHCASE ISDIVIDED INFOUR
PARTS:THETOP-LEFTRESULTS ARETHOSEGIVEN BYMCAULEY AND AL[28] WITHTHEIR“33MODEL.”THETOP-RIGHTRESULTS ARETHOSEOBTAINED BY
APPLYING THEGRAYSCALEK-SVD ALGORITHM[2]ONEACHCHANNELSEPARATELYWITH88ATOMS.THEBOTTOM-LEFT ARE OURRESULTSOBTAINED
WITH AGLOBALLYTRAINEDDICTIONARY.THEBOTTOM-RIGHT ARE THEIMPROVEMENTSOBTAINEDWITH THEADAPTIVEAPPROACHWITH20 ITERATIONS.
BOLDINDICATES THEBESTRESULTS FOREACHGROUP.AS CAN BE SEEN,OURPROPOSEDTECHNIQUECONSISTENTLYPRODUCES THEBESTRESULTS
TABLE II
COMPARISON OF THEPSNR RESULTS ON THEIMAGE“CASTLE”BETWEEN[28]ANDWHAT WEOBTAINEDWITH256 663AND773PATCHES.
FOR THEADAPTIVEAPPROACH, 20 ITERATIONSHAVEBEENPERFORMED.BOLDINDICATES THEBESTRESULT,INDICATINGONCE
AGAIN THECONSISTENTIMPROVEMENTOBTAINEDWITH OURPROPOSEDTECHNIQUE
patch), in order to prevent any learning of these artifacts (over-
fitting). We define then thepatch sparsityof the decompo-
sition as this number of steps. The stopping criteria in (2) be-
comes the number of atoms used instead of the reconstruction
error. Using a smallduring the OMP permits to learn a dic-
tionary specialized in providing a coarse approximation. Our
assumption is that (pattern) artifacts are less present in coarse
approximations, preventing the dictionary from learning them.
We propose then the algorithm described in Fig. 6. We typically
used to prevent the learning of artifacts and found out
that two outer iterations in the scheme in Fig. 6 are sufficient to
give satisfactory results, while within the K-SVD, 10–20 itera-
tions are required.
To conclude, in order to address the demosaicing problem, we
use the modified K-SVD algorithm that deals with nonuniform
noise, as described in previous section, and add to it an adaptive
dictionary that has been learned with low patch sparsity in order
to avoid over-fitting the mosaic pattern. The same technique can
be applied to generic color inpainting as demonstrated in the
next section.
V. EXPERIMENTALRESULTS
We are now ready to present the color image denoising, in-
painting, and demosaicing results that are obtained with the pro-
posed framework.
A. Denoising Color Images
The state-of-the-art performance of the algorithm on
grayscale images has already been studied in [2]. We now
evaluate our extension for color images. We trained some
dictionaries with different sizes of atoms 553, 663,
773 and 883, on 200 000 patches taken from a
database of 15 000 images with the patch-sparsity parameter
(six atoms in the representations). We used the database
LabelMe [55] to build our image database. Then we trained
each dictionary with 600 iterations. This provided us a set of
generic dictionaries that we used as initial dictionaries in our
denoising algorithm. Comparing the results obtained with the
global approach and the adaptive one permits us to see the
improvements in the learning process. We chose to evaluate
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61
Fig. 7. Data set used for evaluating denoising experiments.
TABLE I
PSNR RESULTS OF OURDENOISINGALGORITHMWITH256 ATOMS OFSIZE773FOR AND663FOR .EACHCASE ISDIVIDED INFOUR
PARTS:THETOP-LEFTRESULTS ARETHOSEGIVEN BYMCAULEY AND AL[28] WITHTHEIR“33MODEL.”THETOP-RIGHTRESULTS ARETHOSEOBTAINED BY
APPLYING THEGRAYSCALEK-SVD ALGORITHM[2]ONEACHCHANNELSEPARATELYWITH88ATOMS.THEBOTTOM-LEFT ARE OURRESULTSOBTAINED
WITH AGLOBALLYTRAINEDDICTIONARY.THEBOTTOM-RIGHT ARE THEIMPROVEMENTSOBTAINEDWITH THEADAPTIVEAPPROACHWITH20 ITERATIONS.
BOLDINDICATES THEBESTRESULTS FOREACHGROUP.AS CAN BE SEEN,OURPROPOSEDTECHNIQUECONSISTENTLYPRODUCES THEBESTRESULTS
TABLE II
COMPARISON OF THEPSNR RESULTS ON THEIMAGE“CASTLE”BETWEEN[28]ANDWHAT WEOBTAINEDWITH256 663AND773PATCHES.
FOR THEADAPTIVEAPPROACH, 20 ITERATIONSHAVEBEENPERFORMED.BOLDINDICATES THEBESTRESULT,INDICATINGONCE
AGAIN THECONSISTENTIMPROVEMENTOBTAINEDWITH OURPROPOSEDTECHNIQUE
patch), in order to prevent any learning of these artifacts (over-
fitting). We define then thepatch sparsityof the decompo-
sition as this number of steps. The stopping criteria in (2) be-
comes the number of atoms used instead of the reconstruction
error. Using a smallduring the OMP permits to learn a dic-
tionary specialized in providing a coarse approximation. Our
assumption is that (pattern) artifacts are less present in coarse
approximations, preventing the dictionary from learning them.
We propose then the algorithm described in Fig. 6. We typically
used to prevent the learning of artifacts and found out
that two outer iterations in the scheme in Fig. 6 are sufficient to
give satisfactory results, while within the K-SVD, 10–20 itera-
tions are required.
To conclude, in order to address the demosaicing problem, we
use the modified K-SVD algorithm that deals with nonuniform
noise, as described in previous section, and add to it an adaptive
dictionary that has been learned with low patch sparsity in order
to avoid over-fitting the mosaic pattern. The same technique can
be applied to generic color inpainting as demonstrated in the
next section.
V. EXPERIMENTALRESULTS
We are now ready to present the color image denoising, in-
painting, and demosaicing results that are obtained with the pro-
posed framework.
A. Denoising Color Images
The state-of-the-art performance of the algorithm on
grayscale images has already been studied in [2]. We now
evaluate our extension for color images. We trained some
dictionaries with different sizes of atoms 553, 663,
773 and 883, on 200 000 patches taken from a
database of 15 000 images with the patch-sparsity parameter
(six atoms in the representations). We used the database
LabelMe [55] to build our image database. Then we trained
each dictionary with 600 iterations. This provided us a set of
generic dictionaries that we used as initial dictionaries in our
denoising algorithm. Comparing the results obtained with the
global approach and the adaptive one permits us to see the
improvements in the learning process. We chose to evaluate
LearningD

MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
Inpainting
Higher Dimensional Learning
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 57
Fig. 2. Dictionaries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
Since the atoms can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 553 patches; (b) 883 patches.
Fig. 3. Examples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (in the new metric).
Color artifacts are reduced with our proposed technique (in our proposed new metric). Both images have been denoised with the same global dictionary.
In (b), one observes a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when
(false contours), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,dB. (c) Proposed algorithm,
dB.
Fig. 4. (a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61
Fig. 7. Data set used for evaluating denoising experiments.
TABLE I
PSNR RESULTS OF OURDENOISINGALGORITHMWITH256 ATOMS OFSIZE773FOR AND663FOR .EACHCASE ISDIVIDED INFOUR
PARTS:THETOP-LEFTRESULTS ARETHOSEGIVEN BYMCAULEY AND AL[28] WITHTHEIR“33MODEL.”THETOP-RIGHTRESULTS ARETHOSEOBTAINED BY
APPLYING THEGRAYSCALEK-SVD ALGORITHM[2]ONEACHCHANNELSEPARATELYWITH88ATOMS.THEBOTTOM-LEFT ARE OURRESULTSOBTAINED
WITH AGLOBALLYTRAINEDDICTIONARY.THEBOTTOM-RIGHT ARE THEIMPROVEMENTSOBTAINEDWITH THEADAPTIVEAPPROACHWITH20 ITERATIONS.
BOLDINDICATES THEBESTRESULTS FOREACHGROUP.AS CAN BE SEEN,OURPROPOSEDTECHNIQUECONSISTENTLYPRODUCES THEBESTRESULTS
TABLE II
COMPARISON OF THEPSNR RESULTS ON THEIMAGE“CASTLE”BETWEEN[28]ANDWHAT WEOBTAINEDWITH256 663AND773PATCHES.
FOR THEADAPTIVEAPPROACH, 20 ITERATIONSHAVEBEENPERFORMED.BOLDINDICATES THEBESTRESULT,INDICATINGONCE
AGAIN THECONSISTENTIMPROVEMENTOBTAINEDWITH OURPROPOSEDTECHNIQUE
patch), in order to prevent any learning of these artifacts (over-
fitting). We define then thepatch sparsityof the decompo-
sition as this number of steps. The stopping criteria in (2) be-
comes the number of atoms used instead of the reconstruction
error. Using a smallduring the OMP permits to learn a dic-
tionary specialized in providing a coarse approximation. Our
assumption is that (pattern) artifacts are less present in coarse
approximations, preventing the dictionary from learning them.
We propose then the algorithm described in Fig. 6. We typically
used to prevent the learning of artifacts and found out
that two outer iterations in the scheme in Fig. 6 are sufficient to
give satisfactory results, while within the K-SVD, 10–20 itera-
tions are required.
To conclude, in order to address the demosaicing problem, we
use the modified K-SVD algorithm that deals with nonuniform
noise, as described in previous section, and add to it an adaptive
dictionary that has been learned with low patch sparsity in order
to avoid over-fitting the mosaic pattern. The same technique can
be applied to generic color inpainting as demonstrated in the
next section.
V. EXPERIMENTALRESULTS
We are now ready to present the color image denoising, in-
painting, and demosaicing results that are obtained with the pro-
posed framework.
A. Denoising Color Images
The state-of-the-art performance of the algorithm on
grayscale images has already been studied in [2]. We now
evaluate our extension for color images. We trained some
dictionaries with different sizes of atoms 553, 663,
773 and 883, on 200 000 patches taken from a
database of 15 000 images with the patch-sparsity parameter
(six atoms in the representations). We used the database
LabelMe [55] to build our image database. Then we trained
each dictionary with 600 iterations. This provided us a set of
generic dictionaries that we used as initial dictionaries in our
denoising algorithm. Comparing the results obtained with the
global approach and the adaptive one permits us to see the
improvements in the learning process. We chose to evaluate
MAIRALet al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION 61
Fig. 7. Data set used for evaluating denoising experiments.
TABLE I
PSNR RESULTS OF OURDENOISINGALGORITHMWITH256 ATOMS OFSIZE773FOR AND663FOR .EACHCASE ISDIVIDED INFOUR
PARTS:THETOP-LEFTRESULTS ARETHOSEGIVEN BYMCAULEY AND AL[28] WITHTHEIR“33MODEL.”THETOP-RIGHTRESULTS ARETHOSEOBTAINED BY
APPLYING THEGRAYSCALEK-SVD ALGORITHM[2]ONEACHCHANNELSEPARATELYWITH88ATOMS.THEBOTTOM-LEFT ARE OURRESULTSOBTAINED
WITH AGLOBALLYTRAINEDDICTIONARY.THEBOTTOM-RIGHT ARE THEIMPROVEMENTSOBTAINEDWITH THEADAPTIVEAPPROACHWITH20 ITERATIONS.
BOLDINDICATES THEBESTRESULTS FOREACHGROUP.AS CAN BE SEEN,OURPROPOSEDTECHNIQUECONSISTENTLYPRODUCES THEBESTRESULTS
TABLE II
COMPARISON OF THEPSNR RESULTS ON THEIMAGE“CASTLE”BETWEEN[28]ANDWHAT WEOBTAINEDWITH256 663AND773PATCHES.
FOR THEADAPTIVEAPPROACH, 20 ITERATIONSHAVEBEENPERFORMED.BOLDINDICATES THEBESTRESULT,INDICATINGONCE
AGAIN THECONSISTENTIMPROVEMENTOBTAINEDWITH OURPROPOSEDTECHNIQUE
patch), in order to prevent any learning of these artifacts (over-
fitting). We define then thepatch sparsityof the decompo-
sition as this number of steps. The stopping criteria in (2) be-
comes the number of atoms used instead of the reconstruction
error. Using a smallduring the OMP permits to learn a dic-
tionary specialized in providing a coarse approximation. Our
assumption is that (pattern) artifacts are less present in coarse
approximations, preventing the dictionary from learning them.
We propose then the algorithm described in Fig. 6. We typically
used to prevent the learning of artifacts and found out
that two outer iterations in the scheme in Fig. 6 are sufficient to
give satisfactory results, while within the K-SVD, 10–20 itera-
tions are required.
To conclude, in order to address the demosaicing problem, we
use the modified K-SVD algorithm that deals with nonuniform
noise, as described in previous section, and add to it an adaptive
dictionary that has been learned with low patch sparsity in order
to avoid over-fitting the mosaic pattern. The same technique can
be applied to generic color inpainting as demonstrated in the
next section.
V. EXPERIMENTALRESULTS
We are now ready to present the color image denoising, in-
painting, and demosaicing results that are obtained with the pro-
posed framework.
A. Denoising Color Images
The state-of-the-art performance of the algorithm on
grayscale images has already been studied in [2]. We now
evaluate our extension for color images. We trained some
dictionaries with different sizes of atoms 553, 663,
773 and 883, on 200 000 patches taken from a
database of 15 000 images with the patch-sparsity parameter
(six atoms in the representations). We used the database
LabelMe [55] to build our image database. Then we trained
each dictionary with 600 iterations. This provided us a set of
generic dictionaries that we used as initial dictionaries in our
denoising algorithm. Comparing the results obtained with the
global approach and the adaptive one permits us to see the
improvements in the learning process. We chose to evaluate
LearningD
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-
tom: Zooming on the damaged and restored images. Note that the pictures presented here
have been scaled down for display. (Best seen in color).
6.4 Application to Large-Scale Image Processing
We demonstrate in this section that our algorithm can be used for a difficult large-scale image
processing task, namely, removing the text (inpainting) from the damaged 12-Megapixel image
of Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionary
with 256 elements from the roughly 7!10
6
undamaged 12!12 color patches in the image with
two epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has been
learned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).
Our intent here is of coursenotto evaluate our learning procedure in inpainting tasks, which would
require a thorough comparison with state-the-art techniques on standarddata sets. Instead, we just
wish to demonstrate that it can indeed be applied to a realistic, non-trivial imageprocessing task on
a large image. Indeed, to the best of our knowledge, this is the first time that dictionary learning
is used for image restoration on such large-scale data. For comparison, the dictionaries used for
inpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.
49
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
Figure 7: Inpainting example on a 12-Megapixel image. Top: Damaged and restored images. Bot-
tom: Zooming on the damaged and restored images. Note that the pictures presented here
have been scaled down for display. (Best seen in color).
6.4 Application to Large-Scale Image Processing
We demonstrate in this section that our algorithm can be used for a difficult large-scale image
processing task, namely, removing the text (inpainting) from the damaged 12-Megapixel image
of Figure 7. Using a multi-threaded version of our implementation, we have learned a dictionary
with 256 elements from the roughly 7!10
6
undamaged 12!12 color patches in the image with
two epochs in about 8 minutes on a 2.4GHz machine with eight cores. Once the dictionary has been
learned, the text is removed using the sparse coding technique for inpainting of Mairal et al. (2008b).
Our intent here is of coursenotto evaluate our learning procedure in inpainting tasks, which would
require a thorough comparison with state-the-art techniques on standarddata sets. Instead, we just
wish to demonstrate that it can indeed be applied to a realistic, non-trivial imageprocessing task on
a large image. Indeed, to the best of our knowledge, this is the first time that dictionary learning
is used for image restoration on such large-scale data. For comparison, the dictionaries used for
inpainting in Mairal et al. (2008b) are learned (in batch mode) on 200,000 patches only.
49

Movie Inpainting

Image registration.
Facial Image Compression
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
[Elad et al. 2009]

Image registration.
Non-overlapping patches (fk)k.
Facial Image Compression
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
[Elad et al. 2009]
fk

Image registration.
Non-overlapping patches (fk)k.
Dictionary learning (Dk)k.
Dk
Facial Image Compression
Before turning to preset the results we should add the follow-
ing: while all the results shown here refer to the specific database
we operate on, the overall scheme proposed is general and should
apply to other face images databases just as well. Naturally, some
changes in the parameters might be necessary, and among those,
the patch size is the most important to consider. We also note that
as one shifts from one source of images to another, the relative size
of the background in the photos may vary, and this necessarily
leads to changes in performance. More specifically, when the back-
ground regions are larger (e.g., the images we use here have rela-
tively small such regions), the compression performance is
expected to improve.
4.1. K-SVD dictionaries
The primary stopping condition for the training process was set
to be a limitation on the maximal number of K-SVD iterations
(being 100). A secondary stopping condition was a limitation on
the minimal representation error. In the image compression stage
we added a limitation on the maximal number of atoms per patch.
These conditions were used to allow us to better control the rates
of the resulting images and the overall simulation time.
Every obtained dictionary contains 512 patches of size
15!15 pixels as atoms. InFig. 6we can see the dictionary that
was trained for patch number 80 (The left eye) forL"4 sparse
coding atoms, and similarly, inFig. 7we can see the dictionary that
was trained for patch number 87 (The right nostril) also forL"4
sparse coding atoms. It can be seen that both dictionaries contain
images similar in nature to the image patch for which they were
trained for. A similar behavior was observed in other dictionaries.
4.2. Reconstructed images
Our coding strategy allows us to learn which parts of the im-
age are more difficult than others to code. This is done by
assigning the same representation error threshold to all of the
patches, and observing how many atoms are required for the
representation of each patch on average. Clearly, patches with
a small number of allocated atoms are simpler to represent than
others. We would expect that the representation of smooth areas
of the image such as the background, parts of the face and
maybe parts of the clothes will be simpler than the representa-
tion of areas containing high frequency elements such as the
hair or the eyes.Fig. 8shows maps of atom allocation per patch
and representation error (RMSE—squared-root of the mean
squared error) per patch for the images in the test set in two
different bit-rates. It can be seen that more atoms were allocated
to patches containing the facial details (hair, mouth, eyes, and
Fig. 6.The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method withL"4.
Fig. 7.The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method withL"4.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
[Elad et al. 2009]
fk

Image registration.
Non-overlapping patches (fk)k.
Dictionary learning (Dk)k.
fk!Dkxk
Sparse approximation:
Entropic coding:xk!file.
JPEG-2k PCA
Learning
Dk
Facial Image Compression
Before turning to preset the results we should add the follow-
ing: while all the results shown here refer to the specific database
we operate on, the overall scheme proposed is general and should
apply to other face images databases just as well. Naturally, some
changes in the parameters might be necessary, and among those,
the patch size is the most important to consider. We also note that
as one shifts from one source of images to another, the relative size
of the background in the photos may vary, and this necessarily
leads to changes in performance. More specifically, when the back-
ground regions are larger (e.g., the images we use here have rela-
tively small such regions), the compression performance is
expected to improve.
4.1. K-SVD dictionaries
The primary stopping condition for the training process was set
to be a limitation on the maximal number of K-SVD iterations
(being 100). A secondary stopping condition was a limitation on
the minimal representation error. In the image compression stage
we added a limitation on the maximal number of atoms per patch.
These conditions were used to allow us to better control the rates
of the resulting images and the overall simulation time.
Every obtained dictionary contains 512 patches of size
15!15 pixels as atoms. InFig. 6we can see the dictionary that
was trained for patch number 80 (The left eye) forL"4 sparse
coding atoms, and similarly, inFig. 7we can see the dictionary that
was trained for patch number 87 (The right nostril) also forL"4
sparse coding atoms. It can be seen that both dictionaries contain
images similar in nature to the image patch for which they were
trained for. A similar behavior was observed in other dictionaries.
4.2. Reconstructed images
Our coding strategy allows us to learn which parts of the im-
age are more difficult than others to code. This is done by
assigning the same representation error threshold to all of the
patches, and observing how many atoms are required for the
representation of each patch on average. Clearly, patches with
a small number of allocated atoms are simpler to represent than
others. We would expect that the representation of smooth areas
of the image such as the background, parts of the face and
maybe parts of the clothes will be simpler than the representa-
tion of areas containing high frequency elements such as the
hair or the eyes.Fig. 8shows maps of atom allocation per patch
and representation error (RMSE—squared-root of the mean
squared error) per patch for the images in the test set in two
different bit-rates. It can be seen that more atoms were allocated
to patches containing the facial details (hair, mouth, eyes, and
Fig. 6.The Dictionary obtained by K-SVD for Patch No. 80 (the left eye) using the OMP method withL"4.
Fig. 7.The Dictionary obtained by K-SVD for Patch No. 87 (the right nostril) using the OMP method withL"4.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 275
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach
taken in[17].
In our work we propose a novel compression algorithm, related
to the one presented in[17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals[18–26], and
learning their sparsifying dictionaries[27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in[17],
and (iii) A Principal Component Analysis (PCA) approach.
2
In the next section we provide some background material for
this work: we start by presenting the details of the compression
algorithm developed in[17], as their scheme is the one we embark
from in the development of ours. We also describe the topic of
sparse and redundant representations and the K-SVD, that are
the foundations for our algorithm. In Section3we turn to present
the proposed algorithm in details, showing its various steps, and
discussing its computational/memory complexities. Section4
presents results of our method, demonstrating the claimed
superiority. We conclude in Section5with a list of future activities
that can further improve over the proposed scheme.
2. Background material
2.1. VQ-based image compression
Among the thousands of papers that study still image
compression algorithms, there are relatively few that consider
the treatment of facial images[2–17]. Among those, the most
recent and the best performing algorithm is the one reported in
[17]. That paper also provides a thorough literature survey that
compares the various methods and discusses similarities and
differences between them. Therefore, rather than repeating such
a survey here, we refer the interested reader to[17]. In this
sub-section we concentrate on the description of the algorithm
in[17]as our method resembles it to some extent.
This algorithm, like some others before it, starts with a geomet-
rical alignment of the input image, so that the main features (ears,
nose, mouth, hair-line, etc.) are aligned with those of a database of
pre-aligned facial images. Such alignment increases further the
redundancy in the handled image, due to its high cross similarity
to the database. The warping in[17]is done by an automatic
detection of 13 feature points on the face, and moving them to
pre-determined canonical locations. These points define a slicing
of the input image into disjoint and covering set of triangles, each
exhibiting an affine warp, being a function of the motion of its
three vertices. Side information on these 13 feature locations
enables a reverse warp of the reconstructed image in the decoder.
Fig. 1(left side) shows the features and the induced triangles. After
the warping, the image is sliced into square and non-overlapping
patches (of size 8!8 pixels), each of which is coded separately.
Such possible slicing (for illustration purpose we show this slicing
with larger patches) is shown inFig. 1(right side).
Coding of the image patches in[17]is done using vector quan-
tization (VQ)[30–32]. The VQ dictionaries are trained (using tree-
K-Means) per each patch separately, using patches taken from the
same location from 5000 training images. This way, each VQ is
adapted to the expected local content, and thus the high perfor-
mance presented by this algorithm. The number of code-words
in the VQ is a function of the bit-allocation for the patches. As
we argue in the next section, VQ coding is limited by the available
number of examples and the desired rate, forcing relatively small
patch sizes. This, in turn, leads to a loss of some redundancy be-
tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
of size 8!8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8!8 pixel patches once
again. This method can be applied on a Laplacian pyramid of the
original (warped) image with several scales[33].
As already mentioned above, the results shown in[17]surpass
those obtained by JPEG2000, both visually and in Peak-Signal-to-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
pose to replace the coding stage from VQ to sparse and redundant
representations—this leads us to the next subsection, were we de-
scribe the principles behind this coding strategy.
2.2. Sparse and redundant representations
We now turn to describe a model for signals known asSparse-
land[29]. This model suggests a parametric description of signal
sources in a way that adapts to their true nature. This model will
be harnessed in this work to provide the coding mechanism for
the image patches. We consider a family of image patches of size
N!Npixels, ordered lexicographically as column vectorsx2R
n
(withn"N
2
). Assume that we are given a matrixD2R
n!k
(with
possiblyk>n). We refer hereafter to this matrix as the dictionary.
TheSparselandmodel suggests that every such image patch,x,
could be represented sparsely using this dictionary, i.e., the solu-
tion of
^a"arg min
a
kak
0
subject tokDa#xk
2
2
6e
2
; $1%
is expected to be very sparse,k^ak
0
&n. The notationkak
0
counts the
non-zero entries ina. Thus, every signal instance from the family we
consider is assumed to be represented as a linear combination of
few columns (referred to hereafter asatoms) from the redundant
dictionaryD.
The requirementkDa#xk
2
6esuggests that the approximation
ofxusingDaneed not be exact, and could absorb a moderate error
e. This suggests an approximation that trades-off accuracy of repre-
sentation with its simplicity, very much like the rate-distortion
2
The PCA algorithm is developed in this work as a competitive benchmark, and
while it is generally performing very well, it is inferior to the main algorithm
presented in this work.
Fig. 1.(Left) Piece-wise affine warping of the image by triangulation. (Right) A
uniform slicing to disjoint square patches for coding purposes.
O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271
Much like other compression methods, the quality of the
reconstructed images in our method improves as the bit-rate
increases. However, the contribution gained from such a rate
increment is not divided equally over the image. Additional bits
are allocated to patches with higher representation error, and
those are improved first. This property is directly caused by
the nature of the compression process, which is RMSE oriented
and not bit-rate oriented. The compression process sets a single
RMSE threshold for all the patches, forcing each of them to
reach it without fixing the number of allocated atoms per
patch. Patches with simple (smooth) content are most likely
to have a representation error far below the threshold even
using zero or one atom, whereas patches with more complex
content are expected to give a representation error very close
to the threshold. Such problematic patches will be forced to im-
prove their representation error by increasing the number of
atoms they use as the RMSE threshold is decreased, while
patches with a representation error below the threshold will
not be forced to change at all. Fig. 11illustrates the
gradual improvement in the image quality as the bit-rate in-
creases. As can be seen, not all the patches improve as the
bit-rate increases but only some of them, such as several
patches in the clothes area, in the ears and in the outline of
the hair. These patches were more difficult to represent than
others.
4.3. Comparing to other techniques
An important part in assessing the performance of our com-
pression method is its comparison to known and competitive
compression techniques. As mentioned before, we compare our
results in this work with JPEG, JPEG2000, The VQ-Based compres-
sion method described in[17], and a PCA-Based compression
method that was built especially for this work as a competitive
benchmark. We therefore start with a brief description of the
PCA technique.
The PCA-Based compression method is very similar to the
scheme described in this work, simply replacing the K-SVD dic-
tionaries with a Principal Component Analysis (PCA) ones. These
dictionaries are square matrices storing the eigenvectors of the
autocorrelation matrices of the training examples in each patch,
sorted by a decreasing order of their corresponding eigenvalues.
Fig. 12.Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are the
representation RMSE.
278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282
Much like other compression methods, the quality of the
reconstructed images in our method improves as the bit-rate
increases. However, the contribution gained from such a rate
increment is not divided equally over the image. Additional bits
are allocated to patches with higher representation error, and
those are improved first. This property is directly caused by
the nature of the compression process, which is RMSE oriented
and not bit-rate oriented. The compression process sets a single
RMSE threshold for all the patches, forcing each of them to
reach it without fixing the number of allocated atoms per
patch. Patches with simple (smooth) content are most likely
to have a representation error far below the threshold even
using zero or one atom, whereas patches with more complex
content are expected to give a representation error very close
to the threshold. Such problematic patches will be forced to im-
prove their representation error by increasing the number of
atoms they use as the RMSE threshold is decreased, while
patches with a representation error below the threshold will
not be forced to change at all. Fig. 11illustrates the
gradual improvement in the image quality as the bit-rate in-
creases. As can be seen, not all the patches improve as the
bit-rate increases but only some of them, such as several
patches in the clothes area, in the ears and in the outline of
the hair. These patches were more difficult to represent than
others.
4.3. Comparing to other techniques
An important part in assessing the performance of our com-
pression method is its comparison to known and competitive
compression techniques. As mentioned before, we compare our
results in this work with JPEG, JPEG2000, The VQ-Based compres-
sion method described in[17], and a PCA-Based compression
method that was built especially for this work as a competitive
benchmark. We therefore start with a brief description of the
PCA technique.
The PCA-Based compression method is very similar to the
scheme described in this work, simply replacing the K-SVD dic-
tionaries with a Principal Component Analysis (PCA) ones. These
dictionaries are square matrices storing the eigenvectors of the
autocorrelation matrices of the training examples in each patch,
sorted by a decreasing order of their corresponding eigenvalues.
Fig. 12.Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are the
representation RMSE.
278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282
Much like other compression methods, the quality of the
reconstructed images in our method improves as the bit-rate
increases. However, the contribution gained from such a rate
increment is not divided equally over the image. Additional bits
are allocated to patches with higher representation error, and
those are improved first. This property is directly caused by
the nature of the compression process, which is RMSE oriented
and not bit-rate oriented. The compression process sets a single
RMSE threshold for all the patches, forcing each of them to
reach it without fixing the number of allocated atoms per
patch. Patches with simple (smooth) content are most likely
to have a representation error far below the threshold even
using zero or one atom, whereas patches with more complex
content are expected to give a representation error very close
to the threshold. Such problematic patches will be forced to im-
prove their representation error by increasing the number of
atoms they use as the RMSE threshold is decreased, while
patches with a representation error below the threshold will
not be forced to change at all. Fig. 11illustrates the
gradual improvement in the image quality as the bit-rate in-
creases. As can be seen, not all the patches improve as the
bit-rate increases but only some of them, such as several
patches in the clothes area, in the ears and in the outline of
the hair. These patches were more difficult to represent than
others.
4.3. Comparing to other techniques
An important part in assessing the performance of our com-
pression method is its comparison to known and competitive
compression techniques. As mentioned before, we compare our
results in this work with JPEG, JPEG2000, The VQ-Based compres-
sion method described in[17], and a PCA-Based compression
method that was built especially for this work as a competitive
benchmark. We therefore start with a brief description of the
PCA technique.
The PCA-Based compression method is very similar to the
scheme described in this work, simply replacing the K-SVD dic-
tionaries with a Principal Component Analysis (PCA) ones. These
dictionaries are square matrices storing the eigenvectors of the
autocorrelation matrices of the training examples in each patch,
sorted by a decreasing order of their corresponding eigenvalues.
Fig. 12.Facial images compression with a bit-rate of 400 bytes. Comparing results of JPEG2000, the PCA results, and our K-SVD method. The values in the brackets are the
representation RMSE.
278 O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282
[Elad et al. 2009]
400 bytes
fk

PCA
Learning
Dictionary learning:
C={D\||dm||!1}
ExemplarsY
Constraints on the Learning
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
min
1
2
||Y!DX||
2
+!||X||1
X, D!C

PCA NMF
Learning
Dictionary learning:
C={D\||dm||!1}
C={D\||dm||!1,D"0}
Non-negative matrix factorization:
ExemplarsY
Constraints on the Learning
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
min
1
2
||Y!DX||
2
+!||X||1
X, D!C

Sparse PCA
PCA NMF
Learning
Dictionary learning:
C={D\||dm||!1}
C={D\||dm||!1,D"0}
C=
!
D\||dm||
2
+!||dm||1!1
"
Non-negative matrix factorization:
Sparse-PCA:
ExemplarsY
Constraints on the Learning
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
ONLINELEARNING FORMAT R I XFACTORIZATION AND SPARSECODING
(a) PCA (b) SPCA,τ=70%
(c) NMF (d) SPCA,τ=30%
(e) Dictionary Learning (f) SPCA,τ=10%
Figure 3: Results obtained by PCA, NMF, dictionary learning, SPCA for data set D.
45
min
1
2
||Y!DX||
2
+!||X||1
X, D!C

Translation invariance + patches: [Aharon & Elad 2008]
[Jojic et al. 2003]
Low dimensional dictionary parameterization.
D=(dm)m
DictionaryD
Imagef
dm(t)=d(zm+t)
Signatured
Dictionary Signature / Epitome
Figure 7.1,2,4and20epitomes learned on the barbara image
for the same parameters. They are of sizes42,32,25and15in
order to keep the same number of elements inD. They are not
represented to scale.
5.3. Influence of the Number of Epitomes
We present in this section an experiment where the num-
ber of learned epitomes vary, while keeping the same num-
bers of columns inD. The1,2,4and20epitomes learned
on the image barbara are shown in Figure7. When the num-
ber of epitomes is small, we observe in the epitomes some
discontinuities between texture areas with different visual
characteristics, which is not the case when learning several
independant epitomes.
5.4. Application to Denoising
In order to evaluate the performance of epitome learn-
ing in various regimes (single epitome, multiple epitomes),
we use the same methodology as [1] that uses the success-
ful denoising method first introduced by [9]. Let us con-
sider first the classical problem of restoring a noisy imagey
inR
n
which has been corrupted by a white Gaussian noise
of standard deviation⇥. We denote byy
iinR
m
the patch
ofycentered at pixeli(with any arbitrary ordering of the
image pixels).
The method of [9] proceeds as follows:
•Learn a dictionaryDadapted to all overlapping
patchesy
1,y
2,...from the noisy imagey.
•Approximate each noisy patch using the learned dic-
tionary with a greedy algorithm called orthogonal
matching pursuit (OMP) [17] to have a clean estimate
of every patch ofy
iby addressing the following prob-
lem
argmin
↵i⇥R
p
⇤↵i⇤0s.t.⇤y
i⌃D↵i⇤
2
2⇥(C⇥
2
),
whereD↵iis a clean estimate of the patchy
i,⇤↵i⇤0
is the⌃0pseudo-norm of↵i, andCis a regularization
parameter. Following [9], we chooseC=1.15.
•Since every pixel inyadmits many clean estimates
(one estimate for every patch the pixel belongs to), av-
erage the estimates.
Figure 8. Artificially noised boat image (with standard deviation
⇥=15), and the result of our denoising algorithm.
Quantitative results for single epitome, and multi-scale
multi-epitomes are presented in Table1on six images and
five levels of noise. We evaluate the performance of the de-
noising process by computing the peak signal-to-noise ratio
(PSNR) for each pair of images. For each level of noise,
we have selected the best regularization parameter#overall
the six images, and have then used it all the experiments.
The PNSR values are averaged over5experiments with5
different noise realizations. The mean standard deviation is
of0.05dB both for the single epitome and the multi-scale
multi-epitomes.
We see from this experiment that the formulation we pro-
pose is competitive compared to the one of [1]. Learning
multi epitomes instead of a single one seems to provide bet-
ter results, which might be explained by the lack of flexi-
bility of the single epitome representation. Evidently, these
results are not as good as recent state-of-the-art denoising
algorithms such as [7,15] which exploit more sophisticated
2918
ourselves for simplicity to the case of single and multiple
epitomes of the same size and shape.
The multi-epitome version of our approach can be seen
as an interpolation between classical dictionary and single
epitome. Indeed, defining a multitude of epitomes of the
same size as the considered patches is equivalent to work-
ing with a dictionary. Defining a large number a epito-
mes slightly larger than the patches is equivalent to shift-
invariant dictionaries. In Section5, we experimentally com-
pare these different regimes for the task of image denoising.
4.4. Initialization
Because of the nonconvexity of the optimization prob-
lem, the question of the initialization is an important issue
in epitome learning. We have already mentioned a multi-
scale strategy to overcome this issue, but for the first scale,
the problem remains. Whereas classical flat dictionaries can
naturally be initialized with prespecified dictionaries such
as overcomplete DCT basis (see [9]), the epitome does not
admit such a natural choice. In all the experiences (un-
less written otherwise), we use as the initialization a single
epitome (or a collection of epitomes), common to all ex-
periments, which is learned using our algorithm, initialized
with a Gaussian low-pass filtered random image, on a set of
100 000random patches extracted from5 000natural im-
ages (all different from the test images used for denoising).
5. Experimental Validation
Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbara
images.
We provide in this section qualitative and quantitative
validation. We first study the influence of the different
model hyperparameters on the visual aspect of the epitome
before moving to an image denoising task. We choose to
represent the epitomes as images in order to visualize more
easily the patches that will be extracted to form the images.
Since epitomes contain negative values, they are arbitrarily
rescaled between0and1for display.
In this section, we will work with several images, which
are shown in Figure4.
5.1. Influence of the Initialization
In order to measure the influence of the initialization on
the resulting epitome, we have run the same experience with
different initializations. Figure5shows the different results
obtained.
The difference in contrast may be due to the scaling of
the data in the displaying process. This experiment illus-
trates that different initializations lead to visually different
epitomes. Whereas this property might not be desirable, the
classical dictionary learning framework also suffers from
this issue, but yet has led to successful applications in im-
age processing [9].
Figure 5. Three epitomes obtained on the boat image for different
initializations, but all the same parameters. Left: epitome obtained
with initialization on a epitome learned on random patches from
natural images. Middle and Right: epitomes obtained for two dif-
ferent random initializations.
5.2. Influence of the Size of the Patches
The size of the patches seem to play an important role in
the visual aspect of the epitome. We illustrate in Figure6
an experiment where pairs of epitome of size46⇥46are
learned with different sizes of patches.
Figure 6. Pairs of epitomes of width46obtained for patches of
width6,8,9,10and12. All other parameters are unchanged. Ex-
periments run with2scales (20iterations for the first scale,5for
the second) on the house image.
As we see, learning epitomes with small patches seems
to introduce finer details and structures in the epitome,
whereas large patches induce epitomes with coarser struc-
tures.
2917

Translation invariance + patches: [Aharon & Elad 2008]
[Jojic et al. 2003]
Low dimensional dictionary parameterization.
D=(dm)m
DictionaryD
Imagef
⇥Faster learning.
⇥Make use of atoms spacial locationxm.
dm(t)=d(zm+t)
Signatured
Dictionary Signature / Epitome
Figure 7.1,2,4and20epitomes learned on the barbara image
for the same parameters. They are of sizes42,32,25and15in
order to keep the same number of elements inD. They are not
represented to scale.
5.3. Influence of the Number of Epitomes
We present in this section an experiment where the num-
ber of learned epitomes vary, while keeping the same num-
bers of columns inD. The1,2,4and20epitomes learned
on the image barbara are shown in Figure7. When the num-
ber of epitomes is small, we observe in the epitomes some
discontinuities between texture areas with different visual
characteristics, which is not the case when learning several
independant epitomes.
5.4. Application to Denoising
In order to evaluate the performance of epitome learn-
ing in various regimes (single epitome, multiple epitomes),
we use the same methodology as [1] that uses the success-
ful denoising method first introduced by [9]. Let us con-
sider first the classical problem of restoring a noisy imagey
inR
n
which has been corrupted by a white Gaussian noise
of standard deviation⇥. We denote byy
iinR
m
the patch
ofycentered at pixeli(with any arbitrary ordering of the
image pixels).
The method of [9] proceeds as follows:
•Learn a dictionaryDadapted to all overlapping
patchesy
1,y
2,...from the noisy imagey.
•Approximate each noisy patch using the learned dic-
tionary with a greedy algorithm called orthogonal
matching pursuit (OMP) [17] to have a clean estimate
of every patch ofy
iby addressing the following prob-
lem
argmin
↵i⇥R
p
⇤↵i⇤0s.t.⇤y
i⌃D↵i⇤
2
2⇥(C⇥
2
),
whereD↵iis a clean estimate of the patchy
i,⇤↵i⇤0
is the⌃0pseudo-norm of↵i, andCis a regularization
parameter. Following [9], we chooseC=1.15.
•Since every pixel inyadmits many clean estimates
(one estimate for every patch the pixel belongs to), av-
erage the estimates.
Figure 8. Artificially noised boat image (with standard deviation
⇥=15), and the result of our denoising algorithm.
Quantitative results for single epitome, and multi-scale
multi-epitomes are presented in Table1on six images and
five levels of noise. We evaluate the performance of the de-
noising process by computing the peak signal-to-noise ratio
(PSNR) for each pair of images. For each level of noise,
we have selected the best regularization parameter#overall
the six images, and have then used it all the experiments.
The PNSR values are averaged over5experiments with5
different noise realizations. The mean standard deviation is
of0.05dB both for the single epitome and the multi-scale
multi-epitomes.
We see from this experiment that the formulation we pro-
pose is competitive compared to the one of [1]. Learning
multi epitomes instead of a single one seems to provide bet-
ter results, which might be explained by the lack of flexi-
bility of the single epitome representation. Evidently, these
results are not as good as recent state-of-the-art denoising
algorithms such as [7,15] which exploit more sophisticated
2918
ourselves for simplicity to the case of single and multiple
epitomes of the same size and shape.
The multi-epitome version of our approach can be seen
as an interpolation between classical dictionary and single
epitome. Indeed, defining a multitude of epitomes of the
same size as the considered patches is equivalent to work-
ing with a dictionary. Defining a large number a epito-
mes slightly larger than the patches is equivalent to shift-
invariant dictionaries. In Section5, we experimentally com-
pare these different regimes for the task of image denoising.
4.4. Initialization
Because of the nonconvexity of the optimization prob-
lem, the question of the initialization is an important issue
in epitome learning. We have already mentioned a multi-
scale strategy to overcome this issue, but for the first scale,
the problem remains. Whereas classical flat dictionaries can
naturally be initialized with prespecified dictionaries such
as overcomplete DCT basis (see [9]), the epitome does not
admit such a natural choice. In all the experiences (un-
less written otherwise), we use as the initialization a single
epitome (or a collection of epitomes), common to all ex-
periments, which is learned using our algorithm, initialized
with a Gaussian low-pass filtered random image, on a set of
100 000random patches extracted from5 000natural im-
ages (all different from the test images used for denoising).
5. Experimental Validation
Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbara
images.
We provide in this section qualitative and quantitative
validation. We first study the influence of the different
model hyperparameters on the visual aspect of the epitome
before moving to an image denoising task. We choose to
represent the epitomes as images in order to visualize more
easily the patches that will be extracted to form the images.
Since epitomes contain negative values, they are arbitrarily
rescaled between0and1for display.
In this section, we will work with several images, which
are shown in Figure4.
5.1. Influence of the Initialization
In order to measure the influence of the initialization on
the resulting epitome, we have run the same experience with
different initializations. Figure5shows the different results
obtained.
The difference in contrast may be due to the scaling of
the data in the displaying process. This experiment illus-
trates that different initializations lead to visually different
epitomes. Whereas this property might not be desirable, the
classical dictionary learning framework also suffers from
this issue, but yet has led to successful applications in im-
age processing [9].
Figure 5. Three epitomes obtained on the boat image for different
initializations, but all the same parameters. Left: epitome obtained
with initialization on a epitome learned on random patches from
natural images. Middle and Right: epitomes obtained for two dif-
ferent random initializations.
5.2. Influence of the Size of the Patches
The size of the patches seem to play an important role in
the visual aspect of the epitome. We illustrate in Figure6
an experiment where pairs of epitome of size46⇥46are
learned with different sizes of patches.
Figure 6. Pairs of epitomes of width46obtained for patches of
width6,8,9,10and12. All other parameters are unchanged. Ex-
periments run with2scales (20iterations for the first scale,5for
the second) on the house image.
As we see, learning epitomes with small patches seems
to introduce finer details and structures in the epitome,
whereas large patches induce epitomes with coarser struc-
tures.
2917

Overview
•Sparsity and Redundancy
•Dictionary Learning
•Extensions
•Task-driven Learning
•Texture Synthesis

Ground trust:yk=!fk+!k
Exemplarfk
Observationyk
Task Driven Learning
!

Ground trust:yk=!fk+!k
yk
f(D, yk)
Estimator
f(D,·)
Exemplarfk
Observationyk
Example:!
1
regularization.
Task Driven Learning
!

Ground trust:yk=!fk+!k
yk
f(D, yk)
Estimator
f(D,·)
Exemplarfk
Observationyk
Task driven learning:min
D
E(D)=
!
k
||fk!f(D, yk)||
2
[Mairal et al. 2010]
[Peyr´e & Fadili 2010]
Example:!
1
regularization.
Task Driven Learning
!

Ground trust:yk=!fk+!k
yk
f(D, yk)
Estimator
f(D,·)
Exemplarfk
Observationyk
Task driven learning:min
D
E(D)=
!
k
||fk!f(D, yk)||
2
Gradient descent:
[Mairal et al. 2010]
[Peyr´e & Fadili 2010]
D!D⇥!
!
k
⇥f(D, yk)
!
[f(D, yk)⇥fk]
Example:!
1
regularization.
Task Driven Learning
!

Ground trust:yk=!fk+!k
yk
f(D, yk)
Estimator
f(D,·)
Exemplarfk
Observationyk
Task driven learning:min
D
E(D)=
!
k
||fk!f(D, yk)||
2
Gradient descent:
[Mairal et al. 2010]
[Peyr´e & Fadili 2010]
D!D⇥!
!
k
⇥f(D, yk)
!
[f(D, yk)⇥fk]
Compute the
derivative!f
w.r.t.D?
Example:!
1
regularization.
Task Driven Learning
!

“s= sign(x)”
x(D, y) = argmin
x!R
P
1
2
||y!!Dx||
2
+!||x||1
D
!
!
!
(!Dx!y)+!s=0
Dictionary Sensitivity
f(D, y)=Dx(D, y)
Sparse estimator:
=!

“s= sign(x)”
=! xI=(D
!
I!
!
!DI)
"1
(D
!
I!y!!sI)
DI=(dm)m!I
Support:
I={m\xm!=0}
!
I
x
x(D, y) = argmin
x!R
P
1
2
||y!!Dx||
2
+!||x||1
D
!
!
!
(!Dx!y)+!s=0
Dictionary Sensitivity
f(D, y)=Dx(D, y)
Sparse estimator:
Local expression ofx(D, y) around (D, y):
=!
D

“s= sign(x)”
=! xI=(D
!
I!
!
!DI)
"1
(D
!
I!y!!sI)
Locally: !sIis constant.
!The mapy x(D, y) is a!ne.
!The mapD x(D, y) is a rational function.
DI=(dm)m!I
Support:
I={m\xm!=0}
Compute the derivative ofD x(D, y).
!
I
x
x(D, y) = argmin
x!R
P
1
2
||y!!Dx||
2
+!||x||1
D
!
!
!
(!Dx!y)+!s=0
Dictionary Sensitivity
f(D, y)=Dx(D, y)
Sparse estimator:
Local expression ofx(D, y) around (D, y):
=!
D

Sparse recovery with linearized model!:
Unknown degradation operator:yk=!0(fk)+wk
f(D,!,y)=Dargmin
x!R
P
1
2
||y!!Dx||
2
+!||x||1
Blind Sparse Restoration
Task-Driven Dictionary Learning 17
Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though
the halftoned images (center column) perceptually look relatively close to the original images (left col-
umn), they are binary. Reconstructed images (right column)are obtained by restoring the halftoned
binary images. Best viewed by zooming on a computer screen.
RR n° 7400
Task-Driven Dictionary Learning 17
Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though
the halftoned images (center column) perceptually look relatively close to the original images (left col-
umn), they are binary. Reconstructed images (right column)are obtained by restoring the halftoned
binary images. Best viewed by zooming on a computer screen.
RR n° 7400
fk
yk

Sparse recovery with linearized model!:
Task driven learning ofDand!:
Unknown degradation operator:yk=!0(fk)+wk
min
D,!
!
k
||fk!f(D,!,yk)||
2
f(D,!,y)=Dargmin
x!R
P
1
2
||y!!Dx||
2
+!||x||1
Blind Sparse Restoration
Task-Driven Dictionary Learning 17
Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though
the halftoned images (center column) perceptually look relatively close to the original images (left col-
umn), they are binary. Reconstructed images (right column)are obtained by restoring the halftoned
binary images. Best viewed by zooming on a computer screen.
RR n° 7400
Task-Driven Dictionary Learning 17
Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though
the halftoned images (center column) perceptually look relatively close to the original images (left col-
umn), they are binary. Reconstructed images (right column)are obtained by restoring the halftoned
binary images. Best viewed by zooming on a computer screen.
RR n° 7400
Task-Driven Dictionary Learning 17
Figure 2: From left to right: Original images, halftoned images, reconstructed images. Even though
the halftoned images (center column) perceptually look relatively close to the original images (left col-
umn), they are binary. Reconstructed images (right column)are obtained by restoring the halftoned
binary images. Best viewed by zooming on a computer screen.
RR n° 7400
fk
yk
f(D,!,yk)

Overview
•Sparsity and Redundancy
•Dictionary Learning
•Extensions
•Task-driven Learning
•Texture Synthesis

Texture Synthesis
Generatefperceptually similar to some inputf0
f0
f f
ff

Design and manipulate statistical constraints.
Use statistical constraints for other imaging problems.
Texture Synthesis
Generatefperceptually similar to some inputf0
f0
f f
ff

Dictionaries for Textures

Sparse model for all the texture patches:
E(f)=
!
k
min
x
1
2
||pk(f)!Dx||
2
+!||x||1
Texture ensemble:
T={f\flocal minimum ofE}
pk(f)=f(·+zk)
Sparse Texture Ensemble
[Peyr´e, 2008]

Sparse model for all the texture patches:
E(f)=
!
k
min
x
1
2
||pk(f)!Dx||
2
+!||x||1
Almost bias-free sampling ofT:
Texture ensemble:
T={f\flocal minimum ofE}
f white noise.Initialization:
pk(f)=f(·+zk)
Sparse Texture Ensemble
[Peyr´e, 2008]

Sparse model for all the texture patches:
E(f)=
!
k
min
x
1
2
||pk(f)!Dx||
2
+!||x||1
Almost bias-free sampling ofT:
Texture ensemble:
T={f\flocal minimum ofE}
f white noise.Initialization:
Iteration:
xk⇥argmin
x
1
2
||pk(f)"Dx||
2
+!||x||1
pk(f)=f(·+zk)
Sparse Texture Ensemble
[Peyr´e, 2008]

Sparse model for all the texture patches:
E(f)=

k
min
x
1
2
||pk(f)⇤Dx||
2
+⇤||x||1
Almost bias-free sampling ofT:
Texture ensemble:
T={f\flocal minimum ofE}
f white noise.Initialization:
Iteration:
xk⇥argmin
x
1
2
||pk(f)⇥Dx||
2
+⇤||x||1
pk(f)=f(·+zk)
f(·)⇥

k
˜yk(·⇥zk)
Sparse Texture Ensemble
!
!
0
0
"
"
Figure1:Parameterizationofthedictionaryofedgepatchesandsomeexamples.
Figure2:Iterationsofthesynthesisalgorithmwiththedictionaryofedges(sparsitys=2).
Dictionaryoflocaloscillations.Inordertosynthesizehighlyoscillatingtextures,weconsider
thefollowingsetoffunctions
⌥⇤(t)=sin

R⇥(t⇤(⇥,0))/⌅

,and⇤=(⇥,⇥)⇤[0,2⇧)⇥R
+
. (15)
Thelocalfrequency⌅controlsgloballythewidthoftheoscillationswhereas⇥isthelocalorientation
oftheseoscillations.
Dictionariesoflines.Similarlytotheedgedictionary(14),adictionaryoflinesisobtainedby
rotatingandtranslatingastraightline
⌥⇤(t)='⇥,#,⌅(t)=exp

1
2⌃
2
||R⇥(t⇤(⇥,0))||
2

, (16)
where⇤=(⇥,⇥)⇤[0,2⇧)⇥R
+
andwhere⌃controlthewidthofthelinepattern.
Dictionariesofcrossings.Adictionaryofcrossingsisobtainedbyconsideringatomswhich
containtwooverlappinglines
⌥⇤(t)=max('⇥1,#1,⌅(t),'⇥2,#2,⌅(t))where⇤=(⇥1,⇥1,⇥2,⇥2). (17)
Figure3showsexamplesofsynthesisforthefourdictionariesgeneratedbythesetoffunctions
(14),(15),(16)and(17).
3StrictSparsityandNon-localExpansions
Mostapproachesfortexturesynthesisincomputergraphics[18,53,19,28,3,30,27]performa
recopyofpatchesfromanoriginalinputtexturefinordertocreateanewtexture
˜
fwithsimilar
structures.Theseprocessingscanbecastedintothesparsityframeworkpresentedinthispaper.
Thissectionindeedconsidersourtexturemodelinarestrictedcasewhereoneseeksastrictsparsity
withs=1inahighlyredundantdictionary.
6
DictionaryD
!
!
0
0
"
"
Figure1:Parameterizationofthedictionaryofedgepatchesandsomeexamples.
Figure2:Iterationsofthesynthesisalgorithmwiththedictionaryofedges(sparsitys=2).
Dictionaryoflocaloscillations.Inordertosynthesizehighlyoscillatingtextures,weconsider
thefollowingsetoffunctions
⌥⇤(t)=sin

R⇥(t⇤(⇥,0))/⌅

,and⇤=(⇥,⇥)⇤[0,2⇧)⇥R
+
. (15)
Thelocalfrequency⌅controlsgloballythewidthoftheoscillationswhereas⇥isthelocalorientation
oftheseoscillations.
Dictionariesoflines.Similarlytotheedgedictionary(14),adictionaryoflinesisobtainedby
rotatingandtranslatingastraightline
⌥⇤(t)='⇥,#,⌅(t)=exp

1
2⌃
2
||R⇥(t⇤(⇥,0))||
2

, (16)
where⇤=(⇥,⇥)⇤[0,2⇧)⇥R
+
andwhere⌃controlthewidthofthelinepattern.
Dictionariesofcrossings.Adictionaryofcrossingsisobtainedbyconsideringatomswhich
containtwooverlappinglines
⌥⇤(t)=max('⇥1,#1,⌅(t),'⇥2,#2,⌅(t))where⇤=(⇥1,⇥1,⇥2,⇥2). (17)
Figure3showsexamplesofsynthesisforthefourdictionariesgeneratedbythesetoffunctions
(14),(15),(16)and(17).
3StrictSparsityandNon-localExpansions
Mostapproachesfortexturesynthesisincomputergraphics[18,53,19,28,3,30,27]performa
recopyofpatchesfromanoriginalinputtexturefinordertocreateanewtexture
˜
fwithsimilar
structures.Theseprocessingscanbecastedintothesparsityframeworkpresentedinthispaper.
Thissectionindeedconsidersourtexturemodelinarestrictedcasewhereoneseeksastrictsparsity
withs=1inahighlyredundantdictionary.
6
˜yk=Dxk
[Peyr´e, 2008]

Synthesizedf
Examples of Sparse Synthesis
DictionaryD
S=1
S=2
Edges Oscillations Lines Crossings
Figure3:Examplesofsynthesisfortwosparsitylevelssforthefourkindsofdictionariesconsid-
ered.
3.1StrictSparsityModel
Consideringtheextremecasewheres=1meansthatonewantseachpatchofthesynthesized
imageftobeclosetoapatchintheoriginalexemplartexture
˜
f.Withinthisassumption,one
canconsiderasadictionarythesetofallthepatchesextractedfromtheexemplar
D=(pxi(
˜
f))
N⌅1
i=0
=⌅(
˜
f). (18)
Thisdictionaryishighlyredundantandthesynthesisalgorithmlooksforaperfectmatch
⌅i,pxi
(f)=⌅ip
⌅(xi)(
˜
f),where ⌅i⇤R, (19)
andthewarpingfunction⇤:{0,...,
#
N$1}
2
⇥{0,...,
#
N$1}
2
mapsthepixellocationsof
thesynthesizedftothepixellocationsof
˜
f.
Afurthersimplifyingassumptiondonefrequentlyincomputergraphicsassumesthat⌅i=1,
whichleadstothefollowingdefinitionofthemapping⇤
⌅x,⇤(x)
def.
=argmin
y
||px(f)$py(
˜
f)||. (20)
Inthissetting,thealgorithm2iteratesbetweenthebest-fitcomputation(20)(step3)andthe
averagingofthepatches(step4).ThisissimilartotheoptimizationprocedureofKwartraetal.
[27].
Onecanapplytheiterativealgorithmdescribedinlisting2inordertodrawarandomtexture
thatminimizesED.Figure4showstheiterationsoftexturesynthesiswiththishighlyredundant
dictionary.Fortheseexamples,thesizeofthepatchesissetto⇥=6pixels.Figure6showsother
examplesofsynthesisandcomparestheresultswithtexturequilting[19].Methodsbasedonpixels
andregionscopylike[19]tendtosynthesizeimagesveryclosetotheoriginal.Largepartsofthe
inputareoftencopiedverbatimintheoutput,withsometimeperiodicrepetitions.Incontrast,
andsimilarlyto[27],ourmethodtreatsallthepixelsequallyandoftenleadstoabetterlayoutof
thestructures,withlessglobalfidelitytotheoriginal.
7
DictionaryD

LearnDfrom a single exemplarf0.
Exemplarf0
Learning Sparse Ensemble
Listing5Sparsetexturesynthesisalgorithm.
(1)Initialization:setfatrandom.
(2)Sparsecode:foralllocationsx,compute
sx⇥Proj
M(px(f)).
(3)Reconstruction:computethetexturefbyaveragingthepatches
f(x)=
1
n!
2
!
|y!x|!!/2
(Dsy)(x"y).
(4)Imposeconstraints:performthehistogramequalizationoffwithfe,see
[60].
(5)Stop:whilenotconverged,gobackto2.
Thecorrespondingalgorithmisdetailedin5,andisequivalenttotheiterative
projectionalgorithmofPeyr´e[60].Thisiterativealgorithmcanbeseenasan
extensionofclassicaltexturesynthesismethodssuchas[18,61].Thesecom-
putergraphicsapproachesusethehighlyredundantdictionaryD=(px(fe))
x
ofallthepatchesoftheexemplarfeandenforceaperfectrecopybyaskinga
strictsparsityk=1.
Thetexturemodel!capturesacompactsetofparametersthroughthedic-
tionaryD.Thismodelsharesalsosimilaritywithstatisticalapproachesto
texturesynthesissuchas[62–64]wheresometransformdomainrandomiza-
tionisperformed.Whereastheseapproachesuseafixedwavelettransform
[62,64]orfiltersoptimizedfromafixedlibrary[63]welearnthistransformin
anon-parametricfashion.
Figure21showsexamplesoftexturesynthesisforvariousvaluesoftheparam-
etersmandk.Increasingthesizeofthedictionaryallowsforamorerealistic
synthesisandincreasingtheredundancycreatesmoreblendingbetweenthe
features.
Original m/n=1,k=2 m/n=1,k=8 m/n=2,k=2
Fig.21.Examplesoftexturesynthesisforvariousredundancym/nansparsity.
Inverseproblems.Figure22showsareconstructionfromcompressive
30

LearnDfrom a single exemplarf0.
Exemplarf0
RedundancyQ/N
Sparsity!
Learning Sparse Ensemble
Listing5Sparsetexturesynthesisalgorithm.
(1)Initialization:setfatrandom.
(2)Sparsecode:foralllocationsx,compute
sx⇥Proj
M(px(f)).
(3)Reconstruction:computethetexturefbyaveragingthepatches
f(x)=
1
n!
2
!
|y!x|!!/2
(Dsy)(x"y).
(4)Imposeconstraints:performthehistogramequalizationoffwithfe,see
[60].
(5)Stop:whilenotconverged,gobackto2.
Thecorrespondingalgorithmisdetailedin5,andisequivalenttotheiterative
projectionalgorithmofPeyr´e[60].Thisiterativealgorithmcanbeseenasan
extensionofclassicaltexturesynthesismethodssuchas[18,61].Thesecom-
putergraphicsapproachesusethehighlyredundantdictionaryD=(px(fe))
x
ofallthepatchesoftheexemplarfeandenforceaperfectrecopybyaskinga
strictsparsityk=1.
Thetexturemodel!capturesacompactsetofparametersthroughthedic-
tionaryD.Thismodelsharesalsosimilaritywithstatisticalapproachesto
texturesynthesissuchas[62–64]wheresometransformdomainrandomiza-
tionisperformed.Whereastheseapproachesuseafixedwavelettransform
[62,64]orfiltersoptimizedfromafixedlibrary[63]welearnthistransformin
anon-parametricfashion.
Figure21showsexamplesoftexturesynthesisforvariousvaluesoftheparam-
etersmandk.Increasingthesizeofthedictionaryallowsforamorerealistic
synthesisandincreasingtheredundancycreatesmoreblendingbetweenthe
features.
Original m/n=1,k=2 m/n=1,k=8 m/n=2,k=2
Fig.21.Examplesoftexturesynthesisforvariousredundancym/nansparsity.
Inverseproblems.Figure22showsareconstructionfromcompressive
30
Figure10:Iterationofthesynthesisprocessfors=2.
Theredundancym/nofthedictionary.Moreredundancyprovidesmoregeometricfidelityduring
thesynthesissincepatchesoftheoriginaltexture
˜
fwillbebetterapproximatedinD.Incontrast,
usingasmallmleadstoacompacttexturemodelthatcompressesthegeometriccharacteristics
oftheoriginaltexturewithinafewatoms.Suchamodelallowsgoodgeneralizationperformance
fortasksuchastexturediscriminationorclassificationwhenthedatatoprocessisunknownbut
closetof.
Thesparsitys!1ofthepatchexpansion.Increasingthesparsitysisawaytoovercomethe
limitationsinherenttocompactdictionary(lowredundancym/n)byprovidingmorecomplex
linearcombination.Incontrast,forveryredundantdictionaries(suchasthenon-localexpansion
presentedinsection3)onecanevenimposethats=1.Increasingthesparsityalsoallowstohave
blendingoffeaturesandlinearvariationsinintensitythatleadstoslowilluminationgradients
notpresentintheoriginaltexture.
Figure11showstheinfluenceofthesparsityparameter.
Inordertocapturefeaturesofvarioussizes,onecanperformaprogressivesynthesiswith
varioussizesofpatches!.Thisleadstoamultiscalesynthesisalgorithmthatfollowstheone
alreadypresentedinlisting3.NotethatthissynthesisalgorithmimplicitlyconsidersasetDjof
highlyredundantdictionariesatvariousresolution.Otherapproacheshavebeenproposedtolearn
amultiscaledictionary,seeforinstance[45,35].
s=2
s=4
s=8
r=0.2 r=0.5 r=1 r=2 r=4
Figure11:Influenceoftheredundancyr=m/nandsparsitys.
14

Exemplarf0
pz(f)!p
!(z)(f0)
Patchespz(f0)
Dictionary:all patchespz(f0)=f0(z+·)
Computer Graphics Approach

Patch copy:1-sparsity of the synthesizedf.
Mapping!fromftof0:
Synthesizedf
Exemplarf0
!
pz(f)!p
!(z)(f0)
Patchespz(f0)
Dictionary:all patchespz(f0)=f0(z+·)
Computer Graphics Approach

Patch copy:1-sparsity of the synthesizedf.
Mapping!fromftof0:
Synthesizedf
Exemplarf0
!
f0
pz(f)!p
!(z)(f0)
Patchespz(f0)
Dictionary:all patchespz(f0)=f0(z+·)
Computer Graphics Approach

Texture Inpainting

Sparse dictionary adaptation:
!Minimizing MAP onD.
!Task-driven formulation.
Conclusion

Sparse dictionary adaptation:
!Minimizing MAP onD.
!Task-driven formulation.
Patch based-processing:
!Connection with computer graphics.
!Learn on the data to process.
Conclusion

Sparse dictionary adaptation:
!Minimizing MAP onD.
!Task-driven formulation.
Open problems:
Patch based-processing:
!Non-convex, slow.
!Connection with computer graphics.
!Beyond!
1
: structured sparsity.
!Learn on the data to process.
!Theoretical guarantees.
Conclusion