Deep Learning a whirlwind tour of key principles

Deep Learning : 1BMVA Computer Vision
Summer School 2019
Deep Learning within
Computer Vision

a whirlwind tour of the key principles*
[* and perhaps things we could all do with remembering ]
Toby Breckon
Engineering and Computer Science
Durham University
www.durham.ac.uk/toby.breckon/mltutorial/ [email protected]
Slide material acknowledgements (some material): Lee (UC Davies), Grauman (UT Austin), Lazebnik (Illinois), Fei-Fei (Stanford),
Fergus (Stoney Brook), Huang (Illinois), Lee (Michigan), Ranzato (Facebook A.I. Research), Sermanet (Google), Vedaldi (Oxford), Hinton (Toronto), Fisher (HIPR2, Edinburgh) + additional URL/acknowledgement/paper refs on individual slides

Deep Learning : 2
BMVA Computer Vision
Summer School 2019
Let’s start at the very beginning ...

Deep Learning : 3
BMVA Computer Vision
Summer School 2019
Machine Learning ?
 Why Machine Learning?
–we cannot program everything
–some tasks are difficult to define algorithmically
–especially in computer vision
…. visual sensing has few rules
Well-defined learning problems ?
–easy to learn Vs. difficult to learn
..... varying complexity of visual patterns
An example: learning to recognise objects ...
Image: DK

Deep Learning : 4
BMVA Computer Vision
Summer School 2019
Learning ? - in humans

Deep Learning : 5
BMVA Computer Vision
Summer School 2019
Learning ? - in computers

Deep Learning : 6
BMVA Computer Vision
Summer School 2019
Machine Learning
Definition:
●A set of methods for the automated analysis of structure in
data. …. two main strands of
work, (i) unsupervised learning ….
and (ii) supervised learning.
….similar to ... data mining, but ... focus .. more on
autonomous machine performance, ….
rather than enabling humans to learn from the data.
[Dictionary of Image Processing & Computer Vision, Fisher et al., 2014]

Deep Learning : 7
BMVA Computer Vision
Summer School 2019
Supervised Vs. Unsupervised
Supervised
–knowledge of output - learning with the
presence of an “expert” / teacher
•data is labelled with a class or value
•Goal: predict class or value label
●e.g. Neural Network, Support Vector Machines, Decision
Trees, Bayesian Classifiers ....

Unsupervised
–no knowledge of output class or value
•data is unlabelled or value un-known
•Goal: determine data patterns/groupings
–Self-guided learning algorithm
●(internal self-evaluation against some criteria)
●e.g. k-means, genetic algorithms, clustering approaches ...
….
c1
c2
c3
…. ?

Deep Learning : 8
BMVA Computer Vision
Summer School 2019
Machine
Learning
=
“Decision
or
Prediction”
P
i
x
e
l
s

/

V
o
x
e
l
s

/

S
a
m
p
l
e
s

O
R

(
s
o
m
e
)

F
e
a
t
u
r
e

R
e
p
r
e
s
e
n
t
a
t
i
o
n
(
e
.
g
.

S
I
F
T
,

H
O
G
,

h
is
t
o
g
r
a
m
,

B
a
g

o
f

W
o
r
d
s
,

P
C
A

.
.
.
)
person
cat
dog
cow
….
….
….
car
rhino
….
position
“style”
depth
… in the big picture

Deep Learning : 9
BMVA Computer Vision
Summer School 2019
Common Machine Learning Tasks
Object Classification
what object ?
Object Detection
object or no-object ?
Instance Recognition ?
who (or what) is it ?
Sub-category analysis
which object type ?
Sequence { Recognition | Classification } ?
what is happening / occurring ?
http://pascallin.ecs.soton.ac.uk/challenges/VOC/
{people | vehicle | … intruder ….}
{gender | type | species | age …...}
{face | vehicle plate| gait …. → biometrics}

Deep Learning : 10
BMVA Computer Vision
Summer School 2019
Types of Machine Learning Problem
Classification
●Predict (classify) sample → discrete set of class labels
●e.g. classes {object 1, object 2 … } for recognition task
●e.g. classes {object, !object} for detection task
Regression (traditionally less common in comp. vis.)
●Predict sample → associated numerical value (variable)
●e.g. distance to target based on shape features
●Linear and non-linear attribute to value relationships
Association & clustering
●grouping a set of instances by attribute similarity
●e.g. image segmentation
[Ess et al, 2009]
…. ?

Deep Learning : 11
BMVA Computer Vision
Summer School 2019
Simple Regression Example – Head Pose Estimation
Input: image features (HOG)
Output: { yaw | pitch }
varying illumination + vibration
[Walger / Breckon, 2014]
http://www.youtube.com/embed/UcF_otQSMEc?rel=0
[ video ]

Deep Learning : 12
BMVA Computer Vision
Summer School 2019
Complex Regression Example – Full-Body Pose Estimation
Input: raw image
Output: 17 pose keypoints
[Papandreou et al. 2018]
PoseNet (Google Research):
https://github.com/tensorflow/tfjs-models/tree/master/posenet
Live demo (browser):
https://storage.googleapis.com/tfjs-models/demos/posenet/camera.html

Deep Learning : 13
BMVA Computer Vision
Summer School 2019
The move from shallow to deep …
(< ~2013) (> ~2013)
pre deep learning / pre abyssi post deep learning / post abyssi

Deep Learning : 14
BMVA Computer Vision
Summer School 2019
Traditional (shallow) ApproachesCrngDg sdtn g
I rupl h 1ulrmudkn
Hand-designed
feature extractionMlrdnr,[
m[rssdId l
Trainable
classifier
Image / Video
Pixels
•Features are not learned
●
vectors of shape measures, edge distributions, colours distributions, feature points,
HOG, visual words, ... etc.
●
… i.e. calculated summary “numerical” descriptors
•Trainable classifier is often generic
(e.g. SVM kernel, Decision Forest)
Object
Class
.. see extra slides

Deep Learning : 15
BMVA Computer Vision
Summer School 2019
Deep Learning – end to end approaches
•Learn a feature hierarchy all the way from pixels (or voxels)
to classifier
•Each layer extracts “features” from the output of previous
layer
•Layers have similar structure, performing varying functions
•Train (i.e. optimize) all layers jointlyirP lhN
Layer 1irP lhS Layer 2irP lhP Layer 3 Classifier
Image/
Video
Pixels

Deep Learning : 16
BMVA Computer Vision
Summer School 2019
“Shallow” vs. “deep” architectures[hgfmfeocbgef
]ehs*yede.syhp/rg
Hand-designed
feature extraction@yhcghLie
pihooc?ey
Trainable
classifier
Image/
Video
Pixels
Object
Class
(or output
prediction)&hCeydV
Layer 1&hCeydF Layer N cMkied
pihooc?ey
Simple
classifier
Object
Class
(or output
prediction)
Image/
Video
Pixels
Traditional recognition: “Shallow” architecture
(modern) Deep learning: “Deep” architecture
…

Deep Learning : 17
BMVA Computer Vision
Summer School 2019
A typical end-to-end deep learning
convolutional neural network ….
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
•“AlexNet”: seminal ImageNet Challenge winner (2012)
•Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
•More data (10
6
vs. 10
3
images)
•GPU implementation (50x speedup over CPU)
•Trained on two GPUs for a week

•Algorithms: better regularization for training (DropOut)

Deep Learning : 18
BMVA Computer Vision
Summer School 2019
Steel
drum
The Image Classification Challenge:
1,000 object classes
1,431,167 images
[ Human - Russakovsky et al. IJCV 2015]
4/3/201
8
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]

%

e
r
r
o
r

Deep Learning : 19
BMVA Computer Vision
Summer School 2019
AlexNet’s performance on this
benchmark task was the research
event (“discovery”) that led to the
shallow to deep transformation in
computer vision ...*
*
although CNN were not created overnight, and have their origins in [LeCun, 1998] among others

Deep Learning : 20
BMVA Computer Vision
Summer School 2019
pre abyssi→ post abyssi
*
*
here meaning post (after) the “discovery” of deep learning by the computer vision research community
Note: we are currently in what may become known as the “deep-age”, if not perhaps the “dark-age” (?) of computer vision (see final slides)

Deep Learning : 21
BMVA Computer Vision
Summer School 2019
This talk ...
Is not about …
–how to use {tensorflow |
pytorch | keras | mxnet ….
tensor-py-flow-net (!?) … }
–hyper-parameter tuning
–specific advanced concepts
that have been published on
arXiv in the time I have been
talking …
… sorry.
Is about …
–understanding the core
concepts, well
–bringing everyone up to
speed on where we are and
how we got here
•re: deep learning
–understanding the limitations
of current understanding
+ the challenges that lie
ahead
[ itself a shallow overview of deep learning approaches ]

Deep Learning : 22
BMVA Computer Vision
Summer School 2019
Deep learning can do some clever stuff ...
e.g. Monocular depth prediction via style transfer
[Atapour / Breckon, CVPR 2018] - https://github.com/atapour/monocularDepth-Inference
D
A
D
C
C
C'
l
adv
l
rec
l
rec
D
B
A
A'
A''
B
B'
B''
l
adv
l
rec
l
adv
Training
G
A B
G
B A
G
B C
6
4
3
2
1
2
8
128 128 128 128
. . . .
{
x9
3
26
41
2
8
5
1
2
5121024
1
0
2
4 1
0
2
4
1
0
2
4 5
1
2
2
5
61
2
8
5
1
2
5
1
2
5
1
25
1
22
5
6
1
2
8
6
4
Input RGB
( I)
Restyled RGB
G
A- > B
( I)
Output Depth
G
B - > C
[G
A- > B
( I) ]
Restyled RGB
G
B - > C
( I )
Testing
[ video ]

Deep Learning : 23
BMVA Computer Vision
Summer School 2019
Key question – why does this stuff
work so well ?
(let’s examine some of the fundamentals)

Deep Learning : 24
BMVA Computer Vision
Summer School 2019
Key Principle
Input Image
Convolution (Learned)
Non-linearity
Pooling
Feature maps
….
….
….
….
….
Final
Classification
Each layer in a deep network performs a
different transformation (function) to
map from input to output – these vary from
{convolution, pooling, sub-sampling, non-
linear mapping (fully connected), ….}.
Within a traditional (shallow) Neural
Network, the perceptron activation
functions are all the same (but we vary
the weights).
This provides the network with a larger
parameterization space to represent
(complex) input to output relationships.

Deep Learning : 25
BMVA Computer Vision
Summer School 2019
The rise of the ….
Convolutional Neural Networks (CNN)
•Multi-layer Neural network with:
-Local connectivity
-Shared weight parameters across spatial positions
•Stack multiple stages of feature extractors
operating directly on the image
•Higher stages compute more global, more
invariant feature representation
•Final classification layer at the end
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278–2324, 1998.
Task: digit recognition

Clarification:
Convolutional Neural Networks (CNN)
(i.e. networks using convolution layers)
are a subset of the generalized
Deep Learning extension to
Neural Networks
(i.e. Deep Neural Networks)
CNN are a specific to images (or similar densely sampled signals)**

** we'll concentrate on those here

Deep Learning : 27
BMVA Computer Vision
Summer School 2019
A deep multi-layer architecture ...

Deep Learning : 28
BMVA Computer Vision
Summer School 2019
Convolutional Neural Networks (CNN)
•Feed-forward network:
–Convolve input (feature extraction)
–Non-linearity (rectified linear)
–Pooling (local max)
Supervised learning (with labels)
Train convolution filters by backpropogation
Input Image
Convolution (Learned)
Non-linearity
Pooling
Feature maps
….
….
….
….
….
Final
Classification
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

Deep Learning : 29
BMVA Computer Vision
Summer School 2019
Convolution in the first layer
immediately reduces the
complexity of the input in a
structured and meaningful way
(based on learnt weights)
Can share parameters across filters to reduce to size of
parameter set
Convolution Layer(s)

Deep Learning : 30
BMVA Computer Vision
Summer School 2019
t
Input (image) Intermediate Feature Map
•Convolutional
–dependencies are local
–translation invariance
–few parameters (filter weights)
–filter stride can be > 1 (faster, less memory)
.
.
.
Convolution Layer(s)

Deep Learning : 31
BMVA Computer Vision
Summer School 2019
Aside : image convolution
(in general image processing)
… is essentially the localised weighted sum of the image and
a convolution kernel (mask weights) over a N x M pixel
neighbourhood, at a given location within the image (x,y).
Input Image
Output Image
Image source: developer.apple.com
[ used in image filtering operations (e.g. smoothing) ]
smoothing
RECAP: from Low Level Vision

Deep Learning : 32
BMVA Computer Vision
Summer School 2019
Convolution is very powerful ...
original filter (3 x 3)
blur
111
111
111
sharpen
edges
0-10
151
0-10
121
000
-1-2-1
Different weights have do a wide range of effects on input data ….
RECAP: from Low Level Features

Deep Learning : 33
BMVA Computer Vision
Summer School 2019
hence → multiple layers of
convolution
(“convolutions upon convolutions”)
can approximate and provide most
feature extraction and image pre-
filtering (de-noising) approaches ...

Deep Learning : 34
BMVA Computer Vision
Summer School 2019
….
….
Convolution Layer(s)
Produces a structured
intermediate feature map from the
input image (or previous layer in
the network)
Input - image / layer
Output - Feature Map
Input - image / layer
Output - Feature Map
Input - image / layer
Output -
Feature Map

Deep Learning : 35
BMVA Computer Vision
Summer School 2019
Non-linearity Layer(s)
(a.k.a. fully connected)
Provides a non-linear input to
output mapping via a (traditional)
activation function approach
layer of neurons
–maybe either sub-sampling (N→ M) or fully
connected (N → M, N=M) via per element
activation function, e.g.
•Tanh
•Sigmoid: 1/(1+exp(-x))
•Rectified linear
(most common)
»Simplifies backprop
»Makes learning faster
»Avoids output saturation
issues
…..
Previous
Layer, size N
Next
Layer, size M

Deep Learning : 36
BMVA Computer Vision
Summer School 2019
Pools the input layer to form new intermediate output layer.
By “pooling” (e.g., taking max / sum) filter responses at different
locations we gain robustness to the variance of the spatial
location of features and reduce input dimensionality.
Pooling layer(s)

Deep Learning : 37
BMVA Computer Vision
Summer School 2019
•Performs localized sum() or max() over sub windows/regions
●
non-overlapping Vs. overlapping regions
●
Role of pooling:
●
Invariance to small transformations
●
Larger receptive fields (see more of input)
max()
sum()
Pooling layer(s)

Deep Learning : 38
BMVA Computer Vision
Summer School 2019
CNN – example architecture (various layers)
Number of layers, type of layer, number of nodes/maps per
layer – all down to the (human) designer
–many variants have emerged
CNN training – efficient backpropogation
as per traditional neural network approaches (with Dropout)
Example CNN design:
https://sites.google.com/site/5kk73gpu2013/assignment/cnn

Deep Learning : 39
BMVA Computer Vision
Summer School 2019
Seminal
Deep Learning Architectures
ImageNet Classification with Deep Convolutional Neural Networks
A Krizhevsky I Sutskever, G Hinton (2012) - “AlexNet”
Going Deeper with Convolutions, C Szegedy et al (2014) - “GoogLeNet”

Deep Learning : 40
BMVA Computer Vision
Summer School 2019
→ Contemporary architectures ….
VGG -16 … (deeper, smaller 3x3 convolutions throughout)
ResNet …(residual blocks connecting input to the output)
…. and more and more ...
Simonyan Zisserman, 2014: http://www.robots.ox.ac.uk/~vgg/research/very_deep/
K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR 2016.
Figures: http://book.paddlepaddle.org/03.image_classification/

Deep Learning : 41
BMVA Computer Vision
Summer School 2019
Winners – 2010 → 2017
Lin et al Sanchez & Krizhevsky et alZeiler & Simonyan &Szegedy et alHe et al Shao et al Hu et alRussakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet)(ResNet) (SENet)
shallow 8 layers 8 layers
19 layers22 layers
152 layers152 layers152 layers
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 42
BMVA Computer Vision
Summer School 2019
Comparing complexity...
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced for educational use.
VGG: Highest memory,
most operations
Inception-v4: Resnet + Inception!
GoogLeNet:
most efficient
ResNet:
Moderate efficiency
depending on model,
highest accuracy
AlexNet:
Smaller compute, still memory
heavy, lower accuracy

Deep Learning : 43
BMVA Computer Vision
Summer School 2019
Forward pass time & power consumption ...
[ This slide – from: Fei-Fei Li & Justin Johnson & Serena Yeung]
An Analysis of Deep Neural Network Models for Practical Applications, 2017.
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced for educational use.

Deep Learning : 44
BMVA Computer Vision
Summer School 2019
Some times simple is better …
[for simple problems, where deeper models may just overfit]
TPR FPRF P A
AlexNet 0.91 0.07 0.93 0.950.92
InceptionV1 0.96 0.09 0.95 0.940.93
VGG-13 0.93 0.11 0.93 0.920.91
FireNet 0.92 0.09 0.93 0.93 0.92
InceptionV1-OnFire 0.96 0.10 0.94 0.930.93
Statistical performance on full-frame fire detection – True Positive Rate
(TPR), False Positive Rate (TPR), F-score(F), Precision (P), Accuracy (A)
Reduced complexity CNN sub-architecture of InceptionV1 - InceptionV1-OnFire
dense
1
1
4
4
5
5
dense
2
3
64
128
256
4096 4096
Max
pooling
Max
pooling
Max
pooling
Stride
of 4
i
n
p
u
t
C
o
n
v

7
x
7
+
2
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
L
o
c
a
lR
e
s
p
N
o
r
m
C
o
n
v

1
x
1
+
1
(
V
)
C
o
n
v

3
x
3
+
1
(
S
)
L
o
c
a
lR
e
s
p
N
o
r
m
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

5
x
5
+
3
(
V
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

5
x
5
+
3
(
V
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

7
x
7
+
1
(
V
)
F
C
C
o
n
v

1
x
1
+
1
(
S
)
F
C
F
C
S
o
f
t
m
a
x
A
c
t
i
v
a
t
io
n
s
o
f
t
m
a
x
0
C
o
n
v

1
x
1
+
1
(
S
)
F
C
F
C
S
o
f
t
m
a
x
A
c
t
i
v
a
t
i
o
n
s
o
f
t
m
a
x
1
S
o
f
t
m
a
x
A
c
t
i
v
a
t
i
o
n
s
o
f
t
m
a
x
2
i
n
p
u
t
C
o
n
v

7
x
7
+
2
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
L
o
c
a
l
R
e
s
p
N
o
r
m
C
o
n
v

1
x
1
+
1
(
V
)
C
o
n
v

3
x
3
+
1
(
S
)
L
o
c
a
lR
e
s
p
N
o
r
m
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

5
x
5
+
3
(
V
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

5
x
5
+
3
(
V
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
2
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
M
a
x
P
o
o
l

3
x
3
+
1
(
S
)
D
e
p
t
h
C
o
n
c
a
t
C
o
n
v

3
x
3
+
1
(
S
)
C
o
n
v

5
x
5
+
1
(
S
)
C
o
n
v

1
x
1
+
1
(
S
)
A
v
e
r
a
g
e
P
o
o
l

7
x
7
+
1
(
V
)
F
C
C
o
n
v

1
x
1
+
1
(
S
)
F
C
F
C
S
o
f
t
m
a
x
A
c
t
i
v
a
t
i
o
n
s
o
f
t
m
a
x
0
C
o
n
v

1
x
1
+
1
(
S
)
F
C
F
C
S
o
f
t
m
a
x
A
c
t
i
v
a
t
io
n
s
o
f
t
m
a
x
1
S
o
f
t
m
a
x
A
c
t
iv
a
t
i
o
n
s
o
f
t
m
a
x
2
Reduced complexity CNN sub-architecture of AlexNet [1] - FireNet
… superior in-frame fire detection obtained via
reduced complexity architectures [Dunnings / Breckon, 2018]
https://github.com/tobybreckon/fire-detection-cnn

Deep Learning : 45
BMVA Computer Vision
Summer School 2019
Varying Deep CNN Architectures
(and applications) all based on ...
•Feed-forward network:
–Convolve input (feature extraction)
–Non-linearity (rectified linear)
–Pooling (local max)
… all trained by backpropogation
Repeated multiple times over network architecture
Input
Convolution (Learned)
Non-linearity
Pooling
Feature maps
….
….
….
….
….
Final
Classification
Example: LeNet—5 (http://book.paddlepaddle.org/03.image_classification/ )

Deep Learning : 46
BMVA Computer Vision
Summer School 2019
Train via Backpropagation
Neural Network Training: weight modifications are made in
the “backwards” direction: from the output layer, through each
hidden layer down to the first hidden layer, hence
“Backpropagation”
Key Algorithmic Steps
–Initialize weights (to small random values*) in the network
–Propagate the inputs forward
•(by applying activation function) at each node
–Backpropagate the error backwards
•(by updating weights and biases)
–Terminating condition
•when validation error is very small or enough iterations
Backpropogation details beyond scope/time (see accompanying reading)
.. see extra slides
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536.
ASIDE: deep learners should know this

Deep Learning : 47
BMVA Computer Vision
Summer School 2019
So overall deep network layers
provide …
feature extraction
data de-noising
dimensionality reduction
feature pooling
spatial invariance
non-linear input → output mapping
… specifically optimized towards a
given problem
*
(* that is represented by a given set of defined examples)

Deep Learning : 48
BMVA Computer Vision
Summer School 2019
… which really covers most desirable
aspects for most computer vision
problems we encounter.
[if you think about it]

Deep Learning : 49
BMVA Computer Vision
Summer School 2019
Key question – is it really that
easy ?
(images in →results out ?!?)

Deep Learning : 50
BMVA Computer Vision
Summer School 2019
Is it really that simple ?
https://www.youtube.com/watch?v=mxKlUO_tjcg [ video ]
Cao et al, CVPR, 2017
https://github.com/ZheC/Realtime_Multi-Person_Pose_Estimation

Deep Learning : 51
BMVA Computer Vision
Summer School 2019
Remember, it's all about ..

Deep Learning : 52
BMVA Computer Vision
Summer School 2019
Learning from the Data
Training data: used to train the system
–i.e. build the rules / learnt target function
–split into training (back-propogated) and
validation (when to stop back-propogating)
–specific examples (used to learn)
Test data: used to test performance of the system
–unseen by the system during training
–specific examples (used to evaluate)
e.g. face gender classification
….
….

Deep Learning : 53
BMVA Computer Vision
Summer School 2019
Simple ? - Well almost …..

provided we avoid
the pitfalls on the way
(i.e. follow good practice and do good science)
.. see extra slides

Deep Learning : 54
BMVA Computer Vision
Summer School 2019
We must avoid over-fitting …..
(i.e. over-learning)

Deep Learning : 55
BMVA Computer Vision
Summer School 2019
Principle of Occam's Razor
Occam's Razor
●“entia non sunt multiplicanda praeter
necessitatem” (latin!)
●“entities should not be multiplied beyond
necessity” (english)
●“All things being equal, the simplest
solution tends to be the best one”
For Machine Learning : prefer the
simplest {model | hypothesis | …. tree |
projection | network } that fits the data
14th-century English logician
William of Ockham

Deep Learning : 56
BMVA Computer Vision
Summer School 2019
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Source: [PRML, Bishop, 2006]
Degree of Polynomial Model
Graphical Example: function approximation (via regression)

Deep Learning : 57
BMVA Computer Vision
Summer School 2019
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Source: [PRML, Bishop, 2006]
Increased Complexity

Deep Learning : 58
BMVA Computer Vision
Summer School 2019
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Source: [PRML, Bishop, 2006]
Increased Complexity
Good Approximation

Deep Learning : 59
BMVA Computer Vision
Summer School 2019
Function f()
Learning Model
(approximation of f())
Training Samples
(from function)
Source: [PRML, Bishop, 2006]
Over-fitting!
Poor approximation
(as the model M=9 is not the simplest that fits the data!)

Deep Learning : 60
BMVA Computer Vision
Summer School 2019
How to spot over-fitting ...
Performance on the training data improves
Performance on the unseen test data decreases
Increasing model complexity or training iterations

Deep Learning : 61
BMVA Computer Vision
Summer School 2019
Key issue – data bias
https://www.bbc.co.uk/news/business-48842750
Gender Shades: Intersectional Accuracy Disparities in
Commercial Gender Classification, Proceedings of Machine Learning Research 81:1–15, 2018
http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
[8
th
July 2019]
Overfitting to both train/test data
distributions due to dataset bias
→ leads poor or unintended
behaviour in deployment
–e.g.

Deep Learning : 62
BMVA Computer Vision
Summer School 2019
Q: – how to spot
dataset bias ?
A: in general – with great difficulty (open research area)

Deep Learning : 63
BMVA Computer Vision
Summer School 2019
Key question – what about all this
terminology ?
( the “language” of the abyssi era )
(the basis of deep learning; things to know + understand)

Deep Learning : 64
BMVA Computer Vision
Summer School 2019
Loss functions ?

Deep Learning : 65
BMVA Computer Vision
Summer School 2019
Loss Functions: how good is our net ?
cat
car
frog
3.2 1.3 2.2
5.1 4.9 2.5
-1.7 2.0-3.1
Suppose: 3 training examples, 3 classes.
With some parameters, W the scores
are:
A loss function tells how
good our current classifier is
Given a dataset of examples
Where is image and
is (integer) label
Loss over the dataset is a
sum of loss over examples:
April 10, 2018
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 66
BMVA Computer Vision
Summer School 2019
Loss Functions: how good is our net ?
cat
car
frog
3.2 1.3 2.2
5.1 4.9 2.5
-1.7 2.0-3.1
April 10, 2018
Multi-class Support Vector
Machine (SVM) loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for
the scores vector:
the SVM loss has the form:
Suppose: 3 training examples, 3 classes.
With some parameters, W the scores
are:
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 67
BMVA Computer Vision
Summer School 2019
Loss Functions: how good is our net ?
cat
car
frog
3.2 1.3 2.2
5.1 4.9 2.5
-1.7 2.0-3.1
April 10, 2018
Suppose: 3 training examples, 3 classes.
With some parameters, W the scores
are:
Multi-class SVM loss = “Hinge loss”
the hinge loss has the form:
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 68
BMVA Computer Vision
Summer School 2019
Regularization: prevent the net over-fitting
Data loss: Model predictions
should match training data
Regularization: Prevent the model
from doing too well on training
data (i.e. overfitting)
→Occam’s Razor
= regularization strength
(hyperparameter)
Simple examples
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 69
BMVA Computer Vision
Summer School 2019
Regularization: prevent the net over-fitting
Data loss: Model predictions
should match training data
Regularization: Prevent the model
from doing too well on training
data (i.e. overfitting)
= regularization strength
(hyperparameter)
Why regularize?
-Express preferences over weights
-Make the model simple so it works on test data
-Improve optimization by adding curvature
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 70
BMVA Computer Vision
Summer School 2019
e.g. L2 Regularization: how and why ?
Expresses a preference for weight equality:
L2 Regularization
L2 regularization likes to
“spread out” the weights
Where several W may have same classification result:
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 71
BMVA Computer Vision
Summer School 2019
Softmax(): mapping output activation
scores to probabilities
cat
car
frog
3.2
5.1
-1.7
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
24.5
164.0
0.18
0.13
0.87
0.00
exp
normalize
probabilities
Probabilities
must be >= 0
Probabilities
must sum to 1
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 72
BMVA Computer Vision
Summer School 2019
Softmax(): mapping output activation
scores to probabilities
cat
car
frog
Want to interpret raw CNN output scores as probabilities:
scores → probabilities
use softmax()
function
probabilities
Probabilities
must be >= 0
Probabilities
must sum to 1
Correct (probabilities)
3.2
exp
24.5 0.13
compare
1.00
5.1 164.0
normalize
0.87 0.00
-1.7 0.18 0.00 0.00
Cross-entropy loss (diff. in correct probabilities)
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 73
BMVA Computer Vision
Summer School 2019
… in every node: activation functions

Deep Learning : 74
BMVA Computer Vision
Summer School 2019
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU

k-
Output
Of
Layer
N-1
To Layer N+1
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 75
BMVA Computer Vision
Summer School 2019
Activation Functions – ReLU variations
Leaky ReLU
backprop parameter
[Mass et al., 2013]
[He et al., 2015]
-Does not saturate
-Computationally efficient
-Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
-will not “die” (like ReLU → 0 output)
also Parametric Rectifier (PReLU)
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 76
BMVA Computer Vision
Summer School 2019
… in every node: activation functions

k-
Output
Of
Layer
N-1
To Layer N+1
-Use ReLU. Be careful with your learning rates
-Try out Leaky ReLU / Maxout / ELU
-Try out tanh but don’t expect much
-Don’t use sigmoid
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 77
BMVA Computer Vision
Summer School 2019
… but how many network
architectures are we training?
(are all the node weights updated in each backpropogation cycle?)

Deep Learning : 78
BMVA Computer Vision
Summer School 2019
Dropout: efficient and robust training
for large neural nets (Hinton et al., 2012 http://arxiv.org/abs/1207.0580)

Consider a neural net with H hidden units

Each time we present a training
example within backpropogation, we
randomly omit each hidden unit in {all
| some} layers with probability 0.5.
→ we are randomly sampling from 2
H
different architectures

At test time – use all hidden units but
halve all the outgoing weights
– computes an approximate mean of the
predictions of all 2
H models.
[ This slide – adapted from: G. Hinton]

Deep Learning : 79
BMVA Computer Vision
Summer School 2019
Get the inputs setup correctly ...

Deep Learning : 80
BMVA Computer Vision
Summer School 2019
Input pre-processing – image data
e.g. consider CIFAR-10
dataset with [32,32,3] images
Zero-centre all the image pixel data inputs so
backpropogation gradients are both +ve and -ve
How -
-Subtract the mean image (e.g. AlexNet)
(CIFAR - mean image = [32,32,3] array)
-Subtract per-channel mean (e.g. VGGNet)
(CIFAR - mean along each channel = 3 numbers)
Not common to normalize
variance, to do PCA
or whitening
Remember to zero-
centre inputs at run-
time (test) also!
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
Otherwise – major issue!

Deep Learning : 81
BMVA Computer Vision
Summer School 2019
Data Augmentation – generate more data!
Use random mix/combinations of : flipping, translation,
rotation, stretching, shearing, illumination changes
(log/exp/gamma transform), lens distortions, …(go crazy!)
Load image
and label
“cat”
Compute
loss
CNN
Transform image
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 82
BMVA Computer Vision
Summer School 2019
And finally ...

Deep Learning : 83
BMVA Computer Vision
Summer School 2019
Hyper-parameters: within deep learning
choices about the algorithm that
we set rather than learn
–e.g.
•drop-out rate
•weight initialization
•Backprop parameters ...
•…
•cross-validation folds (?)
Highly problem-dependent.
–must explore the hyper-parameter
space (via glorified “trial and error”)
to find optimal set
https://chrisalbon.com/machine_learning/model_selection/hyperparameter_tuning_using_grid_search/

Deep Learning : 84
BMVA Computer Vision
Summer School 2019
Key question – why didn’t we wake
up and realize this deep learning
stuff sooner ?
(I mean, look at all that pre abyssi work – ??? )

Deep Learning : 85
BMVA Computer Vision
Summer School 2019
Earlier shallow neural nets had some limitations
When to we terminate backpropagation ?
How do we select the parameters ?
–learning rate (weight up-dates)
–network topology (number of hidden nodes / number of layers)
–choice of activation function
How can we be sure what is the network learning?
–How can we be sure the correct (classification) function is being
learned ?
Are the network weights optimal ?
–maybe in a local minima in the weight space
c.f. AI folk-lore “the tanks story”

Deep Learning : 86
BMVA Computer Vision
Summer School 2019
Key Enablers
(i.e. what changed to make all this happens)
As of now (~2012 → 2016+) we now have three key things
that made deep learning possible:
–data (lots of it available)
–low-cost, high-performance GPU hardware
(to train larger networks than before)
–key algorithmic insights to backpropogation training
(to train networks more efficiently + guard against local
minima and overfitting)

Validation classification

Validation classification

Validation classification
~14 million labeled images, 20k classes

Deep Learning : 87
BMVA Computer Vision
Summer School 2019
But what about those over-fitting and
local minima problems?
(for deep networks)

Deep Learning : 88
BMVA Computer Vision
Summer School 2019
Neural Networks – a deep reprize
Recent work shows local minima are less of a problem than thought
(Pascanu et al., 2014, Dauphin et. al., 2014, Choromanska et al., 2015)
–local minima dominate low dimensions but saddle points (ridges)
dominate high dimensions
–most local minima are close to global minima in high dimensions
•… and deep neural networks use a very high dimensional weight space
Advances in training: use of drop-out to regularize the weights in the globally
connected layers (which contain most of the parameters)
–dropout: half of the hidden units in a layer are randomly removed for each
training example.
–effect: stops hidden units from relying too much on other hidden units
(hence reduces likelihood of over-fitting)

Deep Learning : 89
BMVA Computer Vision
Summer School 2019
But what if I have limited data
examples for my problem … ?
(“deep learning” needs “big data”)

Deep Learning : 90
BMVA Computer Vision
Summer School 2019
Transfer Learning
First train the network on a related task where sufficient data is available
–e.g. train on ImageNet for image classification
Use these weights from this training cycle as the initialization for a
second training cycle with the limited data from your task
–adjusting output layer for the number of classes
–e.g. limited images of rare tropical fish / x-ray images of guns
→ Essentially we transfer the knowledge from one task to the other
[Figure: Ackay / Breckon, 2017]

Deep Learning : 91
BMVA Computer Vision
Summer School 2019
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Image Captioning: CNN + RNN
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015.
Object Detection
(Fast R-CNN)
CNN pretrained
on ImageNet
Word vectors pretrained
with word2vec
Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Generating Image Descriptions”, CVPR 2015
Figure copyright IEEE, 2015.
April 24, 2018

Example: X-ray image object detection
Transfer Learning Using Convolutional Neural Networks For Object Classification Within X-Ray Baggage Security Imagery (S. Akcay,
M.E. Kundegorski, M. Devereux, T.P. Breckon), In Proc. International Conference on Image Processing, IEEE, 2016.
●Deep CNN (via transfer learning): Features → Classification (end to end)
–95% (True+) over 6 object categories, FP (see above)
CameraLaptopGunGun Component KnivesCeramic KnivesmAP
AlexNet 97.23 99.7097.30 89.64 93.19 94.50 95.26
GoogLeNet 97.14 92.5699.50 97.70 95.50 98.40 98.40
mAP = mean Average Precision
(over all classes)

Deep Learning : 93
BMVA Computer Vision
Summer School 2019
But these deep networks seem to
take the whole image for
classification ?
What about detection ?

Deep Learning : 94
BMVA Computer Vision
Summer School 2019
Region-based CNN (R-CNN)
Learn both a Region Proposal Network (RPN)
–likely object locations, given the image
… and then classify those regions via existing CNN
architecture (jointly trained) References:
RCNN [Girshick et al. CVPR 2014]
Fast RCNN [Girshick, ICCV 2015]
Faster RCNN [Ren et al., 2015]
https://www.youtube.com/watch?v=WZmSMkK9VuA
[Figure: Ackay / Breckon, 2017]

Deep Learning : 95
BMVA Computer Vision
Summer School 2019
Pedestrian detection with CNN
[Sermanet et al., 2013]https://www.youtube.com/watch?v=uKU2pzpGUlM

Deep Learning : 96
BMVA Computer Vision
Summer School 2019
Key question – how can we be sure
of what the network is learning ?
(remember – this was a problem in the good old days*)
*henceforth known as a the pre abyssi era of computer vision

Deep Learning : 97
BMVA Computer Vision
Summer School 2019
… we can map network activation back to
the input pixel space
•What input pattern originally caused a
given activation in the feature maps?
–hence trace back through the network
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
de-Convolution Network layer (left) is
attached to a Convolution Neural Network layer (right)
Convolution Neural Networkde-Convolution Network
A de-Convolution Network layer can
be used to reconstruct an
approximate version of the features
from the layer beneath.
- hence we can recover an
approximate feature visualization

Deep Learning : 98
BMVA Computer Vision
Summer School 2019
De-convolving CNNs
This allows us to “Transpose” the architecture to go from
activations back to image
c1
c2
c3c4c5 f6 f7 f8
…
output
c
T
1
c
T
2
c
T
3c
T
4c
T
5
f
T
6
f
T
8f
T
7
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

Deep Learning : 99
BMVA Computer Vision
Summer School 2019
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Example CNN : Layer 2

Deep Learning : 100
BMVA Computer Vision
Summer School 2019
Reference: Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Example CNN : Layer 3

Deep Learning : 101
BMVA Computer Vision
Summer School 2019
So … we can see some internal
representations that may convince us that
deep networks are doing a good job
(after all they appear to be learning “the right stuff”**)
[** but we could be suffering from confirmation bias - “a tendency to search for or interpret
information in a way that confirms one's preconceptions” ]
(CNNs: we already think they are working, so therefore we look for evidence to confirm this)

Deep Learning : 102
BMVA Computer Vision
Summer School 2019
Original Image
(correctly predicted by CNN)
Small Changes
(via JPEG DCT changes)
New Image
(incorrectly predicted by CNN)
Original Image
(correctly predicted by CNN)
Small Changes
(via JPEG DCT changes)
New Image
(incorrectly predicted by CNN)
Are they always right ? - fooling CNNs
Reference: Intriguing properties of neural networks [Szegedy ICLR 2014]
Press article: http://www.i-programmer.info/news/105-artificial-intelligence/7352-the-flaw-lurking-in-every-deep-neural-net.html

Deep Learning : 103
BMVA Computer Vision
Summer School 2019
What is happening here ?
Decision boundary and feature space representation that we
(like to) think the CNN has learned is in fact not optimal
–and in fact far from perfect
Key questions remain : what is being learnt ?
–and how confident can we be it is being learnt ?
Vs.
Example Data (2 labels) Multiple decision boundaries exist – how do we know which we have ?
Source: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html

Deep Learning : 104
BMVA Computer Vision
Summer School 2019
… in today’s terminology we call these
adversarial examples
… which brings us on to a whole new sub-
topic of deep learning

Deep Learning : 105
BMVA Computer Vision
Summer School 2019
Generative Adversarial Networks (GAN)
[so much to say, so little time]
train two models:
–one to generate some sort of fake examples from random noise
(or some conditioned distribution)
–one to discern fake model examples from real examples
Many applications in improving deep network performance
[Goodfellow et al., 2014
https://arxiv.org/abs/1406.2661]

Deep Learning : 106
BMVA Computer Vision
Summer School 2019
Many researchers increasing see CNN
(and wider deep learning techniques) as a
“black box” approach
We are only beginning to understand how they
work and why
*
* so are we perhaps still in the “dark age” of deep learning

Deep Learning : 107
BMVA Computer Vision
Summer School 2019
Hence network visualization is an
important research topic.
(we try to ascertain the decision boundary and feature representation in use)
CNN visualization attempts to
understand what is being learnt.

Deep Learning : 108
BMVA Computer Vision
Summer School 2019
Core message: CNNs, and deep learning,
have led the resurgence of Neural Networks
and generally now outperform other
methods in complex image classification
and many other tasks
However – clearly some limitations remain

Deep Learning : 109
BMVA Computer Vision
Summer School 2019
Are they the answer to all our
(computer vision) problems ?

Deep Learning : 110
BMVA Computer Vision
Summer School 2019
Beyond Classification ...
(applications in computer vision beyond classification include :
Detection Segmentation, Regression, Pose estimation, Image
Synthesis ...)

Deep Learning : 111
BMVA Computer Vision
Summer School 2019
Example: semantic pixel labelling via
SegNet ….
Application to Semantic Image Segmentation via CNN
–use of per complex network to perform per-pixel classification by
object type (i.e. semantic pixel labelling)
–Encoder ↔ Decoder architecture (but same concepts of layer types)
Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation." arXiv preprint arXiv:1511.00561, 2015.

http://arxiv.org/abs/1511.00561
http://www.youtube.com/embed/e9bHTlYFwhg?rel=0
http://mi.eng.cam.ac.uk/projects/segnet/

Deep Learning : 112
BMVA Computer Vision
Summer School 2019
Labeling Pixels: Edge Detection
DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection
[Bertasius et al. CVPR 2015]

Deep Learning : 113
BMVA Computer Vision
Summer School 2019
CNN as a Similarity Measure for Matching
Face detection/FaceNet
[Schroff et al. 2015]
Stereo vision matching [Zbontar and LeCun CVPR 2015]
Compare image patches [Zagoruyko and Komodakis 2015]
Match ground and aerial images
[Lin et al. CVPR 2015]
Optic Flow - FlowNet [Fischer et al 2015]

Deep Learning : 114
BMVA Computer Vision
Summer School 2019
CNN for Image Restoration/Enhancement
Image super-resolution
[Dong et al. ECCV 2014]
Non-blind deconvolution (de-blurring)
[Xu et al. NIPS 2014]
Non-uniform blur estimation - [Sun et al. CVPR 2015]

Deep Learning : 115
BMVA Computer Vision
Summer School 2019
CNN for Image Generation (synthesis)
Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]
Video: http://lmb.informatik.uni-freiburg.de/Publications/2015/DB15/Generate_Chairs_mov_morphing.avi

Deep Learning : 116
BMVA Computer Vision
Summer School 2019
Using CNN activation(s) as features ….
[Donahue et al. ICML 2013]
CNN Features off-the-shelf:
an Astounding Baseline for Recognition
[Razavian et al. 2014]

Deep Learning : 117
BMVA Computer Vision
Summer School 2019
Recent trends ...

Deep Learning : 118
BMVA Computer Vision
Summer School 2019
[Vinyals et al., 2015]
[Karpathy and Fei-Fei,
2015]
No errors Minor errors Somewhat related
A white teddy bear sitting in
the grass
A man riding a wave on
top of a surfboard
A man in a baseball
uniform throwing a ball
A cat sitting on a
suitcase on the floor
A woman is holding a
cat in her hand
All images are CC0 Public domain:
https://pixabay.com/en/luggage-antique-cat-1643010/
https://pixabay.com/en/teddy-plush-bears-cute-teddy-bear-1623436/
https://pixabay.com/en/surf-wave-summer-sport-litoral-1668716/
https://pixabay.com/en/woman-female-model-portrait-adult-983967/
https://pixabay.com/en/handstand-lake-meditation-496008/
https://pixabay.com/en/baseball-player-shortstop-infield-1045263/
Captions generated by Justin Johnson using Neuraltalk2
A woman standing on a
beach holding a surfboard
Images ↔ text ….
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 119
BMVA Computer Vision
Summer School 2019
Style Transfer ….
[ This slide – adapted from: Fei-Fei Li & Justin Johnson & Serena Yeung]

Deep Learning : 120
BMVA Computer Vision
Summer School 2019
Efficient Networks – MobileNets (et al.)
MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
(Google)
https://arxiv.org/abs/1704.04861
 MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov,
Liang-Chieh Chen
CVPR 2018
https://openaccess.thecvf.com/content_cvpr_2018/CameraReady/3427.pdf
http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/

Deep Learning : 121
BMVA Computer Vision
Summer School 2019
Neural Architecture Search (NAS / “AutoML”)
… automates the process
of designing neural
network architectures
3 key components:
– Search space
: of child
network architectures
– Search strategy
: to
generate child architectures
– Performance evaluation:
to measure effectiveness of
generated child architectures
Neural Architecture Search: A Survey (Elsken et al. 2018)
https://arxiv.org/abs/1808.05377
…...

Deep Learning : 122
BMVA Computer Vision
Summer School 2019
So - Why does deep learning work
so well ?
(compared to other pre-abyssi techniques)

Deep Learning : 123
BMVA Computer Vision
Summer School 2019
My answer:
larger parameter space
optimized with more data
trained to avoid over-fitting
(a ~10 billion+ parameter space can possibly represent any other ML technique as a subset)

Deep Learning : 124
BMVA Computer Vision
Summer School 2019
Further Reading – post abyssi textbooks
Deep Learning - http://www.deeplearningbook.org/
Goodfellow / Bengio / Courville
MIT Press
2016
Available as HTML online
(free)

Deep Learning : 125
BMVA Computer Vision
Summer School 2019
Further Reading – pre abyssi textbooks
Bayesian Reasoning and Machine
Learning
– David Barber
http://www.cs.ucl.ac.uk/staff/d.barber/brml/
(Cambs. Univ. Press, 2012)
Computer Vision: Models, Learning,
and Inference
– Simon Prince
(Springer, 2012)
http://www.computervisionmodels.com/
… both very probability driven, both available as free PDF online
(woo, hoo!)

Deep Learning : 126
BMVA Computer Vision
Summer School 2019
Further Reading – key papers
Y. LeCun, Y. Bengio, and G. Hinton. "Deep learning." Nature 521.7553 (2015): 436.
http://www.csri.utoronto.ca/~hinton/absps/NatureDeepReview.pdf
Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural Networks 61
(2015): 85-117. (via http://arxiv.org/pdf/1404.7828 )
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional
Neural Networks, NIPS 2012 (http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf )
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."
Computer vision–ECCV 2014. Springer International Publishing, 2014. 818-833.
http://ftp.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and
Bengio, Y., 2014. Generative adversarial nets. In Advances in neural information processing
systems (pp. 2672-2680).
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region
proposal networks. InAdvances in neural information processing systems 2015 (pp. 91-99).
http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

Deep Learning : 127BMVA Computer Vision
Summer School 2019
That's all folks ...
Slides, examples, demo code, links + extra supporting slides @ www.durham.ac.uk/toby.breckon/mltutorial/

Deep Learning a whirlwind tour of key principles

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Deep Learning a whirlwind tour of key principles

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77