DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

ssuserc416e2 2,434 views 34 slides May 25, 2019
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

A presentation introducting DeepLab V3+, the state-of-the-art architecture for semantic segmentation. It also includes detailed descriptions of how 2D multi-channel convolutions function, as well as giving a detailed explanation of depth-wise separable convolutions.


Slide Content

DeepLabV3+ Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Background DeepLabV3+ is the latest version of the DeepLab models. DeepLab V1 : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR 2015. DeepLab V2 : DeepLab : Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017. DeepLab V3 : Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. DeepLab V3+ : Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018.

Semantic Segmentation Classifying all pixels in an image into classes. Classification at the pixel level. Does not have to separate different instances of the same class. Has i mportant applications in Medical Imaging.

Current Results on Pascal VOC 2012

Motivation and Key Concepts Use Atrous Convolution and Separable Convolutions to reduce computation. Combine Atrous Spatial Pyramid Pooling Modules and Encoder-Decoder Structures. ASPPs capture contextual information at multiple scales by pooling features at different resolutions. Encoder-Decoders can obtain sharp object boundaries.

Architecture Overview

Advanced Convolutions

Convolution (Cross-Correlation) for 1 Channel Convolution with Zero-Padding Display with Convolution Kernel Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed

Other Convolutions (Cross-Correlations) Strided Convolution with Padding Atrous (Dilated) Convolution with r=2 Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed

Atrous Convolution à trous is French for “with holes” Atrous Convolution is also known as Dilated Convolution. Atrous Convolution with r=1 is the same as ordinary Convolution The image on the left shows 1D atrous convolution

Receptive Field of Atrous Convolutions Left: r=1, Middle: r=2, Right: r=4 Atrous Convolution has a larger receptive field than normal convolution with the same number of parameters.

Depth-wise Separable Convolution A special case of Grouped Convolution. Separate the convolution operation along the depth (channel) dimension. It can refer to both (depth -> point) and (point -> depth). It only has meaning in multi-channel convolutions (cross-correlations).

Review: Multi-Channel 2D Convolution

Exact Shapes and Terminology Filter : A collection of Kernels of shape concatenated channel-wise. Input Tensor Shape: Filters are 3D, Kernels are 2D, All filters are concatenated to a single 4D array in 2D CNNs. ( , , , , , , )  

Step 1: Convolution on Input Tensor Channels

Step 2: Summation along Input Channel Dimension

Step 3: Add Bias Term Each kernel of a filter iterates only 1 channel of the input tensor. The number of filters is . Each filter generates one output channel. Each 2D kernel is different from all other kernels in the 3D filter.   Key Points

Normal Convolution Top: Input Tensor Middle: Filter Bottom: Output Tensor

Depth-wise Separable Convolution Replace Step 2. Instead of summation, use point-wise convolution (1x1 convolution). There is now only one filter. The number of 1x1 filters is . Bias is usually included only at the end of both convolution operations. Usually refers to depth-wise convolution -> point-wise convolution. Xception uses point-wise convolution -> depth-wise convolution.  

Depth-wise Separable Convolution

Characteristics Depth-wise Separable Convolution can be used as a drop-in replacement for ordinary convolution in DCNNs. The number of parameters is reduced significantly (sparse representation). The number of flops is reduced by several orders of magnitude (computationally efficient). There is no significant drop in performance (performance may even improve). Wall-clock time reduction is less dramatic due to GPU memory access patterns.

Example: Flop Comparison (Padding O, Bias X) Ordinary Convolution For a 256x256x3 image with 128 filters with kernel size of 3x3, the number of flops would be There is an 8-fold reduction in the number of flops.   Depth-wise Separable Convolution Left : Depth Conv, Right : Point Conv For a 256x256x3 image with 128 filters and a 3x3 kernel size, the number of flops would be  

Example: Parameter Comparison (Excluding Bias Term) Ordinary Convolution For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of weights would be   Depth-wise Separable Convolution For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of flops would be There is also an 8-fold reduction in parameter numbers.  

Atrous Depth-wise Separable Convolution

Architecture Overview

Encoder-Decoder Structures The Encoder reduces the spatial sizes of feature maps, while extracting higher-level semantic information. The Decoder gradually recovers the spatial information. UNETs are a classical example of encoder-decoder structures. In DeepLabV3+, DeepLabV3 is used as the encoder.

Architecture Overview

Decoder Layer Structure Apply 4-fold bilinear up-sampling on the ASPP outputs. Apply 1x1 Convolution with reduced filter number on a intermediate feature layer. Concatenate ASPP outputs with intermediate features. Apply two 3x3 Convolutions. Apply 4-fold bilinear up-sampling. Purpose & Implementation The ASPP is poor at capturing fine details. The decoder is used to improve the resolution of the image. The intermediate layer has 1x1 convolutions to reduce channel number.

ASPP: Atrous Spatial Pyramid Pooling

The ASPP Layer Encodes multi-scale contextual information through multiple rates. Concatenate all extracted features and an up-sampled global average pooling layer channel-wise. Use Atrous Depth-wise separable convolutions for multiple channels. Bad at capturing sharp object boundaries.

Modified Aligned Xception Network Xception : Extreme Inception Network. Backbone network for DeepLabV3+ Uses residual blocks and separable convolutions.

Explanation of Xception Takes the “ Inception Hypothesis ”, which states that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly, to the extreme. The extensive use of separable convolutions and atrous convolutions allows the model to fit in GPU memory despite the huge number of layers. Originally applied point-wise convolution before depth-wise convolution. Invented by François Chollet .

Architecture Review

The End