DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

ssuserc416e2 2,434 views 34 slides May 25, 2019

Slide 1 of 34

About This Presentation

A presentation introducting DeepLab V3+, the state-of-the-art architecture for semantic segmentation. It also includes detailed descriptions of how 2D multi-channel convolutions function, as well as giving a detailed explanation of depth-wise separable convolutions.

Size: 4.79 MB

Language: en

Added: May 25, 2019

Slides: 34 pages

Slide Content

DeepLabV3+ Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Background DeepLabV3+ is the latest version of the DeepLab models. DeepLab V1 : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR 2015. DeepLab V2 : DeepLab : Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017. DeepLab V3 : Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. DeepLab V3+ : Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018.

Semantic Segmentation Classifying all pixels in an image into classes. Classification at the pixel level. Does not have to separate different instances of the same class. Has i mportant applications in Medical Imaging.

Current Results on Pascal VOC 2012

Motivation and Key Concepts Use Atrous Convolution and Separable Convolutions to reduce computation. Combine Atrous Spatial Pyramid Pooling Modules and Encoder-Decoder Structures. ASPPs capture contextual information at multiple scales by pooling features at different resolutions. Encoder-Decoders can obtain sharp object boundaries.

Architecture Overview

Advanced Convolutions

Convolution (Cross-Correlation) for 1 Channel Convolution with Zero-Padding Display with Convolution Kernel Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed

Other Convolutions (Cross-Correlations) Strided Convolution with Padding Atrous (Dilated) Convolution with r=2 Blue maps: inputs, Cyan maps: outputs, Kernel: not displayed

Atrous Convolution à trous is French for “with holes” Atrous Convolution is also known as Dilated Convolution. Atrous Convolution with r=1 is the same as ordinary Convolution The image on the left shows 1D atrous convolution

Receptive Field of Atrous Convolutions Left: r=1, Middle: r=2, Right: r=4 Atrous Convolution has a larger receptive field than normal convolution with the same number of parameters.

Depth-wise Separable Convolution A special case of Grouped Convolution. Separate the convolution operation along the depth (channel) dimension. It can refer to both (depth -> point) and (point -> depth). It only has meaning in multi-channel convolutions (cross-correlations).

Review: Multi-Channel 2D Convolution

Exact Shapes and Terminology Filter : A collection of Kernels of shape concatenated channel-wise. Input Tensor Shape: Filters are 3D, Kernels are 2D, All filters are concatenated to a single 4D array in 2D CNNs. ( , , , , , , )

Step 1: Convolution on Input Tensor Channels

Step 2: Summation along Input Channel Dimension

Step 3: Add Bias Term Each kernel of a filter iterates only 1 channel of the input tensor. The number of filters is . Each filter generates one output channel. Each 2D kernel is different from all other kernels in the 3D filter. Key Points

Normal Convolution Top: Input Tensor Middle: Filter Bottom: Output Tensor

Depth-wise Separable Convolution Replace Step 2. Instead of summation, use point-wise convolution (1x1 convolution). There is now only one filter. The number of 1x1 filters is . Bias is usually included only at the end of both convolution operations. Usually refers to depth-wise convolution -> point-wise convolution. Xception uses point-wise convolution -> depth-wise convolution.

Depth-wise Separable Convolution

Characteristics Depth-wise Separable Convolution can be used as a drop-in replacement for ordinary convolution in DCNNs. The number of parameters is reduced significantly (sparse representation). The number of flops is reduced by several orders of magnitude (computationally efficient). There is no significant drop in performance (performance may even improve). Wall-clock time reduction is less dramatic due to GPU memory access patterns.

Example: Flop Comparison (Padding O, Bias X) Ordinary Convolution For a 256x256x3 image with 128 filters with kernel size of 3x3, the number of flops would be There is an 8-fold reduction in the number of flops. Depth-wise Separable Convolution Left : Depth Conv, Right : Point Conv For a 256x256x3 image with 128 filters and a 3x3 kernel size, the number of flops would be

Example: Parameter Comparison (Excluding Bias Term) Ordinary Convolution For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of weights would be Depth-wise Separable Convolution For a 256x256x3 image with 128 filters and 3x3 kernel size, the number of flops would be There is also an 8-fold reduction in parameter numbers.

Atrous Depth-wise Separable Convolution

Architecture Overview

Encoder-Decoder Structures The Encoder reduces the spatial sizes of feature maps, while extracting higher-level semantic information. The Decoder gradually recovers the spatial information. UNETs are a classical example of encoder-decoder structures. In DeepLabV3+, DeepLabV3 is used as the encoder.

Architecture Overview

Decoder Layer Structure Apply 4-fold bilinear up-sampling on the ASPP outputs. Apply 1x1 Convolution with reduced filter number on a intermediate feature layer. Concatenate ASPP outputs with intermediate features. Apply two 3x3 Convolutions. Apply 4-fold bilinear up-sampling. Purpose & Implementation The ASPP is poor at capturing fine details. The decoder is used to improve the resolution of the image. The intermediate layer has 1x1 convolutions to reduce channel number.

ASPP: Atrous Spatial Pyramid Pooling

The ASPP Layer Encodes multi-scale contextual information through multiple rates. Concatenate all extracted features and an up-sampled global average pooling layer channel-wise. Use Atrous Depth-wise separable convolutions for multiple channels. Bad at capturing sharp object boundaries.

Modified Aligned Xception Network Xception : Extreme Inception Network. Backbone network for DeepLabV3+ Uses residual blocks and separable convolutions.

Explanation of Xception Takes the “ Inception Hypothesis ”, which states that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly, to the extreme. The extensive use of separable convolutions and atrous convolutions allows the model to fit in GPU memory despite the huge number of layers. Originally applied point-wise convolution before depth-wise convolution. Invented by François Chollet .

DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx