Mask-RCNN for Instance Segmentation

hitheone 4,604 views 37 slides Oct 01, 2018
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

Present Mask-RCNN, a state-of-the-art Deep Neural Network to segment instances from images.


Slide Content

( (
8C! 8B8)." D!C(!/#!/). B8 !A
(09D(8!1D"C,C
).8!""9(A"D#)28.!

-

Visual perception tasks

122.23..2
2.12 .12
4

Agenda
•Visual perception tasks
•Mask-RCNN
•Mask-RCNN architecture
•Feature Pyramid Network
•Region Proposal Network
•RoIAlign
•Mark-RCNN head network
•Result
•Summary

Introduction to MaskRCNN
•Mask-RCNN stands for Mask-Region Convolutional Neural Network
•State-of-the-art algorithm for Instance Segmentation
•Evolved through 4 main versions:
•RCNN →Fast-RCNN →Faster-RCNN →Mask-RCNN
•The first 3 versions are for Object Detection
•Improvements over Faster RCNN: use RoIAligninstead of RoIPool
•Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each
class independently
•2 main stages:
•1ststage: use Region Proposal Network (RPN) to propose candidate object bounding boxes
•2ndstage: classify the candidate boxes, refine the boxes and predict masks

Terms
•Bounding box: rectangle identifying location of an object
•Mask: set of pixels which belong to an object
•Anchor: a bounding box is generated independently from image content
•RoI: Region of Interest, a bounding box which may contain an object
•Non-Maximum Suppression(NMS): a method to eliminate duplicated bounding box using scores
and IoUthreshold
•IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other.
•RoIAlign: a method to extract features for RoIsfrom feature maps
•Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale
•Region Proposal Network (RPN): a neural network to propose RoIfor an image
•Fully Convolutional Network (FCN): a convolution-based neural network to extract masks

MaskRCNNarchitecture

1 1!++:7
:#
:#
:#:B
7
:# :#
'1!"1+C1!A:
'1:+:: 71!A:
'1::(!11 !
7 1 Background + number of classes)
:":B11 :::B" !1!

Multi-scale problem


Approaches for multiple scaled objects

2...2 2.2.-
.22.-...
2."2 2..-2A..
2#.-2A.2 2.-. .
1
"2:222.
2.-:
:.29 .22.--
.2A..22!.-- .2#.
()

Feature Pyramid Network (FPN)




))
)
) 4+4
+
#
#3"

46! 542 53
0(1

To detect boundaries of objects:
• 1 11
•111 1 1
• 1 1 1
Bounding box regression

Bounding box regression
)*
/)0)
/*0*
•/DI&:"&=D/&F(FG&+:0 /)0)/*0*
•+&AGDF(FG&+:0+!&+F(D"&+ /0
.":+!.&:!"=!+!&
•/0.!FAIA+:FD% /)0)/*0*G-
../*1/)/./)(&,.
!.0*10)0.0)(&,!
• /)0)/*0*FAGDF-FGA0AIA+:FD%/0.!G-
/)./1(&,./*./).
0).01(&,!0*.0)!
/&!DFDF(F:"+:DI&:"&=D/4
2FDI&:+FI+!D/2
•4/404.4!-&+F/&+F0.":+!&:!"=!+D(F:"+:D/4
•2/202.2!-&+F/&+F0.":+!&:!"=!+D+!=FDI&:+FI+!D/2
•0"&
•:/. 2/14/4.:0. 201404!
•:..AD= 2.4.:!.AD= 2!4!
•:/:0G("0GA"&-F"&++F&GA+"D&D+!&+FD4
•:.:!G("0AD=G(+F&GA+"D&GD+!.":+!!"=!+D4
•&"&F&G+(+F(F:"+:DI&:"&=D/4G(DG"+"-D/AD&=."+!"+GF=FGG"D&
+!&+!:#IG+:DI&:"&=D/"G":&+"":G-
!
.
/0

4!
4.
4/40
2/20
2.
2!
1FD%5+

)(

Region Proposal Network


Anchor generator
Proposal layer
"4
"44
4
"414!
"44
"Filter out negative anchors with
Rpn_probs and Non-Max Suppression

RPN head network

,#"F
G
='BC
A4":)C4!
,#"F
G
4"#BC3AB3=#4'#"G='BC
#'!4G
/A"3AB#5C

,#"F
G
4"#BC3AB3=#4'#"G ='BC
/A"355#G
1"#BC2
1"#BC 2
F4=(CAB4"#B#B
5#("":5#GB:BCC#"(
B#!12

Anchor generator
)
-=38!#A%#4
-=38!#3=%#A31
#1%!
#1%!
#1%!
#1%!(
.1%&#1"
/8%%"A%#(!12A3!2! 61A%##3==4!'=%8#122%8!!6!4#=!2:3%4%3%!=0
1=38!#'1=38!#8#1%!

IoU –Intersection over Union

3)'I&)I=:'.5)=#+=:'.=G
?9+#G+?=:=G+E)=#+#'&/
•4'71227
•272
•&77112
271
0
1
4'7 01.) /((
2.9BEA=
01
4'7 01.% .%
0
1
4'7 01.,,. %%
'-=)A9E &'#&+=)G=+#'& !#+

Proposal layer

•Sort all anchors by rpn_probs(how likely an anchor contains an object)
•Choosetop N anchors, throw the remainings(e.g., N ~ 6000)
•Apply Non-Maximum Suppresion(NMS) to eliminate duplicated boxes.
Keepup to M anchors (e.g., M ~ 2000).

Non-Maximum Suppresion(NMS)

•.'C:#")%A"'!5C9"C
•09A=*='C:#"C%
+?E&???%)&*?)#A?&(
,>#'&
-:%C"")%
"A%"A9")
,?1C9A%9"500)
.(.)9""%!")%
'&#'&-:%C"9"%!")%
)?A:&=
12
1?A&C9")%*%"A
12
-""#'!C:C9A:%!"A=:!:!")(
9""%C9")(:C9:%&%?A:C
:=:>&A=:!:!")%(:C9,?) 2
?>-):='=1'##A%%:?>?A:&=

Non-Maximum Suppresion(NMS)

=A#
,#A=2='5#
.="5#="52='
=0A"5#=:5(
--'==#52='5#
-A#=A#,#A==#52='5#
(>#:
)
>#A52='5#2(#="5
)
,==!%A:A5"5#="52='
==#5A52='CA"#">::A)
):2#"52='5#2CA1(
,2%:A:A==>""2>#:

Train the RPN
•Positive boxes: IoU>= 0.7 with any GT box
•Negative boxes: IoU< 0.3 with all GT boxes
•Ratio of positive boxes: 1/3
•Fixed numof anchors per image for train: 256

Loss function
•iis the index of an anchor in a mini-batch
•piis the predicted probability of anchor ibeing an object
•Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative
•tiis a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box
•is that of the ground-truth box associated with a positive anchor
•Classification loss Lclsis log loss over two classes (object vs. not object)
•For regression loss Lreg, use , where R is smoothL1 defined as:
•While both positive and negative anchorscontribute to classificationloss, only positiveanchors contribute to regressionloss.
•Nclsis normalized by the mini-batch size ( ), Nregis normalized by the number of anchor locations ( ), set

RoIAlign

1024
1024
540
540
Input image
Object
64
1024/16 = 64
540/16 = 33.75
33.75
Feature map
RoI
16X less
33.75 / 7 = 4.82each bin
7x7
Small feature map
(for each RoI)
RoI
Use bilinear interpolation to
calculate exact value at each bin
No quantization
(From [1])
FCN

Identify Feature Pyramid level for RoIs

Resize
P2
P3
P4
P5
w, h: width & height of a RoI
224: canonical ImageNet pre-training size
k0: target level of the RoI whose w*h = 2242
(here, k0= 5)
Target level k of a RoI is identified by:
Crop the RoIson
their feature map

Intuitions:
Features of large RoIs from smallerfeature map (high semantic level)
Features of small RoIs from largerfeature map (low semantic level)
RoIs
(From [6])

-

Mask-RCNN head network
•A classifierto identify the class for each RoI: K classes + background
•A regressorto predict the 4 values dy, dx, dh, dw for each RoI
•Fully Convolutional Network (FCN) [5] to predict mask per class
•Represent a mask as m x m matrix
•For each RoI, try to predict mask for each class
•Use sigmoid to predict how probability for each pixel
•Use binary loss to train the network

Mask-RCNN head network architecture

7x7x256
Small feature map
(for each RoI)

1024
Fully connected layer implemented by CNN
Shared weights over multiple RoIs

Softmax
(K+1) x 4

(K+1)
14x14x2563x3
(256 filters)
Conv1
14x14x256
Conv4
14x14x256
3x3
(256 filters)
Conv Transpose
(Up sampling)
2x2
(256 filters)
(stride 2)
28x28x256
...
x 4 conv layers
Conv
28x28x(K+1)
1x1
(K+1 filters)
Sigmoid
activation
28x28x(K+1)
7x7
(1024 filters)
Conv1Conv2
(BG + num classes)K+1
Dense
Dense
(K+1) x 4
1024K+1
Predict mask per class
BG vs K classes
4 box regression values:
dy, dx, dh, dw
1x1
(1024 filters)

Loss functions
•For each sampled RoI, a multi-task loss is applied:
where
•Lclsis classification loss
•Llocis bounding-box regression loss
•Lmaskis mask loss
•The final loss is calculated as mean of loss over samples

Classification loss Lcls
•For a RoI, denotes:
•: true class of the RoI
•: predicted probability distribution over K+1 classes
•The classification loss Lclsfor a RoIis a log-loss calculated as:

Bounding-box regression loss Lloc
•For a RoI, denotes:
•: true class of the RoI
•: true bounding-box regression targets of the RoI
•: predicted bounding-box regression for the class u.
•The bounding-box regression loss Llocfor the RoIis calculated as:
where

Mask loss Lmask
•For a RoI, denotes:
•: true class of the RoI
•: the true mask and the predicted mask for the class of the RoI
respectively ( )
•The mask loss Lmaskfor the RoIis the average binary cross-entropy
loss, calculated as:

Mask-RCNN on COCO data

(From [1])

Evolution of R-CNN

= Faster R-CNN [2] + Fully Convolutional Network [5]
RoIPool RoIAlignPer-pixel softmax Per-pixel sigmoid
Mask R-CNN [1]
Faster R-CNN= Fast R-CNN [3] +
Fast R-CNN= R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer
R-CNN [1]Region proposal on input image #!++
" #+" !=
+
+

Summary
•Introduced MaskRCNN, an algorithm for Instance Segmentation
•Detect both bounding boxes and masks of objects in an end-to-end
neural network
•Improve RoIPool from Faster-RCNN with RoIAlign
•Employ Fully Convolutional Network for mask detection

References
[1] KaimingHe, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE
International Conference on Computer Vision (ICCV), 2017.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In NIPS, 2015.
[3] R. Girshick. Fast R-CNN. In ICCV, 2015.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014
[5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017
by Nguyen Phuoc Tat Dat

Appendix:Some popular DL-based algorithms for visual perception tasks
by Nguyen Phuoc Tat Dat
Visual perception tasksAlgorithms
Image Classification
AlexNet
Inception
GooLeNet/Inception v1
ResNet
VGGNet
Object Detection
Fast/Faster R-CNN
SSD
YOLO
Semantic SegmentationFully Convolutional Network (FCN)
U-Net
Instance SegmentationMask R-CNN

Thank you for listening!
!