Introduction to MaskRCNN
•Mask-RCNN stands for Mask-Region Convolutional Neural Network
•State-of-the-art algorithm for Instance Segmentation
•Evolved through 4 main versions:
•RCNN →Fast-RCNN →Faster-RCNN →Mask-RCNN
•The first 3 versions are for Object Detection
•Improvements over Faster RCNN: use RoIAligninstead of RoIPool
•Employ Fully Convolutional Network (FCN) for mask prediction. Predict mask for each
class independently
•2 main stages:
•1ststage: use Region Proposal Network (RPN) to propose candidate object bounding boxes
•2ndstage: classify the candidate boxes, refine the boxes and predict masks
Terms
•Bounding box: rectangle identifying location of an object
•Mask: set of pixels which belong to an object
•Anchor: a bounding box is generated independently from image content
•RoI: Region of Interest, a bounding box which may contain an object
•Non-Maximum Suppression(NMS): a method to eliminate duplicated bounding box using scores
and IoUthreshold
•IoU: Intersection over Union, a metrics to evaluate how 2 areas likely to be similar to each other.
•RoIAlign: a method to extract features for RoIsfrom feature maps
•Feature Pyramid Network (FPN): a neural network to extract feature maps with different scale
•Region Proposal Network (RPN): a neural network to propose RoIfor an image
•Fully Convolutional Network (FCN): a convolution-based neural network to extract masks
•Sort all anchors by rpn_probs(how likely an anchor contains an object)
•Choosetop N anchors, throw the remainings(e.g., N ~ 6000)
•Apply Non-Maximum Suppresion(NMS) to eliminate duplicated boxes.
Keepup to M anchors (e.g., M ~ 2000).
Train the RPN
•Positive boxes: IoU>= 0.7 with any GT box
•Negative boxes: IoU< 0.3 with all GT boxes
•Ratio of positive boxes: 1/3
•Fixed numof anchors per image for train: 256
Loss function
•iis the index of an anchor in a mini-batch
•piis the predicted probability of anchor ibeing an object
•Ground truth label is 1 if the anchor is positive, and is 0 if the anchor is negative
•tiis a vector representing the 4 parameterized coordinates (dy, dx, dh, dw) of the predicted bounding box
•is that of the ground-truth box associated with a positive anchor
•Classification loss Lclsis log loss over two classes (object vs. not object)
•For regression loss Lreg, use , where R is smoothL1 defined as:
•While both positive and negative anchorscontribute to classificationloss, only positiveanchors contribute to regressionloss.
•Nclsis normalized by the mini-batch size ( ), Nregis normalized by the number of anchor locations ( ), set
RoIAlign
1024
1024
540
540
Input image
Object
64
1024/16 = 64
540/16 = 33.75
33.75
Feature map
RoI
16X less
33.75 / 7 = 4.82each bin
7x7
Small feature map
(for each RoI)
RoI
Use bilinear interpolation to
calculate exact value at each bin
No quantization
(From [1])
FCN
Identify Feature Pyramid level for RoIs
Resize
P2
P3
P4
P5
w, h: width & height of a RoI
224: canonical ImageNet pre-training size
k0: target level of the RoI whose w*h = 2242
(here, k0= 5)
Target level k of a RoI is identified by:
Crop the RoIson
their feature map
Intuitions:
Features of large RoIs from smallerfeature map (high semantic level)
Features of small RoIs from largerfeature map (low semantic level)
RoIs
(From [6])
-
Mask-RCNN head network
•A classifierto identify the class for each RoI: K classes + background
•A regressorto predict the 4 values dy, dx, dh, dw for each RoI
•Fully Convolutional Network (FCN) [5] to predict mask per class
•Represent a mask as m x m matrix
•For each RoI, try to predict mask for each class
•Use sigmoid to predict how probability for each pixel
•Use binary loss to train the network
Mask-RCNN head network architecture
7x7x256
Small feature map
(for each RoI)
1024
Fully connected layer implemented by CNN
Shared weights over multiple RoIs
Softmax
(K+1) x 4
(K+1)
14x14x2563x3
(256 filters)
Conv1
14x14x256
Conv4
14x14x256
3x3
(256 filters)
Conv Transpose
(Up sampling)
2x2
(256 filters)
(stride 2)
28x28x256
...
x 4 conv layers
Conv
28x28x(K+1)
1x1
(K+1 filters)
Sigmoid
activation
28x28x(K+1)
7x7
(1024 filters)
Conv1Conv2
(BG + num classes)K+1
Dense
Dense
(K+1) x 4
1024K+1
Predict mask per class
BG vs K classes
4 box regression values:
dy, dx, dh, dw
1x1
(1024 filters)
Loss functions
•For each sampled RoI, a multi-task loss is applied:
where
•Lclsis classification loss
•Llocis bounding-box regression loss
•Lmaskis mask loss
•The final loss is calculated as mean of loss over samples
Classification loss Lcls
•For a RoI, denotes:
•: true class of the RoI
•: predicted probability distribution over K+1 classes
•The classification loss Lclsfor a RoIis a log-loss calculated as:
Bounding-box regression loss Lloc
•For a RoI, denotes:
•: true class of the RoI
•: true bounding-box regression targets of the RoI
•: predicted bounding-box regression for the class u.
•The bounding-box regression loss Llocfor the RoIis calculated as:
where
Mask loss Lmask
•For a RoI, denotes:
•: true class of the RoI
•: the true mask and the predicted mask for the class of the RoI
respectively ( )
•The mask loss Lmaskfor the RoIis the average binary cross-entropy
loss, calculated as:
Mask-RCNN on COCO data
(From [1])
Evolution of R-CNN
= Faster R-CNN [2] + Fully Convolutional Network [5]
RoIPool RoIAlignPer-pixel softmax Per-pixel sigmoid
Mask R-CNN [1]
Faster R-CNN= Fast R-CNN [3] +
Fast R-CNN= R-CNN [4] + ConvNet on whole input image first, then apply RoIPooling layer
R-CNN [1]Region proposal on input image #!++
" #+" !=
+
+
Summary
•Introduced MaskRCNN, an algorithm for Instance Segmentation
•Detect both bounding boxes and masks of objects in an end-to-end
neural network
•Improve RoIPool from Faster-RCNN with RoIAlign
•Employ Fully Convolutional Network for mask detection
References
[1] KaimingHe, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. IEEE
International Conference on Computer Vision (ICCV), 2017.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In NIPS, 2015.
[3] R. Girshick. Fast R-CNN. In ICCV, 2015.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, 2014
[5] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, 2015.
[6] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In CVPR, 2017
by Nguyen Phuoc Tat Dat
Appendix:Some popular DL-based algorithms for visual perception tasks
by Nguyen Phuoc Tat Dat
Visual perception tasksAlgorithms
Image Classification
AlexNet
Inception
GooLeNet/Inception v1
ResNet
VGGNet
Object Detection
Fast/Faster R-CNN
SSD
YOLO
Semantic SegmentationFully Convolutional Network (FCN)
U-Net
Instance SegmentationMask R-CNN