Object Detection is a very powerful field.pptx

usmanyaseen16 99 views 62 slides Apr 29, 2024
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

Object Detection is a very powerful field


Slide Content

Object Detection

Object classification

Object localization

Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image. However , those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections. With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object.

2013: OverFeat OverFeat : Integrated Recognition, Localization and Detection using Convolutional Networks

Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well. The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN? This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows. Also , since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot . OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN. Unfortunately , such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used.

Also proposed in 2013, R-CNN is a bit late compared with OverFeat . However , this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e , region proposal stage, and region classification and refinement stage. 2013: R-CNN

R-CNN first extracts potential regions of interest from an input image by using a technique called selective search. Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object. Therefore , the results of selective search have a very high probability to contain something meaningful. Next , R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition. Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage.

Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location. To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet ), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors . However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat , and running on each region proposal separately makes it even slower. Later , we will see how R-CNN evolve over time to address these two issues.

2015: Fast R-CNN

A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals. Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals. However , different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor. These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.

Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers . In SPPNet , a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector. This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer . Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer. This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.

With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference. The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.

This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue. Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds. Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.

2015: Faster R-CNN Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions. Just a few months later, Ross and his team came back with another improvement again. This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.

With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal. As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data. So , is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search ? Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN). To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score. The paper used VGG originally but other backbone networks such as ResNet become more widespread later. To generate region proposals, a 3×3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location . In practice, this sliding window is implemented with a 3×3 convolution kernel with a 1×1 convolution kernel.

Although the sliding window has a fixed size, our objects may appear on different scales. Therefore , Faster R-CNN introduced a technique called anchor box. Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location. In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each. These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map. Note that the bounding box regression will be computed from these anchor box instead of the whole image.

So far, we discussed the new Region Proposal Network to replace the old selective search region proposal. To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization. Fast R-CNN also uses a shared CNN feature extractor. Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above. This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals. One walk around is called alternative training where you train RPN and Fast R-CNN in turns. And later in a paper “Instance-aware semantic segmentation via multi-task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.

2015: YOLO v1 You Only Look Once: Unified, Real-Time Object Detection While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it . Does object detection need to be so cumbersome? If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed? With these questions, Joseph Redmon submitted a network called YOLO to arxiv.org only four days after Faster R-CNN’s submission. It finally brought popularity back to one-stage object detection two years after OverFeat’s debut.

Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN. In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals. The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell. Because objects may have different sizes, there will be more than one bounding box regressor per cell. During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time. In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score). This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image .

CNN model that forms the backbone of YOLO

Object localization

Steps 1. YOLO cuts an image into squares. This makes it easier for YOLO to find objects in the image. It only needs to look at one square at a time, instead of the entire image. 2. For each square, YOLO guesses if there is an object in it and, if so, what kind of object it is. It does this by using a deep learning model. The model has been trained on a lot of images and labels. This means that the model knows how to identify different types of objects in images. 3. YOLO gets rid of any extra guesses. It does this by using a technique called non-maximum suppression. This removes any guesses that are overlapping with other guesses. This makes sure that YOLO only outputs one guess for each object in the image. 4. YOLO outputs the remaining guesses as rectangles and object labels. A rectangle is a box that surrounds an object in an image. An object label is a name for the type of object in the box. These outputs the remaining guesses as rectangles and object labels. This means that YOLO outputs a box and a name for each object that it finds in the image.

For multiple objects

Non max suppression

Anchor boxes

Cnn with two anchor boxes

2015: SSD SSD: Single Shot MultiBox Detector YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable . In YOLO v1, multiple objects could be assigned to the same grid cell. This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors. SSD is such a challenger and attacks this problem from three angles.

Key Features of SSD Single Shot : Unlike some traditional object detection models that use a two-stage approach (first proposing regions of interest and then classifying those regions), SSD performs object detection in a single pass through the network. It directly predicts the presence of objects and their bounding box coordinates in a single shot, making it faster and more efficient. MultiBox : SSD uses a set of default bounding boxes (anchor boxes) of different scales and aspect ratios at multiple locations in the input image. These default boxes serve as prior knowledge about where objects are likely to appear. SSD predicts adjustments to these default boxes to locate objects accurately.

Key Features of SSD Multi-Scale Detection : SSD operates on multiple feature maps with different resolutions, allowing it to detect objects of various sizes. Predictions are made at different scales to capture objects at varying levels of granularity. Class Scores : SSD not only predicts the bounding box coordinates but also assigns class scores to each default box, indicating the likelihood of an object belonging to a specific category (e.g., car, pedestrian, bicycle).

Key Concepts of SSD Default Bounding Boxes (Anchor Boxes) : SSD uses a predefined set of default bounding boxes, also known as anchor boxes. These boxes come in various scales and aspect ratios, providing prior knowledge about where objects are likely to be located in the image. SSD predicts adjustments to these default boxes to localize objects accurately. Multi-Scale Feature Maps : SSD operates on multiple feature maps at different resolutions. Obtain these feature maps by applying convolutional layers to the input image at various stages. Using feature maps at numerous scales allows SSD to detect objects of different sizes.

Key Concepts of SSD Multi-Scale Predictions : For each default bounding box, SSD makes predictions at multiple feature map layers with different resolutions. This enables the model to capture objects at various scales. These predictions include class scores for different object categories and offsets for adjusting the default boxes to match the objects’ positions. Aspect Ratio Handling : SSD uses separate predictors (convolutional filters) for different aspect ratios of bounding boxes. This allows it to adapt to objects with varying shapes and aspect ratios.

Base Network (Truncated for Classification) : SSD begins with a standard CNN architecture, which is typically used for high-quality image classification tasks. However, in SSD, this base network is truncated before any classification layers. The base network is responsible for extracting essential features from the input image.

Multi-Scale Feature Maps : Additional convolutional layers are added to the truncated base network. These layers progressively reduce the spatial dimensions while increasing the number of channels (feature channels). This design allows SSD to produce feature maps at multiple scales. Each scale’s feature map is suitable for detecting objects of different sizes. Default Bounding Boxes (Anchor Boxes) : SSD associates a predefined set of default bounding boxes (anchor boxes) with each feature map cell. These default boxes have various scales and aspect ratios. The placement of default boxes relative to their corresponding cell is fixed and follows a convolutional grid pattern. For each feature map cell, SSD predicts the offsets necessary to adjust these default boxes to fit objects and the class scores indicating the presence of specific object categories . Aspect Ratios and Multiple Feature Maps : SSD employs default boxes with different aspect ratios and uses them across multiple feature maps at various resolutions. This approach efficiently captures a range of possible object shapes and sizes. Unlike other models, SSD doesn’t rely on an intermediate fully connected layer for predictions but uses convolutional filters directly.

Grid cell Instead of using sliding window, SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detection objects simply means predicting the class and location of an object within that region. If no object is present, we consider it as the background class and the location is ignored. For instance, we could use a 4x4 grid in the example below. Each grid cell is able to output the position and shape of the object it contains.

Anchor box Each grid cell in SSD can be assigned with multiple anchor/prior boxes. These anchor boxes are pre-defined and each one is responsible for a size and shape within a grid cell. For example, the swimming pool in the image below corresponds to the taller anchor box while the building corresponds to the wider box.

SSD uses a matching phase while training, to match the appropriate anchor box with the bounding boxes of each ground truth object within an image. Essentially , the anchor box with the highest degree of overlap with an object is responsible for predicting that object’s class and its location. This property is used for training the network and for predicting the detected objects and their locations once the network has been trained. In practice, each anchor box is specified by an aspect ratio and a zoom level.

Aspect ratio Not all objects are square in shape. Some are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this. The ratios parameter can be used to specify the different aspect ratios of the anchor boxes associates with each grid cell at each zoom/scale level . Zoom level It is not necessary for the anchor boxes to have the same size as the grid cell. We might be interested in finding smaller or larger objects within a grid cell. The zooms parameter is used to specify how much the anchor boxes need to be scaled up or down with respect to each grid cell. Just like what we have seen in the anchor box example, the size of building is generally larger than swimming pool.

2016: FPN Feature Pyramid Networks for Object Detection With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined. Researchers start to look at improving each individual parts of these networks. Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid. This feature pyramid idea isn’t very novel in computer vision research. Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales . However , how to share the feature pyramid between RPN and the region-based detector is still yet to be determined.

First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output. Also , we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid. Next , to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well. Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well. In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.

2016: YOLO v2 The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects. YOLO v2 added Batch Normalization layers from a paper called “ Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. 

Just like SSD, YOLO v2 also introduced Faster R-CNN’s idea of anchor boxes for bounding box regression. Also , anchors sizes are determined by a K-means clustering of the target dataset to better align with object shapes . A new backbone network called Darknet is used for feature extraction. This is inspired by “ Network in Network ” and GooLeNet’s bottleneck structure . To improve the detection of small objects, YOLO v2 added a passthrough layer to merge features from an early layer. This part can be seen as a simplified version of SSD . YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.

2017: RetinaNet To understand why one-stage detectors are usually not as good as two-stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions . RetinaNet invented a new loss function called Focal Loss to help the network learn what’s important . Focal Loss added a power γ (they call it focusing parameter) to Cross-Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy. It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head. Similar to SSD and YOLO v2, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.

2018: YOLO v3 YOLOv3: An Incremental Improvement Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector. YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well. And it got really popular in the industry because of its fast speed and simple components.

Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet -like detection head with an FPN neck. The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster. Also , YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design. Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects. 

2019: Objects As Points Although the image classification area becomes less active recently, object detection research is still far from mature . In 2018, a paper called “ CornerNet : Detecting Objects as Paired Keypoints ” provided a new perspective for detector training. Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior? This new trend of ditching anchor boxes is called “anchor-free” object detection.

Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression.

Objects As Points, aka CenterNet , took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers . Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly . The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets . By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too. The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together. Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more. Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.

2019: EfficientDet EfficientDet : Scalable and Efficient Object Detection

EfficientDet showed us some more exciting development in the object detection area. FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales. Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification. Later , NAS-FPN and PANet both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization. EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN . Basically , BiFPN features additional cross-layer connections to encourage feature aggregation back and forth. To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design. Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.

More less famous models… 2009: DPM Object Detection with Discriminatively Trained Part Based Models By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features . 2012: Selective Search Selective Search for Object Recognition Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition . 2016: R-FCN R-FCN: Object Detection via Region-based Fully Convolutional Networks Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “ Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+ ”.

2017: Soft-NMS Improving Object Detection With One Line of Code Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed . 2017: Cascade R-CNN Cascade R-CNN: Delving into High Quality Object Detection While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.

2017: Mask R-CNN Mask R-CNN Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e , creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark. 2018: PANet Path Aggregation Network for Instance Segmentation Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features. 2019: NAS-FPN NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of multiple merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.
Tags