[NS][Lab_Seminar_240902]MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection.pptx
thanhdowork
59 views
18 slides
Sep 02, 2024
Slide 1 of 18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
About This Presentation
MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection
Size: 1.81 MB
Language: en
Added: Sep 02, 2024
Slides: 18 pages
Slide Content
MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/09/02 Guiping Cao et al. IJCAI 2024
Introduction Background: Object detection is essential in CV for identify bounding boxes and categories in images Traditional detectors rely on convolutional networks and complex pipelines Transformer-based detectors like DETR simplify these pipelines but face issues with query distribution and box-sensitive category predictions Problem: DETR-like models suffer from box-sensitive category predictions and imbalance spatial distribution of queries
Introduction Problems Figure 1: Comparison of different models of AP w.r.t the training epochs and parameters on val2017 of COCO. MLP-DINO gets the best performance under different training epoch settings
Introduction Problems Figure 2: The models with (-w-) QICS achieves both higher scores of categories prediction and boosted performance than that of without (-wo-) utilizing QICS on val2017 of COCO dataset. The DINO and MLP-DINO model use ResNet50 and Strip-MLP-TSWMP as the backbone, respectively
Introduction Problems Figure 3: Example of different spatial distributions of 20 query points on 5 objects (4 classes marked with different shapes) within an image. (a) Queries are selected only by the predicted confidence score, with many query points (in red color) clustering in a single object (e.g. indicated by the rectangle). (b) By representing query points as nodes in a graph, we incorporate spatial information into the query selection process, getting updated distribution of query points (in black color) and enabling higher hit-rate of queries to objects, as presented in Table 5.
Proposed model Key Innovations Query-Independent Category Supervision (QICS): Decouples category prediction from bounding box regression Deep MLP Backbone: integrate an MLP-based model into a transformer framework to handle both long-range and short-range information Graph-based Query Selection (GQS): use graph representation to balance the spatial distribution of queries, improving query hit-rate
Method Overall Figure 4: The overall architecture of the proposed MLP-DINO. b1 ∼ b4 and e1 ∼ e4 represent the output features from the deep MLP backbone and transformer-encoder, respectively. These features are at different levels and resolutions
Method Query-Independent Category Supervision (QICS) Motivation: Category prediction in DETR-like is often influenced by the accuracy of bounding box predictions QICS approach: Global classification module (GCM) predicts categories independent of queries Use global average pooling on backbone features to decouple category prediction from the box regression process
Method Graph-based Query Selection (GQS) Motivation: Standard query methods often lead to clustered queries, reducing hit-rate on objects GQS Approach: Represent query points as nodes in a graph Incorporate spatial information through a Coefficient of Variation (CV) based metric to distribute queries more effectively Improve query hit-rate by ensuring queries cover a broader range of potential objects
Method SWMP for Deep MLP Challenge: MLP models are sensitive to input image size due to their reliance on fully-connected layers SWMP Solution: The Sharing Weight on Mini-Patch (SWMP) method allows MLP models to handle images of arbitrary size Crop images into mini-patches, apply MLP layers with shared weights, and reassembles the image
Experiments Main Results Table 1: Comparison results of MLP-DINO with other popular detection models under different backbones and training epochs on val2017 of COCO. Models of DINO and MLP-DINO adopt 4 scales of feature maps from the backbone network.
Experiments Ablation Study - Ablation on All Components of MLP-DINO Table 2: Ablation results on all components of the MLP-DINO
Experiments Ablation Study - Ablation on Different Backbones in DINO Table 4: The ablation comparison of different backbones in DINO. The type of S and L represent short-range and long-range information built by the model, respectively. The number of parameters and FLOPs of Strip-MLP-T model is fewer than ResNet50 and Swin-T
Experiments Ablation Study - Effectiveness of GQS on Hit-Rate Table 5: Ablations of GQS on different backbone models
Experiments Ablation Study - Ablation of QICS on Different Features Table 6: Ablation results of applying QICS to different part of backbone features and encoder features of MLP-DINO
Experiments Ablation Study - Ablation of the Queries Pool Size for GQS Table 7: The ablation results of different queries candidate pool size of λ for GQS. The backbone model is Strip-MLP-SWMP
Conclusion MLP-DINO integrate MLPs into transformer-based detection models, enhancing category prediction and query distribution QICS and GQS methods significantly improve object detection performance Explore in other tasks like segmentation Further research on improving transformer-based models using MLPs