[NS][Lab_Seminar_240902]MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection.pptx

thanhdowork 59 views 18 slides Sep 02, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection


Slide Content

MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/09/02 Guiping Cao et al. IJCAI 2024

Introduction Background: Object detection is essential in CV for identify bounding boxes and categories in images Traditional detectors rely on convolutional networks and complex pipelines Transformer-based detectors like DETR simplify these pipelines but face issues with query distribution and box-sensitive category predictions Problem: DETR-like models suffer from box-sensitive category predictions and imbalance spatial distribution of queries

Introduction Problems Figure 1: Comparison of different models of AP w.r.t the training epochs and parameters on val2017 of COCO. MLP-DINO gets the best performance under different training epoch settings

Introduction Problems Figure 2: The models with (-w-) QICS achieves both higher scores of categories prediction and boosted performance than that of without (-wo-) utilizing QICS on val2017 of COCO dataset. The DINO and MLP-DINO model use ResNet50 and Strip-MLP-TSWMP as the backbone, respectively

Introduction Problems Figure 3: Example of different spatial distributions of 20 query points on 5 objects (4 classes marked with different shapes) within an image. (a) Queries are selected only by the predicted confidence score, with many query points (in red color) clustering in a single object (e.g. indicated by the rectangle). (b) By representing query points as nodes in a graph, we incorporate spatial information into the query selection process, getting updated distribution of query points (in black color) and enabling higher hit-rate of queries to objects, as presented in Table 5.

Proposed model Key Innovations Query-Independent Category Supervision (QICS): Decouples category prediction from bounding box regression Deep MLP Backbone: integrate an MLP-based model into a transformer framework to handle both long-range and short-range information Graph-based Query Selection (GQS): use graph representation to balance the spatial distribution of queries, improving query hit-rate

Method Overall Figure 4: The overall architecture of the proposed MLP-DINO. b1 ∼ b4 and e1 ∼ e4 represent the output features from the deep MLP backbone and transformer-encoder, respectively. These features are at different levels and resolutions

Method Query-Independent Category Supervision (QICS) Motivation: Category prediction in DETR-like is often influenced by the accuracy of bounding box predictions QICS approach: Global classification module (GCM) predicts categories independent of queries Use global average pooling on backbone features to decouple category prediction from the box regression process

Method Graph-based Query Selection (GQS) Motivation: Standard query methods often lead to clustered queries, reducing hit-rate on objects GQS Approach: Represent query points as nodes in a graph Incorporate spatial information through a Coefficient of Variation (CV) based metric to distribute queries more effectively Improve query hit-rate by ensuring queries cover a broader range of potential objects

Method SWMP for Deep MLP Challenge: MLP models are sensitive to input image size due to their reliance on fully-connected layers SWMP Solution: The Sharing Weight on Mini-Patch (SWMP) method allows MLP models to handle images of arbitrary size Crop images into mini-patches, apply MLP layers with shared weights, and reassembles the image

Experiments Experimental Settings COCO2017 dataset 118k training images 5k validation images AdamW optimizer Weight decay 1*10-4 Batch size 8

Experiments Main Results Table 1: Comparison results of MLP-DINO with other popular detection models under different backbones and training epochs on val2017 of COCO. Models of DINO and MLP-DINO adopt 4 scales of feature maps from the backbone network.

Experiments Ablation Study - Ablation on All Components of MLP-DINO Table 2: Ablation results on all components of the MLP-DINO

Experiments Ablation Study - Ablation on Different Backbones in DINO Table 4: The ablation comparison of different backbones in DINO. The type of S and L represent short-range and long-range information built by the model, respectively. The number of parameters and FLOPs of Strip-MLP-T model is fewer than ResNet50 and Swin-T

Experiments Ablation Study - Effectiveness of GQS on Hit-Rate Table 5: Ablations of GQS on different backbone models

Experiments Ablation Study - Ablation of QICS on Different Features Table 6: Ablation results of applying QICS to different part of backbone features and encoder features of MLP-DINO

Experiments Ablation Study - Ablation of the Queries Pool Size for GQS Table 7: The ablation results of different queries candidate pool size of λ for GQS. The backbone model is Strip-MLP-SWMP

Conclusion MLP-DINO integrate MLPs into transformer-based detection models, enhancing category prediction and query distribution QICS and GQS methods significantly improve object detection performance Explore in other tasks like segmentation Further research on improving transformer-based models using MLPs