[20240603_LabSeminar_Huy]TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking.pptx

thanhdowork 100 views 17 slides Jun 03, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking


Slide Content

Quang-Huy Tran Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: [email protected] 2024-06-03 TransMOT : Spatial-Temporal Graph Transformer for Multiple Object Tracking Peng Chu et al. ICCV-23: Proceedings of the IEEE/CVF Winter Conference on applications of computer vision

OUTLINE MOTIVATION INTRODUCTION METHODOLOGY EXPERIMENT & RESULT CONCLUSION

MOTIVATION Multiple object tracking : Input: Image sequence from video. MOT tracker: Feature representation, Affinity Estimation, Optimization for Association . Output: Tr aj ector ies of all targets across the whole video. Overview

MOTIVATION Traditional methods use strict threshold to track only most confident detection candidates Overview Our method leverages low confidence detection candidates for robust tracking in occluded scenes.

INTRODUCTION propose a novel spatial-temporal graph Transformer for MOT ( TransMOT ) objects are arranged as a temporal series of sparse weighted graphs that are constructed using their spatial relationships within each frame. handles a large and varied number of tracked targets and detection candidates during tracking Using the graph representation, TransMOT encodes the features and spatial-temporal relationships for all tracked targets through its Spatial-Temporal Graph Transformer Encode TransMOT can learn the spatial-temporal clues for association and directly generate the assignment matrix for MOT. By relying on the discriminative spatial-temporal clues and capability of modeling many candidates, TransMOT can associate candidates from a large number of loosely filtered detection predictions.

METHODOLOGY Main Architecture - TransMOT

METHODOLOGY Problem Setting At the t- th frame, TransMOT maintains a set of tracklets. tracklet maintains a set of states: past locations and appearance features on the previous T image frames. Given a new image frame : determines whether any tracked objects are occluded, computes new locations for the existing tracklets ,and generates new tracklets for new objects that enter the scene. At each frame, the detection and feature extraction sub-networks generate Mt candidate object detection proposals .  

METHODOLOGY Spatial-Temporal Graph Transformer Encoder Spatial Graph Transformer Encoder Input: the states of the tracklets for the past T frames. Adjacency matrix is defined by Intersection over Union( IoU ).

METHODOLOGY Spatial-Temporal Graph Transformer Encoder Node features are embedded through a source embedding layer (a linear layer) independently for each node. Then passed into the spatial graph transformer encoder layer together. self-attention weights for the i-th head: Graph multi-head attention weighted feature tensor. Temporal Graph Transformer Encoder transposes the first two dimensions of SGTE and apply standard transformer model in temporal dimension.

METHODOLOGY Spatial Graph Transformer Decoder produces extended assignment matrix from candidate graph and STGTE. candidate graph same as graph of tracklets .   Apply graph multi-head attention into candidate graph Add virtual node from STGTE handle the candidates that either is false positives or need to initiate a new tracklet in the current frame t to form an extended tracklet embedding . Feed-forward layer and a normalization layer to generate output.

METHODOLOGY Training and Tracking Framework Relax the constraints in finding assignment adjacency matrix for efficient computation: a tracklet is always associated with a detection candidate or a virtual source. a row of the assignment matrix can be treated as a probability distribution over a total of categories.   Cross-entropy loss:

EXPERIMENT AND RESULT EXPERIMENT SETTINGs Measurement: standard ID score metrics: measure the long-term ID consistency and compare whole trajectories with ground truth for ID precision (IDP), ID recall (IDR), and their IDF1 scores. multiple object tracking accuracy (MOTA): bounding box false positives(FP), false negatives (FN) and the identity switches (IDS). The percentage of mostly tracked targets (MT) and the percentage of mostly lost targets (ML). Four standard MOT challenge datasets for pedestrian tracking: MOT15, MOT16, MOT17, and MOT20.

EXPERIMENT AND RESULT RESULT – Overall Perfor mance

EXPERIMENT AND RESULT RESULT – Visualization

CONCLUSION proposed a novel spatial-temporal graph Transformer for Multi-Object Tracking ( TransMOT ). By formulating the tracklets and candidate detections as a series of weighted graphs, the spatial and temporal relationships of the tracklets and candidates are explicitly modeled and leveraged. achieves higher tracking accuracy, but also is more computationally efficient than the transitional Transformer-based methods.