[NS][Lab_Seminar_250317]DriveLM: Driving with Graph Visual Question Answering.pptx

DriveLM: Driving with Graph Visual Question Answering Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 5/03/17 Chonghao Sima et al. ECCV 2024

Introduction DriveLM: A new task, dataset, metrics, and baseline for end-to-end autonomous driving Inspired by multi-step human reasoning in driving Introduce Graph VQA as a proxy task GVQA model reasoning as a series of interconnected question-answer pairs (QAs) Key idea: capture logical dependencies in driving decisions using a graph structure Previous limitations: Limited Generalization Restricted Human Interaction Insufficient Reasoning Complexity Data Annotation Limitations Challenges in Open-Loop Planning Efficiency Constraints Driving-Specific Input Limitations

Introduction Figure 1: DriveLM considers GVQA where question-anser pairs are interconnected via logical dependencies at the object-level and task-level

Graph Construction Node Node represent individual question-answer pairs Each QA (v = (q,a)) is associated key objects in the scene Example: Question: What are the important objects in the current scene? Answer: There is a moving car to the front left of the ego car Fig. 3: DriveLM-Agent Pipeline. Given the scene image, a VLM performs prompting with context to model the logical dependency among the five QA stages. Context is built using preceding QAs, and can have one or more sources.

Graph Construction Edge Directed edge connect parent QA v p to child QA v c Edge e = ( v p , v c ) indicate that answer v p to provides context for answering v c Two types of edges: Object-level: interaction between different objects (eg. planning for a sedan depend on perception of a pedestrian) Task-level: logical chain of reasoning stages Fig. 3: DriveLM-Agent Pipeline. Given the scene image, a VLM performs prompting with context to model the logical dependency among the five QA stages. Context is built using preceding QAs, and can have one or more sources.

Model Task-level Edges: Reasoning Stages (P1-M) Perception (P1): identification, description, and localization of key objects in the current scene Eg. Questions: “What are the important objects?” “What is the moving status of object X?” Prediction (P2): estimation of possible action/interaction of key objects based on perception results Eg. Questions: “What is the future state of object X?” “Would object X be in the moving direction of the ego vehicle?” Planning (P3): possible safe actions of the ego vehicle Eg. Questions: “What actions could the ego vehicle?” “In this scenario, what are safe actions to take?” Behavior (B): classification of driving decision, aggregate information from P1-3 Eg. “Accelerate and go ahead, change to the left lane” Motion (M): waypoint of ego vehicle future trajectory. Continuous (x, y) coordinates Define the motion M as the ego vehicle future trajectory, which is a set of N points with coordinates (x, y) in bird’s-eye view (BEV), denoted as Distance for x,y at each time interval:

DriveLM-Agent: A GVQA Baseline Prompting with Context DriveLM-Agent use contextual prompting to implement logical dependencies For an edge, the QA from parent node is appended to the question of the child node with Context Context can include QAs from multiple preceding nodes (concatenated) Allow model to use prior reasoning to answer subsequent questions

DriveLM VLMs for end-to-end driving Table 2: Open-loop Planning on DriveLM-nuScenes and Zero-shot Generalization across Sensor Configurations on Waymo. Fig. 4: Qualitative Results of DriveLM-Agent. (Left) DriveLM-nuScenes val frame, (Right) Waymo val frame.

DriveLM Instantiation in DriveLM-Data Fig. 2: Annotation Pipeline

DriveLM Instantiation in DriveLM-Data Table 1: Comparison of DriveLM-nuScenes & -CARLA with Existing Datasets. DriveLM-Data significant advances annotation quantity, comprehensiveness (covering perception, prediction and planning), and logic (chain to graph). † full dataset, ‡ keyframe dataset, ∗ semi-rule-based labeling (w/ human annotators), ∗∗ fully-rule-based (no human annotators). - means publicly unavailable.

DriveLM Instantiation in DriveLM-Data (nuScenes) DriveLM-nuScenes: Annotated QAs arranged in a graph, linking images with driving behavior Semi-rule-based annotation pipeline with human annotators Key frames and objects are selected QAs annotated for Perception, Prediction, and Planning Rigorous multi-round quality checks

DriveLM Instantiation in DriveLM-Data (CARLA) DriveLM-CARLA: Automatically generated frame-level QAs in an interconnected graph Fully rule-based QA labeling pipeline using PDM-Lite expert algorithm Focus on road layout, stop signs, traffic lights, and vehicles Logical relationships connect the series of QAs into a graph Allows for scalable data generation

DriveLM Inference Flexibility The size and structure of the graph during inference can be adapted Trained on all available QAs Inference can be performed on specific subgraphs Questions can be sampled using heuristics based on the task or compute budget

DriveLM Metrics Graph Visual QA use GPT Score GPT Score to measure the semantic alignment of answers and complement the SPICE metric Question, the ground truth answer, the predicted answer, and a prompt asking for a numerical score of the answer are sent to ChatGPT-3.5

DriveLM Metrics Trajectory Prediction use displacement error Use Average Displacement Error ( average L2 distance between the predicted trajectory and the ground truth trajectory over all predicted horizons ) Use Final Displacement Error (Euclidean distance between the predicted endpoint and the true endpoint at the last predicted step) Predicted trajectory use collision rate Collision Rate accounts for the ratio of how many test frames the predicted trajectory collides with objects in over all test frames For generalization, use zero-shot setting

Experiments Questions How can VLMs be effectively repurposed for end-to-end autonomous driving? Can VLMs for driving generalize when evaluated with unseen sensor setups? What is the effect of each question in perception, prediction, and planning on the final behavior decision? How well do VLMs perform perception, prediction, and planning via GVQA?

Experiments Setup All fine-tuning is implemented with LoRA For DriveLM-nuScenes, finetune BLIP-2

Experiments Main results Table 3: Question-wise analysis in DriveLM-nuScenes

Experiments Main results Table 4: Baseline P1−3 Results. DriveLM-Agent and the zero-shot BLIP-2 benefit from a step-wise reasoning procedure given by our graph structure.

Experiments Qualitative result

Conclusion DriveLM utilizes a graph structure to model multi-step reasoning in autonomous driving Nodes as QAs, edges as logical dependencies at object and task levels Contextual prompting leverages prior answers for subsequent questions DriveLM-Data provides instantiations of this graph structure Offers a promising approach for improved generalization and interactivity in AD systems

[NS][Lab_Seminar_250317]DriveLM: Driving with Graph Visual Question Answering.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

[NS][Lab_Seminar_250317]DriveLM: Driving with Graph Visual Question Answering.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx