“Depth Estimation from Monocular Images Using Geometric Foundation Models,” a Presentation from Toyota Research Institute

embeddedvision 0 views 33 slides Oct 15, 2025
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2025/10/depth-estimation-from-monocular-images-using-geometric-foundation-models-a-presentation-from-toyota-research-institute/

Rareș Ambruș, Senior Manager for Large Behavior Models at Toyota Research Institute...


Slide Content

© Copyright TRI
Depth Estimation from
Monocular Images Using
Geometric Foundation Models
Rareș Ambruș, PhD
Senior Manager
Large Behavior Models
1

© Copyright TRI
Introduction
Mono-Depth
MultiView-Depth
Conclusion
Outline
2

© Copyright TRI
TRI envisions a future where
Toyota products, enabled by
TRI technology, dramatically
improve quality of life for
individuals and society.
Mission
3

© Copyright TRI
https://www.youtube.com/watch?v=kKz2mtYdQAA
4

© Copyright TRI
Learning Robust 3D Perception from Cameras
Single RGB Image Predicted Depth Image
MonoDepth
Network
5

© Copyright TRI
Introduction
Mono-Depth
MultiView-Depth
Conclusion
Outline
6

© Copyright TRI
Zero-shoton any domain
without fine-tuning
(appearance gap)
Scale-aware (metric)
predictions on any camera
(geometric gap)
Model uncertainty
Monocular Depth Estimation
7

© Copyright TRI
Large-scale diverse pre-training
→ Diffusion models
Challenges
Add Noise Add Noise
Learned
Denoising
8

© Copyright TRI
Challenges
Prompt
"A photo of a white robot in tall grass staring at clouds in the blue sky"
9

© Copyright TRI
Large-scale diverse pre-training
→ Diffusion models
Challenges
Add Noise Add Noise
Learned
Denoising
Sparse, unstructured training labels
→ Pixel-level generation
Geometric doman gap
→ Condition on camera information
10

© Copyright TRI
Geometric RIN (GRIN): Efficient Pixel-Level Diffusion
with Sparse Labels
Vitor Guizilini, Pavel Tokmakov, Achal Dave, Rares Ambrus, 3DV’25 (Oral)
Architecture: GRIN(Geometric Recursive Interface Networks)
Local conditioningwith visual features + 3D geometric embeddings
Global conditioningwith dense features to preserve scene-level information
11

© Copyright TRI
Input: RGB image + camera intrinsics
Ground-truth:Sparse depth maps
Input Embeddings
12

© Copyright TRI
Input Embeddings
Geometric embeddingsImage embeddings
Global Local
13

© Copyright TRI
Local conditioning
Local conditioning: Image + geometric embeddings
Sparse information from validpixels (log-encoded depth)
14

© Copyright TRI
Global conditioning
Global conditioning: Image + geometric embeddings
Dense information from the entireimage
15

© Copyright TRI
Localand globaltokens are concatenated
RIN denoising to generate localpredictions
The GRIN Architecture
16

© Copyright TRI
*Jabri et al. Scalable Adaptive Computation for Iterative Generation, ICML 2023
Efficient Diffusion
Recurrent Interface Networks (RIN*)
Read:Input tokens are projected (cross-attention) onto a fixed-dimensional latent space
Compute: Self-attention is performed in this latent space
Write:The processed latent space is written back (cross-attention) into the input tokens
17

© Copyright TRI
Experimental Results
State-of-the-Artzero-shotmetricdepth estimation
Trained from scratch, on a combination of real-world and synthetic datasets
18

© Copyright TRI
Experimental Results
19

© Copyright TRI
Uncertainty Estimation
Standard deviationfrom multiple samples
Improvements by filtering outinaccurate pixels
20

© Copyright TRI
Introduction
Mono-Depth
MultiView-Depth
Conclusion
Outline
21

© Copyright TRI
GRIN Limitations
Single input (monocular)
→ Multiple conditioning cameras
Single task (depth estimation)
→ Multi-task: depth and image
Fixed output viewpoint
→ Novel view from any viewpoint
22

© Copyright TRI
Zero-Shot Novel View and Depth Synthesis with Multi-
View Geometric Diffusion
Vitor Guizilini, Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus. CVPR25.
MVGD extends GRIN to novel view and depth synthesisfrom any viewpoint
Efficient network architecture allows 500+ input conditioning cameras
Scenescale normalizationfor training on a large-scale, heterogeneous dataset
23

© Copyright TRI
Zero-Shot Novel View and Depth Synthesis with Multi-
View Geometric Diffusion
Vitor Guizilini, Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus. CVPR25.
Scene scale normalization
allows training on a large-
scale, heterogeneous dataset
(60M samples)
24

© Copyright TRI
Zero-Shot Novel View and Depth Synthesis with Multi-
View Geometric Diffusion
Vitor Guizilini, Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus. CVPR25.
Efficient network architecture allows incremental model
upsamplingwithout restarting from scratch
25

© Copyright TRI
Zero-Shot Novel View and Depth Synthesis with Multi-
View Geometric Diffusion
Vitor Guizilini, Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus. CVPR25.
Task and target embeddingsallow switching between
novel view and depth synthesisfrom any viewpoint
26

© Copyright TRI
Red-input views; Green-novel viewpoint
27

© Copyright TRI
Pointclouds created by aggregating depth from all novel viewpoints
28

© Copyright TRI
Introduction
Mono-Depth
MultiView-Depth
Conclusion
Outline
29

© Copyright TRI
Geometric RIN (GRIN): Efficient Pixel-Level Diffusion with
Sparse Labels → efficient zero-shot metric monocular depth
estimation
Zero-Shot Novel View and Depth Synthesis with Multi-View
Geometric Diffusion →arbitrary number of conditioning
views, large-scale pretraining and incremental upsampling
Conclusion
30

© Copyright TRI
Thanks to many collaborators!
AAlspach,ABeaulieu,JBohg,WBurgard,ABühler,DChen,HChiu,ACramariuc,ADave,KGDerpanis,YDu,FDurand,JFang,ZFang,K
Fragkiadaki,WTFreeman,AGaidon,AGaneshan,LGuibas,VGuizilini,AWHarley,NHeppert,XHuang,TIkeda,MZIrshad,SIwase,P
Jensfelt,TKanai,WKehl,TKerola,HKim,ZKira,KKitani,TKo,TKollar,MKowal,YKudo,NKuppuswamy,RLee,KHLee,JLi,SLin,KLiu,M
Lunayach,SMaeda,HMei,KNishiwaki,DPark,SPillai,ARaventos,DRempe,GRosman,TSadjadpour,GShakhnarovich,PSharma,V
Sitzmann,JSolomon,CStearns,XTan,JTang,JBTenenbaum,ATewari,PTokmakov,AValada,AVallet,IVasiljevic,MWalter,YWang,Y
Yang,SZakharov

© Copyright TRI
32
Questions?
Rareș Ambruș, PhD
Senior Manager
Large Behavior Models
Code & Data: https://github.com/TRI-ML/vidar
Blog posts: https://medium.com/toyotaresearch/
Open Positions: https://tri.global/careers
Twitter: https://twitter.com/ToyotaResearch

© Copyright TRI
33
Depth estimation:
●Geometric RIN (GRIN): Efficient Pixel-Level Diffusion with Sparse Labels
Vitor Guizilini, Pavel Tokmakov, Achal Dave, Rares Ambrus. 3DV25.
●Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
Vitor Guizilini, Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus. CVPR25.
Object-centric representations:
●ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation.
Zakharov, Liu, Gaidon, Ambrus. SIGGRAPH24.
●ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
Iwase, Irshad, Liu, Guizilini, Lee, Ikeda, Amma, Nishiwaki, Kitani, Ambrus, Zakharov. CVPR25.
●OmniShape: Zeroshot Multi-Hypothesis Shape and Pose Estimation in the Real World
Liu*, Zakharov*, Chen, Ikeda, Gaidon, Shakhnarovich, Ambrus. ICRA25.
Robot policies: failure detection and statistical analysis:
●Is your imitation learning policy better than mine? policy comparison with near-optimal stopping
Snyder, Hancock, Badithela, Dixon, Miller, Ambrus, Majumdar, Itkina, Nishimura. RSS25.
●Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation
Learning Policies
Xu, Nguyen, Dixon, Rodriguez, Miller, Lee, Shah, Ambrus, Nishimura, Itkina. RSS25.
References