GOSIM_China_2024_Embodied AI Data VLA World Model

yuhuang 422 views 43 slides Oct 17, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

Development of AI
Levels of AI
Spatial Intelligence
Embodied Intelligence
Tricks for Generalization
Robot Types
VLA Models
Action Data Collection Platform
Public Datasets
Wearable AI or Ego AI
Simulation
Differentiable Rendering
Open Source Simulation Platform
Requirements of Embodied AI Datasets
Co...


Slide Content

What Data Imperative for
Action Learning of
Embodied AI?
!"#$"%&'

①Development of AI
②Levels of AI
③Spatial Intelligence
④Embodied Intelligence
⑤Tricks for Generalization
⑥Robot Types
⑦VLA Models
⑧Action Data Collection Platform
⑨Public Datasets
⑩Wearable AI or Ego AI
⑪Simulation
⑫Differentiable Rendering
⑬Open Source Simulation Platform
⑭Requirements of Embodied AI
Datasets
⑮Conclusion
Contents

Development of AI

Development of AI
●1Algorithms
●CNN--> Transformer (Mamba?)
●Fundamental Model and LLM
●New paradigm:pretrain+finetune
●Non auto-regress/generative model
●Emergent capabilities:ICL/CoT/IF
●MOE, routingnetwork
●visual-language model & multi-modalities
●OpenAIo1:MLM+RL
●Agent, Spatial & Embodied AI

●2Compute
●Chips/Training platform /Inference deploy
●Cost / Power Consumption
Development of AI

●3Data (bottleneck)
●Source: Real/Synthetic
●Processing: privacy,
preference, cleaning
●Quality: diversity, profession
●Closed/Open source?
Development of AI

●Rapid AI development causes problems;
●AI-human relation is the core issue;
●Rule-based;
●Traditional IL and RL-based;
Levels of AI
●LLM/VLM/MLM;
●Spatial AI;
●Embodied AI;
●Wearable AI or Ego AI?
●AGI and ASI?
Perception and Action

Levels of AI
●LLM/VLM/MLM;
●Spatial AI;
●Embodied AI;
●Wearable AI or Ego AI?
●AGI and ASI?
Reasoning&Decision making
●Rapid AI development causes problems;
●AI-human relation is the core issue;
●Rule-based;
●Traditional IL and RL-based;

●Rapid AI development causes problems;
●AI-human relation is the core issue;
●Rule-based;
●Traditional IL and RL-based;
Levels of AI
●LLM/VLM/MLM;
●Spatial AI;
●Embodied AI;
●Wearable AI or Ego AI?
●AGI and ASI?
Memory and Reflection

Levels of AI
●LLM/VLM/MLM;
●Spatial AI;
●Embodied AI;
●Wearable AI or Ego AI?
●AGI and ASI?
Autonomous Learning and Generalization
●Rapid AI development causes problems;
●AI-human relation is the core issue;
●Rule-based;
●Traditional IL and RL-based;

Levels of AI
●LLM/VLM/MLM;
●Spatial AI;
●Embodied AI;
●Wearable AI or Ego AI?
●AGI and ASI?
Personality and Collaboration ●Rapid AI development causes problems;
●AI-human relation is the core issue;
●Rule-based;
●Traditional IL and RL-based;

Spatial Intelligence
●The core idea of Spatial Intelligence is to “enable AI to understand and manipulate
environments in a way that mimics human spatial intelligence, potentially transforming
industries that rely on detailed environmentalsimulations”.
●Trilobites, the first organisms that could sense light, emerged.
●The ability to seeis thought to have ushered in Cambrian explosion.
●Visionshifted from passivereception of light to an activeprocess involving the nervous
system, which in turn led to the creation of intelligence.

Spatial Intelligence
●Active Vision was once proposed in 80s’, but lackingcorresponding techs to promote it.
●NO good features: SIFT, SURF;
●NO depth sensors: Kinect;
●NO Deep Learning (ImageNet);
●NO LLM or VLA (GPT, PaLM-E);
•For examples: Active Tracking with Saccade and Pursuit as human vision

Spatial Intelligence
●Spatial Computing”teaches computers to better understand and interact with people
more naturally in the human world.”
●Spatial Computing encompasses technological experiences like virtual reality (VR) and
augmented reality (AR), as well as related concepts mixed reality and extended reality.

Embodied Intelligence
●Embodied intelligence enables agents to demonstrate artificial intelligence not only in
virtual worlds (such as cyberspace) but also in the physical world, which is essential for
achieving AGI and even ASI.
●MLMsand world models are the prominent features of embodied intelligence.
•Embodied perception (representation,
active perception, navigation)•Embodied simulation (world model)•Embodied interaction (QA, grasping)•Embodied intelligence:•MLMs,•Task and action planning•Embodied control
•The Vision-Language-Action (VLA) model
is a special MLMin embodied intelligence.

Embodied Intelligence
●Comparison of Embodied AI and Spatial Intelligence
●Spatial Intelligence focuses on understanding and interpreting the environment,
predictingspatial relations, and inferringobject dynamics;
●Embodied AIinvolves spatial intelligence but extends it by embedding this knowledge
into systems capable of physical interaction(robotics, drones, autonomous systems).
●Embodied AI executes tasks based on spatial understanding.
AspectSpatial IntelligenceEmbodied AI
DefinitionUnderstanding and interpreting spacePhysical agents performing tasks
ApplicationContextual awareness in environmentsRobotics, smart devices
InteractionData-driven insights (prediction)Physical interaction with the world
Light --> Sight --> Insight -> Action

Embodied Intelligence
●Two challenging problems in Embodied AI:
●Dexterity
●Body, Legs, Arms, Hands, Feet
●Generalization
●Objects, scenes,
●Embodied entity
●Manipulation tasks and skills

Embodied Intelligence
●Data is one of the bottlenecks on Embodied AI.
●Diversity;
●Entity
●Time
●Place
●View
●Goal
●Skill

Embodied Intelligence
●The world model is an important way to achieve AGI and ASI, and is the cornerstone of
applications from VR to decision-making systems.
●The world model understands the world by predictingthe future.
●The world model is an intrinsicrepresentationof the surrounding environment and is a
neural simulator.
●Based on the world model, the AI embodied agent understands its running environment,
predicts the results of its behavior, and makes intelligent decisions.
OpenAISora

Tricks for Generalization
●1)Domain (Sim-2-Real) Transfer;
●2)Efficient Algorithms:
●2.1)Data(set) augmentation
●2.2)Representation/AffordanceLearn
●2.3)Generative model(GenAI/AIGC)
●2.4)LLM/VLM/MLM/VLA

Robot Types
●(1)Fixed-base Robots, like FrankaEmikaPanda;

Robot Types
●(2)Wheeled Robots, for example Jackal robot;

Robot Types
●(3)Tracked Robots, such as iRobot PackBot;

Robot Types
●(4)Quadruped Robots, like Boston Dynamics Spot, MIT Mini cheetah;

Robot Types
●(5)Humanoid Robots, as Tesla Optimus, Figure v2;

Robot Types
●(6)Biomimetic Robots.

●LLM-based VLAmodels can reason about the relationships between objects,
predict future states, and even infer possible actions.
●OpenVLA, 3-D VLA, Bi-VLA, CoVLA, TinyVLA, QUAR-VLA;
Visual-Language-Action Models

Action Data Collection Platform
●Motion Capture is the process of recording themovementof objects or people, that
information is used to animatedigital charactermodels in 2D or 3Dcomputer animation.

Action Data Collection Platform
●Robotic exoskeletons: TABLIS, AirExo, Wearable upper-lib exoskeleton, DexCap;

Action Data Collection Platform
●Simpler data collection platforms:Gello, ALOHA andMobile aloha;

Action Data Collection Platform
●Not physically moving robots: Dobb·E, UMI, RUM;
Stick v1 Stick v2

Action Data Collection Platform
●VRKits: Holo-Dex, Open-TV, HumanPlus, Open Teach, VisionProTeleop, ACE,
MetaProjectAria;

Action Data Collection Platform
●Mobile phones: RoboTurk

Public Datasets
●Robot Manipulation data: Bridge Data1/2, RH20T, Open X, Droid;

Public Datasets
●Human Behavior Data: Epic-Kitchens, Ego4D, Assembly101, Ego-Exo 4D;
•程序活动
•自然活动

●Mapping the activities of others to the
egocentricview is a basic skill of humans
from a very early age, i.e. human ego
intelligence.
●Wearable AI is essentially a robotics
application. Devices such as smart glasses,
neural wristbands, and AR headsets use AI to
perceive the user’s environment, understand
spatial context, and make predictions.
●“How-to” instructional videos often alternate
between a third-person view of the
demonstrator and a close-up egocentric view
of near-field demonstration.
Wearable AI or Ego AI

●Spatial intelligence addresses the key challenges of context awareness and
environment understanding for wearable AI, enabling richer interactions.
●Wearable AI captures multi-modal ego-centricdata(vision, audio, motion) that
can be used to train and enhance spatial intelligence models.
Wearable AI or Ego AI

●Traditional Graphics, Kinematics, Dynamics;
●PGC --> UGC --> AIGC
●Generative AI (GenAI):
●GAN, VQE, Diffusion (StableDiffusion)
●CLIP
●Sora(DiT)
●3D Represent. &Real2Sim
●NeRF
●GaussianSplatting
●LLM-based:
●GenSim
●RoboGEN
Simulation

●NeRF
●In 3D reconstruction and novel view synthesis,
NeRFsleverage NNs to model the geometry and
appearance of 3D scenes as densityfields and
radiancefields.
Differentiable Rendering
●Gaussian Splatting
●3D Gaussian Splatting
combines the advantages of
neural implicit field and
point-based rendering
methods, achieving the high-
fidelity rendering quality of
the former while maintaining
the real-time rendering
capability of the latter.

●OmniGibson/Behavior-1K, Habitat1/2/3, SAPIEN, Maniskill1/2/3, BiGym
Open Source Simulation Platform

1.The dataset supports study of large-scaleembodied entity learning.
2.The dataset supports generalizationto new tasks, new environments, and
even new institutions of embodied entities.
3.The dataset meets requirements in scale and diversityof entity, time,
place, view, goal, skill.
4.The dataset complies with privacy/ethical standards: de-identification.
5.The dataset includes real-virtual data: both Real2Sim & Sim2Real
transfer realized.
6.The dataset includes Exo-Ego view data: support flexible transform of
Exo-Ego views;
7.The dataset formulates an unified format: convertible between various
data formats.
Requirements of Embodied AI Datasets

●AI development is hot in spatial intelligence and embodied AI: --》AGI--》ASI?
●One of the critical bottlenecks for embodied AI is data, data, data;
●Spatial intelligence is transforming CV from passive perception to active vision
with world interaction;
●Embodied AI integrates spatial intelligence with physical actionin the world
(robotics, autonomous systems).
●Wearable AI benefits from and contributes to spatial intelligence —creating a
feedback loop of innovation.
●World Model (Differentiable rendering, as NeRFand Gaussian splatting) and
VLA Model are key directions for spatial intelligence and embodied AI.
Conclusions