Large Scale Foundation Models
for Autonomous Driving
Yu Huang
Roboraction.AI
Y Huang, Y Chen, Z, Li, “Large Scale Foundation Models for Autonomous Driving”, arXiv 2311.12144, 2023
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Introduction
• Autonomous driving is a long-tailed AI problem;
• Foundation model is a paradigm that a model is first pre-trained and
then fine-tuned to the downstream tasks;
• Large Scale Language Models (LLMs) with billions of parameters are
based on foundation model, like chatGPT and GPT-4.
• Diffusion model works for data generation;
• NeRF provides implicit representation for 3-D structure.
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Large Scale Language Models
• Transformer is the backbone architecture of
most well known LLMs;
• Modifications of Transformers for efficiency and
scalability as follows;
• Multi-query attention (MQA): keys and values are shared
across all of different attention "heads"
• GQA: generalization of MQA in an intermediate number of
key-value heads
• RoPE (Rotary Position Embedding): position encoding with a
rotation matrix for RoFormer
• Switch Transformers: simplified Mixture of Experts routing
• FlashAttention 1/2: using tiling for memory reads/writes
reduction
• PageAttention: virtual memory and paging technique (used
in operating systems) for attention mechanism
Large Scale Language Models
• LLMs refer to Transformer-based language models that contain hundreds of
billions (or more) of parameters4, which are trained on massive text data:
• GPT (generative pre-trained transformer)-1/2/3/4: from text to multi-modality;
• PaLM (pathways language model): on an efficient ML system Pathways on thousands of TPUs;
• OPT (Open Pre-trained Transformer): comparable to GPT-3;
• GLM (General Language Model Pretraining): autoregressive blank infilling;
• LLaMA (LLM Meta AI)-1/2: Open Foundation Language Models, fine-tuned chat models;
• T5 (Text-to-Text Transfer Transformer): encoder-decoder models;
• LLMs significantly extend the model size, data size (tokens), and total compute
(orders of magnification), which model modality is improved largely by scaling;
Kaplan (OpenAI)‘s power-law Hoffmann (Google DeepMind)’s compute-optimal training
Large Scale Language Models
• Training efficiency: compute, memory, communication
• Data parallelism: distribute the whole training corpus into multiple GPUs with
replicated model parameters and states;
• Synchronous: distributed data parallelism (DDP);
• Asynchronous: parameter server (PS);
• Model parallelism: partition a model graph into subgraphs, and assign each
subgraph to a different GPU;
• Pipeline parallelism: distribute the different layers of a LLM into multiple GPUs;
• Tensor parallelism: decompose the tensors (the parameter matrices) into multiple GPUs;
• Zero Redundancy Optimizer (ZeRO): partition model states in three
corresponding stages across processors to optimize the communication;
• ZeRO-Offload: offload data and computations to CPU and save the memory;
• ZeRO-Infinity: leverage CPU and NVMe memory across multiple devices;
Large Scale Language Models
• Training efficiency: compute, memory, communication
• Platforms:
• DeepSpeed: optimization library for distributed training and inference, Microsoft;
• DeepSpeed MII: makes low-latency and high-throughput inference;
• Megatron-LM: training large transformer language models at scale, Nvidia;
• TensorRT-LLM (from previous FasterTransformer): Python API to define LLMs and build TensorRT
engines for inference efficiently on NVIDIA GPUs;
• Colossal-AI: leverage a series of parallel methods to generate distributed AI models by training.
• vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs;
• lightLLM: Python-based LLM inference and serving framework with lightweight design.
Large Scale Language Models
• Emergence arise in large models instead of smaller models;
• In-context learning: Instruction and demonstrations on downstream tasks;
• Instruction following: fine-tuning with natural language descriptions;
• Step-by-step reasoning: Chain-of-Thought (CoT) prompting.
• Parameter Efficient Fine-tuning (PEFT): optimizing a small fraction of parameters;
• Addition-based: adapter tuning, prefix-tuning (soft prompting), prompt-based tuning, P-
tuning 1/2, (IA)3;
• Selection/Specification-based: BitFit, DiffPruning, Cross-attention tuning, Fish-Mask, LT-SFT;
• Reparameterization-based: LoRA (Low rank adaption), HINT (Hypernetwork instruction
tuning), QLoRA, Delta-tuning;
• LLM alignment with human preference:
• RLHF (reinforcement learning from human feedback)
• Constitutional AI: RL from AI Feedback (RLAIF).
Large Scale Language Models
• Other issues of LLMs
• Hallucination: A situation where the model generates content that is not based
on factual or accurate information;
• Explainability: The ability to explain or present the behavior of models in
human-understandable terms.
• Evaluation: It is important to better understand the strengths and weakness,
also provide a better guidance for human-LLMs interaction;
• RAG (Retrieval Augmented Generation): A promising solution for LLMs to
effectively interact with the external world;
• Knowledge Graph (KG): use LLMs to augment KGs for knowledge extraction, KG
construction, and refinement, or use KGs to augment LLMs for training and
prompt learning, or knowledge augmentation.
• Others: computational tractability, continual learning, privacy and copyright etc.
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Visual Language Model, Multi-Modality
Model and Embodied AI
• Vision Transformers (ViT):
• split an image into fixed-size patches, linearly embed each, add position
embeddings, feed to a Transformer encoder;
• add an extra learnable “classification token” to the sequence.
• ViT-22B: parallel layers, query/key (QK) normalization, omitted biases;
• DINO v1: Self-Supervised vision transformers;
• DINO v2: Unsupervised visual feature pre-training;
• Pix2seq: Language Modeling for Object Detection;
• Segment Anything Model (SAM): a promptable method;
• Segmenting Everything Everywhere (SEEM): interactive;
• SAM3D: LiDAR point projected to BEV images for 3D object detection;
• SEAL: segmentation of any point cloud sequences (LiDAR+camera);
Visual Language Model, Multi-Modality
Model and Embodied AI
• Visual Language Models:
• CLIP, BLIP v1/2, PaLI -1/X/3, ImageBind, AnyMAL.
Visual Language Model, Multi-Modality
Model and Embodied AI
• Multi-modal Model:
• PointCLIP v1/2;
• ULIP v1/2;
• CLIP2Point;
• CLIP2Scene;
• OpenShape;
Visual Language Model, Multi-Modality
Model and Embodied AI
• World Model: It explicitly represent the knowledge of an agent about its
environment, using a generative model to predict the future;
• Dynalang is an agent learning a multi-modal world model to predict future text and image
representations and learns to act from model rollouts;
Visual Language Model, Multi-Modality
Model and Embodied AI
• Tree of Thoughts (ToT): Problem Solving with Large Language Models;
• Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models;
• Graph of Thoughts: Solving Elaborate Problems with LLMs;
Visual Language Model, Multi-Modality
Model and Embodied AI
• Embodied AI/Agent: AI algorithms and agents no longer learn from datasets,
instead learn through interactions with environ from an egocentric perception;
“The Rise and Potential of Large Language Model Based Agents: A Survey”
Visual Language Model, Multi-Modality
Model and Embodied AI
• PaLM-E: embodied language models to incorporate sensor modalities for multiple
tasks, even for sequential robotic manipulation planning;
• VOYAGER: An Open-Ended Embodied Agent with LLMs;
• EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought;
Visual Language Model, Multi-Modality
Model and Embodied AI
• A Generalist Agent (Gato);
• LM-Nav: Large Pre-trained Models of Language, Vision, and Action;
• ReAct: Synergize reasoning and acting in language models;
Visual Language Model, Multi-Modality
Model and Embodied AI
• Toolformer: Language Models Can Teach Themselves to Use Tools;
• ToolLLM: Facilitating Large Language Models To Master 16000+ Real-World APIs;
Visual Language Model, Multi-Modality
Model and Embodied AI
• RT-1: Robotics Transformer for Real-World Control at Scale;
• RT-2: Vision-Language-Action (VLA) Models Transfer Web Knowledge;
• RT-X: A high-capacity model with “generalist” X-robot policy;
Visual Language Model, Multi-Modality
Model and Embodied AI
• Habitat 1.0: A Platform for Embodied AI Research;
• Habitat 2.0: Training Home Assistants to Rearrange their Habitat;
• Habitat 3.0: A Co-Habitat For Humans, Avatars And Robots;
Habitat 1.0 Habitat 2.0 Habitat 3.0
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Diffusion Model
• It aims to generate images from Gaussian noise via an iterative denoising process.
• Its implementation is built based on strict physical implications, which consists of
a diffusion process and a reverse process.
• In the diffusion process, an image is converted to a Gaussian distribution by
adding random Gaussian noise with iterations.
• The reverse process is to recover the image from the distribution by several
denoising steps.
Diffusion Model
• Latent Diffusion Model (LDM): model the distribution of the latent space of images;
• Two modules: an autoencoder and a diffusion model;
• Open source: Stable Diffusion.
Diffusion Model
• DALL-E v1: Zero-shot text-to-image generation, multi-modal version of GPT-3;
• DALL-E v2: Hierarchical text-conditional image generation with diffusion decoder;
• DALL-E v3: Improving Image Generation with Better Captions;
• Image Captioner: very similar to a language model, trained with a CLIP;
• A small caption subset to fine-tune the captioner, output as “short synthetic captions”;
• A long highly-descriptive captions to fine-tune, output as “descriptive synthetics captions”;
Diffusion Model
• Point-E: Generating 3D Point Clouds from Complex Prompts;
• LidarCLIP: Learn mapping from LiDAR clouds to CLIP embedding;
• Fantasia3D: Disentangl. Geometry and Appearance for HQ Text-to-3D Content Creation;
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Neural Radiance Field (NeRF)
• NeRF uses NNs to model the 3D geometry and appearance of objects in a scene,
enabling the creation of high-quality visualizations over traditional techniques;
• The three steps: sampling 5D coordinates (location and viewing direction) along
camera rays, applying an MLP to estimate color and volume density, and
aggregating these values into an image by volume rendering;
Neural Radiance Field (NeRF)
• Generalization: MVSNeRF, PixelNeRF, IBRNet;
• Quality and scalability: NeRF in the wild, Mip-NeRF, Mip-NeRF 360;
• Acceleration: KiloNeRF, Instant Neural Graphics Primitives, FastNeRF;
• Relighting: Neural Reflectance Fields, NeRV;
• Large Scale Scenes: Block-NeRF, Mega-NeRF, UE4-NeRF;
• Driving Scenes: Neural Scene Graphs, Lift3D, S-NeRF;
• NeRF and Langugage: Dream Fields, DietNeRF, CLIP-NeRF;
• NeRF and Diffusion: Latent NeRF, SparseFusion, Magic-3D;
• NeRF, Diffusion and Language: DreamFusion, Points-to-3D.
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Applications of Foundation Models for
Autonomous Driving
• Autonomous Driving SAE Levels;
Applications of Foundation Models for
Autonomous Driving
• Modular or E2E approach?
Applications of Foundation Models for
Autonomous Driving
• Challenges or problems:
• Corner cases;
• Current popular solutions:
• Data closed loop;
• Categories of methods with
large foundation models;
• Based on grounding scenarios;
• Large Language Models?
• Potential in future:
• Diffusion model;
• NeRF.
1. “Vision Language Models in Autonomous Driving and Intelligent Transportation Systems”, arXiv:2310.14414, 2023
2. “A Survey of Large Language Models for Autonomous Driving”, arXiv:2311.01043, 2023
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Simulation for Autonomous Driving
• Simulation works as AIGC;
• Sensor Data Synthesis:
• Image, video, LiDAR;
• Traffic flow synthesis;
• Technologies:
• NeRF, Diffusion, Visual-language model, LLMs;
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
World Model for Autonomous Driving
• World model is a neural simulator, synthesizing long tailed scenarios;
• It can predict the next observations to facilitate end-to-end solutions;
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Data Annotation for Autonomous Driving
• Auto labeling is important for efficiency of a data closed loop;
• Open vocabulary annotation needs world knowledge from LLMs/VLMs;
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Decision Making, Planning and E2E Driving
• LLMs’ Integration:
•LLMs can serve as the decision-making module, various functions, such as the
perception module, localization module and prediction module, act as the
vehicle’s sensing device or tools.
•Besides, the vehicle’s actions and controller function as its executor, running
orders from the decision-making process.
•Similarly, a multi-modal language model (MMLM) is built from sensor-text-
action data (with the help of LLMs) for E2E autonomous driving, either to
generate trajectory prediction or control signals directly, like a LLM instruction
tuning solution.
•Another way to apply LLMs is merging vectorized modalities (encoded with
input from raw sensor or tools like perception, localization and prediction) with
a pre-trained LLM, like a LLM augmented solution.
Decision Making, Planning and E2E Driving
• Tokenization like language GPT:
•It builds the model based on self collected data (with the help of LLM/VLM) in
a similar way as the language GPT.
Decision Making, Planning and E2E Driving
• Pre-trained Foundation Model:
•It is self-supervised, including perception module or world model module;
• Perception module needs accompanied by planning and decision making;
• World model module could be E2E, free from object-level understanding;
•It needs huge of data to cover diversity of driving scenarios, even without
LLMs’ support.
Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Conclusion
• LLMs owns human knowledge and emergent technologies;
• Vision model, visual language model and multi-modality model extend the
LLMs’ capabilities to broad modalities;
• Diffusion model is a generative model, trained for diverse data generation;
• Neural radiance field provides the neural method of 3-D scene synthesis;
• Applications for autonomous driving categorize into different grounding
cases: simulation, world model, annotation, planning, decision making
and E2E driving;
• Embodies AI/agents could get augmented by LLMs, grounding on
autonomous driving.
End

Application of Foundation Model for Autonomous Driving

  • 1.
    Large Scale FoundationModels for Autonomous Driving Yu Huang Roboraction.AI Y Huang, Y Chen, Z, Li, “Large Scale Foundation Models for Autonomous Driving”, arXiv 2311.12144, 2023
  • 2.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 3.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 4.
    Introduction • Autonomous drivingis a long-tailed AI problem; • Foundation model is a paradigm that a model is first pre-trained and then fine-tuned to the downstream tasks; • Large Scale Language Models (LLMs) with billions of parameters are based on foundation model, like chatGPT and GPT-4. • Diffusion model works for data generation; • NeRF provides implicit representation for 3-D structure.
  • 5.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 6.
    Large Scale LanguageModels • Transformer is the backbone architecture of most well known LLMs; • Modifications of Transformers for efficiency and scalability as follows; • Multi-query attention (MQA): keys and values are shared across all of different attention "heads" • GQA: generalization of MQA in an intermediate number of key-value heads • RoPE (Rotary Position Embedding): position encoding with a rotation matrix for RoFormer • Switch Transformers: simplified Mixture of Experts routing • FlashAttention 1/2: using tiling for memory reads/writes reduction • PageAttention: virtual memory and paging technique (used in operating systems) for attention mechanism
  • 7.
    Large Scale LanguageModels • LLMs refer to Transformer-based language models that contain hundreds of billions (or more) of parameters4, which are trained on massive text data: • GPT (generative pre-trained transformer)-1/2/3/4: from text to multi-modality; • PaLM (pathways language model): on an efficient ML system Pathways on thousands of TPUs; • OPT (Open Pre-trained Transformer): comparable to GPT-3; • GLM (General Language Model Pretraining): autoregressive blank infilling; • LLaMA (LLM Meta AI)-1/2: Open Foundation Language Models, fine-tuned chat models; • T5 (Text-to-Text Transfer Transformer): encoder-decoder models; • LLMs significantly extend the model size, data size (tokens), and total compute (orders of magnification), which model modality is improved largely by scaling; Kaplan (OpenAI)‘s power-law Hoffmann (Google DeepMind)’s compute-optimal training
  • 8.
    Large Scale LanguageModels • Training efficiency: compute, memory, communication • Data parallelism: distribute the whole training corpus into multiple GPUs with replicated model parameters and states; • Synchronous: distributed data parallelism (DDP); • Asynchronous: parameter server (PS); • Model parallelism: partition a model graph into subgraphs, and assign each subgraph to a different GPU; • Pipeline parallelism: distribute the different layers of a LLM into multiple GPUs; • Tensor parallelism: decompose the tensors (the parameter matrices) into multiple GPUs; • Zero Redundancy Optimizer (ZeRO): partition model states in three corresponding stages across processors to optimize the communication; • ZeRO-Offload: offload data and computations to CPU and save the memory; • ZeRO-Infinity: leverage CPU and NVMe memory across multiple devices;
  • 9.
    Large Scale LanguageModels • Training efficiency: compute, memory, communication • Platforms: • DeepSpeed: optimization library for distributed training and inference, Microsoft; • DeepSpeed MII: makes low-latency and high-throughput inference; • Megatron-LM: training large transformer language models at scale, Nvidia; • TensorRT-LLM (from previous FasterTransformer): Python API to define LLMs and build TensorRT engines for inference efficiently on NVIDIA GPUs; • Colossal-AI: leverage a series of parallel methods to generate distributed AI models by training. • vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs; • lightLLM: Python-based LLM inference and serving framework with lightweight design.
  • 10.
    Large Scale LanguageModels • Emergence arise in large models instead of smaller models; • In-context learning: Instruction and demonstrations on downstream tasks; • Instruction following: fine-tuning with natural language descriptions; • Step-by-step reasoning: Chain-of-Thought (CoT) prompting. • Parameter Efficient Fine-tuning (PEFT): optimizing a small fraction of parameters; • Addition-based: adapter tuning, prefix-tuning (soft prompting), prompt-based tuning, P- tuning 1/2, (IA)3; • Selection/Specification-based: BitFit, DiffPruning, Cross-attention tuning, Fish-Mask, LT-SFT; • Reparameterization-based: LoRA (Low rank adaption), HINT (Hypernetwork instruction tuning), QLoRA, Delta-tuning; • LLM alignment with human preference: • RLHF (reinforcement learning from human feedback) • Constitutional AI: RL from AI Feedback (RLAIF).
  • 11.
    Large Scale LanguageModels • Other issues of LLMs • Hallucination: A situation where the model generates content that is not based on factual or accurate information; • Explainability: The ability to explain or present the behavior of models in human-understandable terms. • Evaluation: It is important to better understand the strengths and weakness, also provide a better guidance for human-LLMs interaction; • RAG (Retrieval Augmented Generation): A promising solution for LLMs to effectively interact with the external world; • Knowledge Graph (KG): use LLMs to augment KGs for knowledge extraction, KG construction, and refinement, or use KGs to augment LLMs for training and prompt learning, or knowledge augmentation. • Others: computational tractability, continual learning, privacy and copyright etc.
  • 12.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 13.
    Visual Language Model,Multi-Modality Model and Embodied AI • Vision Transformers (ViT): • split an image into fixed-size patches, linearly embed each, add position embeddings, feed to a Transformer encoder; • add an extra learnable “classification token” to the sequence. • ViT-22B: parallel layers, query/key (QK) normalization, omitted biases; • DINO v1: Self-Supervised vision transformers; • DINO v2: Unsupervised visual feature pre-training; • Pix2seq: Language Modeling for Object Detection; • Segment Anything Model (SAM): a promptable method; • Segmenting Everything Everywhere (SEEM): interactive; • SAM3D: LiDAR point projected to BEV images for 3D object detection; • SEAL: segmentation of any point cloud sequences (LiDAR+camera);
  • 14.
    Visual Language Model,Multi-Modality Model and Embodied AI • Visual Language Models: • CLIP, BLIP v1/2, PaLI -1/X/3, ImageBind, AnyMAL.
  • 15.
    Visual Language Model,Multi-Modality Model and Embodied AI • Multi-modal Model: • PointCLIP v1/2; • ULIP v1/2; • CLIP2Point; • CLIP2Scene; • OpenShape;
  • 16.
    Visual Language Model,Multi-Modality Model and Embodied AI • World Model: It explicitly represent the knowledge of an agent about its environment, using a generative model to predict the future; • Dynalang is an agent learning a multi-modal world model to predict future text and image representations and learns to act from model rollouts;
  • 17.
    Visual Language Model,Multi-Modality Model and Embodied AI • Tree of Thoughts (ToT): Problem Solving with Large Language Models; • Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models; • Graph of Thoughts: Solving Elaborate Problems with LLMs;
  • 18.
    Visual Language Model,Multi-Modality Model and Embodied AI • Embodied AI/Agent: AI algorithms and agents no longer learn from datasets, instead learn through interactions with environ from an egocentric perception; “The Rise and Potential of Large Language Model Based Agents: A Survey”
  • 19.
    Visual Language Model,Multi-Modality Model and Embodied AI • PaLM-E: embodied language models to incorporate sensor modalities for multiple tasks, even for sequential robotic manipulation planning; • VOYAGER: An Open-Ended Embodied Agent with LLMs; • EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought;
  • 20.
    Visual Language Model,Multi-Modality Model and Embodied AI • A Generalist Agent (Gato); • LM-Nav: Large Pre-trained Models of Language, Vision, and Action; • ReAct: Synergize reasoning and acting in language models;
  • 21.
    Visual Language Model,Multi-Modality Model and Embodied AI • Toolformer: Language Models Can Teach Themselves to Use Tools; • ToolLLM: Facilitating Large Language Models To Master 16000+ Real-World APIs;
  • 22.
    Visual Language Model,Multi-Modality Model and Embodied AI • RT-1: Robotics Transformer for Real-World Control at Scale; • RT-2: Vision-Language-Action (VLA) Models Transfer Web Knowledge; • RT-X: A high-capacity model with “generalist” X-robot policy;
  • 23.
    Visual Language Model,Multi-Modality Model and Embodied AI • Habitat 1.0: A Platform for Embodied AI Research; • Habitat 2.0: Training Home Assistants to Rearrange their Habitat; • Habitat 3.0: A Co-Habitat For Humans, Avatars And Robots; Habitat 1.0 Habitat 2.0 Habitat 3.0
  • 24.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 25.
    Diffusion Model • Itaims to generate images from Gaussian noise via an iterative denoising process. • Its implementation is built based on strict physical implications, which consists of a diffusion process and a reverse process. • In the diffusion process, an image is converted to a Gaussian distribution by adding random Gaussian noise with iterations. • The reverse process is to recover the image from the distribution by several denoising steps.
  • 26.
    Diffusion Model • LatentDiffusion Model (LDM): model the distribution of the latent space of images; • Two modules: an autoencoder and a diffusion model; • Open source: Stable Diffusion.
  • 27.
    Diffusion Model • DALL-Ev1: Zero-shot text-to-image generation, multi-modal version of GPT-3; • DALL-E v2: Hierarchical text-conditional image generation with diffusion decoder; • DALL-E v3: Improving Image Generation with Better Captions; • Image Captioner: very similar to a language model, trained with a CLIP; • A small caption subset to fine-tune the captioner, output as “short synthetic captions”; • A long highly-descriptive captions to fine-tune, output as “descriptive synthetics captions”;
  • 28.
    Diffusion Model • Point-E:Generating 3D Point Clouds from Complex Prompts; • LidarCLIP: Learn mapping from LiDAR clouds to CLIP embedding; • Fantasia3D: Disentangl. Geometry and Appearance for HQ Text-to-3D Content Creation;
  • 29.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 30.
    Neural Radiance Field(NeRF) • NeRF uses NNs to model the 3D geometry and appearance of objects in a scene, enabling the creation of high-quality visualizations over traditional techniques; • The three steps: sampling 5D coordinates (location and viewing direction) along camera rays, applying an MLP to estimate color and volume density, and aggregating these values into an image by volume rendering;
  • 31.
    Neural Radiance Field(NeRF) • Generalization: MVSNeRF, PixelNeRF, IBRNet; • Quality and scalability: NeRF in the wild, Mip-NeRF, Mip-NeRF 360; • Acceleration: KiloNeRF, Instant Neural Graphics Primitives, FastNeRF; • Relighting: Neural Reflectance Fields, NeRV; • Large Scale Scenes: Block-NeRF, Mega-NeRF, UE4-NeRF; • Driving Scenes: Neural Scene Graphs, Lift3D, S-NeRF; • NeRF and Langugage: Dream Fields, DietNeRF, CLIP-NeRF; • NeRF and Diffusion: Latent NeRF, SparseFusion, Magic-3D; • NeRF, Diffusion and Language: DreamFusion, Points-to-3D.
  • 32.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 33.
    Applications of FoundationModels for Autonomous Driving • Autonomous Driving SAE Levels;
  • 34.
    Applications of FoundationModels for Autonomous Driving • Modular or E2E approach?
  • 35.
    Applications of FoundationModels for Autonomous Driving • Challenges or problems: • Corner cases; • Current popular solutions: • Data closed loop; • Categories of methods with large foundation models; • Based on grounding scenarios; • Large Language Models? • Potential in future: • Diffusion model; • NeRF. 1. “Vision Language Models in Autonomous Driving and Intelligent Transportation Systems”, arXiv:2310.14414, 2023 2. “A Survey of Large Language Models for Autonomous Driving”, arXiv:2311.01043, 2023
  • 36.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 37.
    Simulation for AutonomousDriving • Simulation works as AIGC; • Sensor Data Synthesis: • Image, video, LiDAR; • Traffic flow synthesis; • Technologies: • NeRF, Diffusion, Visual-language model, LLMs;
  • 38.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 39.
    World Model forAutonomous Driving • World model is a neural simulator, synthesizing long tailed scenarios; • It can predict the next observations to facilitate end-to-end solutions;
  • 40.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 41.
    Data Annotation forAutonomous Driving • Auto labeling is important for efficiency of a data closed loop; • Open vocabulary annotation needs world knowledge from LLMs/VLMs;
  • 42.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 43.
    Decision Making, Planningand E2E Driving • LLMs’ Integration: •LLMs can serve as the decision-making module, various functions, such as the perception module, localization module and prediction module, act as the vehicle’s sensing device or tools. •Besides, the vehicle’s actions and controller function as its executor, running orders from the decision-making process. •Similarly, a multi-modal language model (MMLM) is built from sensor-text- action data (with the help of LLMs) for E2E autonomous driving, either to generate trajectory prediction or control signals directly, like a LLM instruction tuning solution. •Another way to apply LLMs is merging vectorized modalities (encoded with input from raw sensor or tools like perception, localization and prediction) with a pre-trained LLM, like a LLM augmented solution.
  • 45.
    Decision Making, Planningand E2E Driving • Tokenization like language GPT: •It builds the model based on self collected data (with the help of LLM/VLM) in a similar way as the language GPT.
  • 46.
    Decision Making, Planningand E2E Driving • Pre-trained Foundation Model: •It is self-supervised, including perception module or world model module; • Perception module needs accompanied by planning and decision making; • World model module could be E2E, free from object-level understanding; •It needs huge of data to cover diversity of driving scenarios, even without LLMs’ support.
  • 47.
    Outline • Introduction • LargeScale Language Models • Visual Language Model, Multi-Modality Model and Embodied AI • Diffusion Model • Neural Radiance Field (NeRF) • Applications of Foundation Models for Autonomous Driving • Simulation • World Model • Data Annotation • Decision making, planning and E2E driving • Conclusion
  • 48.
    Conclusion • LLMs ownshuman knowledge and emergent technologies; • Vision model, visual language model and multi-modality model extend the LLMs’ capabilities to broad modalities; • Diffusion model is a generative model, trained for diverse data generation; • Neural radiance field provides the neural method of 3-D scene synthesis; • Applications for autonomous driving categorize into different grounding cases: simulation, world model, annotation, planning, decision making and E2E driving; • Embodies AI/agents could get augmented by LLMs, grounding on autonomous driving.
  • 49.