Application of Foundation Model for Autonomous Driving
The document discusses the role of large-scale foundation models in autonomous driving, focusing on various models like large-scale language models, visual language models, diffusion models, and neural radiance fields. It outlines their applications in simulation, world modeling, data annotation, and decision-making processes for driving. The conclusion highlights the advanced capabilities these models provide, emphasizing their integration and potential for enhancing autonomous driving technologies.
The presentation introduces foundation models for autonomous driving, emphasizing large-scale language models (LLMs) like chatGPT and GPT-4, alongside diffusion models and NeRF.
Discusses the architecture and efficiency of LLMs, focusing on training techniques, platforms like DeepSpeed and Megatron-LM, and challenges like hallucination and explainability.
Explores advancements in visual language models and embodied AI, including ViTs, CLIP, multi-modal models, and agents capable of learning through environment interactions.
Explains the diffusion model and its applications in generating images from Gaussian noise, including details on Latent Diffusion Models and notable projects like DALL-E.
Describes NeRF and its methods for 3D object and scene generation, highlighting generalizations and accelerations in the field.
Analyzes the applications of foundation models in autonomous driving, discussing challenges, solutions with LLMs, and the potential of using diffusion models and NeRF.
Highlights the importance of simulation using generative models for autonomous driving training, including sensor data synthesis and world modeling.
Discusses the significance of automated data labeling for efficiency in autonomous driving and the role of LLMs in open vocabulary annotation.
Details how LLMs integrate into decision-making modules for autonomous vehicles, emphasizing sensor data handling and model training.
Summarizes the capabilities and applications of LLMs, visual models, and the integration of AI in autonomous driving, emphasizing the future landscape.
Application of Foundation Model for Autonomous Driving
1.
Large Scale FoundationModels
for Autonomous Driving
Yu Huang
Roboraction.AI
Y Huang, Y Chen, Z, Li, “Large Scale Foundation Models for Autonomous Driving”, arXiv 2311.12144, 2023
2.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
3.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
4.
Introduction
• Autonomous drivingis a long-tailed AI problem;
• Foundation model is a paradigm that a model is first pre-trained and
then fine-tuned to the downstream tasks;
• Large Scale Language Models (LLMs) with billions of parameters are
based on foundation model, like chatGPT and GPT-4.
• Diffusion model works for data generation;
• NeRF provides implicit representation for 3-D structure.
5.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
6.
Large Scale LanguageModels
• Transformer is the backbone architecture of
most well known LLMs;
• Modifications of Transformers for efficiency and
scalability as follows;
• Multi-query attention (MQA): keys and values are shared
across all of different attention "heads"
• GQA: generalization of MQA in an intermediate number of
key-value heads
• RoPE (Rotary Position Embedding): position encoding with a
rotation matrix for RoFormer
• Switch Transformers: simplified Mixture of Experts routing
• FlashAttention 1/2: using tiling for memory reads/writes
reduction
• PageAttention: virtual memory and paging technique (used
in operating systems) for attention mechanism
7.
Large Scale LanguageModels
• LLMs refer to Transformer-based language models that contain hundreds of
billions (or more) of parameters4, which are trained on massive text data:
• GPT (generative pre-trained transformer)-1/2/3/4: from text to multi-modality;
• PaLM (pathways language model): on an efficient ML system Pathways on thousands of TPUs;
• OPT (Open Pre-trained Transformer): comparable to GPT-3;
• GLM (General Language Model Pretraining): autoregressive blank infilling;
• LLaMA (LLM Meta AI)-1/2: Open Foundation Language Models, fine-tuned chat models;
• T5 (Text-to-Text Transfer Transformer): encoder-decoder models;
• LLMs significantly extend the model size, data size (tokens), and total compute
(orders of magnification), which model modality is improved largely by scaling;
Kaplan (OpenAI)‘s power-law Hoffmann (Google DeepMind)’s compute-optimal training
8.
Large Scale LanguageModels
• Training efficiency: compute, memory, communication
• Data parallelism: distribute the whole training corpus into multiple GPUs with
replicated model parameters and states;
• Synchronous: distributed data parallelism (DDP);
• Asynchronous: parameter server (PS);
• Model parallelism: partition a model graph into subgraphs, and assign each
subgraph to a different GPU;
• Pipeline parallelism: distribute the different layers of a LLM into multiple GPUs;
• Tensor parallelism: decompose the tensors (the parameter matrices) into multiple GPUs;
• Zero Redundancy Optimizer (ZeRO): partition model states in three
corresponding stages across processors to optimize the communication;
• ZeRO-Offload: offload data and computations to CPU and save the memory;
• ZeRO-Infinity: leverage CPU and NVMe memory across multiple devices;
9.
Large Scale LanguageModels
• Training efficiency: compute, memory, communication
• Platforms:
• DeepSpeed: optimization library for distributed training and inference, Microsoft;
• DeepSpeed MII: makes low-latency and high-throughput inference;
• Megatron-LM: training large transformer language models at scale, Nvidia;
• TensorRT-LLM (from previous FasterTransformer): Python API to define LLMs and build TensorRT
engines for inference efficiently on NVIDIA GPUs;
• Colossal-AI: leverage a series of parallel methods to generate distributed AI models by training.
• vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs;
• lightLLM: Python-based LLM inference and serving framework with lightweight design.
10.
Large Scale LanguageModels
• Emergence arise in large models instead of smaller models;
• In-context learning: Instruction and demonstrations on downstream tasks;
• Instruction following: fine-tuning with natural language descriptions;
• Step-by-step reasoning: Chain-of-Thought (CoT) prompting.
• Parameter Efficient Fine-tuning (PEFT): optimizing a small fraction of parameters;
• Addition-based: adapter tuning, prefix-tuning (soft prompting), prompt-based tuning, P-
tuning 1/2, (IA)3;
• Selection/Specification-based: BitFit, DiffPruning, Cross-attention tuning, Fish-Mask, LT-SFT;
• Reparameterization-based: LoRA (Low rank adaption), HINT (Hypernetwork instruction
tuning), QLoRA, Delta-tuning;
• LLM alignment with human preference:
• RLHF (reinforcement learning from human feedback)
• Constitutional AI: RL from AI Feedback (RLAIF).
11.
Large Scale LanguageModels
• Other issues of LLMs
• Hallucination: A situation where the model generates content that is not based
on factual or accurate information;
• Explainability: The ability to explain or present the behavior of models in
human-understandable terms.
• Evaluation: It is important to better understand the strengths and weakness,
also provide a better guidance for human-LLMs interaction;
• RAG (Retrieval Augmented Generation): A promising solution for LLMs to
effectively interact with the external world;
• Knowledge Graph (KG): use LLMs to augment KGs for knowledge extraction, KG
construction, and refinement, or use KGs to augment LLMs for training and
prompt learning, or knowledge augmentation.
• Others: computational tractability, continual learning, privacy and copyright etc.
12.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
13.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Vision Transformers (ViT):
• split an image into fixed-size patches, linearly embed each, add position
embeddings, feed to a Transformer encoder;
• add an extra learnable “classification token” to the sequence.
• ViT-22B: parallel layers, query/key (QK) normalization, omitted biases;
• DINO v1: Self-Supervised vision transformers;
• DINO v2: Unsupervised visual feature pre-training;
• Pix2seq: Language Modeling for Object Detection;
• Segment Anything Model (SAM): a promptable method;
• Segmenting Everything Everywhere (SEEM): interactive;
• SAM3D: LiDAR point projected to BEV images for 3D object detection;
• SEAL: segmentation of any point cloud sequences (LiDAR+camera);
14.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Visual Language Models:
• CLIP, BLIP v1/2, PaLI -1/X/3, ImageBind, AnyMAL.
15.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Multi-modal Model:
• PointCLIP v1/2;
• ULIP v1/2;
• CLIP2Point;
• CLIP2Scene;
• OpenShape;
16.
Visual Language Model,Multi-Modality
Model and Embodied AI
• World Model: It explicitly represent the knowledge of an agent about its
environment, using a generative model to predict the future;
• Dynalang is an agent learning a multi-modal world model to predict future text and image
representations and learns to act from model rollouts;
17.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Tree of Thoughts (ToT): Problem Solving with Large Language Models;
• Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models;
• Graph of Thoughts: Solving Elaborate Problems with LLMs;
18.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Embodied AI/Agent: AI algorithms and agents no longer learn from datasets,
instead learn through interactions with environ from an egocentric perception;
“The Rise and Potential of Large Language Model Based Agents: A Survey”
19.
Visual Language Model,Multi-Modality
Model and Embodied AI
• PaLM-E: embodied language models to incorporate sensor modalities for multiple
tasks, even for sequential robotic manipulation planning;
• VOYAGER: An Open-Ended Embodied Agent with LLMs;
• EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought;
20.
Visual Language Model,Multi-Modality
Model and Embodied AI
• A Generalist Agent (Gato);
• LM-Nav: Large Pre-trained Models of Language, Vision, and Action;
• ReAct: Synergize reasoning and acting in language models;
21.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Toolformer: Language Models Can Teach Themselves to Use Tools;
• ToolLLM: Facilitating Large Language Models To Master 16000+ Real-World APIs;
22.
Visual Language Model,Multi-Modality
Model and Embodied AI
• RT-1: Robotics Transformer for Real-World Control at Scale;
• RT-2: Vision-Language-Action (VLA) Models Transfer Web Knowledge;
• RT-X: A high-capacity model with “generalist” X-robot policy;
23.
Visual Language Model,Multi-Modality
Model and Embodied AI
• Habitat 1.0: A Platform for Embodied AI Research;
• Habitat 2.0: Training Home Assistants to Rearrange their Habitat;
• Habitat 3.0: A Co-Habitat For Humans, Avatars And Robots;
Habitat 1.0 Habitat 2.0 Habitat 3.0
24.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
25.
Diffusion Model
• Itaims to generate images from Gaussian noise via an iterative denoising process.
• Its implementation is built based on strict physical implications, which consists of
a diffusion process and a reverse process.
• In the diffusion process, an image is converted to a Gaussian distribution by
adding random Gaussian noise with iterations.
• The reverse process is to recover the image from the distribution by several
denoising steps.
26.
Diffusion Model
• LatentDiffusion Model (LDM): model the distribution of the latent space of images;
• Two modules: an autoencoder and a diffusion model;
• Open source: Stable Diffusion.
27.
Diffusion Model
• DALL-Ev1: Zero-shot text-to-image generation, multi-modal version of GPT-3;
• DALL-E v2: Hierarchical text-conditional image generation with diffusion decoder;
• DALL-E v3: Improving Image Generation with Better Captions;
• Image Captioner: very similar to a language model, trained with a CLIP;
• A small caption subset to fine-tune the captioner, output as “short synthetic captions”;
• A long highly-descriptive captions to fine-tune, output as “descriptive synthetics captions”;
28.
Diffusion Model
• Point-E:Generating 3D Point Clouds from Complex Prompts;
• LidarCLIP: Learn mapping from LiDAR clouds to CLIP embedding;
• Fantasia3D: Disentangl. Geometry and Appearance for HQ Text-to-3D Content Creation;
29.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
30.
Neural Radiance Field(NeRF)
• NeRF uses NNs to model the 3D geometry and appearance of objects in a scene,
enabling the creation of high-quality visualizations over traditional techniques;
• The three steps: sampling 5D coordinates (location and viewing direction) along
camera rays, applying an MLP to estimate color and volume density, and
aggregating these values into an image by volume rendering;
31.
Neural Radiance Field(NeRF)
• Generalization: MVSNeRF, PixelNeRF, IBRNet;
• Quality and scalability: NeRF in the wild, Mip-NeRF, Mip-NeRF 360;
• Acceleration: KiloNeRF, Instant Neural Graphics Primitives, FastNeRF;
• Relighting: Neural Reflectance Fields, NeRV;
• Large Scale Scenes: Block-NeRF, Mega-NeRF, UE4-NeRF;
• Driving Scenes: Neural Scene Graphs, Lift3D, S-NeRF;
• NeRF and Langugage: Dream Fields, DietNeRF, CLIP-NeRF;
• NeRF and Diffusion: Latent NeRF, SparseFusion, Magic-3D;
• NeRF, Diffusion and Language: DreamFusion, Points-to-3D.
32.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
Applications of FoundationModels for
Autonomous Driving
• Challenges or problems:
• Corner cases;
• Current popular solutions:
• Data closed loop;
• Categories of methods with
large foundation models;
• Based on grounding scenarios;
• Large Language Models?
• Potential in future:
• Diffusion model;
• NeRF.
1. “Vision Language Models in Autonomous Driving and Intelligent Transportation Systems”, arXiv:2310.14414, 2023
2. “A Survey of Large Language Models for Autonomous Driving”, arXiv:2311.01043, 2023
36.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
37.
Simulation for AutonomousDriving
• Simulation works as AIGC;
• Sensor Data Synthesis:
• Image, video, LiDAR;
• Traffic flow synthesis;
• Technologies:
• NeRF, Diffusion, Visual-language model, LLMs;
38.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
39.
World Model forAutonomous Driving
• World model is a neural simulator, synthesizing long tailed scenarios;
• It can predict the next observations to facilitate end-to-end solutions;
40.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
41.
Data Annotation forAutonomous Driving
• Auto labeling is important for efficiency of a data closed loop;
• Open vocabulary annotation needs world knowledge from LLMs/VLMs;
42.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
43.
Decision Making, Planningand E2E Driving
• LLMs’ Integration:
•LLMs can serve as the decision-making module, various functions, such as the
perception module, localization module and prediction module, act as the
vehicle’s sensing device or tools.
•Besides, the vehicle’s actions and controller function as its executor, running
orders from the decision-making process.
•Similarly, a multi-modal language model (MMLM) is built from sensor-text-
action data (with the help of LLMs) for E2E autonomous driving, either to
generate trajectory prediction or control signals directly, like a LLM instruction
tuning solution.
•Another way to apply LLMs is merging vectorized modalities (encoded with
input from raw sensor or tools like perception, localization and prediction) with
a pre-trained LLM, like a LLM augmented solution.
45.
Decision Making, Planningand E2E Driving
• Tokenization like language GPT:
•It builds the model based on self collected data (with the help of LLM/VLM) in
a similar way as the language GPT.
46.
Decision Making, Planningand E2E Driving
• Pre-trained Foundation Model:
•It is self-supervised, including perception module or world model module;
• Perception module needs accompanied by planning and decision making;
• World model module could be E2E, free from object-level understanding;
•It needs huge of data to cover diversity of driving scenarios, even without
LLMs’ support.
47.
Outline
• Introduction
• LargeScale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion
48.
Conclusion
• LLMs ownshuman knowledge and emergent technologies;
• Vision model, visual language model and multi-modality model extend the
LLMs’ capabilities to broad modalities;
• Diffusion model is a generative model, trained for diverse data generation;
• Neural radiance field provides the neural method of 3-D scene synthesis;
• Applications for autonomous driving categorize into different grounding
cases: simulation, world model, annotation, planning, decision making
and E2E driving;
• Embodies AI/agents could get augmented by LLMs, grounding on
autonomous driving.