Application of Foundation Model for Autonomous Driving

Large Scale Foundation Models
for Autonomous Driving
Yu Huang
Roboraction.AI
Y Huang, Y Chen, Z, Li, “Large Scale Foundation Models for Autonomous Driving”, arXiv 2311.12144, 2023

Outline
• Introduction
• Large Scale Language Models
• Visual Language Model, Multi-Modality Model and Embodied AI
• Diffusion Model
• Neural Radiance Field (NeRF)
• Applications of Foundation Models for Autonomous Driving
• Simulation
• World Model
• Data Annotation
• Decision making, planning and E2E driving
• Conclusion

Introduction
• Autonomous driving is a long-tailed AI problem;
• Foundation model is a paradigm that a model is first pre-trained and
then fine-tuned to the downstream tasks;
• Large Scale Language Models (LLMs) with billions of parameters are
based on foundation model, like chatGPT and GPT-4.
• Diffusion model works for data generation;
• NeRF provides implicit representation for 3-D structure.

Large Scale Language Models
• Transformer is the backbone architecture of
most well known LLMs;
• Modifications of Transformers for efficiency and
scalability as follows;
• Multi-query attention (MQA): keys and values are shared
across all of different attention "heads"
• GQA: generalization of MQA in an intermediate number of
key-value heads
• RoPE (Rotary Position Embedding): position encoding with a
rotation matrix for RoFormer
• Switch Transformers: simplified Mixture of Experts routing
• FlashAttention 1/2: using tiling for memory reads/writes
reduction
• PageAttention: virtual memory and paging technique (used
in operating systems) for attention mechanism

• LLMs refer to Transformer-based language models that contain hundreds of
billions (or more) of parameters4, which are trained on massive text data：
• GPT (generative pre-trained transformer)-1/2/3/4: from text to multi-modality;
• PaLM (pathways language model): on an efficient ML system Pathways on thousands of TPUs;
• OPT (Open Pre-trained Transformer): comparable to GPT-3;
• GLM (General Language Model Pretraining): autoregressive blank infilling;
• LLaMA (LLM Meta AI)-1/2: Open Foundation Language Models, fine-tuned chat models;
• T5 (Text-to-Text Transfer Transformer): encoder-decoder models;
• LLMs significantly extend the model size, data size (tokens), and total compute
(orders of magnification), which model modality is improved largely by scaling;
Kaplan (OpenAI)‘s power-law Hoffmann (Google DeepMind)’s compute-optimal training

• Training efficiency: compute, memory, communication
• Data parallelism: distribute the whole training corpus into multiple GPUs with
replicated model parameters and states;
• Synchronous: distributed data parallelism (DDP);
• Asynchronous: parameter server (PS);
• Model parallelism: partition a model graph into subgraphs, and assign each
subgraph to a different GPU;
• Pipeline parallelism: distribute the different layers of a LLM into multiple GPUs;
• Tensor parallelism: decompose the tensors (the parameter matrices) into multiple GPUs;
• Zero Redundancy Optimizer (ZeRO): partition model states in three
corresponding stages across processors to optimize the communication;
• ZeRO-Offload: offload data and computations to CPU and save the memory;
• ZeRO-Infinity: leverage CPU and NVMe memory across multiple devices;

• Training efficiency: compute, memory, communication
• Platforms:
• DeepSpeed: optimization library for distributed training and inference, Microsoft;
• DeepSpeed MII: makes low-latency and high-throughput inference;
• Megatron-LM: training large transformer language models at scale, Nvidia;
• TensorRT-LLM (from previous FasterTransformer): Python API to define LLMs and build TensorRT
engines for inference efficiently on NVIDIA GPUs;
• Colossal-AI: leverage a series of parallel methods to generate distributed AI models by training.
• vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs;
• lightLLM: Python-based LLM inference and serving framework with lightweight design.

• Emergence arise in large models instead of smaller models;
• In-context learning: Instruction and demonstrations on downstream tasks;
• Instruction following: fine-tuning with natural language descriptions;
• Step-by-step reasoning: Chain-of-Thought (CoT) prompting.
• Parameter Efficient Fine-tuning (PEFT): optimizing a small fraction of parameters;
• Addition-based: adapter tuning, prefix-tuning (soft prompting), prompt-based tuning, P-
tuning 1/2, (IA)3;
• Selection/Specification-based: BitFit, DiffPruning, Cross-attention tuning, Fish-Mask, LT-SFT;
• Reparameterization-based: LoRA (Low rank adaption), HINT (Hypernetwork instruction
tuning), QLoRA, Delta-tuning;
• LLM alignment with human preference:
• RLHF (reinforcement learning from human feedback)
• Constitutional AI: RL from AI Feedback (RLAIF).

• Other issues of LLMs
• Hallucination: A situation where the model generates content that is not based
on factual or accurate information;
• Explainability: The ability to explain or present the behavior of models in
human-understandable terms.
• Evaluation: It is important to better understand the strengths and weakness,
also provide a better guidance for human-LLMs interaction;
• RAG (Retrieval Augmented Generation): A promising solution for LLMs to
effectively interact with the external world;
• Knowledge Graph (KG): use LLMs to augment KGs for knowledge extraction, KG
construction, and refinement, or use KGs to augment LLMs for training and
prompt learning, or knowledge augmentation.
• Others: computational tractability, continual learning, privacy and copyright etc.

Visual Language Model, Multi-Modality
Model and Embodied AI
• Vision Transformers (ViT):
• split an image into fixed-size patches, linearly embed each, add position
embeddings, feed to a Transformer encoder;
• add an extra learnable “classification token” to the sequence.
• ViT-22B: parallel layers, query/key (QK) normalization, omitted biases;
• DINO v1: Self-Supervised vision transformers;
• DINO v2: Unsupervised visual feature pre-training;
• Pix2seq: Language Modeling for Object Detection;
• Segment Anything Model (SAM): a promptable method;
• Segmenting Everything Everywhere (SEEM): interactive;
• SAM3D: LiDAR point projected to BEV images for 3D object detection;
• SEAL: segmentation of any point cloud sequences (LiDAR+camera);

• Visual Language Models:
• CLIP, BLIP v1/2, PaLI -1/X/3, ImageBind, AnyMAL.

• Multi-modal Model:
• PointCLIP v1/2;
• ULIP v1/2;
• CLIP2Point;
• CLIP2Scene;
• OpenShape;

• World Model: It explicitly represent the knowledge of an agent about its
environment, using a generative model to predict the future;
• Dynalang is an agent learning a multi-modal world model to predict future text and image
representations and learns to act from model rollouts;

• Tree of Thoughts (ToT): Problem Solving with Large Language Models;
• Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models;
• Graph of Thoughts: Solving Elaborate Problems with LLMs;

• Embodied AI/Agent: AI algorithms and agents no longer learn from datasets,
instead learn through interactions with environ from an egocentric perception;
“The Rise and Potential of Large Language Model Based Agents: A Survey”

• PaLM-E: embodied language models to incorporate sensor modalities for multiple
tasks, even for sequential robotic manipulation planning;
• VOYAGER: An Open-Ended Embodied Agent with LLMs;
• EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought;

• A Generalist Agent (Gato);
• LM-Nav: Large Pre-trained Models of Language, Vision, and Action;
• ReAct: Synergize reasoning and acting in language models;

• Toolformer: Language Models Can Teach Themselves to Use Tools;
• ToolLLM: Facilitating Large Language Models To Master 16000+ Real-World APIs;

• RT-1: Robotics Transformer for Real-World Control at Scale;
• RT-2: Vision-Language-Action (VLA) Models Transfer Web Knowledge;
• RT-X: A high-capacity model with “generalist” X-robot policy;

• Habitat 1.0: A Platform for Embodied AI Research；
• Habitat 2.0: Training Home Assistants to Rearrange their Habitat;
• Habitat 3.0: A Co-Habitat For Humans, Avatars And Robots;
Habitat 1.0 Habitat 2.0 Habitat 3.0

Diffusion Model
• It aims to generate images from Gaussian noise via an iterative denoising process.
• Its implementation is built based on strict physical implications, which consists of
a diffusion process and a reverse process.
• In the diffusion process, an image is converted to a Gaussian distribution by
adding random Gaussian noise with iterations.
• The reverse process is to recover the image from the distribution by several
denoising steps.

Diffusion Model
• Latent Diffusion Model (LDM): model the distribution of the latent space of images;
• Two modules: an autoencoder and a diffusion model;
• Open source: Stable Diffusion.

Diffusion Model
• DALL-E v1: Zero-shot text-to-image generation, multi-modal version of GPT-3;
• DALL-E v2: Hierarchical text-conditional image generation with diffusion decoder;
• DALL-E v3: Improving Image Generation with Better Captions;
• Image Captioner: very similar to a language model, trained with a CLIP;
• A small caption subset to fine-tune the captioner, output as “short synthetic captions”;
• A long highly-descriptive captions to fine-tune, output as “descriptive synthetics captions”;

Diffusion Model
• Point-E: Generating 3D Point Clouds from Complex Prompts;
• LidarCLIP: Learn mapping from LiDAR clouds to CLIP embedding;
• Fantasia3D: Disentangl. Geometry and Appearance for HQ Text-to-3D Content Creation;

Neural Radiance Field (NeRF)
• NeRF uses NNs to model the 3D geometry and appearance of objects in a scene,
enabling the creation of high-quality visualizations over traditional techniques;
• The three steps: sampling 5D coordinates (location and viewing direction) along
camera rays, applying an MLP to estimate color and volume density, and
aggregating these values into an image by volume rendering;

Neural Radiance Field (NeRF)
• Generalization: MVSNeRF, PixelNeRF, IBRNet;
• Quality and scalability: NeRF in the wild, Mip-NeRF, Mip-NeRF 360;
• Acceleration: KiloNeRF, Instant Neural Graphics Primitives, FastNeRF;
• Relighting: Neural Reflectance Fields, NeRV;
• Large Scale Scenes: Block-NeRF, Mega-NeRF, UE4-NeRF;
• Driving Scenes: Neural Scene Graphs, Lift3D, S-NeRF;
• NeRF and Langugage: Dream Fields, DietNeRF, CLIP-NeRF;
• NeRF and Diffusion: Latent NeRF, SparseFusion, Magic-3D;
• NeRF, Diffusion and Language: DreamFusion, Points-to-3D.

Applications of Foundation Models for
Autonomous Driving
• Autonomous Driving SAE Levels;

Autonomous Driving
• Modular or E2E approach?

Autonomous Driving
• Challenges or problems:
• Corner cases;
• Current popular solutions:
• Data closed loop;
• Categories of methods with
large foundation models;
• Based on grounding scenarios;
• Large Language Models?
• Potential in future:
• Diffusion model;
• NeRF.
1. “Vision Language Models in Autonomous Driving and Intelligent Transportation Systems”, arXiv:2310.14414, 2023
2. “A Survey of Large Language Models for Autonomous Driving”, arXiv:2311.01043, 2023

Simulation for Autonomous Driving
• Simulation works as AIGC;
• Sensor Data Synthesis:
• Image, video, LiDAR;
• Traffic flow synthesis;
• Technologies:
• NeRF, Diffusion, Visual-language model, LLMs;

World Model for Autonomous Driving
• World model is a neural simulator, synthesizing long tailed scenarios;
• It can predict the next observations to facilitate end-to-end solutions;

Data Annotation for Autonomous Driving
• Auto labeling is important for efficiency of a data closed loop;
• Open vocabulary annotation needs world knowledge from LLMs/VLMs;

Decision Making, Planning and E2E Driving
• LLMs’ Integration:
•LLMs can serve as the decision-making module, various functions, such as the
perception module, localization module and prediction module, act as the
vehicle’s sensing device or tools.
•Besides, the vehicle’s actions and controller function as its executor, running
orders from the decision-making process.
•Similarly, a multi-modal language model (MMLM) is built from sensor-text-
action data (with the help of LLMs) for E2E autonomous driving, either to
generate trajectory prediction or control signals directly, like a LLM instruction
tuning solution.
•Another way to apply LLMs is merging vectorized modalities (encoded with
input from raw sensor or tools like perception, localization and prediction) with
a pre-trained LLM, like a LLM augmented solution.

• Tokenization like language GPT:
•It builds the model based on self collected data (with the help of LLM/VLM) in
a similar way as the language GPT.

• Pre-trained Foundation Model:
•It is self-supervised, including perception module or world model module;
• Perception module needs accompanied by planning and decision making;
• World model module could be E2E, free from object-level understanding;
•It needs huge of data to cover diversity of driving scenarios, even without
LLMs’ support.

Conclusion
• LLMs owns human knowledge and emergent technologies;
• Vision model, visual language model and multi-modality model extend the
LLMs’ capabilities to broad modalities;
• Diffusion model is a generative model, trained for diverse data generation;
• Neural radiance field provides the neural method of 3-D scene synthesis;
• Applications for autonomous driving categorize into different grounding
cases: simulation, world model, annotation, planning, decision making
and E2E driving;
• Embodies AI/agents could get augmented by LLMs, grounding on
autonomous driving.

Application of Foundation Model for Autonomous Driving

In this document

More Related Content

What's hot

Similar to Application of Foundation Model for Autonomous Driving

More from Yu Huang

Recently uploaded

Application of Foundation Model for Autonomous Driving