Skip to content

taco-group/GenAI4AD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generative AI for Autonomous Driving

License arXiv

We welcome anyone to contribute to this repository. Please raise issues or pull requests for any missing papers, datasets, or methods. We will update the repository regularly.

Contents

Datasets

Single-Vehicle Perception Datasets

Dataset Data Source Sampling Rate Camera Type LiDAR Radar HD Map Annotation Type
KITTI (2012) Karlsruhe, Germany 10 Hz Stereo (2 cameras) 3D Bounding Boxes
Cityscapes (2016) 50 German Cities N/A Stereo (2 cameras) 2D Segmentation
ApolloScape (2018) Various Cities in China N/A Stereo (2 cameras) Semantic Segmentation
Honda H3D (2019) Bay Area, USA N/A Frontal View (1 camera) 3D Bounding Boxes
nuScenes (2019) Boston, Pittsburgh, Singapore 2 Hz Surround View (6 cameras) 3D Bounding Boxes
Waymo Open Dataset (2019) Multiple US Cities 10 Hz Frontal/Side (5 cameras) 3D Bounding Boxes
Argoverse (2019) Miami and Pittsburgh 10 Hz Surround View 3D Bounding Boxes
PandaSet (2020) San Francisco N/A Surround View (7 cameras) 3D Bounding Boxes, Segmentation
Audi A2D2 (2020) Various Cities in Germany 10 Hz Surround View (6 cameras) 3D Bounding Boxes
ONCE Dataset (2021) Various Cities in China 10 Hz Surround View (7 cameras) 3D Bounding Boxes

Motion Forecasting and Cooperative Driving Datasets

Dataset Data Source Sampling Rate Camera Type LiDAR HD Map Annotation Type
HighD (2018) German Highways N/A Drone (Bird's-eye View) Agent 2D Bounding Boxes
INTERACTION (2019) US, China, EU Intersections 10 Hz Drone and Fixed Cameras Agent Trajectories
PIE (2019) Toronto, Canada 30 Hz Frontal View (1 camera) Pedestrian Bounding Boxes, Intention Labels
Argoverse 1 & 2 (2019, 2022) Miami and Pittsburgh 10 Hz Surround View Agent Trajectories
Lyft Level 5 (2020) Palo Alto, USA 10 Hz Surround View Agent 3D Bounding Boxes
rounD (2020) German Roundabouts N/A Drone (Bird's-eye View) Vehicle 2D Bounding Boxes
Waymo Open Motion (2021) Multiple US Cities 10 Hz None Vehicle, Pedestrian, Cyclist Trajectories
nuPlan (2021) Multiple US Cities 10 Hz Surround View Agent 3D Bounding Boxes
LOKI (2021) Japan Intersections 5 Hz Vehicle Cameras 3D Bounding Boxes, Intention Labels
DAIR-V2X (2021) China Intersections N/A Vehicle and Roadside Cameras 3D Bounding Boxes
exiD (2022) German Highway Exits N/A Drone (Bird's-eye View) Vehicle 2D Bounding Boxes
V2X-Seq (2023) Urban Intersections 10 Hz Vehicle and Roadside Cameras 3D Agent Bounding Boxes
V2V4Real (2023) Ohio, USA 10 Hz Surround View 3D Bounding Boxes
UniOcc (2025) Various Cities in US 10 Hz Surround View 3D Occupancy Grids

Simulation Based Datasets

Dataset Data Source Camera Type LiDAR HD Map Simulation Task
FRIDA/FRIDA2 (2010–2012) MATLAB Monocular Foggy Images
SYNTHIA (2016) Unity Multiple Views Rain and Fog Images
Virtual KITTI (2016 & 2019) KITTI, Unity Monocular/Stereo Real2Sim Transfer
Playing for Benchmarks (2018) GTA-V Game Engine Multiple Views Interactive Driving Simulation
Foggy Cityscapes (2018) Cityscapes Monocular Foggy Images
IDDA (2020) CARLA Simulator Fisheye Semantic Segmentation
AIODrive (2021) CARLA Multiple Views Long Range Point Cloud
OPV2V (2021) CARLA Multiple Vehicles Cooperative Perception
Shift (2022) CARLA Multiple Views Weather, Lighting Simulation
DeepAccident (2023) CARLA Multiple Views Accident Scene Simulation
WARM-3D (2024) CARLA Monocular Sim2Real Transfer
SimBEV (2025) CARLA Multiple Views BEV Segmentation

Language-Based Datasets

Dataset Data Source Modality QA Type # QA Pairs
BDD-X (2018) Dashcam Recordings Videos (40s clips) Ego Intention, Scene Description 7K
DRAMA (2023) Japan Driving Videos Video Risk Object, Ego Intention, Ego Actions, Reasoning 170K
Rank2Tell (2024) US Driving Videos Video Object Importance, Ego Intention, Ego Actions, Reasoning 300K
LingoQA (2024) Driving Videos (4s clips) Video Scene Description, Recommended Actions, Reasoning 419K
NuScenes-QA (2024) nuScenes Same as nuScenes Scene Description 460K
DriveLM (2024) nuScenes, CARLA Same as nuScenes Multi-step Reasoning 360K
NuPlanQA (2025) Not Released as of April 2025 nuPlan Same as nuPlan Perception, Spatial Reasoning, Ego Intentions 1M
NuInstruct (2024) nuScenes Same as nuScenes Instruction–Response Pairs Across 17 Task Types 91K
doScenes (2024) nuScenes Same as nuScenes Free-Form Driving Instructions and Scene Reference Points 4K
MAPLM (2024) Chinese Cities Image, LiDAR Detailed Map Description (Lanes, Road, Signs) 61K
NuScenes-MQA (2024) nuScenes Same as nuScenes Scene Captioning, Visual QA 1.5M
DriveBench (2025) nuScenes Same as DriveLM Visual QA 20k

Methods

Image Generation Methods

Controllable Generation

Method Venue Dataset Modeling Type Backbone Control Variables
GeoDiffusion ICLR'24 nuScenes, COCO-Stuff Diffusion, VAE U-Net Object Box, Camera Pose, Text
DetDiffusion CVPR'24 COCO-Stuff Diffusion, VAE U-Net Object Box, Perception, Text
BEVGen IEEE RA-L'24 nuScenes, Argoverse 2 VQ-VAE Transformer BEV Map, Object Box, Text
BEVControl arXiv'23 nuScenes VAE CNN, Transformer, CLIP BEV Sketch, Text
MagicDrive ICLR'24 nuScenes Diffusion, VAE U-Net Road Map, Object Box, Camera Pose
MagicDrive3D arXiv'24 nuScenes 3DGS, Diffusion, VAE U-Net BEV Map, Object Box, Camera Pose
Drive-WM CVPR'24 Driving Data Diffusion, VAE U-Net Map, Text
SimGen NeurIPS'24 YouTube Diffusion, SDEdit U-Net BEV, Text
DatasetDM NeurIPS'23 - Diffusion, LLM, VAE U-Net, ControlNet Text
DriveGAN CVPR'21 RWD GAN, VAE CNN, LSTM, MLP Steering, Speed, Scene Features
LightDiff CVPR'24 nuScenes VAE, Diffusion U-Net Lighting Conditions
Streetscapes SIGGRAPH'24 Google Street View Diffusion ControlNet Road Map, Height Map, Camera Pose
Wovogen ECCV'24 Urban Driving Diffusion, AutoEncoder CNN, CLIP Text, World Volumes, Ego Actions
HoloDrive arXiv'24 nuScenes VAE, Diffusion U-Net, Attention Text, 2D Layout
WeatherDG arXiv'24 Cityscapes Diffusion, LLM VAE, U-Net Text
UrbanArchitect arXiv'24 nuScenes Diffusion, ControlNet VAE Text, 3D Layout

Decompositional Generation

Method Venue Dataset Modeling Type Backbone Control Variables
ChatSim CVPR'24 Waymo Open Dataset LLM, NeRF MLP, Transformer 3D Assets
UrbanGIRAFFE ICCV'23 KITTI-360, CLEVR-W NeRF MLP Camera Pose, Panoptic Prior
Sat2Scene CVPR'24 HoliCity, OmniCity NeRF MLP Satellite Images, Layout, 3D Constraints
Block-NeRF CVPR'22 Block-NeRF Dataset NeRF MLP Spatial Block Layout, 3D Constraints
S-NeRF CVPR'23 nuScenes, Waymo Open Dataset NeRF MLP Camera Path, 3D Constraints
NF-LDM CVPR'23 VizDoom, Replica, AVD Diffusion, NeRF MLP Scene Embedding, 3D Constraints
Panoptic NeRF IEEE 3DIMPVT'22 KITTI 360 NeRF MLP Semantic Segmentation, 3D Constraints
Neural Point Light Field CVPR'22 Waymo Open Dataset NeRF MLP Camera Pose, 3D Constraints
Neural Scene Graphs CVPR'21 KITTI NeRF MLP Object Graph Topology, 3D Constraints
UniSim CVPR'23 PandaSet NeRF MLP Agent Profile, 3D Constraints
CADSim CoRL'23 MVMC, PandaSet Differentiable CAD Rendering MLP CAD Geometry, 3D Constraints

LiDAR Generation Methods

Method Venue Dataset Modeling Type Backbone Control Mechanism Generation Type
LiDMs CVPR'24 nuScenes, KITTI-360 Diffusion CNN, U-Net Multi-modal conditions Scene Generation
RangeLDM ECCV'24 KITTI-360, nuScenes Diffusion, VAE CNN, U-Net Partial Point Cloud Scene Completion, Generation
LidarDM ICRA'25 KITTI-360, WOD Diffusion, VAE CNN Semantic Map LiDAR Simulation & Raycasting
DynamicCity ICLR'25 Occ3D, CarlaSC Diffusion, VAE Transformer, CNN Layout, Trajectory, Text, Inpainting 4D Occupancy Scene Generation
GenMM arXiv'24 BDD100K, WOD Diffusion U-Net, Transformer 3D Bounding Boxes, Reference Image Object-Level Manipulation
Text2LiDAR ECCV'24 KITTI-360, nuScenes Diffusion Transformer Text Full Scene Generation
UltraLiDAR CVPR'23 PandaSet, KITTI VQ-VAE Transformer Sparse Point Cloud Scene Completion, Generation
LidarGRIT CVPR-W'24 KITTI-360, KITTI odometry VQ-VAE Transformer Unconditional Scene Generation
NeRF-LiDAR CVPR'24 nuScenes NeRF U-Net, MLP Camera Poses, Multi-view Images LiDAR Simulation
LiDAR4D CVPR'24 KITTI, nuScenes NeRF U-Net, MLP Camera Poses, Multi-view LiDAR Point Cloud LiDAR Simulation
DyNFL CVPR'24 WOD Neural SDF MLP LiDAR Scans, 3D Bounding Boxes LiDAR Simulation
LiDARsim CVPR'20 LiDARsim Dataset Physics-based Raycasting Raycasting Engine, U-Net 3D backgrounds, Dynamic Object Meshes LiDAR Simulation
PCGen ICRA'23 WOD FPA Raycasting Raycasting Engine, MLP Reconstructed Scenario LiDAR Simulation
LiDARGEN ECCV'22 KITTI-360, nuScenes Score-Based U-Net Sparse Point Cloud Scene Generation
Yue et al. ACM'18 KITTI Physics-based Raycasting Raycasting Engine Pre-defined In-game Scene Parameters LiDAR Simulation

Trajectory Generation Methods

Method Venue Dataset Modeling Type Backbone
Kim et al. IEEE Access'21 Real-world Driving CVAE DeepConvLSTM
Barbié et al. JRM'19 Synthetic CVAE RNN
CGNS IROS'19 ETH/UCY, SDD GAN CNN
EvolveGraph NeurIPS'20 ETH/UCY, SDD, H3D Autoregressive GNN
STG-DAT T-ITS'21 ETH/UCY, SDD CVAE GNN
PathGAN ETRI'21 iSUN GAN CNN
MID CVPR'22 ETH/UCY, Stanford Drone Diffusion Transformer
LED CVPR'23 ETH/UCY Diffusion Leapfrog
SingularTrajectory CVPR'24 Multiple Benchmarks Diffusion SVD
Diffusion-Planner ICLR'25 nuPlan Diffusion Transformer
GPT-Driver NeurIPS'23 nuScenes LLM Transformer
DriveLM ECCV'24 nuScenes VLM Transformer
LMDrive CVPR'24 CARLA LLM Transformer
OpenEMMA WACV'25 nuScenes VLM Transformer
Desire CVPR'17 KITTI, Stanford Drone CVAE RNN
Trajectron ICCV'19 ETH/UCY CVAE Graph RNN
Trajectron++ ECCV'20 ETH/UCY, nuScenes CVAE Constrained Graph RNN
Social GAN CVPR'18 ETH/UCY GAN RNN
SoPhie CVPR'19 ETH/UCY GAN Cross Attention
Social-BiGAT NeurIPS'19 ETH/UCY Bicycle-GAN Graph Attention Network
MotionDiffuser CVPR'23 WOMD Diffusion Transformer
SDT OpenReview'24 AV2 Diffusion Transformer
Westny et al. arXiv'24 rounD, highD Diffusion GNN
LMTrajectory CVPR'24 ETH/UCY LLM Transformer
TrafficSim CVPR'21 ATG4D (private) CVAE GNN
TrafficBots ICRA'23 WOMD CVAE MLP
DJINN NeurIPS'23 INTERACTION Diffusion Transformer
Scenario Diffusion NeurIPS'23 AV2 Diffusion UNet
BehaviorGPT NeurIPS'25 WOMD Autoregressive Transformer

3D Occupancy Generation Methods

Method Venue+Year Dataset Modeling Type Backbone Control Mechanism Generation Type Code
UrbanDiffusion arXiv'24 nuScenes via Occ3D VQ-VAE Diffusion BEV Layout Static Scene Not Released
DOME arXiv'24 nuScenes via Occ3D VAE DiT Ego Trajectory Scene and Agent Only Not Released
OccWorld ECCV'24 nuScenes via Occ3D VQ-VAE Transformer Past Occupancy Scene and Agent GitHub
OccSORA [Redacted] arXiv'24 nuScenes via Occ3D VQ-VAE DiT Ego Trajectory, Past Occupancy Scene and Agent GitHub*
OccLLaMA arXiv'24 nuScenes via Occ3D VQ-VAE LLaMA Language Scene and Agent Not Released
UnO CVPR'24 nuScenes, Argoverse2 Not Specified Transformer Past Occupancy Semantic LiDAR Not Released
DynamicCity ICLR'25 CARLA VAE DiT Ego Trajectory Scene and Agent GitHub

Video-based Scene Generation Methods

Note:
For the "Condition" column:
I = Image, T = Text, E = BEV, B = Bounding Boxes/Layout,
D = Depth, C = Camera, M = Maps, A = Driver Action,
O = Optical Flow, J = Trajectory, S = Subject, H = High-level instructions (Command, Goal Point).
Conditions in brackets are optional.

Method Year Modeling Backbone Frames FPS Condition Closed-loop LLMs Code
Panacea CVPR'24 Diffusion ControlNet 8 2 ITEBDCM Github
Delphi CoRR'24 Diffusion U-Net 40 2 TEBC N/A
DriveDreamer ECCV'24 Diffusion U-Net, Transformer 32 12 ITMBA Github
DriveDreamer-2 ArXiv'24 Diffusion U-Net 8 4 T(ECI) Github
DriveScape ArXiv'24 Diffusion U-Net 30 2-10 IMEB N/A
DriveArena CoRR'24 Diffusion, AR U-Net N/A 12 TBCM Github
DriveGen ArXiv'24 Diffusion U-Net - - ITB Github
DrivingDiffusion ECCV'24 Diffusion U-Net - - ITBO Github
Vista CoRR'24 Diffusion, AR U-Net 25 10 I(AHJ) Github
SubjectDrive CoRR'24 Diffusion ControlNet 8 2 ITSB N/A
GenAD CVPR'24 Diffusion Transformer 8 2 ITAJ N/A
DrivingWorld ArXiv'24 AR Transformer, GPT 400 10 IJ Github
Doe-1 ArXiv'24 N/A N/A - 2 ITJ Github
ChatSim CVPR'24 Agent N/A 40 10 IT Github
ProphetDWM ArXiv'25 Diffusion U-Net 10 4 ITA N/A
LongDWM* ArXiv'25 Diffusion Transformer 13 10 ITJ Github

*: Not released as of June.2025.

3D/4D Generation Methods

Note:
In the "Condition" column:
M = Maps, I = Images/Videos, B = 3D Bounding Boxes/Layout, J = Trajectory, T = Text, O = Opacity, C = Camera, A = Driving Action.
* means not presented in the original paper but supported later.
\dagger means reconstruction models with a generative prior.

Method Venue Task Modeling Type Backbone Condition Output Code
InfiniCube ArXiv'24 4D Gen. 3DGS, DiT 3D U-Net, ControlNet MBJT Video, 3DGS N/A
WoVoGen ECCV'24 4D Gen. Diffusion 3D U-Net, Transformer MOTA Video Github
DriveX ArXiv'24 4D Gen. Diffusion U-Net MOTA Video, 3DGS Github
ChatSim CVPR'24 4D Gen. NeRF, 3DGS* Transformer IT Video Github
MagicDrive3D CORR'24 4D Gen. 3DGS MLP TEBJ Video, 3DGS Github
DreamDrive ArXiv'24 4D Gen. 3DGS, Diffusion MLP IJ Video, 3DGS N/A
OmniRe ICLR'25 4D Rec. 3DGS, Graph N/A I(CD) 3DGS, SMPL Github
4DGF NeurIPS'24 4D Rec. 3DGS, Graph N/A IC(D) 3DGS Github
StreetGaussian ECCV'24 4D Rec. 3DGS N/A ICD 3DGS Github
DrivingGaussian CVPR'24 4D Rec. 3DGS N/A, Graph ICD 3DGS N/A
SGD CORR'24 4D Rec.\dagger 3DGS U-Net, ControlNet ITCD 3DGS N/A
EmerNeRF ICLR'24 4D Rec. NeRF MLP ICD NeRF Github
VastGaussian CVPR'24 3D Rec. 3DGS CNN IC 3DGS N/A
CityGaussian ECCV'24 3D Rec. 3DGS N/A IC 3DGS Github
DNMP ICCV'23 3D Rec. Voxel, Mesh MLP ICD Voxel, Mesh Github
S-NeRF ICLR'23 3D Rec. NeRF MLP ICD NeRF Github
BlockNeRF CVPR'22 3D Rec. NeRF MLP IC NeRF N/A
UrbanNeRF CVPR'22 3D Rec. NeRF MLP ICD NeRF N/A
Julian et al. CVPR'21 4D Rec. NeRF, Graph MLP IC NeRF Github
STORM ICLR'25 4D Rec. 3DGS Transformer IC 3DGS Github

3D Scene Editing Methods

Here we note their supported operations and output format.

Method Modeling Type Insertion Removal Manipulation Camera LiDAR Code
UniSim NeRF ✔️ ✔️ ✔️ ✔️ ✔️ N/A
DrivingGaussian 3DGS ✔️ ✔️ Github
StreetGaussian 3DGS ✔️ ✔️ ✔️ ✔️ N/A
Generative LiDAR Generative Inpainting ✔️ ✔️ ✔️ ✔️ ✔️ N/A
DriveEditor SAM, Video Diffusion ✔️ ✔️ ✔️ ✔️ N/A

LLM-based Autonomous Driving Systems

In the condition column, QA stands for question answering, DM for decision making, ED for environment description, SU for scene understanding, and DC for driving context.

Method Venue Interaction Task Scenario Backbone Strategy Input Output Code
Dilu ArXiv'23 Prompting QA DM GPT-4 ReAct ED Action Github
Drive-Like-A-Human WACV'24 Prompting QA DM GPT-3.5 ReAct ED Action Github
Driving-with-LLMs ICRA'24 Fine-tuning QA SU LLaMA-7b None Question Answer Github
LaMPilot CVPR'24 Prompting QA SU General LLMs PoT Instruction, DC Code Github
LLaDA CVPR'24 Prompting QA DM GPT-4 CoT Intended Command Action Github
GPT-driver NeurIPS'23 Fine-tuning Planning E2E GPT-3.5 CoT Instruction, DC Object, Action, Trajectory Github
Talk2Drive ITSC'24 Prompting Planning E2E GPT-4 CoT Instruction, DC Executable Controls Github
Agent-Driver COLM'24 Prompting Planning E2E GPT-3.5 ReAct Observation Object, Action, Trajectory Github

MLLM-based Autonomous Driving Systems

In the condition column, VQA stands for visual question answering, SU for scene understanding, DS for driving scene, MVF for multi-view frame, and TC for transportation context.

Method Venue Interaction Task Scenario Backbone Strategy Input Output Code
HiLM-D ArXiv'23 Prompting VQA SU MiniGPT-4 None Question, DS (Video) Answer N/A
DriveLM ECCV'24 Fine-tuning VQA SU BLIP-2 CoT Question, DS (Image) Answer Github
Dolphins ECCV'24 Fine-tuning VQA SU OpenFlamingo CoT Question, DS (Video) Answer Github
EM-VLM4AD CVPR'24 Fine-tuning VQA SU T5/T5-Large None Question, DS (MVF) Answer Github
LLM-Augmented-MTR IROS'24 Prompting VQA SU GPT-4V CoT Instruction, TC-Map Context Understanding Github
LMDrive CVPR'24 Fine-tuning Planning E2E LLaVA-v1.5 CoT Instruction, DS (MVF), LiDAR Control Signal Github
LeGo-Drive IROS'24 Fine-tuning Planning E2E CLIP None Instruction, DS (Image) Trajectory Github
RAG-Driver ArXiv'24 Fine-tuning Planning E2E ViT-B/32, Vicuna-1.5 RAG Instruction, DS (Video) Action, Trajectory Github
DriveVLM CoRL'24 Fine-tuning Planning E2E Qwen-V CoT Instruction, DS (Video) Action, Trajectory N/A
EMMA ArXiv'24 Fine-tuning Planning E2E Gemini 1.0 Nano-1 CoT Instruction, DS (MVF) Object, Action, Trajectory N/A
OpenEMMA WACV'25 Prompting Planning E2E General MLLMs CoT Instruction, DS (Image) Object, Action, Trajectory Github

Citation

If you find this repository useful for your research, please consider citing the following paper:

@article{wang2025generative,
    title={Generative AI for Autonomous Driving: Frontiers and Opportunities},
    author={Yuping Wang and Shuo Xing and Cui Can and Renjie Li and Hongyuan Hua and Kexin Tian and Zhaobin Mo and Xiangbo Gao and Keshu Wu and Sulong Zhou and Hengxu You and Juntong Peng and Junge Zhang and Zehao Wang and Rui Song and Mingxuan Yan and Walter Zimmer and Xingcheng Zhou and Peiran Li and Zhaohan Lu and Chia-Ju Chen and Yue Huang and Ryan A. Rossi and Lichao Sun and Hongkai Yu and Zhiwen Fan and Frank Hao Yang and Yuhao Kang and Ross Greer and Chenxi Liu and Eun Hak Lee and Xuan Di and Xinyue Ye and Liu Ren and Alois Knoll and Xiaopeng Li and Shuiwang Ji and Masayoshi Tomizuka and Marco Pavone and Tianbao Yang and Jing Du and Ming-Hsuan Yang and Hua Wei and Ziran Wang and Yang Zhou and Jiachen Li and Zhengzhong Tu},
    year={2025},
    eprint={2505.08854},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •