logoOccLLaMA: A Unified Occupancy-Language-Action World Model for Understanding and Generation Tasks in Autonomous Driving

Abstract

Scene understanding via multi-modal large language models (MLLMs) and scene forecasting with world models have advanced the development of autonomous driving. The former maps visual inputs to driving-specific outputs, neglecting spatial reasoning and world dynamics. The latter captures world dynamics, lacking comprehensive scene understanding. In contrast, humans seamlessly integrate understanding, forecasting, and decision-making via multi-modal representations, avoiding misalignment and complexity.

To this end, we propose OccLLaMA, a unified occupancy-language-action world model for multi-task learning. It uses semantic occupancy as a unified and modality-agnostic 3D visual representation, effectively integrating spatial scene understanding and scene forecasting. We further introduce a novel scene tokenizer tailored for occupancy, enabling a unified representation manner for multi-task across understanding and generation. Furthermore, we enhance LLM, specifically LLaMA, to enable end-to-end multi-task learning within a unified auto-regressive framework.

Extensive experiments demonstrate that OccLLaMA not only achieves competitive performance on multi-task,including scene understanding, occupancy forecasting and motion planning, but also significantly enhances motion planning performance by the integration of multi-task learning, showcasing its effectiveness and potential as a foundation model for autonomous driving.

Framework

The Scene Tokenizer and Unified World Model are core components of OccLLaMA. The Scene Tokenizer employs a sparse encoder and decoupled decoder to efficiently tokenize the occupancy scene, addressing data sparsity and class imbalance. The Unified World Model integrates occupancy-language-action modalities within a unified discrete auto-regressive framework, supporting multi-task learning in autonomous driving.

Experiments

scene_understand

OccLLaMA enables scene understanding with spatial reasoning based on occupancy observation and enhances motion planning as a prerequisite chain-of-though

BibTeX

@article{wei2024occllama,
    title={Occllama: An occupancy-language-action generative world model for autonomous driving},
    author={Wei, Julong and Yuan, Shanshuai and Li, Pengfei and Hu, Qingda and Gan, Zhongxue and Ding, Wenchao},
    journal={arXiv preprint arXiv:2409.03272},
    year={2024}
}