Technical Report · May 11, 2026

The DAWN of World-Action Interactive Models

Hongbo Lu1,2,*, Liang Yao1,*, Chenghao He1,*, Haoyu Wang1,*, Xiang Gu3, Xianfei Li1,2, Wenlong Liao1,†, Tao He1, Pai Peng1,†,‡
1COWARobot Co. Ltd 2Shanghai Jiao Tong University 3Hohai University
*Equal Contribution · Corresponding Author · Project Lead

Correspondence: volans.liao@cowarobot.com, pengpai@cowarobot.com
From WAMs to WAIM concept diagram
Figure 1. From WAMs to WAIM: world and action are coupled during inference.

Abstract

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines.

DAWN formalizes this perspective as World-Action Interactive Models (WAIMs) and instantiates it for autonomous driving as a compact latent generative model. It couples a World Predictor with a World-Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update world prediction, recursively refining both during inference.

Rather than eliminating test-time world evolution or rolling out the full future in pixel space, DAWN performs short explicit latent rollout to support long-horizon trajectory generation in complex interactive scenes. Experiments show strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks.

Core Idea

DAWN argues that useful world-action models should not merely represent world and action together. They should let the two hypotheses co-evolve during inference: the current world hypothesis refines the action hypothesis, and the emerging action hypothesis revises the predicted world evolution.

WAIM Formulation

Future world states and future actions are inferred as coupled variables instead of independent outputs or fixed pipeline stages.

Latent Rollout

DAWN uses a short semantic latent rollout rather than expensive pixel-space future rendering.

Recursive Refinement

A World Predictor and World-Conditioned Action Denoiser repeatedly update each other to form a coherent future-action pair.

Method

Architecture

DAWN contains a Student Vision-Encoder, a training-time Teacher Vision-Encoder, an Auto-Encoder Resampler, a World Predictor, a World-Conditioned Action Denoiser, and a lightweight Action Head. The implementation uses V-JEPA 2 Large as the vision backbone and compresses dense visual tokens into compact latent world tokens.

Inference

At inference time, the teacher branch is removed. DAWN encodes the current observation into latent context, produces an initial action hypothesis, alternates between short latent world rollout and action denoising, and finally decodes the refined action state into a trajectory.

Training Recipe

The report describes four stages: large-scale driving video pretraining, Auto-Encoder Resampler training, World Predictor training on downstream datasets, and joint world-action training where action proposal and interactive refinement share denoiser weights.

Overview of DAWN architecture
Figure 2. DAWN architecture with latent world prediction and world-conditioned action denoising.

Results

89.1 NAVSIM v1 PDMS

Best overall perception-free PDMS reported in the report, with strong NC, ego progress, and time-to-collision scores.

0.33 m nuScenes Avg. L2

Lowest average L2 trajectory error among compared methods, improving mid- and long-horizon accuracy.

0.11% nuScenes Avg. Collision

Best average collision rate, showing improved planning accuracy without sacrificing safety-related behavior.

Component ablations show that interactive world-action updates improve PDMS from 85.2 to 87.9 in the lower-resolution setting. Removing either World→Action or Action→World coupling weakens the model, supporting the report's central WAIM principle that world evolution and action generation should mutually constrain each other.

NAVSIM v1 Benchmark

Type Method Inputs NC ↑ DAC ↑ EP ↑ C ↑ TTC ↑ PDMS ↑
Perception-based Transfuser C & L 97.7 92.8 79.2 100 92.8 84.0
Perception-based Hydra-MDP C & L 98.4 97.7 85.0 100 94.5 89.9
Perception-based Hydra-MDP++ C & L 97.6 96.0 80.4 100 93.1 86.6
Perception-based DiffusionDrive C & L 98.2 96.2 82.2 100 94.7 88.1
Perception-based GoalFlow C & L 98.4 98.3 85.0 100 94.6 90.3
Perception-based DriveDPO C & L 98.5 98.1 84.3 100 94.8 90.0
Perception-based iPad Camera 99.2 97.4 87.8 99.7 96.3 91.7
Perception-based DriveSuprim Camera 98.6 98.6 91.3 100 95.5 93.5
Perception-free LAW C & L 97.4 93.3 78.8 100 91.9 83.8
Perception-free World4Drive C & L 97.4 94.3 79.9 100 92.8 85.1
Perception-free Epona Camera 97.9 95.1 80.4 99.9 93.8 86.2
Perception-free Drive-JEPA Camera 98.7 96.2 82.9 100 95.5 89.0
Perception-free DAWN* (Ours) Camera 98.2 95.8 84.2 100 95.8 87.9
Perception-free DAWN (Ours) Camera 98.7 95.9 84.3 100 96.0 89.1

DAWN* denotes the 256×256-resolution variant reported in the technical report.

nuScenes Benchmark

Method L2 (m) ↓ Collision Rate (%) ↓
1s 2s 3s Avg. 1s 2s 3s Avg.
ST-P3 1.33 2.11 2.90 2.11 0.23 0.62 1.27 0.71
OccNet 1.29 2.13 2.99 2.13 0.21 0.59 1.37 0.72
UniAD 0.48 0.96 1.65 1.03 0.05 0.17 0.71 0.31
VAD 0.41 0.70 1.05 0.72 0.07 0.18 0.43 0.23
PPAD 0.31 0.56 0.87 0.58 0.08 0.12 0.38 0.19
GenAD 0.28 0.49 0.78 0.52 0.08 0.14 0.34 0.19
BEV-Planner 0.30 0.52 0.83 0.55 0.10 0.37 1.30 0.59
LAW 0.26 0.57 1.01 0.61 0.14 0.21 0.54 0.30
World4Drive 0.23 0.47 0.81 0.50 0.02 0.12 0.33 0.16
WorldRFT 0.21 0.44 0.76 0.47 0.10 0.11 0.23 0.15
DAWN (Ours) 0.17 0.31 0.52 0.33 0.00 0.10 0.23 0.11
Effect of interactive rounds
Figure 3. Planning quality across different numbers of interactive rounds.
Qualitative planning results
Figure 4. Qualitative planning results in representative driving scenarios.

BibTeX

@misc{lu2026dawn,
      title={The DAWN of World-Action Interactive Models},
      author={Hongbo Lu and Liang Yao and Chenghao He and Haoyu Wang and Xiang Gu and Xianfei Li and Wenlong Liao and Tao He and Pai Peng},
      year={2026},
      eprint={2605.11550},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.11550},
}