The DAWN of World-Action Interactive Models

Lu, Hongbo; Yao, Liang; He, Chenghao; Wang, Haoyu; Gu, Xiang; Li, Xianfei; Liao, Wenlong; He, Tao; Peng, Pai

Technical Report · May 11, 2026

The DAWN of World-Action Interactive Models

Hongbo Lu^1,2,*, Liang Yao^1,*, Chenghao He^1,*, Haoyu Wang^1,*, Xiang Gu³, Xianfei Li^1,2, Wenlong Liao^1,†, Tao He¹, Pai Peng^1,†,‡

¹COWARobot Co. Ltd ²Shanghai Jiao Tong University ³Hohai University
^*Equal Contribution · ^†Corresponding Author · ^‡Project Lead
Correspondence: volans.liao@cowarobot.com, pengpai@cowarobot.com

arXiv Code BibTeX

From WAMs to WAIM concept diagram — Figure 1. From WAMs to WAIM: world and action are coupled during inference.

Abstract

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines.

DAWN formalizes this perspective as World-Action Interactive Models (WAIMs) and instantiates it for autonomous driving as a compact latent generative model. It couples a World Predictor with a World-Conditioned Action Denoiser: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update world prediction, recursively refining both during inference.

Rather than eliminating test-time world evolution or rolling out the full future in pixel space, DAWN performs short explicit latent rollout to support long-horizon trajectory generation in complex interactive scenes. Experiments show strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks.

Core Idea

DAWN argues that useful world-action models should not merely represent world and action together. They should let the two hypotheses co-evolve during inference: the current world hypothesis refines the action hypothesis, and the emerging action hypothesis revises the predicted world evolution.

WAIM Formulation

Future world states and future actions are inferred as coupled variables instead of independent outputs or fixed pipeline stages.

Latent Rollout

DAWN uses a short semantic latent rollout rather than expensive pixel-space future rendering.

Recursive Refinement

A World Predictor and World-Conditioned Action Denoiser repeatedly update each other to form a coherent future-action pair.

Method

Architecture

DAWN contains a Student Vision-Encoder, a training-time Teacher Vision-Encoder, an Auto-Encoder Resampler, a World Predictor, a World-Conditioned Action Denoiser, and a lightweight Action Head. The implementation uses V-JEPA 2 Large as the vision backbone and compresses dense visual tokens into compact latent world tokens.

Inference

At inference time, the teacher branch is removed. DAWN encodes the current observation into latent context, produces an initial action hypothesis, alternates between short latent world rollout and action denoising, and finally decodes the refined action state into a trajectory.

Training Recipe

The report describes four stages: large-scale driving video pretraining, Auto-Encoder Resampler training, World Predictor training on downstream datasets, and joint world-action training where action proposal and interactive refinement share denoiser weights.

Overview of DAWN architecture — Figure 2. DAWN architecture with latent world prediction and world-conditioned action denoising.

Results

89.1 NAVSIM v1 PDMS

Best overall perception-free PDMS reported in the report, with strong NC, ego progress, and time-to-collision scores.

0.33 m nuScenes Avg. L2

Lowest average L2 trajectory error among compared methods, improving mid- and long-horizon accuracy.

0.11% nuScenes Avg. Collision

Best average collision rate, showing improved planning accuracy without sacrificing safety-related behavior.

Component ablations show that interactive world-action updates improve PDMS from 85.2 to 87.9 in the lower-resolution setting. Removing either World→Action or Action→World coupling weakens the model, supporting the report's central WAIM principle that world evolution and action generation should mutually constrain each other.

NAVSIM v1 Benchmark

Type	Method	Inputs	NC ↑	DAC ↑	EP ↑	C ↑	TTC ↑	PDMS ↑
Perception-based	Transfuser	C & L	97.7	92.8	79.2	100	92.8	84.0
Perception-based	Hydra-MDP	C & L	98.4	97.7	85.0	100	94.5	89.9
Perception-based	Hydra-MDP++	C & L	97.6	96.0	80.4	100	93.1	86.6
Perception-based	DiffusionDrive	C & L	98.2	96.2	82.2	100	94.7	88.1
Perception-based	GoalFlow	C & L	98.4	98.3	85.0	100	94.6	90.3
Perception-based	DriveDPO	C & L	98.5	98.1	84.3	100	94.8	90.0
Perception-based	iPad	Camera	99.2	97.4	87.8	99.7	96.3	91.7
Perception-based	DriveSuprim	Camera	98.6	98.6	91.3	100	95.5	93.5
Perception-free	LAW	C & L	97.4	93.3	78.8	100	91.9	83.8
Perception-free	World4Drive	C & L	97.4	94.3	79.9	100	92.8	85.1
Perception-free	Epona	Camera	97.9	95.1	80.4	99.9	93.8	86.2
Perception-free	Drive-JEPA	Camera	98.7	96.2	82.9	100	95.5	89.0
Perception-free	DAWN* (Ours)	Camera	98.2	95.8	84.2	100	95.8	87.9
Perception-free	DAWN (Ours)	Camera	98.7	95.9	84.3	100	96.0	89.1

DAWN* denotes the 256×256-resolution variant reported in the technical report.

nuScenes Benchmark

Method	L2 (m) ↓				Collision Rate (%) ↓
Method	1s	2s	3s	Avg.	1s	2s	3s	Avg.
ST-P3	1.33	2.11	2.90	2.11	0.23	0.62	1.27	0.71
OccNet	1.29	2.13	2.99	2.13	0.21	0.59	1.37	0.72
UniAD	0.48	0.96	1.65	1.03	0.05	0.17	0.71	0.31
VAD	0.41	0.70	1.05	0.72	0.07	0.18	0.43	0.23
PPAD	0.31	0.56	0.87	0.58	0.08	0.12	0.38	0.19
GenAD	0.28	0.49	0.78	0.52	0.08	0.14	0.34	0.19
BEV-Planner	0.30	0.52	0.83	0.55	0.10	0.37	1.30	0.59
LAW	0.26	0.57	1.01	0.61	0.14	0.21	0.54	0.30
World4Drive	0.23	0.47	0.81	0.50	0.02	0.12	0.33	0.16
WorldRFT	0.21	0.44	0.76	0.47	0.10	0.11	0.23	0.15
DAWN (Ours)	0.17	0.31	0.52	0.33	0.00	0.10	0.23	0.11

Effect of interactive rounds — Figure 3. Planning quality across different numbers of interactive rounds.

Figure 4. Qualitative planning results in representative driving scenarios.

BibTeX

@misc{lu2026dawn,
      title={The DAWN of World-Action Interactive Models},
      author={Hongbo Lu and Liang Yao and Chenghao He and Haoyu Wang and Xiang Gu and Xianfei Li and Wenlong Liao and Tao He and Pai Peng},
      year={2026},
      eprint={2605.11550},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.11550},
}