WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

Abstract

Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

Core Contributions

Systematically exploring the role of image-based perceptual condition, such as depth and optical flow, in enhancing long-horizon video generation as auxiliary signals.
Proposing a unified framework that integrates perceptual conditioning and memory mechanisms for robust long-horizon video prediction.
Extensive validation across different generative models and datasets, including both general-purpose and robotic manipulation domains, highlighting the potential of our approach as a foundation for scalable world models.

Method

Given an input video, RGB, depth, and optical flow signals are encoded into a joint latent representation via a 3D VAE. The latents are split into a memory bank and prediction groups for the Diffusion Transformer. The memory bank stores historical frames and is excluded from loss computation; short-term memory retains a few fully denoised frames for fine details, while long-term memory keeps depth cues noise-free and adds low-level noise to RGB information. During training, prediction groups are assigned different noise levels according to the noise scheduler curve, aligned with the noise scheduling used during inference.

Qualitative Results

Below, we present qualitative results generated by our method, including human activity scenes and robotic arm manipulation tasks.

Prompt: A robot arm picks up a blue cup from the sink area and places it on a tray, then picks up an orange cup places it on a tray, then picks up the green one places it on a tray.

Prompt: The robotic arm moves downward, approaching drawer and open it, then the robotic arm moves up, then the robotic arm approaches the green can, picks it up and puts it in the drawer, finally it approaches the black bowl, grips and puts it in the drawer.

Prompt: A young man jogs around a peaceful lake at dawn, he stops to catch his breath and stretch, he takes a photo of the sunrise, then continues running with determination.

Prompt: A young woman types on her laptop in a coffee shop, she takes a sip and checks her schedule, receives a message and smiles, then closes her computer to leave.

Prompt: A little girl sits by the window on a rainy day, she draws shapes on the foggy glass, her mother brings her hot chocolate, together she watches the rain.

Prompt: An elderly couple walks hand in hand in the park. They chat and smile as they stroll. The man feeds the woman a small treat. The camera zooms in on their happy laughter.

BibTeX

@article{liu2025worldweaver,
  title={WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
},
  author={Liu, Zhiheng and Deng, Xueqing and Chen, Shoufa and Wang, Angtian and Guo, Qiushan and Han, Mingfei and Xue, Zeyue and Chen, Mengzhao and Luo, Ping and Yang, Linjie},
  journal={arXiv preprint arXiv:2508.15720},
  year={2025}
}