Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

Accepted by ICML 2026
Xiaokun Feng1,2,3, Jiashu Zhu3, Meiqi Wu3, Chubin Chen3, Fangyuan Mao3, Haiyang Guo3, Jiahong Wu3, Xiangxiang Chu3, Kaiqi Huang1,2,*
1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China    2The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China    3AMAP, Alibaba Group, Beijing, China
*Corresponding author: kaiqi.huang@nlpr.ia.ac.cn

Generated Long Video Samples

Prompt

A fluffy Corgi dog trots happily across a lush green lawn. Its short legs and wagging tail convey pure delight as it explores the grassy expanse, a charming and energetic display of canine joy.

Prompt

A majestic sea turtle glides effortlessly through a vibrant, colorful coral reef. The camera follows its slow, deliberate movements, revealing the intricate patterns of the reef and the calm, clear blue water surrounding it, a tranquil underwater scene.

Prompt

A hyperrealistic rendering of Iron Man soaring effortlessly through a vast, cloud-filled sky. The camera follows him smoothly as he glides, showcasing the intricate details of his suit against the dynamic, swirling clouds, evoking a sense of powerful, serene flight.

Abstract

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models.

To mitigate these concerns, we propose MIGA, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA.

Figure A1: MIGA infinite-frame long video generation results Figure A1. MIGA enables temporally consistent, infinite-frame (โˆž) video generation in a training-free manner. We present three long videos (1000+ frames) generated by MIGA, while the foundation model used by MIGA, Wan2.1-1.3B, supports only 81 frames by default.

Approach Overview

โ‘  Two-Stage Training-Inference Alignment (TTA)

Train-free frame-level autoregressive frameworks (e.g., FIFO-Diffusion) require the foundation model to denoise latents that span a wide range of noise levels at inference, while the model is trained on latents sharing a single noise level โ€” a mismatch that hampers generation quality. TTA alleviates this gap by explicitly shrinking the noise span seen by the model. In Stage 1, we maintain a zigzag-structured latent queue that changes the noise level only every Lzig latents instead of every single frame, providing a smoother input distribution. In Stage 2, once all latents are denoised to the same noise level, a unified denoising pass is performed, exactly matching the noise condition seen during training.

Figure 1: Two-Stage Training-Inference Alignment (TTA) framework Figure 1. Inference framework comparison between FIFO-Diffusion and our Two-Stage Training-Inference Alignment (TTA) mechanism. By introducing a zigzag-structured queue and a unified denoising stage, TTA proactively reduces the noise span of latents fed to the foundation model.

โ‘ก Dual Consistency Enhancement (DCE)

Although TTA closes the training-inference gap, long-term temporal consistency still requires explicit modeling. DCE addresses this from two complementary directions on the maintained long latent queue. The self-reflection approach focuses on the queue's tail: it adaptively detects abrupt drops in self-similarity among newly added high-noise latents, and triggers an expanded local search only at those anomaly points to correct them โ€” avoiding the redundant computation of fixed-step search. The long-range frame guidance approach focuses on the queue's head: late, low-noise latents that already cover broad temporal context are injected into each local denoising iteration, enabling feature interaction between distant frames and steering the generation toward globally consistent content.

Figure A2: Dual Consistency Enhancement (DCE) framework Figure A2. Illustration of the Dual Consistency Enhancement (DCE) mechanism. Self-reflection corrects early high-noise frames at the tail of the queue, while long-range frame guidance leverages later low-noise frames at the head of the queue to jointly improve temporal consistency.