DIAGONAL DISTILLATION

FOR STREAMING AUTORE-GRESSIVE VIDEO GENERATION

Anonymous authors

Paper under double-blind review

In the results shown in the above video, the first four cases are presented at 1x acceleration with 15 fps (original speed), while the last four cases are presented at 2x acceleration with 30 fps.

Abstract

The advent of large-scale diffusion models has significantly enhanced the quality of generated videos. However, their application in real-time streaming remains limited. While autoregressive models provide a natural framework for sequential frame generation, they typically require substantial computational resources to achieve high visual fidelity. Although diffusion distillation techniques can compress such models into efficient few-step variants, current video distillation methods predominantly adopt image-specific approaches that overlook critical temporal dependencies. Consequently, these techniques often excel in image generation but underperform in video synthesis, resulting in issues such as reduced motion coherence, error accumulation over long sequences, and a persistent trade-off between latency and quality. We identify that these challenges are primarily exacerbated by two fac- tors: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (exposure bias). To address these issues, we propose a novel framework—Diagonal Distillation—that operates orthogonally to existing approaches. This method systematically leverages temporal information across both video chunks and denoising steps. Specifically, we introduce an asymmetric generation strategy: “more steps early, fewer steps later.” This enables later chunks to efficiently inherit rich appearance information from the initial, more thoroughly processed chunks while utilizing partially denoised chunks as conditional inputs for subsequent generation. By realigning the implicit prediction of subsequent noise levels during chunk generation with actual inference conditions, our approach mitigates error propagation and alleviates oversaturation in long-range sequences. Additionally, we incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method takes only 2.61 seconds to generate a 5-second video and provides a 270x speedup over the undistilled model, nearly doubling the acceleration ratio of the previous state-of-the-art (140x) while preserving similar visual quality.



Real-Time Video Generation

Our model can generate a 5-second video in just 2.61 seconds while maintaining comparable quality.


A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat's orange fur. The shot is clear and sharp, with a shallow depth of field.

A cheerful, fuzzy panda playing a guitar near a warm campfire. The panda has soft, black patches against a white fluffy coat, with large, expressive eyes filled with joy. It is sitting comfortably, strumming the strings with its front paws. Flames from the campfire flicker and dance, casting gentle shadows on the ground. In the background, a majestic snow-capped mountain rises, its peaks dusted with snow under a clear blue sky. The scene is captured in a medium shot, emphasizing the cozy, serene atmosphere of the winter landscape.

A cheerful and energetic Corgi playing happily in a sunlit park during the golden hour of sunset. The Corgi is running around with a playful wagging tail and joyful barks, jumping over small obstacles and chasing after a ball. The park is filled with lush green grass and scattered trees casting long shadows. In the background, the sky is painted with warm hues of orange and pink, reflecting a serene and cozy atmosphere. The scene includes an intense shaking effect that adds excitement and vibrancy to the video. Medium close-up shot focusing on the Corgi's playful antics.

A high-speed train rushes down the railway tracks, its sleek body gleaming under the bright sunlight. The train moves with great momentum, leaving a trail of steam behind as it passes through the landscape. The camera follows the train from a distance, capturing the powerful motion and speed. Surrounding the tracks are lush green fields and tall trees, creating a picturesque backdrop. The video should emphasize the train's velocity and the dynamic motion of the scenery passing by. Wide shot, showing the entirety of the train and the expansive environment surrounding it.

A fluffy white sheep grazing in a lush green meadow under a clear blue sky. The sheep has soft wool, large brown eyes, and a gentle expression. It is nibbling on grass, moving its head side to side as it eats. In the background, there are rolling hills covered with green grass and wildflowers in full bloom. The scene is peaceful and serene, captured in a medium close-up shot focusing on the sheep's calm and content demeanor.

A cute, fluffy panda sitting at a small table in a traditional Chinese restaurant, eating Chinese cuisine. The panda has soft black patches around its eyes and ears, contrasting with its white fur. It sits comfortably in a relaxed posture, holding a pair of chopsticks with its front paws, delicately picking up pieces of dumplings from a steaming plate. The restaurant is decorated with red lanterns and bamboo accents, creating a cozy and inviting atmosphere. In the background, other diners can be seen enjoying their meals. Medium close-up shot focusing on the panda's face and hands as it enjoys its meal.

A person is confidently marching forward with a determined expression. They take strong, purposeful strides as they move across a paved street under clear blue skies. The person wears a smart casual outfit consisting of a fitted t-shirt and jeans, along with sneakers. Their posture is upright and their arms swing naturally at their sides. The scene is captured in a medium shot, focusing on the individual from a slightly elevated angle to emphasize their forward motion. The background includes a few passersby and buildings, but the focus remains on the marching figure.

A sleek, modern boat accelerating across the water to gain speed. The boat's hull slices through the waves, creating a trail of white foam behind it. The sun glints off the water, casting shimmering reflections. The boat's engine roars as it speeds up, and the spray from the bow sprays upwards. Focus on the dynamic motion of the boat as it accelerates, with a medium shot capturing the entirety of the boat and the surrounding water. No camera movement, just a static yet action-packed scene.

A close-up shot of a sleek green hard-shell suitcase with a shiny metallic handle and sturdy wheels. The suitcase has a simple yet modern design, with subtle embossed lines running along its surface. In the background, there is a blurred airport terminal with people walking by, adding a sense of travel and adventure. The suitcase remains stationary, emphasizing its detailed texture and color.

A sleek, modern sports car accelerating rapidly down a straight, empty road, gaining speed with each passing second. The car's engine roars as it speeds up, tires leaving occasional smoke trails on the asphalt. The exterior of the car is a shiny metallic silver with minimalistic design elements. The background showcases a scenic countryside with rolling hills and a clear blue sky. The camera follows the car from a slight distance, capturing the intense acceleration and speed from a dynamic tracking shot.

A nighttime scene from a vintage film-style photograph, depicting a giant, otherworldly creature slowly walking down a desolate, rundown city street. Only one dim streetlamp casts flickering shadows, illuminating the creature's massive, imposing form. Its skin is rough and covered in peculiar growths, with glowing eyes that reflect the dim light. The creature's steps echo in the empty alleyways, creating a sense of eerie quiet. The background features crumbling buildings, broken windows, and trash-strewn sidewalks. The photo has a grainy texture and a muted color palette, capturing the haunting atmosphere of the scene. A medium shot with a slight tilt to the camera, emphasizing the creature's movement and presence.

A person is cycling through a scenic park trail. The rider is wearing a helmet, casual clothes, and sunglasses, pedaling steadily. They are mid-action, leaning slightly forward, with one hand on the handlebars and the other hanging loosely. The environment around them includes lush green trees, blooming flowers, and a winding dirt path. The sun is shining brightly, casting dappled shadows through the leaves. The scene captures a close-up of the rider from a side angle, focusing on their determined expression and the motion of the bicycle wheels.


Faster, and Better

Our approach generates 5-second videos in only 2.61 seconds, achieving a 270x speedup over the baseline model (almost twice the prior state-of-the-art 140x acceleration) with similar visual quality.


Wan2.1-1.3B

CausVid

Self forcing

Ours-1.3B

A cinematic film shot in 35mm capturing a dynamic step-printing scene of a person running. The runner is a young man with short, tousled brown hair and determined eyes, sprinting down a city street lined with tall buildings and neon signs. His arms are pumped vigorously, and he looks focused and energetic. The background features blurred motion with the cityscape gradually fading into a soft, sepia tone. The camera follows him closely, capturing his every stride and movement. The scene has a nostalgic and vintage film texture, enhancing the dramatic intensity of the run. A close-up shot from a slightly behind-the-subject angle.

A cinematic film shot in 35mm capturing a dynamic step-printing scene of a person running. The runner is a young man with short, tousled brown hair and determined eyes, sprinting down a city street lined with tall buildings and neon signs. His arms are pumped vigorously,......

A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

A cinematic scene from a classic western movie, featuring a rugged man riding a powerful horse through the vast Gobi Desert at sunset. The man, dressed in a dusty cowboy hat and a worn leather jacket, reins tightly on the horse's neck as he gallops across the golden sands. The sun sets dramatically behind them, casting long shadows and warm hues across the landscape. The background is filled with rolling dunes and sparse, rocky outcrops......

A dramatic and dynamic scene in the style of a disaster movie, depicting a powerful tsunami rushing through a narrow alley in Bulgaria. The water is turbulent and chaotic, with waves crashing violently against the walls and buildings on either side. The alley is lined with old, weathered houses, their facades partially submerged and splintered. The camera angle is low, capturing the full force of the tsunami as it surges forward, creating a sense of urgency and danger. People can be seen running frantically, adding to the chaos. The background features a distant horizon, hinting at the larger scale of the tsunami. A dynamic, sweeping shot from a low-angle perspective, emphasizing the movement and intensity of the event.

A dramatic and dynamic scene in the style of a disaster movie, depicting a powerful tsunami rushing through a narrow alley in Bulgaria. The water is turbulent and chaotic, with waves crashing violently against the walls and buildings on either side. The alley is lined with old, weathered houses, their facades partially submerged and splintered. The camera angle is low, capturing the full force of the tsunami as it surges forward, creating a sense of urgency and danger......

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A dynamic and chaotic scene in a dense forest during a heavy rainstorm, capturing a real girl frantically running through the foliage. Her wild hair flows behind her as she sprints, her arms flailing and her face contorted in fear and desperation. Behind her......

A photograph in a warm and nostalgic style, capturing chimneys against a setting sun. The chimneys stand tall and sturdy, casting long shadows across a peaceful rural landscape. The sun is low in the sky, painting the scene in soft orange and pink hues. The background features a serene countryside with fields, trees, and distant hills. The chimneys are surrounded by a haze of golden light, creating a sense of warmth and tranquility. A wide-angle shot with the chimneys in the foreground, capturing the entire sunset scene.

A winter storm rages in Antarctica, with swirling snow and icy winds. A toy robot in vibrant purple overalls and sturdy cowboy boots takes a leisurely stroll across the icy landscape. Its arms are held out to maintain balance against the gusts, and its large round eyes sparkle with curiosity and joy. The robot's legs move with a mechanical yet rhythmic motion, each step steady and determined. The background shows a rugged Antarctic terrain with jagged ice formations and the occasional exposed rock. The photo has a nostalgic, retro-futuristic style, capturing a moment of whimsical adventure amidst the harsh conditions. A medium shot with a dynamic angle, emphasizing the robot's journey through the storm.

An adorable kangaroo, dressed in a cute green dress with polka dots, is wearing a small sun hat perched on its head. The kangaroo takes a pleasant stroll through the bustling streets of Mumbai during a vibrant and colorful festival. The background is filled with lively festival-goers in traditional Indian attire, adorned with intricate henna designs and bright jewelry. The scene is filled with colorful decorations, vendors selling various items, and people dancing and singing......

An aerial shot of a fast-moving drone flying through a dense green jungle, capturing the vibrant foliage and lush canopy below. The drone glides smoothly, showcasing the intricate network of vines and towering trees. The background features a mix of bright green leaves and dappled sunlight filtering through the branches. The drone's path is dynamic, suggesting a sense of speed and movement. A high-angle aerial view with a clear focus on the drone's flight path.


ABLATION STUDY

We generated videos under different configurations: without Diagonal Distillation, video quality remained largely unchanged compared to the normal setting, but generation time increased by approximately 40%. Without Diagonal Forcing, generation time remained similar, but video quality deteriorated significantly.

Without Diagonal Distillation

Without Diagonal Forcing

Full Method (Ours)

Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle.


Without Diagonal Distillation

Without Diagonal Forcing

Full Method (Ours)

A close up view of a glass sphere that has a zen garden within it. There is a small dwarf in the sphere who is raking the zen garden and creating patterns in the sand.


Without Diagonal Distillation

Without Diagonal Forcing

Full Method (Ours)

A 3D animation of a small, round, fluffy creature with big, expressive eyes exploring a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bushy, striped tail. It hops along a sparkling stream, its eyes wide with wonder. The forest is alive with magical elements: flowers that glow and change colors, trees with leaves in shades of purple and silver, and small floating lights that resemble fireflies. The creature stops to interact playfully with a group of tiny, fairy-like beings dancing around a mushroom ring. The creature looks up in awe at a large, glowing tree that seems to be the heart of the forest. The scene is rendered in a detailed, fantasy style, with a soft, ethereal lighting that enhances the enchantment. The camera follows the creature as it moves, capturing its playful interactions and the magical ambiance of the forest. A medium shot with a dynamic angle that highlights the creature's expressions and the enchanting environment.


Without Diagonal Distillation

Without Diagonal Forcing

Full Method (Ours)

A romantic scene in a nighttime cityscape where a man and a woman walk hand in hand under a starry sky, their faces illuminated by the soft glow of streetlights. They are dressed in casual yet elegant attire, the man in a dark blue suit and the woman in a light green dress. A wooden bucket is placed on the ground nearby, adding a touch of rustic charm. The couple's expressions are filled with happiness and affection, as they gaze into each other's eyes. The background features tall buildings with windows lit up, creating a warm and cozy atmosphere. The stars above twinkle brightly, enhancing the serene and intimate mood. The scene is captured in a medium shot with a slightly upward angle, capturing both the couple and the surrounding environment.



Long video generation

In long-sequence video generation tasks, we successfully mitigated the inherent exposure bias and color oversaturation issues. Meanwhile, the generation speed achieved a significant improvement of approximately 96% compared to the Self-forcing method.


Self forcing

CausVid

Ours

A dynamic rally car speeding through a tight turn on a winding track, tires screeching as it navigates the curves with precision. The car is a sleek, racing machine with a vibrant red body and black accents, its headlights glowing brightly in the night. The driver, a determined-looking man with focused eyes.......

A dramatic tilt-up shot from the base of a sleek, modern skyscraper, gradually moving upward to emphasize its towering height against the vast, clear sky. The building's glass facade reflects the sunlight, creating a shimmering effect. The sky is a blend of deep blue and light clouds, adding depth to the scene. The camera angle highlights the vertical lines and sharp edges of the structure, emphasizing its imposing presence. A dynamic and cinematic view, capturing the grandeur of the skyscraper.

A dynamic FPV aerial view of a vibrant underwater suburban neighborhood, where colorful corals line the streets. The camera moves swiftly, capturing the intricate details of the coral formations and the diverse marine life swimming around. The streets are bustling with colorful fish and schools of tropical fish......

A first-person perspective photo in a realistic outdoor style, capturing a hiker ascending a winding mountain trail. Each step reveals more of the breathtaking landscape ahead, including dense green forests, rugged cliffs, and distant peaks shrouded in mist......

A high-energy road race photograph capturing a cyclist powering up a steep hill. The cyclist is a middle-aged man with a determined expression, sweat glistening on his brow. He is dressed in a sleek, aerodynamic racing jersey and cycling shorts, with number clearly visible on his back. His helmet is snugly fastened......

A bustling market scene captured in the style of a documentary photo, showcasing a large truck driving through an open-air market. The truck moves past colorful stalls filled with various goods, each stall adorned with vibrant decorations and enticing merchandise......

A dynamic photograph capturing a little boy riding his bike through a garden that transitions through the changing seasons—fall leaves crunch underfoot, winter snow blankets the ground, spring flowers bloom, and summer sunshine sparkles through the foliage.......



Contrast ratio and saturability

To directly measure exposure bias and its relation to prompt types and motion, we performed the following analyses. We quantified exposure bias as the drift magnitude of saturation and contrast over time across a 45-frame sequence. The results reveal a critical trade-off governed by noise intensity:

The Saturation metric is computed by converting images to HSV and averaging the mean saturation channel values across all frames. A higher value indicates oversaturation.

(N:Total number of frames in the sequence; Si:Saturation channel of the i-th frame in HSV color space, where each pixel’s value (ranging from 0 to 1) represents how vivid the color is.)

Saturation Metric Visualization

The Contrast metric uses RMS Contrast, calculated as the standard deviation of grayscale pixel intensities.

(H,W: Image height and width (in pixels); pjk: Grayscale intensity of the pixel at row j, column k; μ: The average intensity of all pixels in the image.)

Contrast Metric Visualization

Therefore,our method's values are closer to natural image distributions.

Efficient Video Discrimination in Latent Space

In the field of Video generation, a series of motion model-based methods (such as VideoJAM, Video-Lavit and MicroCinema) decode latent representations back into the pixel space for motion estimation or quality discrimination. Although effective, this process introduces computational overhead that cannot be ignored. Our experimental analysis indicates that the conversion from latent space to pixel space itself has become a system bottleneck: even without the reconstruction loss of the variational autoencoder (VAE), integrating the optical flow estimation module alone will cause extremely high memory usage. This prompts us to rethink the level of necessity in the discrimination process. The core insight of the flow distribution matching method proposed in this work lies in the fact that sufficient video structure and temporal coherence information have been retained in the latent space by the VAE encoder. Therefore, we abandoned the time-consuming pixel space reconstruction step and directly carried out efficient discrimination in the latent space, thereby achieving a significant speed improvement and memory usage reduction while ensuring the generation quality.



5-min video generation

Even when the video duration is extended to 5 minutes, the generation quality remains highly consistent. The picture is stable, the details are rich, and no quality degradation is observed, which proves its outstanding ability in generating long sequences.


A middle-aged man in a white coat with a beard is cooking while referring to a book in his hand. The background is the kitchen

A middle-aged man with a white beard, wearing a black coat, was looking at his mobile phone on the rooftop. Behind him were tall buildings. As time went by, the man grew older and older, with more wrinkles on his face and his hair turning white

On the e-sports field, several young people sat at the game table and started playing a game. Then, they won the game and everyone stood up to cheer

The scene begins with the camera shuttling through the dark town, which is very eerie and terrifying. Gradually, the lens pulls back, revealing the entire city. The camera starts to circle around the entire city, and the picture takes on a spherical shape

The protagonist of the Dragon Ball animation, Goku, is right in the center of the screen, with lightning flashing from time to time around him and flashes of lightning light from his hand

A hacker, dressed in black, wearing a hat and black, was sitting in front of a computer typing. The computer screen was bright and very bright

A man, wearing headphones, is sitting at a desk typing on a computer, surrounded by computers.