BMClogo

Behind the scenes, almost watching a video produced by an artificial intelligence model? You might think that the process is similar to stop-motion animation where many images are created and stitched, but that is not exactly the case for “diffusion models” like Openal’s Sora and Google’s VEO 2.

Instead of frame by frame (or “automatic ingestion”), these systems process the entire sequence immediately. The resulting clips are usually realistic, but the process is slow and no direct changes are allowed.

Scientists at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and Adobe Research have now developed a hybrid method called “Causvid” to create videos in seconds. Just like a fast student learning from a skilled teacher, a complete diffusion model trains an autoregressive system to quickly predict the next framework while ensuring high quality and consistency. Causvid’s student model can then generate clips from simple text prompts, turn photos into moving scenes, expand videos, or change their creations with new inputs in new inputs.

The dynamic tool enables quick, interactive content creation, cutting the 50-step process into several actions. It can make many imaginative and artistic scenes, such as paper airplanes turning into swans, wool mammoths adventure in the snow, or a child jumping into a puddle. The user can also issue an initial prompt such as “generate a person to cross the road” and then make subsequent inputs to add new elements on the site, such as “when he reaches the sidewalk opposite, he writes his writing in his notebook.”

Short computer generated animation walking on leaves on old deep sea diving suit

Causvid’s videos illustrate its ability to create smooth, high-quality content.

The AI-generated animations are provided by researchers.

The model can be used for different video editing tasks, such as helping viewers understand live broadcasts in different languages ​​by generating videos synchronized with audio translation, CSAIL researchers said. It can also help render new content in video games, or quickly create training simulations to teach robots new tasks.

Tianwei Yin SM ’25, PhD ’25, a recent graduate of the Electrical Engineering and Computer Science and CSAIL branch, attributes the strength of the model to its hybrid approach.

“CAUSVID combines pre-trained diffusion models with autoregressive architectures, often found in text-generating models,” said Yin, a co-leading author of the new paper on the tool. “This AI-powered teacher model can envision future steps to train a box-by-frame system to avoid causing errors.”

Yin’s co-leading writer Qiang Zhang is a research scientist at XAI and a former Csail visitor. They collaborated with Adobe research scientists Richard Zhang, Eli Shechtman and Xun Huang, and two principal CSAIL researchers: MIT professors Bill Freeman and Frédo Durand.

Cause and effect

Many self-circle models can create a video that is originally smooth, but the quality tends to drop in the sequence. The editing of the running people seemed to come to life at first, but their legs began to waving in an unnatural direction, indicating frame-to-frame inconsistency (also known as “error accumulation of errors”).

In previous causal methods, it is common to easily occur video generation, which learned to predict the framework by itself. Causvid instead uses a high-power diffusion model to teach simpler systems its general video expertise, allowing it to create smooth visuals, but much faster.

Video thumbnails

Play video

Causvid enables fast, interactive video creation, cutting a 50-step process into several actions.
Video provided by researchers.

When researchers tested their ability to make high-resolution 10-second videos, Causvid demonstrated its video production capabilities. It outperforms baselines like “Opensora” and “Moviegen” and works 100 times faster than competitions while making the most stable, high-quality edits.

Yin and his colleagues then tested Causvid’s ability to release stable 30-second videos, where it also exceeded comparable models of quality and consistency. These results suggest that Causvid may end up producing stable long videos, even uncertain durations.

A subsequent study showed that users prefer Causvid’s student model over videos based on diffusion-generated by teachers.

“The speed of the self-rotation model does vary,” Yin said. “Its video looks as good as the teacher’s video, but with less time to make, its visuals are less diverse.”

Causvid also performed well when testing over 900 prompts using the text-to-video dataset, with a maximum total score of 84.27. It boasts about the best indicators in categories such as imaging quality and realistic human behavior, increasingly eclipsing video generation models such as “vChitect” and “gen-3”.

While AI video generation takes an efficient step, Causvid may soon be able to use smaller causal buildings and even design visuals immediately and faster. Yin said that if the model was trained on a domain-specific dataset, it might create higher quality clips for robots and games.

Experts say this hybrid system is a promising upgrade to the diffusion model and is currently in trouble with processing speed. “(Diffusion model) is slower than LLM (large language model) or generating image models,” said Jun-Yan Zhu, assistant professor at Carnegie Mellon University, who is not involved in the paper. “This new work has changed to make video power more efficient. This means better streaming speeds, more interactive applications and a lower carbon footprint.”

The team’s work is supported by Amazon Science Hub, Gwangju Institute of Science and Technology, Adobe, Google, the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator. Causvid will be presented at the Computer Vision and Pattern Recognition Conference in June.

Source link