In a recent paper, “Make a Video: Text to Video Generation without Text-Video Data”, Meta AI researchers propose a new approach to text-to-video generation that does not require text-video data.

The researchers’ intuition is that the model can understand how the world moves by learning from paired text-image data and unsupervised video footage

Their three-stage approach used involves training a text-to-image model, then training it on video data in an unsupervised fashion, and finally fine-tuning it with a frame interpolation network.

The results of the researchers’ experiments show that their approach outperforms previous baselines on two standard video generation benchmarks: MSR-VTT and cog video.

In addition, their approach generates videos that are more faithful to the input text of higher quality according to human evaluation.

The approach proposed in this paper represents a significant advance in the text-to-video generation and opens up new possibilities for applications of this technology.

Here is a great video made by Aleksa Gordić – The AI Epiphany on the subject:


Prompt used for the illustration:

a filmmaker robot recording a movie with a camera, futuristic style



Subscribe for updates