Google Shows Off Flexible New Generative AI Video Model VideoPoet
Google Research has introduced VideoPoet, a large language model (LLM) designed specifically for video generation tasks. VideoPoet offers text-to-video, image-to-video, and video-to-audio conversion capabilities, combining multiple synthetic video production capabilities within a single LLM. Unlike other models that require separate models for different tasks, VideoPoet demonstrates how one LLM can handle various high-quality video generation elements.
Introducing VideoPoet
VideoPoet showcases the ability of a single LLM to handle multiple video generation tasks, including text-to-video, image-to-video, and video-to-audio conversion. The model is trained on diverse video, image, audio, and text data, allowing for flexible conditioning and control over the model’s outputs.
Main Points
- VideoPoet demonstrates higher accuracy in matching text prompts and motion compared to other models, despite relying on a single LLM instead of multiple specialized models.
- The AI can generate longer videos by iteratively predicting each new second and maintains coherent object appearances over time.
- VideoPoet allows for fine-grained editing of generated clips by manipulating object motions and adding camera directions.
- LLMs operate on discrete tokens, making video generation challenging. However, existing video and audio tokenizers can encode video and audio clips as sequences of discrete tokens, enabling video generation with LLMs.
Conclusion
VideoPoet demonstrates the potential of LLMs in the field of video generation by delivering high-quality and interesting motion within videos. Google Research sees promising future directions for the model, including extending its capabilities to text-to-audio, audio-to-video, and video captioning.
Source: Voicebot.ai