Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

1Purdue University, 2Baidu USA

Show Spider-Man jumping off a skyscraper in a metropolitan city at night.


Abstract

We build the first multimodal agent-based video generation pipeline through 3D engine scripting. Given any text prompt, multimodal agents collaborate to produce detailed Blender scripts to generate video with plausible character and motion consistency in any length.



Multi-agent Collaboration Framework

Our video generation pipeline consists of multiple agent collaborations, and multi-modal reflection loops on both generated visual results and code libraries. The LLM Director agent takes the user query and transform it to detailed functional process. Then the LLM Programmer agent composes corresponding Blender python scripting given in-context function libraries for the set of basic processes. Each intermediate screenshots and video outputs will be reflected by VLLM Reviewer agent for key feature evaluation. The reflections will feedback to Programmer agent for recursive improving. In particular, the Reviewer agent and the function libraries will be instructed by retrieving public video tutorials and online documents to fasten the reflection loops.

Framework image.

Demos

[Simulation + AnimateDiff] Input Prompt: Show Spider-Man jumping off a skyscraper in a metropolitan city at night, emphasizing his bold descent against the city lights.


[Compositing shot videos to movie] A short demo film, crafted with our work on agent-based pipeline for generating controllable and physically plausible videos


[Camera Control: Zooming out] Input Prompt: man looking like Elon Musk, cheering and waving his arms, rocket, cybertruck behind.


[Camera Control: Zooming in] Same Prompt as above.


Input Prompt: Anime character dva relaxing in her bedroom, sitting down to snack and play video games.


Input Prompt: A wounded female knight faltering and dramatically collapsing in a poisonous forest


Input Prompt: A man in business casual and a blonde princess dancing on the stage, facing each other, camera flys close to the blonde princess.



BibTeX


      @article{he2024kubrick,
        title={Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation}, 
        author={Liu He and Yizhi Song and Hejun Huang and Daniel Aliaga and Xin Zhou},
        year={2024},
        journal={arXiv preprint arXiv:2408.10453},
        url={https://arxiv.org/abs/2408.10453}, 
      }