A Research Exploration into Customized Video Generation without Customized Video Data
Read Paper Given a text-to-video (T2V) model built over a text-to-image (T2I) model, Still-Moving can adjust any customized T2I weights to align with the T2V model. This adaptation uses only a few still reference images, and preserves the motion prior of the T2V model.
Below we show examples of personalized video generation by adapting personalized T2I models (e.g. DreamBooth, [Ruiz et al. 2022]).
* Hover over the video to see the prompt.
Reference images
Generated Personalized Videos
Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on frozen videos (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.
Still-Moving can also be used for generating videos with consistent styles based on pretrained stylization T2I models (e.g. StyleDrop, [Sohn et al. 2023]). Each row below contains a diverse set of videos which adhere to the style of the reference image on the left, while also exhibiting the natural motion of the T2V model.
* Hover over the video to see the prompt.
Style reference image
Generated Stylized Videos
Still-Moving customized models can be combined with ControlNet [Zhang et al. 2023], to allow the customization of existing models to generate videos whose style adheres to the style of a given T2I model, but whose structure and dynamics are determined by a given reference video.
Style Reference Image
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Style Reference Image
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Style Reference Image
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Style Reference Image
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Style Reference Image
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video
Output+Control Videos
Source Video
Output+Control Videos
Source Video
Output+Control Videos
The video below were generated by combining the fine-grained control and structure preservation of ControlNet with the personalization abilities of Still-Moving.
Reference Images
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Reference Images
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Reference Images
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
Source Video+Control
Output
(†): First author, (*) Core technical contribution
Work was done while the first author was an intern at Google.
We would like to thank Jess Gallegos, Sarah Rumbley, Irina Blok, Daniel Hall, Parth Parekh,
Quinn Perfetto, Andeep Toor, Hartwig Adam, Kai Jiang, David Hendon, JD Velasquez, William T. Freeman and David Salesin for their collaboration, insightful discussions, feedback and support.
We thank owners of images and videos used in our experiments (links for attribution) for sharing their valuable assets.
Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.