Still-Moving

A Research Exploration into Customized Video Generation without Customized Video Data

Read Paper
Supplementary Materials

Personalized Video Generation

Given a text-to-video (T2V) model built over a text-to-image (T2I) model, Still-Moving can adjust any customized T2I weights to align with the T2V model. This adaptation uses only a few still reference images, and preserves the motion prior of the T2V model.
Below we show examples of personalized video generation by adapting personalized T2I models (e.g. DreamBooth, [Ruiz et al. 2022]). 

 
* Hover over the video to see the prompt.

Reference images

Generated Personalized Videos

[V] chipmunk flying with a cape
[V] chipmunk riding a skateboard
[V] chipmunk picnicking under a tree with a basket of snacks
[V] chipmunk riding a toy car down a grassy hill
[V] pig swinging on a tire swing in a backyard playground
[V] pig pretending to fly with a cape in the wind
[V] pig skiing down a slope
[V] pig dancing in a field filled with colorful flowers
[V] boy in a field
[V] boy exploring an underwater world
[V] boy building a sandcastle on the beach
[V] boy splashing in puddles
[V] porcupine swinging on a tire swing in a backyard playground
[V] porcupine walking to school with a backpack and a hat on his head
[V] porcupine riding a toy car down a grassy hill
[V] porcupine surfing on a big tall wave
[V] cat running through fallen leaves in an autumn forest
[V] cat happily cuddled in a comfy blue blanket
[V] cat splashing in a puddle after rain
[V] cat rollerblading in the park
[V] woman walking down the street with a backpack
[V] woman lounging on a hammock with closed eyes
[V] woman reading a book
[V] woman in a field of flowers
[V] dog driving a race car
[V] dog sailing in a miniature boat
[V] dog dressed as a chef cooking in the kitchen
[V] dog flying in the sky

Introduction

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on frozen videos (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Stylized Video Generation

Still-Moving can also be used for generating videos with consistent styles based on pretrained stylization T2I models (e.g. StyleDrop, [Sohn et al. 2023]). Each row below contains a diverse set of videos which adhere to the style of the reference image on the left, while also exhibiting the natural motion of the T2V model. 

 
* Hover over the video to see the prompt.

Style reference image

Generated Stylized Videos

A sunflower blooming
A fox frolicking in the forest
A bear twirling with delight
A family of ducks swimming in a pond
A chubby panda munching on bamboo shoots
A penguin dancing
A woman drinking coffee
A boy playing with a ball
A chubby panda munching on bamboo shoots
A bear walking
A dolphin swimming
A dragon flying
Ominous moonlight filtering through twisted trees
A woman with red eyes peeking
A group of friends hanging out
A ghostly apparition reflected in a murky pond
A bionic crocodile lurking beneath a holographic swamp
A spacecraft flying through asteroids
A cyborg bear walking through a cityscape
A cat exploring a neon-lit alleyway
A soaring eagle
A woman dancing
Glowing jellyfish swimming gracefully
Surreal alien spaceship hovering in the neon-lit sky

ControlNet + Stylized Video Generation

Still-Moving customized  models can be combined with ControlNet [Zhang et al. 2023],  to allow the customization of existing models to generate videos whose style adheres to the style of a given T2I model, but whose structure and dynamics are determined by a given reference video.

Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

ControlNet + Personalized Video Generation

The video below were generated by combining the fine-grained control and structure preservation of ControlNet with the personalization abilities of Still-Moving.

Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Authors

1Google DeepMind
2Tel-Aviv University
3Weizmann Institute of Science
4Technion

(†): First author, (*) Core technical contribution
Work was done while the first author was an intern at Google.

Acknowledgements

We would like to thank Jess Gallegos, Sarah Rumbley, Irina Blok, Daniel Hall, Parth Parekh, Quinn Perfetto, Andeep Toor, Hartwig Adam, Kai Jiang, David Hendon, JD Velasquez, William T. Freeman and David Salesin for their collaboration, insightful discussions, feedback and support.
We thank owners of images and videos used in our experiments (links for attribution) for sharing their valuable assets.

Societal Impact

Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.