Still-Moving

A Research Exploration into Customized Video Generation without Customized Video Data

Read Paper
Supplementary Materials

Personalized Video Generation

Given a text-to-video (T2V) model built over a text-to-image (T2I) model, Still-Moving can adjust any customized T2I weights to align with the T2V model. This adaptation uses only a few still reference images, and preserves the motion prior of the T2V model.
Below we show examples of personalized video generation by adapting personalized T2I models (e.g. DreamBooth, [Ruiz et al. 2022]). 

 
* Hover over the video to see the prompt.

Reference images

Generated Personalized Videos

[V] chipmunk flying with a cape
[V] chipmunk riding a skateboard
[V] chipmunk picnicking under a tree with a basket of snacks
[V] chipmunk riding a toy car down a grassy hill
[V] chipmunk playing with a ball.
[V] chipmunk riding a colorful bicycle through a sunny park.
[V] chipmunk skiing down a slope
[V] chipmunk zooming on a hoverboard
[V] pig swinging on a tire swing in a backyard playground
[V] pig pretending to fly with a cape in the wind
[V] pig skiing down a slope
[V] pig dancing in a field filled with colorful flowers
[V] pig surfing on a big tall wave
[V] pig jumping happily in a pile of autumn leaves.
[V] pig driving a race car
[V] pig walking to school with a backpack and a hat on his head
[V] boy in a field
[V] boy exploring an underwater world
[V] boy building a sandcastle on the beach
[V] boy splashing in puddles
[V] boy in a field of flowers
[V] boy blowing bubbles
[V] boy riding a race cart
[V] boy on a camping trip
[V] porcupine swinging on a tire swing in a backyard playground
[V] porcupine walking to school with a backpack and a hat on his head
[V] porcupine riding a toy car down a grassy hill
[V] porcupine surfing on a big tall wave
[V] porcupine racing toy boats in a bathtub filled with water.
[V] porcupine building a sandcastle on a sunny beach.
[V] porcupine skateboarding down a gentle slope with excitement
[V] porcupine dressing up in different hats and accessories
[V] cat running through fallen leaves in an autumn forest
[V] cat happily cuddled in a comfy blue blanket
[V] cat splashing in a puddle after rain
[V] cat rollerblading in the park
[V] cat running through a meadow
[V] cat on a scooter
[V] cat wearing a fancy hat, looking adorable.
[V] cat lounging in a hammock.
[V] woman walking down the street with a backpack
[V] woman lounging on a hammock with closed eyes
[V] woman reading a book
[V] woman in a field of flowers
[V] woman creating an artistic masterpiece in a sunlit art studio
[V] woman enjoying morning coffee at a cozy cafe
[V] woman playing with a cute dog
[V] woman riding her bike
[V] dog driving a race car
[V] dog sailing in a miniature boat
[V] dog dressed as a chef cooking in the kitchen
[V] dog flying in the sky
[V] dog wearing sunglasses, looking cool
[V] dog riding a skateboard
[V] dog rollerblading in the park
[V] dog playing with a helicopter toy

Introduction

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on frozen videos (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Stylized Video Generation

Still-Moving can also be used for generating videos with consistent styles based on pretrained stylization T2I models (e.g. StyleDrop, [Sohn et al. 2023]). Each row below contains a diverse set of videos which adhere to the style of the reference image on the left, while also exhibiting the natural motion of the T2V model. 

 
* Hover over the video to see the prompt.

Style reference image

Generated Stylized Videos

A sunflower blooming
A fox frolicking in the forest
A bear twirling with delight
A family of ducks swimming in a pond
A chubby panda munching on bamboo shoots
A girl with a beanie dancing
A penguin dancing
A child in the snow
A chubby panda munching on bamboo shoots
A penguin dancing
A woman drinking coffee
A boy playing with a ball
A dog walking
An elephant trumpeting joyfully
Close-up of a girl with a beanie
A dolphin leaping out of the water
A man playing guitar
A family of ducks swimming in a pond
A butterfly fluttering by
A boy walking
A chubby panda munching on bamboo shoots
A bear walking
A dolphin swimming
A dragon flying
An octopus swimming
A blue bird flying
A fish swimming
A child jumping
A butterfly fluttering in the breeze
A happy bumblebee buzzing around colorful flowers
An elephant trumpeting joyfully
A girl with a beanie dancing
Ominous moonlight filtering through twisted trees
A woman with red eyes peeking
A group of friends hanging out
A ghostly apparition reflected in a murky pond
Grimacing gargoyle
A haunted mansion
Eerie doll sitting in an attic
A full moon casting eerie light on a deserted playground where a child is playing alone
Spooky mask hanging from a twisted, gnarled tree
Sinister reflection of a face in a cracked mirror
A foggy cemetery with an open grave and a mysterious figure walking beside it
A decrepit hospital room with a ghostly nurse
A bionic crocodile lurking beneath a holographic swamp
A spacecraft flying through asteroids
A cyborg bear walking through a cityscape
A cat exploring a neon-lit alleyway
A polar bear in space
A dolphin swimming
A gorilla jumping
A person skateboarding on a hoverboard
A humanoid wolf walking gracefully
A fox exploring a forest
A cybernetic wolf howling at a digital moon
A robotic eagle soaring through a digital sky
A soaring eagle
A woman dancing
Glowing jellyfish swimming gracefully
Surreal alien spaceship hovering in the neon-lit sky
A butterfly flattering
Horse galloping majestically
Serene lotus flower blooming in neon waters
An elegant swan
Sleek sports car speeding
Mysterious moonlit castle glowing
Enchanting moonlit forest scene
Ethereal mermaid swimming in neon waters
Vibrant parrot perched on a branch
Glowing seahorse floating gracefully
Magical fairy fluttering through a garden
Glowing silhouette of a couple walking hand-in-hand

ControlNet + Stylized Video Generation

Still-Moving customized  models can be combined with ControlNet [Zhang et al. 2023],  to allow the customization of existing models to generate videos whose style adheres to the style of a given T2I model, but whose structure and dynamics are determined by a given reference video.

Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Style Reference Image

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video

Output+Control Videos

Source Video

Output+Control Videos

Source Video

Output+Control Videos

ControlNet + Personalized Video Generation

The video below were generated by combining the fine-grained control and structure preservation of ControlNet with the personalization abilities of Still-Moving.

Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output


Reference Images

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Source Video+Control

Output

Authors

1Google DeepMind
2Tel-Aviv University
3Weizmann Institute of Science
4Technion

(†): First author, (*) Core technical contribution
Work was done while the first author was an intern at Google.

Acknowledgements

We would like to thank Jess Gallegos, Sarah Rumbley, Irina Blok, Daniel Hall, Parth Parekh, Quinn Perfetto, Andeep Toor, Hartwig Adam, Kai Jiang, David Hendon, JD Velasquez, William T. Freeman and David Salesin for their collaboration, insightful discussions, feedback and support.
We thank owners of images and videos used in our experiments (links for attribution) for sharing their valuable assets.

Societal Impact

Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.