Still-Moving

A Research Exploration into Customized Video Generation without Customized Video Data

Supplementary Materials

Read Paper
Project Page

AnimateDiff Results

We demonstrate the robustness of our method by showing the results of applying Still-Moving to the AnimateDiff T2V model (see Sec. 4.1 and Appendix). We compare our method with naive injection suggested by AnimateDiff [Guo et al. 2003], using the same seeds and prompts. 

As can be observed, the naive injection approach often fails short of adhering to the customized data, or leads to significant artifacts. For example, the "melting golden" style (top rows) displays a distorted background and lacks the melting drops that are characteristic to the style.  The featurs of the chipmunk (bottom rows) are not captured accurately (e.g., the chicks and the forehead’s color). Additionally, the identity of the chipmunk changes across the frames. In contrast, when applying our method, the “melting golden” background matches the reference image and the model produces dripping motion. Similarly, the chipmunk maintains a consistent identity that matches the reference images.

Reference Image
Naive Injection
StillMoving (ours)
Prompt:
"A bat swooping in melting golden 3D style"
"A fish gliding in melting golden 3d rendering style"
"A seagull scavenging in melting golden 3d rendering style"
"An owl blinking in melting golden 3d rendering style"
Reference Image
Naive Injection
StillMoving (ours)
Prompt:
"A butterfly flying in flat cartoon illustration style."
"A jellyfish floating in flat cartoon illustration style."
"A dolphin swimming in flat cartoon illustration style."
"A fish swimming in flat cartoon illustration style."
Reference Image
Naive Injection
StillMoving (ours)
Prompt:
"[V] chipmunk riding a bike in the park."
"[V] chipmunk having a tea party."
"[V] chipmunk playing with a toy car on a driveway."
"[V] chipmunk running snow in a winter landscape."

Comparisons

We present a qualitative comparison of our method and the baselines, as shown in Sec. 4.1 in the main text.

Reference Images
Interpolation
Interleaving
StillMoving (Ours)
"a bear twirling with delight"
"a butterfly fluttering from flower to flower"
"rollerblading in the park"
"flying in the sky"



VideoBooth -- Qualitative Results

Qualitative results of VideoBooth [Jiang et al. 2023], which conditions video generation on content extracted from a single masked image (shown on the left). As seen, this method is incapable of generalizing across scenes and object's pose.

Reference Image
"rollerblading in the park"
"flying in the sky"

Ablations

In Sec. 4.3 we ablate the 3 main design choices of our method: applying the Motion Adapters, using Spatial Adapters, and the use of a prior preservation loss.

Reference Images
w/o Motion Adapters
w/o Spatial Adapters
w/o prior preservation
StillMoving (Ours)
"Chipmunk wearing a hat, looking adorable"
"Chipmunk riding a toy car down a grassy hill"

Motion Adapter

We present the effect of using different scales for the Motion Adapter, as discussed in App. 3.

α=1
α=0
α=-1
"The Himalayas Everest, winter landscape."
"Aerial around young hiker man standing on a mountain"
"Woman in white dress standing on top of a mountain"

Limitations

As described in Sec. 4.4, our method is limited by the quality of the injected customized T2I model. If the customized T2I model fails to capture the identity or overfits to certain aspects (e.g. background), our model will inherit these properties. Below are examples for inaccurate identity (top) and overfitting on the background (bottom).

Reference Images
T2I DreamBooth results
T2V generated video

Societal Impact

Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.