A Research Exploration into Customized Video Generation without Customized Video Data
Supplementary Materials
We demonstrate the robustness of our method by showing the results of applying Still-Moving to the AnimateDiff T2V model (see Sec. 4.1 and Appendix). We compare our method with naive injection suggested by AnimateDiff [Guo et al. 2003], using the same seeds and prompts.
As can be observed, the naive injection approach often fails short of adhering to the customized data, or leads to significant artifacts. For example, the "melting golden" style (top rows) displays a distorted background and lacks the melting drops that are characteristic to the style. The featurs of the chipmunk (bottom rows) are not captured accurately (e.g., the chicks and the forehead’s color). Additionally, the identity of the chipmunk changes across the frames. In contrast, when applying our method, the “melting golden” background matches the reference image and the model produces dripping motion. Similarly, the chipmunk maintains a consistent identity that matches the reference images.
We present a qualitative comparison of our method and the baselines, as shown in Sec. 4.1 in the main text.
Qualitative results of VideoBooth [Jiang et al. 2023], which conditions video generation on content extracted from a single masked image (shown on the left). As seen, this method is incapable of generalizing across scenes and object's pose.
In Sec. 4.3 we ablate the 3 main design choices of our method: applying the Motion Adapters, using Spatial Adapters, and the use of a prior preservation loss.
We present the effect of using different scales for the Motion Adapter, as discussed in App. 3.
As described in Sec. 4.4, our method is limited by the quality of the injected customized T2I model. If the customized T2I model fails to capture the identity or overfits to certain aspects (e.g. background), our model will inherit these properties. Below are examples for inaccurate identity (top) and overfitting on the background (bottom).
Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.