π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
π The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
π A new 3D video VAE, and a controllable facial video generation model will be released soon.
π All-in-one framework: Inference and fine-tune up-to-date video generation models.
π Pre-training: Build your own foundational text-to-video model.
π Continuous training: Keep improving your model with new data.
π Domain-specific fine-tuning: Adapt models to your specific scenario.
π Concept-specific fine-tuning: Teach your models with unique concepts.
π Enhanced language understanding: Improve model comprehension through continuous training.
π Post-processing: Enhance the videos with video-to-video enhancement model.
π Post-training/Human preference alignment: Post-training with RLHF for more attractive results.
- [2024-11-01] We make the VideoTuna V0.1.0 public!
The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Input 1 | Input 2 | Input 3 |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
- More demo and applications
- More functionalities such as control modules. (Suggestions are welcome!)
VideoTuna/
βββ assets # put images for readme
βββ checkpoints # put model checkpoints here
βββ configs # model and experimental configs
βββ data # data processing scripts and dataset files
βββ docs # documentations
βββ eval # evaluation scripts
βββ inputs # input examples for testing
βββ scripts # train and inference python scripts
βββ shsripts # train and inference shell scripts
βββ src # model-related source code
βββ tests # testing scripts
βββ tools # some tool scripts
T2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-2B | 720x480, 6s | Hugging Face |
CogVideoX-5B | 720x480, 6s | Hugging Face |
Open-Sora 1.0 | 512Γ512x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
VideoCrafter2 | 320x512x16 | Hugging Face |
VideoCrafter1 | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
I2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-5B-I2V | 720x480, 6s | Hugging Face |
DynamiCrafter | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
- Note: H: height; W: width; L: length
Please check docs/CHECKPOINTS.md to download all the model checkpoints.
Title## π Get started
conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -U poetry pip
poetry config virtualenvs.create false
poetry install
pip install optimum-quanto==0.2.1
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
rm -rf SwissArmyTransformer
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..
Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.
- Inference a set of text-to-video models in one command:
bash tools/video_comparison/compare.sh
- The default mode is to run all models, e.g.,
inference_methods="videocrafter2;dynamicrafter;cogvideoβt2v;cogvideoβi2v;opensora"
- If the users want to inference specific models, modify the
inference_methods
variable incompare.sh
, and list the desired models separated by semicolons. - Also specify the input directory via the
input_dir
variable. This directory should contain aprompts.txt
file, where each line corresponds to a prompt for the video generation. The defaultinput_dir
isinputs/t2v
- The default mode is to run all models, e.g.,
- Inference a set of image-to-video models in one command:
bash tools/video_comparison/compare_i2v.sh
- Inference a specific model, run the corresponding commands as follows:
Task | Model | Command | Length (#frames) | Resolution | Inference Time (s) | GPU Memory (GiB) |
---|---|---|---|---|---|---|
I2V | CogVideoX-5b-I2V | bash shscripts/inference_cogVideo_i2v_diffusers.sh |
49 | 576x1024 | 310.4 | 4.78 |
T2V | CogVideoX-2b | bash shscripts/inference_cogVideo_t2v_diffusers.sh |
49 | 576x1024 | 107.6 | 2.32 |
T2V | Open Sora V1.0 | bash shscripts/inference_opensora_v10_16x256x256.sh |
16 | 256x256 | 11.2 | 23.99 |
T2V | VideoCrafter-V2-320x512 | bash shscripts/inference_vc2_t2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2V | VideoCrafter-V1-576x1024 | bash shscripts/inference_vc1_t2v_576x1024.sh |
16 | 576x1024 | 91.4 | 14.57 |
I2V | DynamiCrafter | bash shscripts/inference_dc_i2v_576x1024.sh |
16 | 576x1024 | 101.7 | 52.23 |
I2V | VideoCrafter-V1 | bash shscripts/inference_vc1_i2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2I | Flux-dev | bash shscripts/inference_flux.sh |
1 | 768x1360 | 238.1 | 1.18 |
T2I | Flux-schnell | bash shscripts/inference_flux.sh |
1 | 768x1360 | 5.4 | 1.20 |
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
We support open-sora finetuning, you can simply run the following commands:
# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh
We support lora finetuning to make the model to learn new concepts/characters/styles.
- Example config file:
configs/001_videocrafter2/vc2_t2v_lora.yaml
- Training lora based on VideoCrafter2:
bash shscripts/train_videocrafter_lora.sh
- Inference the trained models:
bash shscripts/inference_vc2_t2v_320x512_lora.sh
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
We thank the following repos for sharing their awesome models and codes!
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
Please follow CC-BY-NC-ND. If you want a license authorization, please contact yhebm@connect.ust.hk and yxingag@connect.ust.hk.
To be updated...