Skip to content

VideoVerses/VideoTuna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VideoTuna

VideoTuna

Version visitors Homepage GitHub

πŸ€—πŸ€—πŸ€— Videotuna is a useful codebase for text-to-video applications.
🌟 VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
🌟 The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
🌟 A new 3D video VAE, and a controllable facial video generation model will be released soon.

Features

🌟 All-in-one framework: Inference and fine-tune up-to-date video generation models.
🌟 Pre-training: Build your own foundational text-to-video model.
🌟 Continuous training: Keep improving your model with new data.
🌟 Domain-specific fine-tuning: Adapt models to your specific scenario.
🌟 Concept-specific fine-tuning: Teach your models with unique concepts.
🌟 Enhanced language understanding: Improve model comprehension through continuous training.
🌟 Post-processing: Enhance the videos with video-to-video enhancement model.
🌟 Post-training/Human preference alignment: Post-training with RLHF for more attractive results.

πŸ”† Updates

  • [2024-11-01] We make the VideoTuna V0.1.0 public!

Demo

Model Inference and Comparison

combined_video_29_A_mountain_biker_racing_down_a_trail__dust_flying_behind combined_video_22_Fireworks_exploding_over_a_historic_river__reflections_twinkling_in_the_water combined_video_20_Waves_crashing_against_a_rocky_shore_under_a_stormy_sky__spray_misting_the_air combined_video_17_A_butterfly_landing_delicately_on_a_wildflower_in_a_vibrant_meadow combined_video_12_Sunlight_piercing_through_a_dense_canopy_in_a_tropical_rainforest__illuminating_a_ combined_video_3_Divers_observing_a_group_of_tuna_as_they_navigate_through_a_vibrant_coral_reef_teem

3D Video VAE

The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.

Ground Truth Reconstruction
Ground Truth Reconstruction
Ground Truth Reconstruction
Ground Truth Reconstruction
Ground Truth Reconstruction
Ground Truth Reconstruction
Ground Truth Reconstruction

Face domain

Image 1 Image 2 Image 3
Input 1 Input 2 Input 3
Emotion: Anger Emotion: Disgust Emotion: Fear
Emotion: Happy Emotion: Sad Emotion: Surprise
Emotion: Anger Emotion: Disgust Emotion: Fear
Emotion: Happy Emotion: Sad Emotion: Surprise
Emotion: Anger Emotion: Disgust Emotion: Fear
Emotion: Happy Emotion: Sad Emotion: Surprise

Storytelling

The picture shows a cozy room with a little girl telling her travel story to her teddybear beside the bed. As night falls, teddybear sits by the window, his eyes sparkling with longing for the distant place Teddybear was in a corner of the room, making a small backpack out of old cloth strips, with a map, a compass and dry food next to it. The first rays of sunlight in the morning came through the window, and teddybear quietly opened the door and embarked on his adventure. In the forest, the sun shines through the treetops, and teddybear moves among various animals and communicates with them.
Teddybear leaves his mark on the edge of a clear lake, surrounded by exotic flowers, and the picture is full of mystery and exploration. Teddybear climbs the rugged mountain road, the weather is changeable, but he is determined. The picture switches to the top of the mountain, where teddybear stands in the glow of the sunrise, with a magnificent mountain view in the background. On the way home, teddybear helps a wounded bird, the picture is warm and touching. Teddybear sits by the little girl's bed and tells her his adventure story, and the little girl is fascinated.
The scene shows a peaceful village, with moonlight shining on the roofs and streets, creating a peaceful atmosphere. cat sits by the window, her eyes twinkling in the night, reflecting her special connection with the moon and stars. Villagers gather in the center of the village for the annual Moon Festival celebration, with lanterns and colored lights adorning the night sky. cat feels the call of the moon, and her beard trembles with the excitement in her heart. cat quietly leaves her home in the night and embarks on a path illuminated by the silver moonlight.
A group of forest elves dance around glowing mushrooms, their costumes and movements full of magic and vitality. cat joins the celebration and dances with the elves, the picture is full of joy and freedom. A wise old owl reveals the secret power of the moon to cat and the light of the moon in the picture becomes brighter. cat closes her eyes in the moonlight, puts her hands together, and makes a wish, surrounded by the light of stars and the moon. cat feels the surge of power, and her eyes become more determined.

⏰ TODOs

  • More demo and applications
  • More functionalities such as control modules. (Suggestions are welcome!)

πŸ”† Information

Code Structure

VideoTuna/
    β”œβ”€β”€ assets       # put images for readme
    β”œβ”€β”€ checkpoints  # put model checkpoints here
    β”œβ”€β”€ configs      # model and experimental configs
    β”œβ”€β”€ data         # data processing scripts and dataset files
    β”œβ”€β”€ docs         # documentations
    β”œβ”€β”€ eval         # evaluation scripts
    β”œβ”€β”€ inputs       # input examples for testing 
    β”œβ”€β”€ scripts      # train and inference python scripts
    β”œβ”€β”€ shsripts     # train and inference shell scripts
    β”œβ”€β”€ src          # model-related source code
    β”œβ”€β”€ tests        # testing scripts
    β”œβ”€β”€ tools        # some tool scripts

Supported Models

T2V-Models HxWxL Checkpoints
CogVideoX-2B 720x480, 6s Hugging Face
CogVideoX-5B 720x480, 6s Hugging Face
Open-Sora 1.0 512Γ—512x16 Hugging Face
Open-Sora 1.0 256Γ—256x16 Hugging Face
Open-Sora 1.0 256Γ—256x16 Hugging Face
VideoCrafter2 320x512x16 Hugging Face
VideoCrafter1 576x1024x16 Hugging Face
VideoCrafter1 320x512x16 Hugging Face
I2V-Models HxWxL Checkpoints
CogVideoX-5B-I2V 720x480, 6s Hugging Face
DynamiCrafter 576x1024x16 Hugging Face
VideoCrafter1 320x512x16 Hugging Face
  • Note: H: height; W: width; L: length

Please check docs/CHECKPOINTS.md to download all the model checkpoints.

Title## πŸ”† Get started

1.Prepare environment

conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -U poetry pip
poetry config virtualenvs.create false
poetry install
pip install optimum-quanto==0.2.1
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
rm -rf SwissArmyTransformer
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..

2.Prepare checkpoints

Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.

3.Inference state-of-the-art T2V/I2V/T2I models

  • Inference a set of text-to-video models in one command: bash tools/video_comparison/compare.sh
    • The default mode is to run all models, e.g., inference_methods="videocrafter2;dynamicrafter;cogvideoβ€”t2v;cogvideoβ€”i2v;opensora"
    • If the users want to inference specific models, modify the inference_methods variable in compare.sh, and list the desired models separated by semicolons.
    • Also specify the input directory via the input_dir variable. This directory should contain a prompts.txt file, where each line corresponds to a prompt for the video generation. The default input_dir is inputs/t2v
  • Inference a set of image-to-video models in one command: bash tools/video_comparison/compare_i2v.sh
  • Inference a specific model, run the corresponding commands as follows:
Task Model Command Length (#frames) Resolution Inference Time (s) GPU Memory (GiB)
I2V CogVideoX-5b-I2V bash shscripts/inference_cogVideo_i2v_diffusers.sh 49 576x1024 310.4 4.78
T2V CogVideoX-2b bash shscripts/inference_cogVideo_t2v_diffusers.sh 49 576x1024 107.6 2.32
T2V Open Sora V1.0 bash shscripts/inference_opensora_v10_16x256x256.sh 16 256x256 11.2 23.99
T2V VideoCrafter-V2-320x512 bash shscripts/inference_vc2_t2v_320x512.sh 16 320x512 26.4 10.03
T2V VideoCrafter-V1-576x1024 bash shscripts/inference_vc1_t2v_576x1024.sh 16 576x1024 91.4 14.57
I2V DynamiCrafter bash shscripts/inference_dc_i2v_576x1024.sh 16 576x1024 101.7 52.23
I2V VideoCrafter-V1 bash shscripts/inference_vc1_i2v_320x512.sh 16 320x512 26.4 10.03
T2I Flux-dev bash shscripts/inference_flux.sh 1 768x1360 238.1 1.18
T2I Flux-schnell bash shscripts/inference_flux.sh 1 768x1360 5.4 1.20

4. Finetune T2V models

(1). Prepare Dataset

Please follow the docs/datasets.md to try provided toydataset or build your own datasets.

(2). Finetune

Open-Sora finetuning

We support open-sora finetuning, you can simply run the following commands:

# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh

Lora finetuning

We support lora finetuning to make the model to learn new concepts/characters/styles.

  • Example config file: configs/001_videocrafter2/vc2_t2v_lora.yaml
  • Training lora based on VideoCrafter2: bash shscripts/train_videocrafter_lora.sh
  • Inference the trained models: bash shscripts/inference_vc2_t2v_320x512_lora.sh

Finetuning for enhanced langugage understanding

5. Evaluation

We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.

Acknowledgement

We thank the following repos for sharing their awesome models and codes!

  • VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
  • VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
  • Open-Sora: Democratizing Efficient Video Production for All
  • CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
  • VADER: Video Diffusion Alignment via Reward Gradients
  • VBench: Comprehensive Benchmark Suite for Video Generative Models
  • Flux: Text-to-image models from Black Forest Labs.
  • SimpleTuner: A fine-tuning kit for text-to-image generation.

Some Resources

🍻 Contributors

πŸ“‹ License

Please follow CC-BY-NC-ND. If you want a license authorization, please contact yhebm@connect.ust.hk and yxingag@connect.ust.hk.

😊 Citation

To be updated...

Star History

Star History Chart