Skip to content

hyc2026/StoryTeller

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StoryTeller

A step-by-step introduction to the process of the entire Long Video Description pipeline. We also provide intermediate results generated by each step, so you can start from any step.

Annotated Data

  1. Caption: data/raw_data/caption.json
  2. Audio Diarization: data/raw_data/diarization.json
  3. Partition of Dataset: data/raw_data/split.json
  4. Main Actor in each Clip: data/raw_data/ref_actor
  5. MovieQA: data/raw_data/movie_qa.jsonl

Video Files

Similar to Movie101, not public due to copyright issues

Please contact hyc@bytedance.com

1. Generate the frames and audio files by processing the video.

python script/preprocess/preprocess.py
  • Input video path: data/video
  • Output frame path: data/frame
  • Output audio path: data/audio

Complete Data

data
├── audio
├── audio_visual_diarization
├── frame
├── global_diarization
├── long_video_description
├── raw_data
├── scene_detect
└── video

2. Split the 3-min video clip into small segments through auto scene change detection and some rules.

First, split the video into small segments through auto scene change detection, the results are stored in data/raw_data/scene_detect.

pip install scenedetect
scenedetect -i movie.mp4 -o data/raw_data/scene_detect -q detect-adaptive -t 2.0 list-scenes

Then some post-processing steps will be used to process the generated storyboard data, to merge short segments and to ensure that a single dialogue is not cut into two segments as much as possible, the final split file is saved in data/scene_detect/scene_split_new.json.

python script/scene_split/scene_split.py
python script/scene_split/update_scene_split.py

3. Generate reference photos for each character in the 3-min clip.

For the MovieQA dataset, we provided a cast list for each clip in advance. For other movies, the cast list can be easily obtained from the IMDb cast list and a general facial recognition algorithm.

4. Do the global audio diarization in each 3-min clip.

Model: ERes2NetV2
Model Path: checkpoints/eres2netv2/embedding_model.ckpt

Please note that our audio embedding model is finetuned on 167 chinese movies, if you want to achieve better results in movies in other languages, you can consider using the origin model.

We simply use the inference code in the repo: 3D-Speaker, and get the audio embedding of each input audio. The predict embeddings are saved in data/global_diarization/embeddings.jsonl.

Evaluation the performance of the embeddings:

python script/global_diarization/eval_embedding.py --input_file data/global_diarization/embeddings.jsonl

Then, a clustering algorithm is used to assign the same Global ID to embeddings that are close in distance, the result is saved in data/global_diarization/diarization_id.jsonl. The same global ID means that the model predicts that these dialogs are spoken by the same person.

python script/global_diarization/update_diarization.py --input_file data/global_diarization/embeddings.jsonl --output_file data/global_diarization/diarization_id.jsonl
Metric Precision Recall F1
Score 0.90 0.41 0.56

5. Audio Visual Character Identification

Model: Tarsier-7B with OpenAI Wisper-large-v2 audio encoder
Model Path: checkpoints/Whisper-large-v2-Tarsier-7B-character-identification

In step 4, we have assigned a unique ID to each segment of the audio. Now, we will use MLLM to map each ID to a character in the actor list. If the character is not in the actor list, a description will be generated for it.

Generate input inference data and saved in data/audio_visual_diarization/data.jsonl

python script/audio_visual_diarization/gen_infer.py

Inferece the character name for each global ID.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
    --input_path data/audio_visual_diarization/data.jsonl \
    --output_path data/audio_visual_diarization/0.jsonl

Find conflict inference and generate to be corrected inference data saved in data/audio_visual_diarization/align_data/data.jsonl

python script/audio_visual_diarization/alignment.py

Inferece the probability for each conflict ID-name pairs.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
    --input_path data/audio_visual_diarization/align_data/data.jsonl \
    --output_path data/audio_visual_diarization/correct/0.jsonl

Generate the final corrected global audio diarization results, saved in data/audio_visual_diarization/correct/test_diarization.json

Character identification accuracy improvement before and after global audio diarization correction.

Before After
Correctable 70.0 81.1
Total 75.9 79.1

6. Long Description Generation

Model: Tarsier-7B
Model Path: checkpoints/Tarsier-7B-description-generation

Generating the final video description using the recognition results from step 5.

Generate input inference data and saved in data/long_video_description/data.jsonl

python script/long_video_description/gen_infer.py

Inferece the description for each short video clip.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Tarsier-7B-description-generation \
    --input_path data/long_video_description/data.jsonl \
    --output_path data/long_video_description/0.jsonl

Generate the danse descriptions.

python dense_description.py

Evaluation the description by answering MovieQAs.

python script/long_video_description/eval_qa_accuracy.py \
    --pred_caption_path result/tarsier/dense_caption_name.json \
    --out_path result/tarsier/qa_name.jsonl

Evaluating the quality of the video description using QAs.

Model Character Action Plot Total
Gemini-1.5-pro 0.578 0.501 0.534 0.544
GPT-4o 0.517 0.479 0.528 0.507
VILA1.5-8B 0.561 0.459 0.540 0.524
LLaVA-OneVision 0.557 0.454 0.540 0.520
Qwen2-VL-7B 0.549 0.468 0.549 0.523
InternVL2-8B 0.535 0.448 0.506 0.501
Tarsier-7B 0.676 0.583 0.644 0.639

The output of each model can be found in result

Checkpoints

Coming soon.

Citation

Pleae cite us as:

@misc{he2024storyteller,
      title={StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification}, 
      author={Yichen He and Yuan Lin and Jianchao Wu and Hanchong Zhang and Yuchen Zhang and Ruicheng Le},
      year={2024},
      eprint={2411.07076},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages