A step-by-step introduction to the process of the entire Long Video Description pipeline. We also provide intermediate results generated by each step, so you can start from any step.
- Caption:
data/raw_data/caption.json
- Audio Diarization:
data/raw_data/diarization.json
- Partition of Dataset:
data/raw_data/split.json
- Main Actor in each Clip:
data/raw_data/ref_actor
- MovieQA:
data/raw_data/movie_qa.jsonl
Similar to Movie101, not public due to copyright issues
Please contact hyc@bytedance.com
python script/preprocess/preprocess.py
- Input video path: data/video
- Output frame path: data/frame
- Output audio path: data/audio
data
├── audio
├── audio_visual_diarization
├── frame
├── global_diarization
├── long_video_description
├── raw_data
├── scene_detect
└── video
2. Split the 3-min video clip into small segments through auto scene change detection and some rules.
First, split the video into small segments through auto scene change detection, the results are stored in data/raw_data/scene_detect
.
pip install scenedetect
scenedetect -i movie.mp4 -o data/raw_data/scene_detect -q detect-adaptive -t 2.0 list-scenes
Then some post-processing steps will be used to process the generated storyboard data, to merge short segments and to ensure that a single dialogue is not cut into two segments as much as possible, the final split file is saved in data/scene_detect/scene_split_new.json
.
python script/scene_split/scene_split.py
python script/scene_split/update_scene_split.py
For the MovieQA dataset, we provided a cast list for each clip in advance. For other movies, the cast list can be easily obtained from the IMDb cast list and a general facial recognition algorithm.
Model: ERes2NetV2
Model Path: checkpoints/eres2netv2/embedding_model.ckpt
Please note that our audio embedding model is finetuned on 167 chinese movies, if you want to achieve better results in movies in other languages, you can consider using the origin model.
We simply use the inference code in the repo: 3D-Speaker, and get the audio embedding of each input audio. The predict embeddings are saved in data/global_diarization/embeddings.jsonl
.
Evaluation the performance of the embeddings:
python script/global_diarization/eval_embedding.py --input_file data/global_diarization/embeddings.jsonl
Then, a clustering algorithm is used to assign the same Global ID to embeddings that are close in distance, the result is saved in data/global_diarization/diarization_id.jsonl
. The same global ID means that the model predicts that these dialogs are spoken by the same person.
python script/global_diarization/update_diarization.py --input_file data/global_diarization/embeddings.jsonl --output_file data/global_diarization/diarization_id.jsonl
Metric | Precision | Recall | F1 |
---|---|---|---|
Score | 0.90 | 0.41 | 0.56 |
Model: Tarsier-7B with OpenAI Wisper-large-v2 audio encoder
Model Path: checkpoints/Whisper-large-v2-Tarsier-7B-character-identification
In step 4, we have assigned a unique ID to each segment of the audio. Now, we will use MLLM to map each ID to a character in the actor list. If the character is not in the actor list, a description will be generated for it.
Generate input inference data and saved in data/audio_visual_diarization/data.jsonl
python script/audio_visual_diarization/gen_infer.py
Inferece the character name for each global ID.
python tasks/inference_quick_start.py \
--model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
--input_path data/audio_visual_diarization/data.jsonl \
--output_path data/audio_visual_diarization/0.jsonl
Find conflict inference and generate to be corrected inference data saved in data/audio_visual_diarization/align_data/data.jsonl
python script/audio_visual_diarization/alignment.py
Inferece the probability for each conflict ID-name pairs.
python tasks/inference_quick_start.py \
--model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
--input_path data/audio_visual_diarization/align_data/data.jsonl \
--output_path data/audio_visual_diarization/correct/0.jsonl
Generate the final corrected global audio diarization results, saved in data/audio_visual_diarization/correct/test_diarization.json
Character identification accuracy improvement before and after global audio diarization correction.
Before | After | |
---|---|---|
Correctable | 70.0 | 81.1 |
Total | 75.9 | 79.1 |
Model: Tarsier-7B
Model Path: checkpoints/Tarsier-7B-description-generation
Generating the final video description using the recognition results from step 5.
Generate input inference data and saved in data/long_video_description/data.jsonl
python script/long_video_description/gen_infer.py
Inferece the description for each short video clip.
python tasks/inference_quick_start.py \
--model_name_or_path checkpoints/Tarsier-7B-description-generation \
--input_path data/long_video_description/data.jsonl \
--output_path data/long_video_description/0.jsonl
Generate the danse descriptions.
python dense_description.py
Evaluation the description by answering MovieQAs.
python script/long_video_description/eval_qa_accuracy.py \
--pred_caption_path result/tarsier/dense_caption_name.json \
--out_path result/tarsier/qa_name.jsonl
Evaluating the quality of the video description using QAs.
Model | Character | Action | Plot | Total |
---|---|---|---|---|
Gemini-1.5-pro | 0.578 | 0.501 | 0.534 | 0.544 |
GPT-4o | 0.517 | 0.479 | 0.528 | 0.507 |
VILA1.5-8B | 0.561 | 0.459 | 0.540 | 0.524 |
LLaVA-OneVision | 0.557 | 0.454 | 0.540 | 0.520 |
Qwen2-VL-7B | 0.549 | 0.468 | 0.549 | 0.523 |
InternVL2-8B | 0.535 | 0.448 | 0.506 | 0.501 |
Tarsier-7B | 0.676 | 0.583 | 0.644 | 0.639 |
The output of each model can be found in result
Coming soon.
Pleae cite us as:
@misc{he2024storyteller,
title={StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification},
author={Yichen He and Yuan Lin and Jianchao Wu and Hanchong Zhang and Yuchen Zhang and Ruicheng Le},
year={2024},
eprint={2411.07076},
archivePrefix={arXiv},
primaryClass={cs.CV}
}