StoryTeller

A step-by-step introduction to the process of the entire Long Video Description pipeline. We also provide intermediate results generated by each step, so you can start from any step.

Annotated Data

Caption: data/raw_data/caption.json
Audio Diarization: data/raw_data/diarization.json
Partition of Dataset: data/raw_data/split.json
Main Actor in each Clip: data/raw_data/ref_actor
MovieQA: data/raw_data/movie_qa.jsonl

Video Files

Similar to Movie101, not public due to copyright issues

Please contact hyc@bytedance.com

1. Generate the frames and audio files by processing the video.

python script/preprocess/preprocess.py

Input video path: data/video
Output frame path: data/frame
Output audio path: data/audio

Complete Data

data
├── audio
├── audio_visual_diarization
├── frame
├── global_diarization
├── long_video_description
├── raw_data
├── scene_detect
└── video

2. Split the 3-min video clip into small segments through auto scene change detection and some rules.

First, split the video into small segments through auto scene change detection, the results are stored in data/raw_data/scene_detect.

pip install scenedetect
scenedetect -i movie.mp4 -o data/raw_data/scene_detect -q detect-adaptive -t 2.0 list-scenes

Then some post-processing steps will be used to process the generated storyboard data, to merge short segments and to ensure that a single dialogue is not cut into two segments as much as possible, the final split file is saved in data/scene_detect/scene_split_new.json.

python script/scene_split/scene_split.py
python script/scene_split/update_scene_split.py

3. Generate reference photos for each character in the 3-min clip.

For the MovieQA dataset, we provided a cast list for each clip in advance. For other movies, the cast list can be easily obtained from the IMDb cast list and a general facial recognition algorithm.

4. Do the global audio diarization in each 3-min clip.

Model: ERes2NetV2
Model Path: checkpoints/eres2netv2/embedding_model.ckpt

Please note that our audio embedding model is finetuned on 167 chinese movies, if you want to achieve better results in movies in other languages, you can consider using the origin model.

We simply use the inference code in the repo: 3D-Speaker, and get the audio embedding of each input audio. The predict embeddings are saved in data/global_diarization/embeddings.jsonl.

Evaluation the performance of the embeddings:

python script/global_diarization/eval_embedding.py --input_file data/global_diarization/embeddings.jsonl

Then, a clustering algorithm is used to assign the same Global ID to embeddings that are close in distance, the result is saved in data/global_diarization/diarization_id.jsonl. The same global ID means that the model predicts that these dialogs are spoken by the same person.

python script/global_diarization/update_diarization.py --input_file data/global_diarization/embeddings.jsonl --output_file data/global_diarization/diarization_id.jsonl

Metric	Precision	Recall	F1
Score	0.90	0.41	0.56

5. Audio Visual Character Identification

Model: Tarsier-7B with OpenAI Wisper-large-v2 audio encoder
Model Path: checkpoints/Whisper-large-v2-Tarsier-7B-character-identification

In step 4, we have assigned a unique ID to each segment of the audio. Now, we will use MLLM to map each ID to a character in the actor list. If the character is not in the actor list, a description will be generated for it.

Generate input inference data and saved in data/audio_visual_diarization/data.jsonl

python script/audio_visual_diarization/gen_infer.py

Inferece the character name for each global ID.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
    --input_path data/audio_visual_diarization/data.jsonl \
    --output_path data/audio_visual_diarization/0.jsonl

Find conflict inference and generate to be corrected inference data saved in data/audio_visual_diarization/align_data/data.jsonl

python script/audio_visual_diarization/alignment.py

Inferece the probability for each conflict ID-name pairs.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Whisper-large-v2-Tarsier-7B-character-identification \
    --input_path data/audio_visual_diarization/align_data/data.jsonl \
    --output_path data/audio_visual_diarization/correct/0.jsonl

Generate the final corrected global audio diarization results, saved in data/audio_visual_diarization/correct/test_diarization.json

Character identification accuracy improvement before and after global audio diarization correction.

	Before	After
Correctable	70.0	81.1
Total	75.9	79.1

6. Long Description Generation

Model: Tarsier-7B
Model Path: checkpoints/Tarsier-7B-description-generation

Generating the final video description using the recognition results from step 5.

Generate input inference data and saved in data/long_video_description/data.jsonl

python script/long_video_description/gen_infer.py

Inferece the description for each short video clip.

python tasks/inference_quick_start.py \
    --model_name_or_path checkpoints/Tarsier-7B-description-generation \
    --input_path data/long_video_description/data.jsonl \
    --output_path data/long_video_description/0.jsonl

Generate the danse descriptions.

python dense_description.py

Evaluation the description by answering MovieQAs.

python script/long_video_description/eval_qa_accuracy.py \
    --pred_caption_path result/tarsier/dense_caption_name.json \
    --out_path result/tarsier/qa_name.jsonl

Evaluating the quality of the video description using QAs.

Model	Character	Action	Plot	Total
Gemini-1.5-pro	0.578	0.501	0.534	0.544
GPT-4o	0.517	0.479	0.528	0.507
VILA1.5-8B	0.561	0.459	0.540	0.524
LLaVA-OneVision	0.557	0.454	0.540	0.520
Qwen2-VL-7B	0.549	0.468	0.549	0.523
InternVL2-8B	0.535	0.448	0.506	0.501
Tarsier-7B	0.676	0.583	0.644	0.639

The output of each model can be found in result

Checkpoints

Coming soon.

Citation

Pleae cite us as:

@misc{he2024storyteller,
      title={StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification}, 
      author={Yichen He and Yuan Lin and Jianchao Wu and Hanchong Zhang and Yuchen Zhang and Ruicheng Le},
      year={2024},
      eprint={2411.07076},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
dataset		dataset
models		models
result		result
script		script
tasks		tasks
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StoryTeller

Annotated Data

Video Files

1. Generate the frames and audio files by processing the video.

Complete Data

2. Split the 3-min video clip into small segments through auto scene change detection and some rules.

3. Generate reference photos for each character in the 3-min clip.

4. Do the global audio diarization in each 3-min clip.

5. Audio Visual Character Identification

6. Long Description Generation

Checkpoints

Citation

About

Releases

Packages

Languages

License

hyc2026/StoryTeller

Folders and files

Latest commit

History

Repository files navigation

StoryTeller

Annotated Data

Video Files

1. Generate the frames and audio files by processing the video.

Complete Data

2. Split the 3-min video clip into small segments through auto scene change detection and some rules.

3. Generate reference photos for each character in the 3-min clip.

4. Do the global audio diarization in each 3-min clip.

5. Audio Visual Character Identification

6. Long Description Generation

Checkpoints

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages