In many news programs, sign language is provided for auditory-impaired people, which is relevant to video contents closely. Therefore, we propose a new video captioning task termed Sign Language Assisted Video Captioning (SLAVC) task. We introduce Multimodal Relations Auxiliary Network (MRAN) to handle SLAVC task:
MARN can model the relations between different modalities to help generate high-quality sentences.
In addition, we propose China Focus On (CFO) dataset, which containing three modalities (ie. visual, sign language and audio), to explore SLAVC task. The preprocessed features and the json file (include URL, start time and end time, category, captions of each videos) are available here.
conda create -n MRAN python=3.6
conda activate MRAN
pip install torch torchvision torchaudio
pip install -r requirements.txt
Download the preprocessed features and corpus and place them to
data/CFO
.
If you want to use your own datasets, preprocess them as the following steps:
- Extract appearance feature
python preprocess/extract_feat.py --dataset CFO --feature_type appearance --image_height 224 --image_width 224 --gpu_id 0
- Extract motion feature
Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it todata/preprocess/pretrained
.
python preprocess/extract_feat.py --dataset CFO --feature_type motion --image_height 112 --image_width 112 --gpu_id 0
- Extract sign language feature
python preprocess/extract_feat.py --dataset CFO --feature_type hand --image_height 112 --image_width 112 --gpu_id 0
- Build corpus
python preprocess/build_vocab.py --dataset CFO -H 2
Our pretrained model is available here. Download and save it for evaluation. Or you can train a new model:
python train.py --cfg configs/CFO.yml
python evaluate.py --model_path {model_path} --save_path {save_path}
- Some codes refer to HCRN