Skip to content

Latest commit

 

History

History
174 lines (132 loc) · 9.45 KB

README.md

File metadata and controls

174 lines (132 loc) · 9.45 KB

12-in-1: Multi-Task Vision and Language Representation Learning

Please cite the following if you use this code. Code and pre-trained models for 12-in-1: Multi-Task Vision and Language Representation Learning:

@InProceedings{Lu_2020_CVPR,
author = {Lu, Jiasen and Goswami, Vedanuj and Rohrbach, Marcus and Parikh, Devi and Lee, Stefan},
title = {12-in-1: Multi-Task Vision and Language Representation Learning},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

and ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks:

@inproceedings{lu2019vilbert,
  title={Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks},
  author={Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan},
  booktitle={Advances in Neural Information Processing Systems},
  pages={13--23},
  year={2019}
}

Repository Setup

  1. Create a fresh conda environment, and install all dependencies.
conda create -n vilbert-mt python=3.6
conda activate vilbert-mt
git clone --recursive https://github.com/facebookresearch/vilbert-multi-task.git
cd vilbert-multi-task
pip install -r requirements.txt
  1. Install pytorch
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch #Install cudatoolkit that fits the computer version , same as nvcc --version
  1. Install apex, follows https://github.com/NVIDIA/apex

  2. Install this codebase as a package in this environment.

python setup.py develop
  1. Install gitmodules with
 git submodule init
 git submodule update
cd vilbert-multi-task/tools/refer
python setup.py install
make
#Then replace refer.py byt https://gist.github.com/vedanuj/9d3497d107cfca0b6f3dfdc28d5cb226 to update from Python2 version to Python3

Data Setup

Check README.md under data for more details.

Visiolinguistic Pre-training and Multi Task Training

Pretraining on Conceptual Captions

python train_concap.py --bert_model bert-base-uncased --config_file config/bert_base_6layer_6conect.json --train_batch_size 512 --objective 1 --file_path <path_to_extracted_cc_features>

Download link

Multi-task Training

python train_tasks.py --bert_model bert-base-uncased --from_pretrained <pretrained_model_path> --config_file config/bert_base_6layer_6conect.json --tasks 1-2-4-7-8-9-10-11-12-13-15-17 --lr_scheduler 'warmup_linear' --train_iter_gap 4 --task_specific_tokens --save_name multi_task_model

Download link

Fine-tune from Multi-task trained model

python train_tasks.py --bert_model bert-base-uncased --from_pretrained <multi_task_model_path> --config_file config/bert_base_6layer_6conect.json --tasks 1 --lr_scheduler 'warmup_linear' --train_iter_gap 4 --task_specific_tokens --save_name finetune_from_multi_task_model

Vilbert features extraction

In this part, the fine-tuned (VQA or NLVR2) model wights are being frozen.

Training the Regressor Separately

This part was fully made in both ME_train_reg_test-set.ipynb and ME_train_reg_folds.ipynb notebooks. The only difference is the evaluation process. i.e., for the first training was performed on the whole dev-set and the evaluation on test-set. While for the second, 4 splits were used as explaned in the report.

Prepare (Deep) Caption

Preparing (deep) captions consists of loading video IDs and captions from the .txt or .csv file ( needs a 'video_id' and 'caption' columns), add the ground truth (scores), tokenize, tensorize and save the cache file. Add --dc parameters if using deep captions. An example of using this script.

python script/feature_extraction/captions_preparation.py --captions_path Surrey/captions_clean.csv --output_path Surrey/prepared_captions_surrey

Extract Frames from Video

Use this script to extract frames from the video.

python script/feature_extraction/extract_frames.py --output_folder <output_folder> --video_dir <video_dir> --frames <frames> --frame_list <path_to_csv>

Use the frames parameter for the number of frames to be extracted (default is 1 i.e., the middle frame of the video). The extracted frames are saved as <output_folder>/<video-id>_<frame_count>.jpg where <frame_count> in [0..<frames>-1] (and <output_folder>/<video-id>.jpg when extracting only one frame). Otherwise pass a list of frames you want to be extracted with frame_list parameter, which is a link to a csv file with columns=video_name, frame. Keep this structure since it is used by the script/ME/average_features.py or script/extract_features.py scripts. Make sure to have writing permission for the output_folder. Otherwise, here is an example to use

python script/feature_extraction/extract_frames.py --video_dir Surrey/ --output_folder Surrey/frames --frame_list Surrey/Surrey_captions_clean._bis.csv 

Extract Features for Multiple Frames

Use script/extract_features.py and add samples parameter for the number of frames to use.

python script/extract_features.py --model_file data/detectron_model.pth --config_file data/detectron_config.yaml --image_dir datasets/ME/images/train --output_folder datasets/ME/features_100/ME_trainval_resnext152_faster_rcnn_genome.lmdb/ --samples 5

Average Visual Feature Vectors

If using multiple extracted frames from each video, this script is used to average already extracted features. Features files should be named <video-id>_<feature_count>.npy where <feature_count> in [0..<feature_number>].

python script/ME/average_features.py --features_dir <path_to_directory_with_features> --output_folder <path_to_output_averaged_features>

Convert Visual Feature Vectors to lmdb

python script/convert_to_lmdb.py  --features_dir <path_to_directory_with_features> --lmdb_file  <path_to_output_lmdb_file> 

Get Vilbert representation

The ME training dataset (8,000 samples) is fed to the model and the visual and textual representations are written to --rep_save_path so they can be used later to train a regressor. For this you need to have prepared the captions (see below, captions_preparation.py) and extracted visual features as explained below. The path to captions is not passed as an argument here but is created in vilbert/datasets/me_dataset.py ( combination of dataroot in yaml file and hard coded things). If file does not exist, another task is called, so be careful with this. Todo = Change the code here adding more complete error messages

python script/ME/vilbert_representations.py --bert_model bert-base-uncased --from_pretrained save/VQA_bert_base_6layer_6conect-finetune_from_multi_task_model-task_1/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --tasks 20 --batch_size 128 --rep_save_path datasets/ME/out_features/train_features.pkl

End-to-end Training and Evaluation

Training

Training the Multi-task model for ME

python train_tasks.py --bert_model bert-base-uncased --from_pretrained models/multi_task_model.bin --config_file config/bert_base_6layer_6conect.json --tasks 19 --train_iter_gap 4 --task_specific_tokens --save_name finetune_from_multi_task_model-task_19-all_train-BASE --lr_scheduler 'warmup_linear'

Training the VQA fine-tuned model for ME

python train_tasks.py --bert_model bert-base-uncased --from_pretrained save/VQA_bert_base_6layer_6conect-finetune_from_multi_task_model-task_1/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --tasks 19 --train_iter_gap 4 --task_specific_tokens --save_name finetune_from_multi_task_model-task_19-all_train-VQA --lr_scheduler 'warmup_linear'

Training the NLVR2 fine-tuned model for ME

python train_tasks.py --bert_model bert-base-uncased --from_pretrained save/NLVR2_bert_base_6layer_6conect-finetune_from_multi_task_model-task_12/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --tasks 19 --train_iter_gap 4 --task_specific_tokens --save_name finetune_from_multi_task_model-task_19-all_train-NLVR2 --lr_scheduler 'warmup_linear'

Evaluating

Evaluate the Multi-task model previously trained for ME

python script/ME/eval_ME.py --bert_model bert-base-uncased --config_file config/bert_base_6layer_6conect.json --tasks 19 --split test --task_specific_tokens --batch_size 128 --from_pretrained save/ME_bert_base_6layer_6conect-finetune_from_multi_task_model-task_19-all_train-BASE/pytorch_model_12.bin

Evaluate the VQA fine-tuned model previously trained for ME

python script/ME/eval_ME.py --bert_model bert-base-uncased --config_file config/bert_base_6layer_6conect.json --tasks 19 --split test --task_specific_tokens --batch_size 128 --from_pretrained save/ME_bert_base_6layer_6conect-finetune_from_multi_task_model-task_19-all_train-VQA/pytorch_model_14.bin

Evaluate the NLVR2 fine-tuned model previously trained for ME

python script/ME/eval_ME.py --bert_model bert-base-uncased --config_file config/bert_base_6layer_6conect.json --tasks 19 --split test --task_specific_tokens --batch_size 128 --from_pretrained save/ME_bert_base_6layer_6conect-finetune_from_multi_task_model-task_19-all_train-NLVR2/pytorch_model_11.bin

License

vilbert-multi-task is licensed under MIT license available in LICENSE file.