This repository provides the dataset and code for our paper, "Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-matching ability," to be published at ACL 2024.
Given a news thumbnail image I and its news text T, the task is to predict a binary label L indicating whether a news thumbnail image I portrays the actor of a news event, which can be identified from T.
We introduce a dataset of 1,000 news thumbnail images and text for the task, along with high-quality labels. This dataset is intended for the zero-shot evaluation of vision language models.
- Image: news thumbnail image
- Title: news headline
- Summary: summarized body text, done by ChatGPT
- Label
- 1: the image portrays at least one actor of the news event
- 0: the image does not present any actor of the news event
The dataset is available upon request: [LINK]
We present CFT-CLIP, a contrastive learning framework that uses counterfactual text to update vision and language bi-encoders.
This figure illustrates the key idea of the proposed method. Given a pair of a news thumbnail image and an article, the method generates counterfactual news text and uses it as negative samples for contrastive learning. CFT-CLIP is a CLIP-like vision-language transformer encoder that represents the semantics of news thumbnails and news text. It aims to improve the vision and language bi-encoder by contrastive updates involving the counterfactual text generated from an input text.You can use the pretrained checkpoint available at HuggingFace Hub. [LINK]
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
processor = AutoProcessor.from_pretrained("humane-lab/CFT-CLIP")
model = AutoModel.from_pretrained("humane-lab/CFT-CLIP")
image = "cat.jpg"
image = Image.open(image)
inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
text_embeds = outputs.text_embeds
image_embeds = outputs.image_embeds
Counterfactual text generation
The pretraining corpus is not provided through the repository due to copyright issues.
python utils/save_pixel_values.py # Extract pixel values in advance for learning speed
python utils/get_ntt.py --data_path 'train.pkl' --save_path 'train.pkl' --target_text 'summary' # Extract ntt from news text
python utils/image_text_cossine_similarity.py --data_path 'train.pkl' --save_path 'train.pkl' --target_text 'summary' # Extract CLIP cossine similarity between image-text pairs
python utils/counterfactual.py --data_path 'train.pkl' --save_path 'train.pkl' # counterfactual text generation
Training
Set configure using config.py.
python train.py
Evaluation
python evaluation.py --pixel_path "data/pixel_values" ...
The code and dataset are shared under CC BY-NC 4.0. You are free to use the resources for the non-commercial purpose.
@article{yoon2024assessing,
title={Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability},
author={Yoon, Yejun and Yoon, Seunghyun and Park, Kunwoo},
journal={arXiv preprint arXiv:2402.11159},
year={2024}
}