From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Spot-the-difference self-supervised pre-training for anomaly detection and segmentation

Y Zou, J Jeong, L Pemula, D Zhang… - European Conference on …, 2022 - Springer
Visual anomaly detection is commonly used in industrial quality inspection. In this paper, we
present a new dataset as well as a new self-supervised learning method for ImageNet pre …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

J Li, K Pan, Z Ge, M Gao, W Ji, W Zhang… - The Twelfth …, 2023 - openreview.net
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …

Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey

D Sharma, C Dhiman, D Kumar - Expert Systems with Applications, 2023 - Elsevier
Abstract Automatic Visual Captioning (AVC) generates syntactically and semantically correct
sentences by describing important objects, attributes, and their relationships with each other …

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use

Y Bitton, H Bansal, J Hessel, R Shao, W Zhu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of
instruction-following vision-language models for real-world use. Our starting point is curating …

Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset

C Liu, R Zhao, H Chen, Z Zou… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Analyzing land cover changes with multitemporal remote sensing (RS) images is crucial for
environmental protection and land planning. In this article, we explore RS image change …

Visual instruction tuning with polite flamingo

D Chen, J Liu, W Dai, B Wang - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large
Language Models (LLMs) using an assortment of annotated downstream vision-language …

Fashion iq: A new dataset towards retrieving images by natural language feedback

H Wu, Y Gao, X Guo, Z Al-Halah… - Proceedings of the …, 2021 - openaccess.thecvf.com
Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …