From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Spot-the-difference self-supervised pre-training for anomaly detection and segmentation
Visual anomaly detection is commonly used in industrial quality inspection. In this paper, we
present a new dataset as well as a new self-supervised learning method for ImageNet pre …
present a new dataset as well as a new self-supervised learning method for ImageNet pre …
Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Image retrieval on real-life images with pre-trained vision-and-language models
Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …
and short textual description of how to modify the image. Existing methods have only been …
Fine-tuning multimodal llms to follow zero-shot demonstrative instructions
Recent advancements in Multimodal Large Language Models (MLLMs) have been utilizing
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Visual Prompt Generators (VPGs) to convert visual features into tokens that LLMs can …
Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey
Abstract Automatic Visual Captioning (AVC) generates syntactically and semantically correct
sentences by describing important objects, attributes, and their relationships with each other …
sentences by describing important objects, attributes, and their relationships with each other …
Visit-bench: A benchmark for vision-language instruction following inspired by real-world use
We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of
instruction-following vision-language models for real-world use. Our starting point is curating …
instruction-following vision-language models for real-world use. Our starting point is curating …
Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset
Analyzing land cover changes with multitemporal remote sensing (RS) images is crucial for
environmental protection and land planning. In this article, we explore RS image change …
environmental protection and land planning. In this article, we explore RS image change …
Visual instruction tuning with polite flamingo
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large
Language Models (LLMs) using an assortment of annotated downstream vision-language …
Language Models (LLMs) using an assortment of annotated downstream vision-language …
Fashion iq: A new dataset towards retrieving images by natural language feedback
Conversational interfaces for the detail-oriented retail fashion domain are more natural,
expressive, and user friendly than classical keyword-based search interfaces. In this paper …
expressive, and user friendly than classical keyword-based search interfaces. In this paper …