TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Patel, Maitreya; Kusumba, Abhiram; Cheng, Sheng; Kim, Changhoon; Gokhale, Tejas; Baral, Chitta; Yang, Yezhou

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

dc.contributor.author	Patel, Maitreya
dc.contributor.author	Kusumba, Abhiram
dc.contributor.author	Cheng, Sheng
dc.contributor.author	Kim, Changhoon
dc.contributor.author	Gokhale, Tejas
dc.contributor.author	Baral, Chitta
dc.contributor.author	Yang, Yezhou
dc.date.accessioned	2024-12-11T17:02:28Z
dc.date.available	2024-12-11T17:02:28Z
dc.date.issued	2024-11-04
dc.description.abstract	Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: https://tripletclip.github.io
dc.description.sponsorship	This work was supported by NSF RI grants #1750082, #2132724, and CPS grant #2038666. We thank the Research Computing (RC) at Arizona State University (ASU) for providing computing resources. The views and opinions of the authors expressed herein do not necessarily state or reflect those of the funding agencies and employers.
dc.description.uri	http://arxiv.org/abs/2411.02545
dc.format.extent	24 pages
dc.genre	journal articles
dc.genre	postprints
dc.identifier	doi:10.13016/m2ksck-meo0
dc.identifier.uri	https://doi.org/10.48550/arXiv.2411.02545
dc.identifier.uri	http://hdl.handle.net/11603/37072
dc.language.iso	en_US
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.relation.ispartof	UMBC Faculty Collection
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Computer Science - Computation and Language
dc.subject	Computer Science - Computer Vision and Pattern Recognition
dc.title	TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-5593-2804

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2411.02545v1.pdf
Size:: 2.25 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection