Can we train vision and language zero-shot classification models without syntax?
dc.contributor.author | Tejankar, Ajinkya | |
dc.contributor.author | Sanjabi, Maziar | |
dc.contributor.author | Wu, Bichen | |
dc.contributor.author | Khabsa, Madian | |
dc.contributor.author | Xie, Saining | |
dc.contributor.author | Pirsiavash, Hamed | |
dc.contributor.author | Firooz, Hamed | |
dc.date.accessioned | 2023-11-10T14:44:31Z | |
dc.date.available | 2023-11-10T14:44:31Z | |
dc.date.issued | 2022-11-01 | |
dc.description | 3rd Self-Supervised Learning Theory and Practice Workshop at NeurIPS 2022; New Orleans, LA, USA; Nov 28 - Dec 4 2022 | en_US |
dc.description.abstract | Natural language supervision in the form of image captions was recently shown to be an effective way of training zero-shot image classification models. In this work, we focus on teasing out what parts of the language supervision are essential for training zero-shot models. Through extensive and careful experiments, we show that replacing intact captions with Bag-of-Words (BoW) does not significantly degrade the zero-shot performance. Surprisingly, we can even slightly improve the performance on some datasets by balancing the frequency of words in BoW. | en_US |
dc.description.sponsorship | National Science Foundation; 1920079 2230693 1845216 | en_US |
dc.description.uri | https://par.nsf.gov/biblio/10393677-can-we-train-vision-language-zero-shot-classification-models-without-syntax | en_US |
dc.format.extent | 13 pages | en_US |
dc.genre | conference papers and proceedings | en_US |
dc.genre | preprints | en_US |
dc.identifier | doi:10.13016/m2i4ar-yerf | |
dc.identifier.uri | http://hdl.handle.net/11603/30685 | |
dc.language.iso | en_US | en_US |
dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
dc.relation.ispartof | UMBC Computer Science and Electrical Engineering Department Collection | |
dc.relation.ispartof | UMBC Faculty Collection | |
dc.rights | This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author. | en_US |
dc.title | Can we train vision and language zero-shot classification models without syntax? | en_US |
dc.type | Text | en_US |