Tejankar, AjinkyaSanjabi, MaziarWu, BichenKhabsa, MadianXie, SainingPirsiavash, HamedFirooz, Hamed2023-11-102023-11-102022-11-01http://hdl.handle.net/11603/306853rd Self-Supervised Learning Theory and Practice Workshop at NeurIPS 2022; New Orleans, LA, USA; Nov 28 - Dec 4 2022Natural language supervision in the form of image captions was recently shown to be an effective way of training zero-shot image classification models. In this work, we focus on teasing out what parts of the language supervision are essential for training zero-shot models. Through extensive and careful experiments, we show that replacing intact captions with Bag-of-Words (BoW) does not significantly degrade the zero-shot performance. Surprisingly, we can even slightly improve the performance on some datasets by balancing the frequency of words in BoW.13 pagesen-USThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.Can we train vision and language zero-shot classification models without syntax?Text