Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

Elhenawy, Mohammed; Ashqar, Huthaifa; Rakotonirainy, Andry; Alhadidi, Taqwa I.; Jaber, Ahmed; Tami, Mohammad Abu

Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

dc.contributor.author	Elhenawy, Mohammed
dc.contributor.author	Ashqar, Huthaifa
dc.contributor.author	Rakotonirainy, Andry
dc.contributor.author	Alhadidi, Taqwa I.
dc.contributor.author	Jaber, Ahmed
dc.contributor.author	Tami, Mohammad Abu
dc.date.accessioned	2025-10-16T15:27:11Z
dc.date.issued	2025-03-24
dc.description.abstract	Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analyses on the Honda Scenes Dataset, which contains a collection of about 80 h of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. The results also showed that fine-tuning the CLIP models, such as ViT-L/14 (Vision Transformer) and ViT-B/32, significantly improved scene classification, achieving a top F1-score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of advanced driver assistance systems (ADASs). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems.
dc.description.sponsorship	This research was funded partially by the Australian Government through the Australian Research Council Discovery Project DP220102598.
dc.description.uri	https://www.mdpi.com/2079-9292/14/7/1282
dc.format.extent	27 pages
dc.genre	journal articles
dc.identifier	doi:10.13016/m2zpbl-om08
dc.identifier.citation	Elhenawy, Mohammed, Huthaifa I. Ashqar, Andry Rakotonirainy, Taqwa I. Alhadidi, Ahmed Jaber, and Mohammad Abu Tami. “Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding.” Electronics 14, no. 7 (2025): 1282. https://doi.org/10.3390/electronics14071282.
dc.identifier.uri	https://doi.org/10.3390/electronics14071282
dc.identifier.uri	http://hdl.handle.net/11603/40449
dc.language.iso	en
dc.publisher	MDPI
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Data Science
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	advanced driver assistance systems (ADASs)
dc.subject	scene understanding
dc.subject	automated vehicle (AV)
dc.subject	contrastive language–image pretraining (CLIP)
dc.subject	fine-tuning
dc.title	Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-6835-8338

Files

Original bundle

Now showing 1 - 1 of 1

Name:: electronics1401282.pdf
Size:: 9.88 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Data Science