Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

Masri, Sari; Ashqar, Huthaifa; Elhenawy, Mohammed

Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

dc.contributor.author	Masri, Sari
dc.contributor.author	Ashqar, Huthaifa
dc.contributor.author	Elhenawy, Mohammed
dc.date.accessioned	2025-10-16T15:27:10Z
dc.date.issued	2025-05-07
dc.description.abstract	Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can control intersections using bird eye view videos taken by drones in real-time. This fine-tuned GPT-4o model is used to logically and visually reason traffic conflicts and provide instructions to the drivers, which aids in creating a safer and more efficient traffic flow. To fine-tune and evaluate the model, we labeled a dataset that includes three-month drone videos, and their corresponding trajectories recorded in Dresden, Germany, at a 4-way intersection. Preliminary results showed that the fine-tuned GPT-4o achieved an accuracy of about 77%, outperforming zero-shot baselines. However, using continuous video-frame sequences, the model performance increased to about 89% on a time serialized dataset and about 90% on an unbalanced real-world dataset, respectively. This proves the model’s robustness in different conditions. Furthermore, manual evaluation by experts includes scoring the usefulness of the predicted explanations and recommendations by the model. The model surpassed on average rating of 8.99 out of 10 for explanations, and 9.23 out of 10 for recommendations. The results demonstrate the advantages of combining MLLMs with structured prompts and temporal information for conflict detection. These results offer a flexible and robust prototype framework to improve the safety and effectiveness of uncontrolled intersections. The code and labeled dataset used in this study are publicly available (see Data Availability Statement).
dc.description.uri	https://www.mdpi.com/2313-576X/11/2/40
dc.format.extent	15 pages
dc.genre	journal articles
dc.identifier	doi:10.13016/m26dtz-fldb
dc.identifier.citation	Masri, Sari, Huthaifa I. Ashqar, and Mohammed Elhenawy. “Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning.” Safety 11, no. 2 (2025): 40. https://doi.org/10.3390/safety11020040.
dc.identifier.uri	https://doi.org/10.3390/safety11020040
dc.identifier.uri	http://hdl.handle.net/11603/40445
dc.language.iso	en
dc.publisher	MDPI
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Data Science
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	unsignalized intersections
dc.subject	conflict detection
dc.subject	fine-tuning
dc.subject	Multimodal Large Language Models (MLLMs)
dc.subject	visual and logical reasoning
dc.subject	prompt design
dc.subject	urban traffic management
dc.title	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
dc.type	Text
dcterms.creator	https://orcid.org/0000-0002-6835-8338

Files

Original bundle

Now showing 1 - 1 of 1

Name:: safety1100040.pdf
Size:: 3.02 MB
Format:: Adobe Portable Document Format

Download

Collections

UMBC Data Science