Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception

dc.contributor.advisorGangopadhyay, Aryya
dc.contributor.authorKhan, Azim
dc.contributor.departmentInformation Systems
dc.contributor.programInformation Systems
dc.date.accessioned2025-09-24T14:07:10Z
dc.date.issued2025-01-01
dc.description.abstractRobust multimodal perception is essential to understand real-world scenes, particularly under degraded, noisy, or low-visibility conditions. This dissertation introduces a Frequency-Aware Mixture-of-Experts model that combines structural features from the frequency domain with semantic and spatial representations in RGB, infrared (IR), text, and audio modalities. The work advances through a progression of perception tasks, beginning with single-modality perception, extending to vision-language modeling, and culminating in a four-modality adaptive model. We begin by addressing domain-specific perception using single-modality visual learning, which highlights the limitations of relying on a single source of information in complex environments. This motivates the integration of frequencydomain reasoning into multimodal architectures. In the next stage, we enhance vision-language modeling by introducing frequency-based low-rank features into pretrained visual encoders. These features provide noise-resilient representations while maintaining compatibility with language models, leading to improved performance in caption generation and visual question answering (VQA), particularly under visual degradation. Finally, we propose a hybrid Frequency-Aware Mixture-of-Experts (FreqMoE) model that dynamically fuses RGB and IR image features, guided by synchronized text and audio signals. A frequency domain gating mechanism that computes reliability scores from log-magnitude spectral features and a feature-wise modulation module that adapts visual features based on fused semantic embeddings. To support this four-modality setup, we extend three public RGB-IR datasets—M3FD, RoadScene, and MSRS—by adding aligned textual and audio annotations. This results in a synchronized four-modality setup that includes RGB images, IR data, captions, and audio, without requiring new data collection. Experimental results demonstrate that our method outperforms state-of-the-art baselines in both detection and fusion quality metrics. Ablation studies further validate the contributions of frequency-aware gating and semantic conditioning. Our approach offers an interpretable and adaptive solution for robust cross-modal perception under real-world constraints.
dc.formatapplication:pdf
dc.genredissertation
dc.identifierdoi:10.13016/m2zyry-oslb
dc.identifier.other13091
dc.identifier.urihttp://hdl.handle.net/11603/40268
dc.languageen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Information Systems Department Collection
dc.relation.ispartofUMBC Theses and Dissertations Collection
dc.relation.ispartofUMBC Graduate School Collection
dc.relation.ispartofUMBC Student Collection
dc.rightsThis item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
dc.sourceOriginal File Name: Khan_umbc_0434D_13091.pdf
dc.subjectDiscreate Fourier Transform
dc.subjectFeature Modulation
dc.subjectMixture of Experts
dc.subjectMultimodal AI
dc.subjectSingular Value Decomposition
dc.subjectVision Language Model
dc.titleFrequency-Aware Mixture of Experts Model for Robust Multimodal Perception
dc.typeText
dcterms.accessRightsDistribution Rights granted to UMBC by the author.

Files

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Khan-Azim_Open.pdf
Size:
1.57 MB
Format:
Adobe Portable Document Format
Description: