Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception
Links to Files
Permanent Link
Author/Creator
Author/Creator ORCID
Date
Type of Work
Department
Information Systems
Program
Information Systems
Citation of Original Publication
Rights
This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Distribution Rights granted to UMBC by the author.
Abstract
Robust multimodal perception is essential to understand real-world scenes, particularly under degraded, noisy, or low-visibility conditions. This dissertation introduces a Frequency-Aware Mixture-of-Experts model that combines structural features from the frequency domain with semantic and spatial representations in RGB, infrared (IR), text, and audio modalities. The work advances through a progression of perception tasks, beginning with single-modality perception, extending to vision-language modeling, and culminating in a four-modality adaptive model. We begin by addressing domain-specific perception using single-modality visual learning, which highlights the limitations of relying on a single source of information in complex environments. This motivates the integration of frequencydomain reasoning into multimodal architectures. In the next stage, we enhance vision-language modeling by introducing frequency-based low-rank features into pretrained visual encoders. These features provide noise-resilient representations while maintaining compatibility with language models, leading to improved performance in caption generation and visual question answering (VQA), particularly under visual degradation. Finally, we propose a hybrid Frequency-Aware Mixture-of-Experts (FreqMoE) model that dynamically fuses RGB and IR image features, guided by synchronized text and audio signals. A frequency domain gating mechanism that computes reliability scores from log-magnitude spectral features and a feature-wise modulation module that adapts visual features based on fused semantic embeddings. To support this four-modality setup, we extend three public RGB-IR datasets—M3FD, RoadScene, and MSRS—by adding aligned textual and audio annotations. This results in a synchronized four-modality setup that includes RGB images, IR data, captions, and audio, without requiring new data collection. Experimental results demonstrate that our method outperforms state-of-the-art baselines in both detection and fusion quality metrics. Ablation studies further validate the contributions of frequency-aware gating and semantic conditioning. Our approach offers an interpretable and adaptive solution for robust cross-modal perception under real-world constraints.
