Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.

Subjects

Discreate Fourier Transform
Feature Modulation
Mixture of Experts
Multimodal AI
Singular Value Decomposition
Vision Language Model

Abstract

Robust multimodal perception is essential to understand real-world scenes, particularly under degraded, noisy, or low-visibility conditions. This dissertation introduces a Frequency-Aware Mixture-of-Experts model that combines structural features from the frequency domain with semantic and spatial representations in RGB, infrared (IR), text, and audio modalities. The work advances through a progression of perception tasks, beginning with single-modality perception, extending to vision-language modeling, and culminating in a four-modality adaptive model. We begin by addressing domain-specific perception using single-modality visual learning, which highlights the limitations of relying on a single source of information in complex environments. This motivates the integration of frequencydomain reasoning into multimodal architectures. In the next stage, we enhance vision-language modeling by introducing frequency-based low-rank features into pretrained visual encoders. These features provide noise-resilient representations while maintaining compatibility with language models, leading to improved performance in caption generation and visual question answering (VQA), particularly under visual degradation. Finally, we propose a hybrid Frequency-Aware Mixture-of-Experts (FreqMoE) model that dynamically fuses RGB and IR image features, guided by synchronized text and audio signals. A frequency domain gating mechanism that computes reliability scores from log-magnitude spectral features and a feature-wise modulation module that adapts visual features based on fused semantic embeddings. To support this four-modality setup, we extend three public RGB-IR datasets—M3FD, RoadScene, and MSRS—by adding aligned textual and audio annotations. This results in a synchronized four-modality setup that includes RGB images, IR data, captions, and audio, without requiring new data collection. Experimental results demonstrate that our method outperforms state-of-the-art baselines in both detection and fusion quality metrics. Ablation studies further validate the contributions of frequency-aware gating and semantic conditioning. Our approach offers an interpretable and adaptive solution for robust cross-modal perception under real-world constraints.

Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception

Links to Files

Permanent Link

Collections

Author/Creator

Author/Creator ORCID

Date

Type of Work

Department

Program

Citation of Original Publication

Rights

Subjects

Abstract