Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception
| dc.contributor.advisor | Gangopadhyay, Aryya | |
| dc.contributor.author | Khan, Azim | |
| dc.contributor.department | Information Systems | |
| dc.contributor.program | Information Systems | |
| dc.date.accessioned | 2025-09-24T14:07:10Z | |
| dc.date.issued | 2025-01-01 | |
| dc.description.abstract | Robust multimodal perception is essential to understand real-world scenes, particularly under degraded, noisy, or low-visibility conditions. This dissertation introduces a Frequency-Aware Mixture-of-Experts model that combines structural features from the frequency domain with semantic and spatial representations in RGB, infrared (IR), text, and audio modalities. The work advances through a progression of perception tasks, beginning with single-modality perception, extending to vision-language modeling, and culminating in a four-modality adaptive model. We begin by addressing domain-specific perception using single-modality visual learning, which highlights the limitations of relying on a single source of information in complex environments. This motivates the integration of frequencydomain reasoning into multimodal architectures. In the next stage, we enhance vision-language modeling by introducing frequency-based low-rank features into pretrained visual encoders. These features provide noise-resilient representations while maintaining compatibility with language models, leading to improved performance in caption generation and visual question answering (VQA), particularly under visual degradation. Finally, we propose a hybrid Frequency-Aware Mixture-of-Experts (FreqMoE) model that dynamically fuses RGB and IR image features, guided by synchronized text and audio signals. A frequency domain gating mechanism that computes reliability scores from log-magnitude spectral features and a feature-wise modulation module that adapts visual features based on fused semantic embeddings. To support this four-modality setup, we extend three public RGB-IR datasets—M3FD, RoadScene, and MSRS—by adding aligned textual and audio annotations. This results in a synchronized four-modality setup that includes RGB images, IR data, captions, and audio, without requiring new data collection. Experimental results demonstrate that our method outperforms state-of-the-art baselines in both detection and fusion quality metrics. Ablation studies further validate the contributions of frequency-aware gating and semantic conditioning. Our approach offers an interpretable and adaptive solution for robust cross-modal perception under real-world constraints. | |
| dc.format | application:pdf | |
| dc.genre | dissertation | |
| dc.identifier | doi:10.13016/m2zyry-oslb | |
| dc.identifier.other | 13091 | |
| dc.identifier.uri | http://hdl.handle.net/11603/40268 | |
| dc.language | en | |
| dc.relation.isAvailableAt | The University of Maryland, Baltimore County (UMBC) | |
| dc.relation.ispartof | UMBC Information Systems Department Collection | |
| dc.relation.ispartof | UMBC Theses and Dissertations Collection | |
| dc.relation.ispartof | UMBC Graduate School Collection | |
| dc.relation.ispartof | UMBC Student Collection | |
| dc.rights | This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu | |
| dc.source | Original File Name: Khan_umbc_0434D_13091.pdf | |
| dc.subject | Discreate Fourier Transform | |
| dc.subject | Feature Modulation | |
| dc.subject | Mixture of Experts | |
| dc.subject | Multimodal AI | |
| dc.subject | Singular Value Decomposition | |
| dc.subject | Vision Language Model | |
| dc.title | Frequency-Aware Mixture of Experts Model for Robust Multimodal Perception | |
| dc.type | Text | |
| dcterms.accessRights | Distribution Rights granted to UMBC by the author. |
Files
License bundle
1 - 1 of 1
Loading...
- Name:
- Khan-Azim_Open.pdf
- Size:
- 1.57 MB
- Format:
- Adobe Portable Document Format
- Description:
