Facial Expression Recognition with an Efficient Mix Transformer for Affective Human-Robot Interaction
Links to Files
Author/Creator
Date
Type of Work
Department
Program
Citation of Original Publication
Safavi, Farshad, Kulin Patel, and Ramana Vinjamuri. “Facial Expression Recognition with an Efficient Mix Transformer for Affective Human-Robot Interaction.” IEEE Transactions on Affective Computing, 2025, 1–14. https://doi.org/10.1109/TAFFC.2025.3567966.
Rights
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract
Emotion recognition can significantly enhance interactions between humans and robots, particularly in shared tasks and collaborative processes. Facial Expression Recognition (FER) allows affective robots to adapt their behavior in a socially appropriate manner. However, the potential of efficient Transformers for FER remains underexplored. Additionally, leveraging self-attention mechanisms to create segmentation masks that accentuate facial landmarks for improved accuracy has not been fully investigated. Furthermore, current FER methods lack computational efficiency and scalability, limiting their applicability in real-time scenarios. Therefore, we developed the robust, scalable, and generalizable EmoFormer model, incorporating an efficient Mix Transformer block along with a novel fusion block. Our approach scales across a range of models from EmoFormer-B0 to EmoFormer-B2. The main innovation lies in the fusion block, which uses element-wise multiplication of facial landmarks to emphasize their role in the feature map. This integration of local and global attention creates powerful representations. The efficient self-attention mechanism within the Mix Transformer establishes connections among various facial regions. It enhances efficiency while maintaining accuracy in emotion classification from facial landmarks. We evaluated our approach for both categorical and dimensional facial expression recognition on four datasets: FER2013, AffectNet-7, AffectNet-8, and DEAP. Our ensemble method achieved state-of-the-art results, with accuracies of 77.35% on FER2013, 67.71% on AffectNet-7, and 65.14% on AffectNet-8. For the DEAP dataset, our method achieved 98.07% accuracy for arousal and 97.86% for valence, demonstrating the robustness and generalizability of our models. As an application of our method, we implemented EmoFormer in an affective robotic arm, enabling the human-robot interaction system to adjust its speed based on the user's facial expressions. This was validated through a user experiment with six subjects, demonstrating the feasibility and effectiveness of our approach in creating emotionally intelligent human-robot interactions. Overall, our results demonstrate that EmoFormer is a robust, efficient, and scalable solution for FER, with significant potential for advancing human-robot interaction through emotion-aware robotics.