Person Re-Identification using Vision Transformer with Auxiliary Tokens

Author/Creator

Author/Creator ORCID

Date

2021-01-01

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Abstract

Person Re-Identification (re-ID) is an object re-ID problem that aims to re-identify a person by finding an association between the images of a person captured by multiple cameras. Due to its foundational role in computer-vision based video surveillance applications, it is vital to generate a robust feature embedding to represent a person. CNN-based methods are known for their feature learning abilities, and for many years were a prime choice for a person re-ID. In this theses, we explore a method that takes advantage of auxiliary local tokens and the global tokens of the vision transformer to generate the final feature embedding. We also propose a novel blockwise fine-tuning technique that improves the performance of the Vision Transformer. Our model trained with blockwise fine-tuning achieves $96.6$ rank-1 accuracy and $90.3$ mAP score on the Market-1501 dataset. On the CUHK-03 dataset, it achieves $97.5$ rank-1 accuracy and a $95.03$ mAP score. These performances are comparable to many recently published methods for this problem.