Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang

March, 2023

Abstract

Self-attention has been playing an important role in the recent progress of Vision Transformer. Modern Transformers mainly adopt sparse global attention or window attention to alleviate the excessive computation complexity, while the former is inefficient in extracting local features, and the latter may be subject to some handcrafted designs. Comparably, local attention which constrains the receptive field of each query in its own neighboring pixels, enjoys the advantages of both convolution and self-attention, namely local inductive bias and adaptive feature extraction. Nevertheless, current local attention modules either use inefficient Im2Col function, or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, dubbed Slide Attention, by using only common convolution operations and achieving high efficiency, flexibility, and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based view and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks.

Type

Conference paper

Publication

In Computer Vision and Pattern Recognition (CVPR) 2023

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Abstract

Xuran Pan

Ph.D. Student