In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart.
Xuran Pan,
Chunjiang Ge,
Rui Lu,
Shiji Song,
Guanfu Chen,
Zeyi Huang,
Gao Huang