1

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

In this paper, we propose a novel local attention module, dubbed Slide Attention, by using only common convolution operations and achieving high efficiency, flexibility, and generalizability.

Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang

Budgeted Training for Vision Transformer

In this paper, we address the high training cost problem of Vision Transformers by proposing a framework that enables the training process under any training budget from the perspective of model structure, while achieving competitive model performances.

Zhuofan Xia, Xuran Pan, Xuan Jin, Yuan He, Shiji Song, Gao Huang

Contrastive Language-Image Pre-Training with Knowledge Graphs

In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, that injects semantic information into the widely used CLIP model.

Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang

Contrastive Language-Image Pre-Training with Knowledge Graphs

ActiveNeRF: Learning where to See with Uncertainty Estimation

We present a novel learning framework, ActiveNeRF, aiming to model a 3D scene with a constrained input budget. We first incorporate uncertainty estimation into a NeRF model, which ensures robustness under few observations and provides an interpretation of how NeRF understands the scene. On this basis, we propose to supplement the existing training set with newly captured samples based on an active learning scheme. By evaluating the reduction of uncertainty given new inputs, we select the samples that bring the most information gain. In this way, the quality of novel view synthesis can be improved with minimal additional resources.

Xuran Pan, Zihang Lai, Shiji Song, Gao Huang

On the Integration of Self-Attention and Convolution

In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart.

Xuran Pan, Chunjiang Ge, Rui Lu, Shiji Song, Guanfu Chen, Zeyi Huang, Gao Huang

On the Integration of Self-Attention and Convolution

Vision Transformer with Deformable Attention

In this paper, we present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang

Vision Transformer with Deformable Attention

3D Object Detection with Pointformer

In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively.

Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, Gao Huang

3D Object Detection with Pointformer

Implicit Semantic Data Augmentation for Deep Networks

In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation.

Yulin Wang, Xuran Pan, Shiji Song, Hong Zhang, Cheng Wu, Gao Huang