Computer Science - Computer Vision and Pattern Recognition
Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack
local modeling capability, to which the simplest treatment is combined with
convolutional layers. Convolution, famous for its sliding window scheme, also
suffers from this scheme of redundancy and low computational efficiency. In
this paper, we seek to dispense with the windowing scheme and introduce a more
elaborate and effective approach to exploiting locality. To this end, we
propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that
consists of two steps of processes: (1) Pillars-Shift, which generates four
neighboring maps by shifting the input image along four directions, and (2)
Pillars-Concatenation, which applies linear transformations and concatenation
on the maps to aggregate local features. SPC module offers superior local
modeling power and performance gains, making it a promising alternative to the
convolutional layer. Then, we build a pure-MLP architecture called Caterpillar
by replacing the convolutional layer with the SPC module in a hybrid model of
sMLPNet. Extensive experiments show Caterpillar's excellent performance and
scalability on both ImageNet-1K and small-scale classification benchmarks.
Metrics
8 Record Views
Details
Title
Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation