714731163-Spatial-Attention-and-Channel-Attention.pptx

WidedMiled2 9 views 8 slides Feb 27, 2025
Slide 1
Slide 1 of 8
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8

About This Presentation

spatial and channel attention


Slide Content

What are Spatial Attention and Channel Attention? Ref: https://blog.paperspace.com/attention-mechanisms-in-computer-vision-cbam/

Although the Convolutional Block Attention Module (CBAM) was brought into fashion in the ECCV 2018 paper titled " CBAM: Convolutional Block Attention Module ", the general concept was introduced in the 2016 paper titled " SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning ". SCA-CNN demonstrated the potential of using multi-layered attention: Spatial Attention and Channel Attention combined, which are the two building blocks of CBAM in Image Captioning. The CBAM paper was the first to successfully showcase the wide applicability of the module, especially for Image Classification and Object Detection tasks. CBAM contains two sequential sub-modules called the Channel Attention Module (CAM) and the Spatial Attention Module (SAM), which are applied in that particular order.

So, what's meant by Spatial Attention? Spatial  refers to the domain space encapsulated within each feature map . Spatial attention represents the attention mechanism/attention mask on the feature map, or a single cross-sectional slice of the tensor . For instance, in the image below the object of interest is a bird, thus the Spatial Attention will generate a mask which will enhance the features that define that bird . By thus refining the feature maps using Spatial Attention, we are enhancing the input to the subsequent convolutional layers which thus improves the performance of the model.

Then what's channel attention, and do we even need it? As discussed above, channels are essentially the feature maps stacked in a tensor, where each cross-sectional slice is basically a feature map of dimension ( h  x  w ). Usually in convolutional layers, the trainable weights making up the filters learn generally small values (close to zero), thus we observe similar feature maps with many appearing to be copies of one another. This observation was a main driver for the CVPR 2020 paper titled " GhostNet : More Features from Cheap Operations ". Even though they look similar, these filters are extremely useful in learning different types of features. While some are specific for learning horizontal and vertical edges, others are more general and learn a particular texture in the image. The channel attention essentially provides a weight for each channel and thus enhances those particular channels which are most contributing towards learning and thus boosts the overall model performance.

Why use both, isn't either one sufficient? Well, technically yes and no ; the authors in their code implementation provide the option to only use Channel Attention and switch off the Spatial Attention. However, for best results it has been advised to use both. In layman terms, channel attention says which feature map is important for learning and enhances , or as the authors say, "refines" it. Meanwhile, the spatial attention conveys what within the feature map is essential to learn . Combining both robustly enhances the Feature Maps and thus justifies the significant improvement in model performance.

Spatial Attention Module (SAM) Spatial Attention Module (SAM) is comprised of a three-fold sequential operation. The first part of it is called the Channel Pool, where the Input Tensor of dimensions (c × h × w) is decomposed to 2 channels, i.e. (2 × h × w), where each of the 2 channels represent Max Pooling and Average Pooling across the channels. This serves as the input to the convolution layer which output a 1-channel feature map, i.e., the dimension of the output is (1 × h × w). Thus, this convolution layer is a spatial dimension preserving convolution and uses padding to do the same. The output is then passed to a Sigmoid Activation layer . Sigmoid, being a probabilistic activation, will map all the values to a range between 0 and 1. This Spatial Attention mask is then applied to all the feature maps in the input tensor using a simple element-wise product.

Channel Attention Module (CAM) At first glance, CAM resembles Squeeze Excite (SE) layer. Squeeze Excite was first proposed in the CVPR/ TPAMI 2018 paper titled: 'Squeeze-and-Excitation Networks '. 'Squeeze-and-Excitation Network let's do a quick review of the Squeeze Excitation Module. The Squeeze Excitation Module has the following components: Global Average Pooling (GAP), and a Multi-layer Perceptron (MLP) network mapped by reduction ratio (r) and sigmoid activation . The input to the SE block is essentially a tensor of dimension (c × h × w). Global Average Pooling is essentially an Average Pooling operation where each feature map is reduced to a single pixel, thus each channel is now decomposed to a (1 × 1) spatial dimension. Thus the output dimension of the GAP is basically a 1-D vector of length c which can be represented as (c × 1 × 1). This vector is then passed as the input to the Multi-layer perceptron (MLP) network which has a bottleneck whose width or number of neurons is decided by the reduction ratio ( r ). The higher the reduction ratio, the fewer the number of neurons in the bottleneck and vice versa.   The output vector from this MLP is then passed to a sigmoid activation layer which then maps the values in the vector within the range of 0 and 1.

Channel Attention Module (CAM) Channel Attention Module (CAM) is pretty similar to the Squeeze Excitation layer with a small modification. Instead of reducing the Feature Maps to a single pixel by Global Average Pooling (GAP), it decomposes the input tensor into 2 subsequent vectors of dimensionality (c × 1 × 1). One of these vectors is generated by GAP while the other vector is generated by Global Max Pooling (GMP). Average pooling is mainly used for aggregating spatial information, whereas max pooling preserves much richer contextual information in the form of edges of the object within the image which thus leads to finer channel attention. Simply put, average pooling has a smoothing effect while max pooling has a much sharper effect, but preserves natural edges of the objects more precisely.  The authors validate this in their experiments where they show that using both Global Average Pooling and Global Max Pooling gives better results than using just GAP as in the case of Squeeze Excitation Networks.
Tags