Image Classification in 2022

Tranformer-based vision models are gradually evolving and are reported to be as good as Convolutional models on Classification, Segmentations and Action Recognition tasks. We have a whole array of Convolutional models for vision tasks and are more popular than Transformer-based ones. This blog delves into the SWin Transformer Vision model that was presented in the International Conference for Computer Vision (ICCV) 2021 by the Microsoft Research Team. It benchmarks its performance against several SOTA Convolution-based models on the Dog Breed Image Classification task.

Will transformer-based models become the next big thing in Computer Vision? With transformers being a successful solution for language tasks, will it unify the various AI subfields and present powerful solutions to more complex problems? So rolling up my sleeves to evaluate how good they are on the classification task to begin with.

Chosen Data and its Problem Statement:

The myriad of dog breeds with subtle changes in their physical appearances have been a challenge to veterinarians, dog owners, animal shelter staff and potential dog owners in identifying their right breed. They need to identify the breed in order to provide appropriate training, treatment and meet their nutritional needs.The data is sourced from the Stanford Dog Dataset that contains ~20K images of 120 breeds of dogs across the world. This data has been split almost equally into train and test set for a Kaggle Competition Dog Breed Identification.

The objective of the solution is to build a dog breed identification system capable of correctly identifying dog breeds with minimal data and rightly identifying even similar looking dog breeds. This is a multi-class classification task and for every image, the model has to predict probability for all the 120 breeds. The one with the highest probability is the most probable breed of the dog in the image.

Exploratory Data Analysis

For a breed, the training data has a maximum of 120+ images and a minimum of 66 images. Each class, on an average is expected to have 85 images. Looking at the bar plot, it is concluded that there are no class imbalances.

Though there is no class imbalance, data may be insufficient to train the neural network. Image Augmentation using random image perturbations and pre-trained models will be able to circumvent this problem.
Top 5 breeds with the most images are scottish_deerhound, maltese_dog, afghan_hound, entlebucher and bernese_mountain_dog. Bottom 5 breeds with the least images are golden_retriever, brabancon_griffon, komondor, eskimo_dog and briard.
A quick analysis on spatial dimensions of the training images is done to understand the distribution of image height, width and their aspect ratios.

Images with very low (<0.5) and very high (>2.3) aspect ratios are considered to be anomaly images. 1.5 is considered to be a good aspect ratio.
In analysing various dog breeds, below pairs of breeds were generally found to look alike.
- 'boston_bull', 'french_bulldog'
- 'beagle', 'english_foxhound'
- 'bernese_mountain_dog', 'greater_swiss_mountain_dog'
- 'malamute', 'siberian_husky'
- 'basenji', 'ibizan_hound'
- 'vizsla', 'rhodesian_ridgeback'
- 'lhasa', 'shih-tzu'
- 'whippet', 'italian_greyhound'
- 'brittany_spaniel', 'welsh_springer_spaniel', 'blenheim_spaniel'
- 'malinois', 'german_shepherd'
- 'border_collie', 'collie'
- 'norfolk_terrier', 'norwich_terrier’

Evaluation Metrics

LogLoss, which stringently evaluates the confidence of the model predictions for all 120 classes by comparing the prediction probability to the ground truth probability, is the key metric.
N*N confusion matrix is used to study the outcome of the predictions to analyse if there are any pair of classes confused by the model and incorrectly predicted.

SWIN - Transformer Based Model

This architecture is based on “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” developed by Microsoft Research Team. This paper discusses an improved ViT architecture that produces a hierarchical representation of feature maps reducing the computation complexity of self-attention mechanism from quadratic to linear. It is proven to give identical results as SOTA Convolution Networks like EfficientNet on Imagenet classification problem.

The building blocks of this architecture is explained in the below notes:

Image Patching:

In NLP, the tokens which are the processing elements of a model, are the words in a sentence where the size of 1 token is 1 (just 1 word). ViT (Vision Transformers) treat “image patches” as tokens where each patch is a partition of an image consisting of a group of neighbouring pixels. Each image patch is a token. The size of 1 token in any ViT is patch_height * patch_width * number of channels. Based on the patch dimensions, we get a number of patches or tokens for a single image. If the image size (H*W*3) is 224 * 224 * 3 pixels and the patch size is 4 * 4, then we get 224/4 * 224/4 = 56 * 56 patches or 3136 tokens from the image. Each token or patch will be of size 4*4*3 = 48 dimensions or 48 pixels of data. So the input to the architecture for this image consists of 3136 tokens each of size 48 dimensions.

The underlying mechanism of the SWIN transformer is analogous to any CNN based architecture where the spatial dimensions of the image is decreased and the number of channels is increased. SWIN transformer at every stage in the hierarchical architecture, also reduces the number of image patches or the number of tokens while increasing the token dimensions. With this mechanism in mind, it is easier to understand the SWIN architecture easily.

At every stage of the architecture, we can see the number of tokens decreasing while the token size is increasing.

The SWIN-T architecture, apart from the “Patch Partitioner”, is also made up of 3 other building blocks - Linear Embedding, Swin Transformer Block, Patch Merging. These building blocks are repeated and it process feature maps in a hierarchical fashion.

Linear Embedding:

The 3136 tokens each of 48 dimension from the “Patch Partitioner” are fed to a feed forward layer to embed the token of 48 feature vector into a feature vector of size ‘C’. ‘C’ here acts as the capacity of the transformer and the SWIN architecture has 4 variants based on it.

Tiny (Swin-T) - C is ‘96’
Small (Swin-S) - C is ‘96’
Base(Swin-B) - C is ‘128’
Large(Swin-L) - C is ‘192’

Image patching and linear embedding are jointly implemented in a single convolution whose kernel-size and stride-length is same as the patch-size. Number of channels in the convolutional will be ‘C’.

SWin Transformer Block:

The SWin Transformer Block is different from the standard transformer block in the ViT architecture. In SWin Transformers, the Multi-head Self Attention (MSA) layer is replaced by either the Window MSA (W-MSA) module or the Shifted Window MSA (SW-MSA) module.

Stage1 consists of 2 SWIN-T Transformer Blocks (refer image) where the first Transformer Block has Window MSA (W-MSA) and the second Transformer Block has Shifted Window MSA (SW-MSA) module. In the SWin Transformer Block, the inputs and the outputs of the W-MSA and SW-MSA layers are passed via Normalization Layers. It is then subjected to a 2 layered Feed Forward Network with Gaussian Error Linear Units (GELU) activation. There are residual connections within each block and between these 2 blocks.

Window MSA (W-MSA) and Shifted Window MSA (SW-MSA) modules

Why is the standard attention layer in ViT replaced with the Windowed MSA layer?

The standard attention layer in ViT was a global one calculating the attention of a patch with all other patches in the image thus leading to a quadratic complexity proportional to the image dimensions. This doesn't scale very well for high resolution images.

The self-attention mechanism in the W-MSA or SW-MSA module is a local one that calculates self-attention only between patches within the same window of the image and not outside the windows.

Windows are like larger partitions of the image where each window comprises of M*M patches. Replacing global self-attention with local self-attention reduced the computational complexity from quadratic to linear.

Why both W-MSA and SW-MSA for local self-attention? What is the difference between them?

The key difference between W-MSA and SW-MSA attention modules is in the way how the windows for the image are configured.

In W-MSA module, a regular window partitioning strategy is followed. The image is evenly partitioned into non-overlapping windows starting from the top-left pixel of the image, and each window contains M*M or M2 patches .

In SW-MSA module, the window configuration is shifted from that of the W-MSA layer, by displacing the windows by (M/2, M/2) patches from the regular partitioning strategy.

Why a shifted window partitioning strategy in SW-MSA?

Since the attention is restricted locally within a window in W-MSA, the shifted window enables cross-window attention to still yield the benefits of global attention. This is possible because the boundaries of window1 in W-MSA layer are shared with windows W2, W4 and W5 in SW-MSA layer. Hence global attention happens indirectly via “local attention on shifted windows”.

Patch Merging Layer

Patch Merging layer reduces the number of tokens as the network gets deeper and deeper. The first patch merging layer concatenates the features of each group of 2×2 neighbouring patches.

SWIN Transformers on Dog Breed Classification

The package tfswin in PyPI has pretrained TF-Keras variants of the SWIN Transformers and is built based on the official pytorch implementation. Its code is available in github. tfswin is used to train the dog breed images.

from tfswin import SwinTransformerBase224, preprocess_input
def build_model1(swintransformer):
  tf.keras.backend.clear_session()

  inputs = Input(shape=(resize_height, resize_width, 3))

  outputs = Lambda(preprocess_input)(inputs)
  outputs = swintransformer(outputs)
  outputs = Dense(num_classes, activation='softmax')(outputs)

  swin_model = Model(inputs=inputs, outputs=outputs)

  return swin_model

#build the model
swintransformer = SwinTransformerBase224(include_top=False,pooling='avg')
swin_model1     = build_model1(swintransformer)

#set the layers of the pretrained model as non-trainable
for layer in swin_model1.layers[2].layers:   
  layer.trainable = False

swin_model1.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='categorical_crossentropy',metrics=['accuracy'])

Convolutional Based Models

To begin with, various ResNet variants were trained with different replacements for top layers, and partially to totally freezing the network.

#Logloss of the test set using various ResNet variants
+------------+---------------+-------------------------+----------+
| Model Name |   Retrained   |  Top Layers Replacement | Log_Loss |
+------------+---------------+-------------------------+----------+
|  ResNet50  |      None     |   ConvBlock_FC_Output   | 0.96463  |
|  ResNet50  |      None     | GlobalAvgPooling_Output | 0.58147  |
|  ResNet50  | last 4 layers |   ConvBlock_FC_Output   | 2.10158  |
|  ResNet50  | last 4 layers | GlobalAvgPooling_Output | 0.57019  |
+------------+---------------+-------------------------+----------+

Code corresponding to the ResNet50 model with least log loss

from tensorflow.keras.layers import Input,Conv2D,Dense,BatchNormalization,Flatten,Concatenate, Dropout,MaxPooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input

def build_model():

  tf.keras.backend.clear_session()

  inputs = Input(shape=(resize_height, resize_width, 3))

  #added preprocess_input method as a layer to convert input images to those expected by Resnet
  processed_inputs = preprocess_input(inputs)  

  #use the pretrained ResNet model (Parameter pooling = 'avg' will take care of the Gobal Average Pooling of the ResNet model features)
  base_model = ResNet50(weights="imagenet", include_top=False,pooling='avg')(processed_inputs) 
  
  #output layer
  output = Dense(units=num_classes,activation='softmax',name='Output')(base_model)

  resnet_model = Model(inputs=inputs, outputs=output)
  return resnet_model

#build the model
resnet_model = build_model()

#set the layers of the resnet pretrained model as non-trainable except for its last 4 layers which needs to be re-trained for this data
for layer in resnet_model.layers[3].layers[:-4]:   
  layer.trainable = False

#compile the model
resnet_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),loss='categorical_crossentropy',metrics=['accuracy'])
print(resnet_model.summary())

history = resnet_model.fit(train_ds,
                    epochs=50,
                    validation_data=val_ds, callbacks=callbacks_list)

More sophisticated pre-trained Convolutional models like EfficientNet, NASNet, InceptionResNet, Xception and InceptionV3 were also trained using the dog breed images both in standalone and also in an ensembled fashion. The log loss on the test data (consisting of 10K images) from these models were lower than any of the ResNet variants.

#Logloss of the standalone model variants
+----------------------------+-------------+
|         Model Name         |   Log_Loss  |
+----------------------------+-------------+
|      EfficientNetV2M       |   0.28347   |
|      Inception ResNet      |   0.28623   |
|        NasNetLarge         |   0.33285   |
|          Xception          |   0.34187   |
|        Inception_V3        |   0.54297   |
| EfficientNetV2M_GlobalAveg |   0.50423   |
|   InceptionV3_GlobalAveg   |   0.46402   |
+----------------------------+-------------+

For standalone models, the layers of the pre-trained models were frozen and their top layers were replaced by either
- Convolutional Layer followed by MaxPooling Layer
- Global Average Pooling Layer

EfficientNet architecture is a rightly scaled model in all 3 dimensions - depth, width and resolution.
NASNet-Large is the outcome of model architecture search done using various hyperparameter search algorithms, optimization algorithms and reinforcement learning algorithms.
InceptionV3 factorized larger convolutions into smaller asynchronous convolutions and also included auxiliary classifiers to improve model convergence.
Xception architecture does depth-wise separable convolutions instead of the inception modules in the Inception architecture.
The log loss from the ensemble models reduced by > 0.1 than the best performing standalone model.

+--------------------------------------------------------------------------+-----------+
|                             Model Name                                   |  Log_Loss |
+--------------------------------------------------------------------------+-----------+  
| Ensemble1 - EfficientNEt,InceptionResNet,NasNet,Xception)                |  0.17363  |
| Ensemble2 - EfficientNEt,InceptionResNet,NasNet,Xception and InceptionV3 |  0.16914  |
| Ensemble3 - Ensemble2 with 50% dropout.                                  |  0.16678  |
| Ensemble4 - Ensemble of various EfficientNet Architecture                |  0.16519  |
+--------------------------------------------------------------------------+-----------+

Each of these models accepts varied input formats and in Keras they have their own preprocessing functions.

Benchmarking Outcome

+----------------------------------+------------+----------------------+----------+
|            Model Name            | Parameters | Train time (seconds) | Log_Loss |
+----------------------------------+------------+----------------------+----------+
|  EfficientNet_ConvBlock_Output   |   54.7M    |        ~260s         | 0.28347  |
| InceptionResNet_ConvBlock_Output |   56.1M    |        ~260s         | 0.28623  |
|   NASNetLarge_ConvBlock_Output   |   89.6M    |        ~330s         | 0.33285  |
|    XCeption_ConvBlock_Output     |   23.3M    |        ~240s         | 0.34187  |
|   InceptionV3_ConvBlock_Output   |   24.2M    |        ~225s         | 0.54297  |
|      EfficientNet_GlobalAvg      |   53.3M    |        ~260s         | 0.50423  |
|      InceptionV3_GlobalAvg       |    22M     |        ~215s         | 0.46402  |
|           swin_base224           |   86.8M    |        ~550s         | 0.47289  |
|           swin_base384           |    87M     |        ~600s         | 0.41902  |
|          swin_large384           |    195M    |        ~1000s        | 0.42207  |
+----------------------------------+------------+----------------------+----------+

SWIN Transformers have performed better than all of the ResNet50 variants and InceptionV3 model.
Log-Loss of SWIN Transformer on this data is slightly higher compared to InceptionResNet, EfficientNet, Xception and NasNet Large models when their outputs are processed subsequently by convolutional layer followed by Maxpooling.
SWIN however performs is as good as EfficientNet model when their average outputs are directly processed by the output layer.
SWIN models are larger compared to any of the Convolutional models and hence will take a hit on the system throughput and latency.

This study proved useful in understanding the application of transformer-based models for computer vision.

You can find the code to the notebooks on my Github and the GUI for this usecase here.

References:

https://www.kaggle.com/competitions/dog-breed-identification

https://arxiv.org/pdf/2103.14030.pdf

https://www.youtube.com/watch?v=tFYxJZBAbE8&t=362s

https://towardsdatascience.com/swin-vision-transformers-hacking-the-human-eye-4223ba9764c3

https://github.com/shkarupa-alex/tfswin