Bag of Visual Words Applied to Image Classification

A new and better representation of the image

5 min readSep 18, 2022

Photo by Stable Diffusion on Hugging Face

Bag of Visual Words

Bag of Visual Words (BoVW) or Bag of Features (BOF) is an approach that represents unordered collections of image features. The inspiration for BoVW came from the bag of words model, commonly used in the Natural Language Processing (NLP) context. In the computer vision context, the approach can be used to image classification.

In the BoVW, a visual words dictionary is created and used to generate an occurrence histogram of this visual word in an image. The histogram is a new and better representation of the image.

The BoVW can be divided into three stages:

Extraction of distinct invariant features from images.
Visual vocabulary (or codebook) generation.
Vector of visual words (or codewords) occurrence.

Each stage is explained and applied to the image (face) classification of three celebrities, namely, Arnold Schwarzenegger, Silvester Stallone, and Taylor Swift. The classification algorithm is the Extreme Learning Machine (ELM), used for regression and classification in previous blog posts.

Images of the dataset used in this blog post

The figure above shows some images of each celebrity (label) used in this blog post.

Feature Extraction

In the first stage, is used the Scale-Invariant Feature Transform (SIFT) algorithm to extract features from each image by label (3 classes of faces). The keypoints and descriptors are obtained from SIFT, to detect, describe and combine local features in images.

SIFT keypoints in the Schwarzenegger image

The figure above shows the 200 keypoints in the Schwarzenegger image. Each keypoint has a total of 128 values, so, this image has a descriptor formed by a 200 x 128 matrix. It is worth mentioning that each image has a different number of keypoints.

The function below performs the feature extraction. The keypoints and descriptors are computed and appended to a single vector by label each.

At the end of this stage, label 0 (Arnold Schwarzenegger) has 4225 keypoints, label 1 (Silvester Stallone) has 3750 keypoints and, label 2 (Taylor Swift) has 2025 keypoints.

Visual vocabulary generation

After the feature extraction, only 1620 keypoints of each label are used to create the visual vocabulary. This number of features is chosen as 80% of the label with the lowest number of keypoints, in this case, the label 2 with only 2025 keypoints.

The descriptor vector of each label is joined in a big vector. Then, this vector is grouped into K clusters by the K-means clustering algorithm. The K centroids generated are known as visual vocabulary or codebook. Each centroid represent the main features presents in the training set and is known as visual words or codewords.

The block of code above shows the join of each descriptor vector with only 1620 keypoints (n_descriptors). In the end, the K-means clustering is applied to generate the visual vocabulary (bag in the code above).

Vector of visual words occurrence

In the last stage, the visual vocabulary is used to create the vector of visual words occurrence (or feature vector). The vocabulary finds which cluster (or visual word) each keypoint descriptor best fits. The feature vector of size K is generated by increments values of 1 in the cluster position (position of cluster 1, feature vector position 1) corresponding to a given keypoint descriptor. That is, a specific keypoint descriptor of an image was assigned as belonging to cluster (or visual word) 0, so position 0 of the feature vector will have an increment of 1.

Thus, the feature vector for each image is generated, showing a histogram-like behavior. In the end, the feature vector histogram is normalized by Nom L2 to better fit the classification algorithm.

Feature vector histogram of the Schwarzenegger image

The figure above shows the feature vector histogram (unnormalized) of the Schwarzenegger image presented before. In this case, the K-means clustering was performed with K = 100 clusters, that is, the BoVW has 100 visual words. The higher bar indicates that 44 keypoints descriptors are assigned as visual word 6.

The BoVW model is created with the training data through the feature extraction and visual vocabulary generation stages. The new representation of the training and test data is generated by the vector of visual words occurence, and this representation has the dimension N x K, where N is the number of images and K is the number of visual words. The training data has 75 x 100, with 25 images by label, and the test data has 9 x 100, with 3 images by label.

Training and results

The Scikit-learn Pipeline is used to chain the BoVW and classification steps with one call. The input of the Pipeline is the training data (images) and the output is the trained classifier after the transformation performed by the BoVW. After fitting the Pipeline, we can predict the label of the training and test data.

The ELM classifier is used for the classification task of new representations of the images. A better explanation of the ELM classifier is done in previous blog posts on the Iris plant classification or its application for time series forecasting.

The Pipeline training is performed by 100 clusters in the BoVW, and L = 88 hidden layers and for reproducibility a random state of 10 in the ELM classifier. The classification results reach 100% of training accuracy and 100% of test accuracy. The test result indicates that 9 images of 9 were correctly classified.

Investigating the results obtained by the Schwarzenegger image presented before, we obtain the probability distribution [0.5761, 0.2119, 0.2119], indicating that there is a probability of 57.61% being Schwarzenegger (label 0). The probability distribution is obtained from ELM prediction through the Softmax activation function.

Conclusion

In this way, the BoVW is a robust approach to changing the image representation into a new and better one. In addition to demonstrating the efficiency in the classification task through ELM classifier. The Scikit-learn Pipeline allows the use of the BoVW and ELM classifier with another set of images, being able to change the classifier for another one of your choice.

The complete code is available on GitHub. Deployment of the built model can be tested on Hugging Face.

If you are interested, I wrote two blog posts about ELM, applied to time series forecasting and Iris plant classification. If you want to check it out:

Time Series Forecasting through Extreme Learning Machine

An one-step learning approach

rlrocha.medium.com

Extreme Learning Machine to Multiclass Classification

An example applied to the Iris dataset