Bag of Visual Words (BoVW) is a feature representation method for images that draws direct inspiration from the Bag of Words model in natural language processing. Just as BoW in NLP represents a document as the frequency histogram of its constituent words, BoVW represents an image as the frequency histogram of its constituent visual "words" — compact feature descriptors that encode local image patches. The pipeline consists of three main stages: local feature extraction, visual vocabulary construction via clustering, and classification using the histogram representation.
Local feature extraction typically uses SIFT (Scale-Invariant Feature Transform) or its variants like SURF or ORB. SIFT detects keypoints at scale-space extrema (corners and blobs) and computes a 128-dimensional descriptor for each keypoint, encoding the gradient orientation histogram of the surrounding 16×16 pixel region. The resulting keypoints and descriptors are scale and rotation invariant, making them robust to the viewpoint variations common in real-world classification datasets. For BoVW, you extract SIFT descriptors from all training images and pool them together for the next stage.
Visual vocabulary construction applies k-means clustering to the pooled descriptor set, where k is the vocabulary size (typically 500–5000 for standard benchmarks). Each cluster centroid becomes a "visual word." The optimal vocabulary size depends on the discriminability required: small vocabularies (500 words) provide coarse representations that are fast to compute and robust to noise, while large vocabularies (5000+) capture finer distinctions but require more training data to avoid overfitting. For industrial inspection applications with limited training data, vocabularies in the 500–1000 range typically work well.
With the vocabulary constructed, each image is represented by assigning each of its SIFT descriptors to the nearest cluster centroid (visual word) and computing the resulting term-frequency histogram. A common enhancement is TF-IDF weighting, which down-weights visual words that appear frequently across all training images (uninformative) and up-weights rare, discriminative words — directly analogous to TF-IDF in text retrieval. The histogram vectors are then fed to a Support Vector Machine (SVM) classifier, typically with a histogram intersection kernel or a chi-squared kernel that is better suited to histogram comparison than the standard RBF kernel.
While deep learning has largely superseded BoVW for general image classification, BoVW remains relevant in scenarios where training data is scarce, inference hardware is severely constrained, or model interpretability is required. The visual vocabulary can be inspected and individual visual words can be traced back to specific image patches — a degree of interpretability that is difficult to achieve with deep feature representations. For texture classification, material recognition, and other tasks where SIFT-type descriptors remain competitive, BoVW is still a practical and well-understood baseline worth understanding.