Object detection is one of the most important computer vision tasks. It is extensively used whenever one needs to localize objects in visual data. However, training an object detection model requires box annotations that are expensive to collect since it requires drawing boxes around the objects of interest.
In this article, we explain how we built a simple and accurate few-shot object detection system to quickly train new detection concepts with less data. The core idea behind our approach is to pre-train the model with a vast variety of visual concepts. As a result, we can build high-quality features, and have access to different weight candidates that we can re-use as an initial guess to guide the few-shot learning task.
To build our system, we integrate the following ideas that we will detail throughout this article:
- Dedicated concept grids
- Training linear weights during the few-shot learning phase
- Re-using pre-trained classifiers and box regressors.
We first present the architecture we adopted that uses dedicated concept grids and simple detection sub-networks. Then we talk about the dataset and the annotations we built for pre-training. Finally, we discuss how to do custom training with few samples on top of the pre-trained base detector.
In the section we talk about the architecture components we use. More specifically, we talk about using class-dedicated grids and the sub-network blocks used in the base pre-trained detector.
One of the core ideas of our approach is that each detection class has its own dedicated grids, contrary to the classical YOLO approach that uses a class-agnostic grid as you can see in the example in Fig 1.
The main reason why we use dedicated grids is to learn class-specific box regressors that we can re-use later for few-shot learning. For example, if our pre-trained model has a dedicated box regressor for “baked goods” and we want to train a new “cake” detector via few-shot learning, the dedicated “baked goods” box regressor is surely a better initial guess than a class-agnostic regressor that might include irrelevant objects such as bottles and tables. In fact, in practice, we noticed that in some cases, if the right box regressor is selected (e.g. “baked goods” to train “cake”), one might not even need to fine-tune it and train only the classifier part.
We adopt a simple model architecture with a pre-trained ImageNet ResNet-50  as a backbone. We also tried other backbones such as MobileNet  and large EfficientNets , but for simplicity we stick to ResNet-50 in this article.
Fig 2. shows an overview of the modules used in our architecture. Note that we use features from lower levels (2, 3, 4 and not 5) which makes the model smaller and faster. The detection model is composed of three parts:
- A shared sub-network that takes the backbone features as input and outputs detection features. This network is very basic consisting of few depthwise and pointwise convolutions to keep the model fast and small. One could use more advanced blocks such as FPN  or BiFPN  but just a couple of depthwise and pointwise convolutions is enough to train a good base detector.
- Two small networks: one for the class predictions and one for box regression. They output respectively class and box features that we will use as the main ingredients later for few-shot learning.
- Simple 1x1 convolutions with Sigmoid activation that take the features and produce the class probabilities and box regression coordinates. So the outputs of the class and box predictions in the pre-trained models are GxGxC and GxGx4xC respectively, where GxG is the grid size and C is the number of pre-training classes.
As you might have noticed, we keep it super simple and don’t even use anchor boxes. The reason for this is that we don’t want to put any prior on the box sizes since we don’t know in advance what kind of box classes we will need to train via few-shot learning later. The box centers are encoded as offsets with respect to the closest grid cell center normalized by the maximum shift value so we get values between 0 and 1. The box dimensions are simply the normalized box height and width.
The main goal of pre-training the detector model is to have good class and box regression features. The assumption is that, the more visual concepts we use during the pre-training phase, the better the features will be for few-shot learning. The problem is that good quality detection annotations are harder to get and more costly since it takes more effort to draw boxes instead of just saying if an image contains an object or not. Since we don’t have access to clean box annotations for millions of images, we instead use different pre-trained models to build the annotations. More specifically, we use 3 publicly available detection models to generate annotations on around 3M images sampled from the OpenImages dataset:
- Faster RCNN Inception ResNet V2 trained on OpenImages: https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1
- YOLO V3 trained on OpenImages: https://pjreddie.com/darknet/yolo/
- Faster RCNN ResNet50 trained on COCO: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html. The COCO annotations are mapped into OpenImages annotations via a manual mapping to use uniform labels.
To add more diversity in the visual concepts during the pre-training phase without annotating more boxes, we use our internal keywording model to generate multi-label concepts which cover more than 6000 concepts. We adopt a multi-task approach to train the model so that the class features predict both the detection and multi-label concepts while the box features only predict boxes for the detection concepts. To do this, we simply add another branch to the detection model in Fig. 2. and use global average pooling on the class features and an additional 1x1 convolution to predict the multi-label concepts. We found that this is a cheap way to introduce more visual diversity without using more concept boxes.
The annotations from the different detection models and the multi-label concepts are merged and processed to produce the training annotations. There are many cool things that can be done when having access to different annotations trained on different datasets. For example, we measure the similarity between the different detection annotations and multi-label concept predictions to reject samples if the annotations are not consistent. Fig 3. shows an overview of the data annotations pipeline we used.
Fig. 4 shows how merging outputs from different detection models can help produce better annotations.
We do the pre-training in 2 phases:
- Phase 1: the ImageNet pre-trained backbone is frozen and the detection model is trained on 288x288 inputs (18x18 grid)
- Phase 2: the whole model (backbone + detector) is trained with a lower learning rate on 480x480 inputs (30x30 grid)
We use a weighted binary cross-entropy for the detection classes, L1 loss for the boxes and binary cross-entropy for the multi-label concepts. We adopt a class-balanced sampling strategy by sampling one image per keywording concept in the batch to get diverse data.
Fig 5. shows an example of the output after the pre-training. You can notice that one single box can predict different concepts (“Eyewear” + “Glasses” or “Clothing” + “Suit”), that’s because the grids are class-specific as mentioned before. After getting the output annotations, we use non maximum suppression (NMS) across classes to combine the boxes and get multi-label classes per box.
So now that we have a pre-trained detection model from which we can use class and box features to train new concepts, how can we do the training efficiently with few samples?
We have pre-trained the detection model in a way that the features are processed with a single 1x1 convolution layer (one for the classes and one for the boxes). This is done on purpose to make custom training as easy as training separate linear models for different new classes. This allows us to add new classes by simply merging the linear models in one layer.
We now present different ways to do custom training. We will first take a look at the classifier part only, and then we will address the box regression problem.
In the few-shot learning setting, where we don’t have a lot of samples, blindly training a model with binary cross-entropy actually doesn’t work well.
To train our few-shot learning model, we need to extract features for both positive and negative samples. First, we run the available annotated samples with heavy data augmentation to generate more training examples. The features for the positives are the features of the matching cell grid of a given annotated box. The negatives are composed of two types of data: negatives from the training images based on the cells that don’t correspond to a box annotation + negatives randomly sampled from an unlabeled set. The unlabeled set is just a set of randomly sampled images from different concepts (cats, dogs, mountains, etc.). Here the idea is, since there are GxG features for a given image, the probability of picking a false negative from the grids of the unlabeled images is very low, and even if it is picked, a little noise wouldn’t affect the training much. So that’s a cheap way to augment the negatives and it’s one of the main ingredients to make this few-shot learning approach work with few samples.
Now that we have our features for the positives and negatives, how can we train the model? The trick here is to make good use of the pre-trained classifier weights as a regularization or initial guess. To get a reference classifier, when we extract the features, we also get the class predictions with the pre-trained model. For each cell of the GxG grid, we pick the class with the highest class confidence, then find its index channel in the 1x1 conv weights and use this index to get both the reference classifier weights and the reference box regressor (more on the box regressor later). As an illustrating example, let us suppose that you want to build a “Tiger” detector with few examples, and your pre-trained detector was trained on different concepts including “Big cat”, “Car” and “Tree”, (but not “Tiger”). You run your Tiger images through the pre-trained model. Provided that the model performs well, the cell corresponding to the Tiger bounding box will have high confidence for “Big cat” since a tiger is a big cat. This way, we start with a very good guess instead of random weights.
The classifier can be trained with gradient descent (SGD) via a weighted binary-crossentropy loss, giving more weight to the negative samples or solving a linear system. We noticed that it’s necessary to give significantly more weights (100x) to the negatives otherwise we get too many false positives when the batch is balanced. One can start with the reference classifier as an initial guess and train using a small learning rate. Instead, I personally prefer solving a linear system instead because it’s actually faster and more stable. We solve the following linear system:
where Wc and bc are the classifier parameters and Wc,ref is the reference classifier. σ is the Sigmoid function. We use the inverse Sigmoid function on the labels so we can add the classifier weights to the 1x1 conv layer in the model.
The box regressor training process follows a similar approach including the data augmentation step and feature extraction but we keep only the features corresponding to bounding box annotations. We can train a regressor using SGD and L1-loss, or we can also solve a linear system for each box coordinate regressor, so 4 linear systems of the form:
As before, we use the inverse Sigmoid on the encoded box coordinates as target values and use the reference box regressors as regularizers.
Fig 6. shows a result example for detecting “Tayto” chip bags using the approach described above.
One of the most important factors for us when designing models is speed. We want our detector to be fast at both inference and custom training.
Prediction takes around 15ms/sample in a single batch for a 480x480 resized input on an RTX 2080 GPU.
For custom training, the majority of the processing time is taken by data augmentation that we run in parallel. The training itself takes only a few seconds.
Special Case: One-Shot Learning
One-shot learning is an extreme case of few-shot learning which uses just one single sample. One thing that can be done which does not require training the classifier is to use an L2-norm trick which turns the reference feature prototype into a classifier.
To do this, the features in the detection model should be L2-normalized during the pre-training before the final 1x1 convolutions.
Let’s suppose that W is our reference feature prototype and x is a given feature vector that we want to classify. Note that each cell of the grid corresponds to a feature vector, so one input image would output GxGxF feature maps per concept, where F is the feature size produced by the class sub-network. Let’s do some simple math to see how distances translate into a linear classifier:
As you can see from the formula above, since the features are L2-normalized, we can treat the reference feature prototype W as a classifier and put the bias to -1 and it would work as a normal linear classifier. This trick can be done for various concepts, and subsequently the linear models can be merged into one single 1x1 convolution layer to do prediction efficiently.
Fig 7. shows results using one single detection example predicted using the approach described above. Not bad for a single training example!
The Domain Shift Problem
The approach above works quite well when the nature of the few-shot learning concept is related to the pre-trained data. For example, pre-training on different types of animals and doing few-shot learning for a specific animal breed works well since we start with a very good guess using the reference weights. On the other hand, if the few-shot learning data is completely unrelated to the pre-training, the approach wouldn’t perform very efficiently. For this reason, we also include in our SDK tools to support this and train the class and box sub-models instead of the linear models only.
One example we faced that falls into this category is when we tried to use the few-shot learning method - pre-trained using natural images from OpenImages - on satellite images to detect cars, swimming pools and buildings. In this situation, since the nature of the images is very different from the pre-training phase, it is better to train more layers (class and box modules in Fig 2.) or pre-train the model on the satellite images first and then use the new features for few-shot learning. Fig 8. shows a prediction example produced using our custom training approach for the satellite images use-case using crops form only few full-resolution training sampling.
Try our demo to see how Few-shot Learning actually works.Get Started
In this article, we demonstrated a simple way to do Few-shot object detection that works pretty well in practice. The core idea relies heavily on pre-training with a wide variety of visual concepts and re-using the weights as an initial guess to train new concepts with little data. It is worth pointing out that some concepts need more data than others in order to enable a good performance. We believe that this is due to a lack of some visual concepts during the pre-training phase, such as logos. Getting access to more visual concepts during the pre-training phase would produce better quality features that would enable training with even less data - or, better yet: having dedicated expert models for different types of visual data.
 "Deep residual learning for image recognition.", He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CVPR 2016.
 "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.", Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. arXiv:1704.04861, 2017.
 "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.", Mingxing Tan, Quoc V. Le. ICML 2019.
 “Feature Pyramid Networks for Object Detection.”, Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. CVPR 2017
 “EfficientDet: Scalable and Efficient Object Detection.”, Mingxing Tan, Ruoming Pang, Quoc V. Le. CVPR 2020