What is Object Detection?

In CNN approach, Image classification takes an image and predicts the object in an image.

Lets say we built a cat-dog classifier with CNN, and predict images. What if there is a image with both cat and dog present in the image?

The major reason why you cannot proceed with this problem by building a standard convolutional network followed by a fully connected layer is that, the length of the output layer is variable — not constant, this is because the number of occurrences of the objects of interest is not fixed.

A naive approach to solve this problem would be to take different regions of interest from the image, and use a CNN to classify the presence of the object within that region. The problem with this approach is that the objects of interest might have different spatial locations within the image and different aspect ratios. Hence, you would have to select a huge number of regions and this could computationally blow up. Therefore, algorithms like R-CNN, YOLO etc have been developed to find these occurrences and find them fast.

We need to identify the location of the objects in image, with Object detection algorithm (e.g. RCNN).

Unlike image classification, detection requires localizing (likely many) objects within an image.

Classification+Localization=Object Detection\text{Classification} + \text{Localization} = \text{Object Detection}

The difference between object detection algorithms (e.g. RCNN) and classification algorithms (e.g. CNN) is that in detection algorithms, we try to draw a bounding box around the object of interest (localization) to locate it within the image.

  • Classification: What is it?
  • Localization: What and Where is it?
  • Detection: What and Where are they?

What are we predicting?

In place of predicting the class of object from an image, we now have to predict the class as well as a rectangle(called bounding box) containing that object. It takes 4 variables to uniquely identify a rectangle. So, for each instance of the object in the image, we shall predict following variables:

  • Class name
  • x - bounding_box_top_left_x_coordinate
  • y - bounding_box_top_left_y_coordinate
  • w - bounding_box_width
  • h - bounding_box_height

we can have multi-class object detection problem where we detect multiple kinds of objects in a single image.

The Algorithms for Objection Detection

The Common Algorithms for Objection Detection:

  • Histogram of Oriented Gradients (HOG)
  • Region-based Convolutional Neural Networks (R-CNN)
  • Fast R-CNN
  • Faster R-CNN
  • Region-based Fully Convolutional Network (R-FCN)
  • Single Shot Detector (SSD)
  • Spatial Pyramid Pooling (SPP-net)
  • YOLO (You Only Look Once)

Conventional Approaches: Template matching, HOG, etc

  • Feature extraction => Feature vector formation => Similarity measure (Euclidean Distance, SVM, or NN)
    • Search: to look for a match: Exhaustive Search, Quick search techniques

In this Article, RCNNs and YOLO will be briefly introduced.

What is RCNN?

RCNN has nothing to do with RNN (Recurrent neural networks).

R-CNN is short for “Region-based Convolutional Neural Networks.”

Paper of RCNN: Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

How RCNN Detect Things?

  1. Takes in input image
  2. extracts around 2000 bottom-up region proposals
  3. computes features for each proposal using a large convolutional neural network (CNN)
  4. classifies each region using class-specific linear SVMs

Object detection with R-CNN

R-CNN solves this problem by using an object proposal algorithm called Selective Search

RCNN is done by:

  • Step1: Extract regions from input image through Search (Selective Search)
  • Step2: Resize each region (image patch) to the size of model input and forward pass the CNN to get the features
  • Step3: Classify features in each region using multiple SVM classifiers
  • Step4: Linear regression model (Bounding box regression) will be applied to thee class to slightly tune the offsets of the coordinates of bounding box of the region proposal.
    • Optimize patches by training bounding box regression separately.

The Disadvantages of RCNN

  • It can’t be implemented real-time.

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation.

What is Fast RCNN?

The approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.

Motivation:

  • Using feature maps and multiply with the weights, we can obtain the full-res probability map and locate the object.

Fast RCNN uses the ideas from SPP-net and RCNN and fixes the key problem in SPP-net i.e. they made it possible to train end-to-end.

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

They use a different architecture.

Fast RCNN is done by:

  • Step1: Input whole image into CNN and get the feature map at 5th conv layer
  • Step2: ROI (Regions of Interest) Pooling
    • Transform the ROI on the feature map to a fixed-size feature vector
  • Step3: FC layers
  • Step4: Bounding box regressors and Softmax classifier from the FC layers

Why Fast?

  • The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the CNN every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.
  • Final classification and Bounding box regression are done concurrently
  • 100x speed of RCNN

Multi-task loss

  • Log loss + Smooth L1 loss is used
  • Smooth L1-loss can be interpreted as a combination of L1-loss and L2-loss.

Paper of Fast R-CNN

What is Faster RCNN?

Faster R-CNN is a single, unified network for object detection.

Motivation:

  • Slowest part in Fast RCNN and RCNN was Selective Search or Edge boxes. Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests.

Paper of Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

How Faster RCNN detect things?

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

Faster RCNN is done by:

  • Step1: Input whole image into CNN and get the feature map at 5th conv layer
  • Step2: Get region proposals using region proposal network
    • Input of RPN is the entire feature map, and the output is a set of object proposals with objectness scores.
    • Region proposals are generated by sliding a small window, say 3x3 window over the feature maps output from CNN
    • A maximum of k regions are identified, which are called anchors
      • Anchors are boxes with different sizes
      • After anchors are obtained, only the ones above a threshold score are kept
  • Step3: ROI (Regions of Interest) Pooling using the proposals and feature map
    • Transform the ROI on the feature map to a fixed-size feature vector
  • Step4: Bounding box regressors and Softmax classifier from the FC layers

Faster RCNN can be implemented nearly real-time.

What is YOLO?

YOLO is a clever convolutional neural network (CNN) for doing object detection in real-time. Notice that at runtime, we have run our image on CNN only once. Hence, YOLO is super fast and can be run real time.

YOLO stands for You Only Look Once. It is similar to RCNN, but In practical it runs a lot faster than faster RCNN due it’s simpler architecture. Unlike faster RCNN, it’s trained to do classification and bounding box regression at the same time.

How YOLO Detect Things?

Processing images with YOLO is simple and straightforward.

  1. Resize the input image to 488×488488 \times 488
  2. runs a single convolutional network on the image
  3. thresholds the resulting detections by the model’s confidence
  • Does not undergo the region proposal step
  • only predicts over a limited number of bounding boxes
  • Image is divided into an SxS grid
    • Each grid cell predicts B bounding boxes (anchor or anchor boxes) with confidence score
      • confidence = probability of IoU (Intersectionn over Union) between the predicted and the ground true boxes

YOLOv1 Architecture (Network Design)

The Detection network has 24 convolutional layers followed by 2 fully connected layers.

  • Alternating 1 × 1 convolutional layers (reduction layers) reduce the features space from preceding layers

Input is the image with dimension 448×448448 \times 448

Final output is the 7×7×307 \times 7 \times 30 tensor of predictions

  • Leaky ReLU as activation in all the Layers (except last)
  • Linear activation function for final layer
  • Sum of Squares Error (SSE) as optimizing function
  • Batch size of 64, Momentum of 0.9 and Decay of 0.0005
  • Dropout (rate = .5) is used after the first connected layer
  • Data Augmentation is used (random scaling, translation, exposure, satuation)

YOLO unify the separate components of object detection into a single neural network. YOLO’s network divides the input image into an S×SS × S grid as output. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts BB bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.

S = 7 in YOLO’s paper. Therefore in the Network Design part, you can see the final output is the 7×7×307 \times 7 \times 30 tensor of predictions

Note starting from YOLOv3 They used a new network (Darknet-53).

Reference

R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms

Zero to Hero: Guide to Object Detection using Deep Learning: Faster R-CNN,YOLO,SSD

Paper of RCNN: Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

Paper of Fast R-CNN

Paper of Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Darknet: YOLO

Paper of Yolo v1 - You Only Look Once: Unified, Real-Time Object Detection

Paper of Yolo v2 - YOLO9000: Better, Faster, Stronger

Paper of YOLOv3: An Incremental Improvement

Paper of YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLO, YOLOv2 and YOLOv3: All You want to know