Technical Blogs

The latest insights, ideas, and perspectives from bluetick consultants . A complete collection of our latest thinking across industries and capabilities can be here.

Computer vision top object detection techniques

Object detection: Locating the presence of objects with a bounding box and types or classes of the located objects in an image or video is called object detection.

Input: An image with one or more objects. Output: One or more bounding boxes (e.g. defined by a point, width, and height), and a class label for each bounding box.

Some of the famous Object detection techniques

1 .Viola Jones:

This technique was formulated by Paul viola and Micheal Jones in 2001. Although it can be trained to detect a variety of object classes, It was giving better results on face detection.It uses Haar-cascade classifiers technique to detect the object without using Neural Networks.This algorithm tries to find the most relevant features for a human faces(ex : eyes,nose,lips and forehead).If the Algorithm does not find the most relevant features it comes to conclusion that there is no Human face on the region of the image.


• Detection is very fast
• Simple to understand and implement
• Less data needed for training than other ML models
• No resizing of images needed (like with CNN’s)


• Training time is very slow
• Restricted to binary classification
• Mostly effective when face is in frontal view
• May be sensitive to very high/low exposure (brightness)
• High true detection rate, but also high false detection rate


• Attendance recording for employees based on face detection instead of fingerprint.

2 .HOG(Histogram of oriented Gradients)

This approach was first published by navneet Dadal and Bill Triggs in 2005 this approach out performed on face detection and Object detection etc. this approach uses gradients: Difference in pixel intensities for pixel’s right next to each other(surrounding pixels).Using grayscale images and using blurring (gaussian smoothing) have a negative effect on the precision of the final classification algorithm.

Oriented Gradient means which tells us the direction of greatest intensity change in the neighborhood of pixel(x,y).

The angles are within the range[0,180]. So called un-signed gradients. A gradient arrow and the one 180 degrees opposite to it are considered the same. Finally we have to calculate the histogram containing 9 bins corresponding to angels in degrees.


• Gives better results on face detection if the location of the face even changes.
• Easy to detect binary objects in an image.


• Take lot of time to execute.
• Does Not give better results if there are multiple objects.
• Applying the sliding window and Normalization data manually is a challenging thing.


• Pedestrian detection in highways.
• Face detection for attendance and mobile lock screens.

3 .Regional Based Convolutional Neural Networks (R-CNN’s)

This concept was implemented using convolutional neural networks where it takes an input image and produces a set of bounding boxes as output and each bounding box contains an object and also the category.

• It is used for object detection: cars , pedestrians,people etc.
• The main problem with standard CNN is that we have to consider several regions of the image(that contains no objects at all)

RCNNs can effectively reduce the number of iterations using a concept called selective search.

This selective search algorithm generates the so-called region proposals and these regions are fed into a neural network.

Selective search algorithm uses segmentation: so the algorithm groups adjacent regions that are similar to each other. We group pixels into a smaller number of segments.

• Selective search algorithm use the result of Felzenszwalb-huttenlocher transform on the actual image.
• The initial proposed regions are the segmented parts after the transformation.
• The algorithm keeps grouping segments based on similarity(color,texture,size and shape)

After applying selective search the proposed regions will be selected and sent to the neural network.

So the neural network were already trained on training data and each region on the image sent to the neural network and generated output vectors were sent to support vector machine algorithms for finding the class label.

Problems with R-CNN's

• It still takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
• It cannot be implemented real time as it takes around 47 seconds for each test image.
• The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.
• Training time is around 84 hrs.

4 .Fast R-CNN

This approach is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the neural network , we feed the input image to the neural network to generate a convolutional feature map. From the convolutional feature map, we identify the region of proposals and warp them into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI(Region of interest) feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

The reason Fast R-CNN algorithm is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Fast R-CNN drastically improves the training (8.75 hrs vs 84 hrs) and detection time from R-CNN. It also improves Mean Average Precision (mAP) marginally as compared to R-CNN.

Problems with R-CNN's

Most of the time taken by Fast R-CNN during detection is a selective search region proposal generation algorithm. Hence, it is the bottleneck of this architecture which was dealt with in Faster R-CNN.

5 .Faster R-CNN

Faster R-CNN uses a region proposal method to create the sets of regions. It possesses an extra CNN for gaining the regional proposal, which we call the regional proposal network. In the training region, the proposal network takes the feature map as input and outputs region proposals. And these proposals go to the ROI pooling layer for further procedure.

Comparison between Faster R-CNN and Fast R-CNN

• Faster R-CNN is much faster than Fast R-CNN and R-CNN because it is used by RPN(Region Proposal Network) for generating anchor boxes (region proposals).
• It can be used in real-time object detection.


• Face detection using Neural Networks.
• Face Recognition.
• Object tracking

6 .Yolo (You Only Look once)

There are several Minor issues with region based neural network where it is very accurate but not fast and can not be used in real-time.we have to pre-train several components of the approach (CNN,SVM,Linear regression model for bounding boxes etc).yolo can deal with this issues, it was first published back in 2015 by Joseph Redmon , Ross Girshick etc.using yolo we can train the model with a single neural network. First yolo divides the image in 19*19 grids and each grid sent to the neural network for generating 5 vectors called it positions and number of class labels so within single CNN layers entire detecting is going to happen and for managing the multiple objects we can use NON max suppression and Intersection over union concepts. So when compared to regional based Neural networks yolo works in a better manner.yolo algorithm where trained on coco dataset where we are having 80 labels so we can use its pre-trained weights directly to detect objects out of those 80. If we want to create custom data also it is possible using yolo just we need to annotate and train the network.

Advantages and disadvantages of YOLO

• YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms.
• The limitation of the YOLO algorithm is that it struggles with small objects within the image, for example, it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm.


• Vehicle detection.
• Number Plate detection.
• Covid detection

7 .SSD(single shot multibox Detector)

The first step is very similar to YOLO algorithm: we have to divide the original image upto S*S grid cells.
Each grid is responsible for detecting objects in that region of the image.there is a problem again: what if there are Multiple objects in the same region of the image.we can use default boxes - anchor boxes(these are pre-trained boxers)
The SSD algorithm makes an assumption about the size and orientations of the default boxes(we can learn about them from the training dataset).there are patterns for the aspect ratios and sizes of the bounding boxes.we assign 6 different types of bounding boxes to every single grid cell.

The algorithm will find the bounding box that is the most similar to the detected object. The advantage of the SSD algorithm is it makes several predictions at different stages of the CNN.

The SSD algorithm makes several predictions at different stages of the CNN.The feature map is the output of one filter applied to the previous layer.

The convolutional layers decrease in size progressively and allow predictions of the detections at multiple scales.

Features at different layers represent sizes of regions in the input image.The size of the image represented by a feature gets larger and larger and predictions from the previous layers help in dealing with smaller objects.


• Accuracy increases with the number of default boundary boxes at the cost of speed.
• Multi-scale feature maps improve the detection of objects at a different scale.


• Shallow layers in a neural network may not generate enough high level features to do prediction for small objects.
• Therefore, SSD does worse for smaller objects than bigger objects.
• The need for complex data augmentation also suggests it needs a large amount of data to train.

Back To Blogs

Find out our capabilities to match your requirements