Object Detection : is a computer technology related
to Computer vision and Image Processing that deals with detecting
instances of semantic objects of a certain class (such as humans,
buildings, or cars) in digital images and videos.
Classification : is a process related to
categorization,the process in which ideas and objects are
recognized, differentiated and understood.
Localization : Placing a rectangular box for
classified objects.
Why YOLO?
The R CNN family of techniques used to localize the objects within
the image. The network does not look at the entire image, only at
the parts of the images which have a higher chance of containing an
object.
The yolo framework, on the other hand, deals with object detection
in a different way. It takes the entire image in a single instance
and predicts the bounding box coordinates and class probabilities
for these boxes. The biggest advantage using yolo is its superb
speed – it’s incredibly fast and can process 45 frames per second.
yolo also understands generalized object representation. This is one
of the best algorithms for object detection and has shown a
comparatively similar performance to the R CNN algorithms. So for
better results we are using yolov4 for developing custom object
detection.
How YOLO (You only Look once) Works?
Contents :
- What is YOLO and Why is it Useful?
- How does the YOLO Framework Function?
- How to Encode Bounding Boxes?
- Intersection over Union and Non-Max Suppression
- What is YOLO
Yolo is an algorithm that uses neural networks to provide real-time
object detection. This algorithm is popular because of its speed and
accuracy. It has been used in various applications to detect traffic
signals, people, parking meters, and animals. Yolo model process 45
frames per second then any other detecting models.
How does YOLO work?
After reading an Image YOLO splits the Image into 19 * 19 grids. For
better understanding will take 3 * 3 grid image example

From the above Image 3 * 3 grid each grid will be considered as an
individual image and sent to the Neural Network. If the Images label
founds in the grid will collect the 4 Parameters from the grid
- Height
- Width
- X_Coordinates from the mid Point of the label
- Y_Coor dinates from the mid Point of the label
How YOLO Train Images?
We need to pass the labelled data to the architecture in order to
train it. Suppose we have divided the image into a grid of size 3 *
3 and there are a total of 3 classes which we want the objects to be
classified into. Let’s say the classes are Bus, Car, and Person
respectively. So, for each grid cell, the label y will be an eight
dimensional vector:
Here,
- pc defines whether an object is present in the grid or not (it is the probability)
- Probability range 0 to 1.
- height, Width, x coordinate, y coordinate, specify the bounding box if there is an object
- c1, c2, c3 represent the classes. So, if the object is a car, c2 will be 1 and c1 & c3 will be 0, and so on
Let’s say we select the first grid from the above example:
Since there is no object in this grid, pc will be zero and the y
label for this grid will be:
Means that it doesn’t matter what height, width, x coordinate, y
coordinate, c1, c2, and c3 contain as there is no object in the
grid. Let’s take another grid in which we have a car (c2 = 1):

In the above image, there was a car , So YOLO will take the
mid-point of these objects . The y label for the center left grid
with the car will be:
Since there is an object in this grid, pc will be equal to 1. H , W
, X Coordinate , y Coordinate will be allocated . Since a car is the
second class, c2 = 1 and c1 and c3 = 0. So, for each of the 9 grids,
we will have an eight dimensional output vector. This output will
have a shape of 3 X 3 X 8
How does bounding boxes really work?
If we found a label in one grid then Yolo takes care of finding
their height, width, and both coordinates.

there is a (label) car in one grid

collecting the values from the grid and y will be =
height , width , X and y values are updated according to the label
found inside the grid . This will happen for every grid in an image.
Intersection Over Union:
IOU = AREA OF INTERSECTION / AREA OF UNION
Union over Intersection Measure the overlap between two bounding
boxes During training we calculate the IOU between the predicted
bounding box and the ground truth (the pre labelled bonding box we
aim to match )
If IOU value > 0.5 that predicted bounding is a good boundary or < 0.5 not a good boundary.
Non Max Suppression:
Non max suppression means in an image the grids are 19 * 19 Each
grid will think it is a midpoint. So on every iterations during
training we will get different boundary predicted positions like
So from among these we will select the boundary which has a high Pc
value. High Pc boundaries will be selected and these undergo the
IOU concept with the neighbor boundary in the grid. If The IOU
value is > 0.5 for that particular Pc it is considered as a perfect
boundary.
Yolo V4 on coco dataset:
Yolo V4 model were Trained on coco dataset where there will be 80
different labels. It is very easy to test our own labels using
pretrained weights. For this purpose we used a framework called
darknet.
Ex:
- person
- bicycle
- car
- Motor Bike
- aero plane
- Bus
- train
- truck
- boat
- traffic lights etc..
Finding detections on video :
How to do it for our own data?
For developing our own object detection model we need to select
labels that we want to detect and annotate them in yolo format. For
annotating the images we are using a tool known as labellmg.
labelling own data for object detection
For creating our own object detection we need to collect the data
from external sources. Collected 10 different label data from Kaggle
and some other websites now our main task is to annotate the Images
into Yolo format. For that purpose I am using an tool known as an
Labellmg tool.
In the first step select the directory of the images and apply a
rectangle box for every label in the image.
What happens if we apply a rectangle box?
we will get their x coordinate , y coordinate , height and width .
and wee need to specify class label also for every image . by doing
this process we will get images and their annotated file.
Custom object detection
Multiclass object detection
Developing an custom object detection using 10 Labels
The yolo implementations are amazing tools that can be used to start
detecting common objects in images and videos. However there are
many cases, when the object which we want to detect is not part of
the popular dataset. In such cases we need to create our own
training set and execute our own training. Vehicle and its License
plate detection are one such case. We have used yolov4 to detect the
desired classes.
- car – 0
- truck – 1
- Bus – 2
- Motor Cycle – 3
- Auto – 4
- car LP – 5
- truck LP – 6
- Bus Lp – 7
- Motor Cycle – 8
- Auto LP – 9
Later splitted all the dataset in training and validation set and
stored the path of all the images in file named train.txt and
valid.txt
Configuring Files
Yolov4 needs certain specific files to know how and what to train.
- Obj.data
- Obj.names
- Obj.cfg
obj.data
This basically says that we are training 10 classes, what the train
and validation files are and which file contains the name of the
object we want to detect.During training save the weight in the
backup folder.
- classes = 10
- train = train.txt
- valid = test.txt
- names = obj.names
- backup = backup
obj.cfg
Just copied the yolov4.cfg files and made a few changes in it.
- Set batch=24 to use 24 images for every training step.
- Set subdivisions=8 to subdivide the batch by 8 to speed up the training process.
- Set filters=(classes + 5)*3 , e.g filter=45.
After completing the training part total accuracy score 89%. For
better model performance we need to increase the accuracy score.
Steps used to increase the accuracy
- Data Augmentation
- Increasing the subdivisions for each batch
- Trained the model with 6000 Iterations
- After applying all 3 above steps and once again training has been done. the accuracy has increase by 94%
What is Data Augmentation
Techniques are used to increase the amount of data by adding
slightly modified copies of already existing data or newly created
synthetic data from existing data . So data augmentation involves
creating new and representative data. some images rotated with 40
and some with 180 , 270 etc.
- Rotation_range = 40
- width_shift_range=0.2
- height_shift_range=0.2
- shear_range=0.2
- zoom_range=0.3
Final result on a Video :