YOLO!!! So do we only live once? I sure do not know. What I know is that we only
have to LOOK once. Wait what?
That’s right. If you want to detect and localize objects in an image, there is
no need to go through the whole process of proposing regions of interest,
classify them and correct their bounding boxes. If you recall from my previous
is exactly what models like RCNN and Faster RCNN do.
Do we really need all that complexity and computation? Well if we want top-notch
accuracy we certainly do. Luckily there is another simpler way to perform such a
task, by processing the image only once and output the prediction immediately.
These types of models are called Single shot detectors.
Single shot detectors
Instead of having a dedicated system to propose regions of interests, we have a
set of predefined boxes to look for objects, which are forwarded to a bunch of
convolutional layers to predict class scores and bounding box offsets. Then
for each box we predict a number of bounding boxes with a confidence score
assigned to each one, we detect one object centered in that box and we output a
set of probabilities for each possible class. Once we have all that, we simply
and maybe naively keep only the box with a high confidence score. And it works.
With very impressive results actually. To elaborate the overall flow even
better, let’s use one of the most popular single shot detectors called YOLO .
You only look once (YOLO)
There have been 3 versions of the model so far, with each new one improving the
previous in terms of both speed and accuracy. The number of predefined cells and
the number of predicted bounding boxes for each cell is defined based on the
input size and the classes. In our case, we are going to use the actual numbers
used to evaluate the PASCAL VOC
First, we divide the image into a grid of 13×13, resulting in 169 cells in total.
For every one of the cells, it predicts 5 bounding boxes (x,y,w,h) with a
confidence score, it detects one object regardless the number of boxes and 20
probabilities for the 20 classes.
In total, we have 169*5=845 bounding boxes and the shape of output tensor of the
mode is going to be (13,13,5*5+20)= (13,13,45). The whole essence of the YOLO
models is to build this (13,13,45) tensor. To accomplish that, it uses a CNN
network and 2 fully connected layers to perform the actual regression.
The final prediction is extracted after keeping only the bounding boxes with a
high confidence score( higher than a threshold such as 0.3)
Because the model may output duplicate detections for the same object, we use a
technique called Non-maximal suppression to remove duplicates. In a simple
implementation, we sort the predictions by the confidence score and as we
iterate them we keep only the first appearances of each class.
As far as the actual model is concerned, the architecture is quite trivial as it
consists of only convolutional and pooling layers, without any fancy tricks. We
train the model using a multiple loss function, which includes a classification
loss, a localization loss and a confidence loss.
The most recent versions of YOLO have introduced some special tricks to improve
the accuracy and reduce the training and inference time. Some examples are batch
normalization, anchor boxes, dimensions clusters and others. If you want to get
into more details, you should definitely check the original
The power of YOLO is not its spectacular accuracy or the very clever ideas
behind it, is its superb speed, which makes it ideal for embedded systems and
low-power applications. That’s why self-driving cars and surveillance cameras
are its most common real-world use cases.
As deep learning continues to play along with computer vision (and it will sure do), we
can expect many more models to be tailored for low-power systems even if they sometimes sacrify
accuracy. And dont forget the whole Internet of Things kind of thing. This is where these models
* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.