Deep Learning is eating the world.
The hype began around 2012 when a Neural Network achieved super human performance on Image Recognition tasks and only a few people could predict what was about to happen.
During the past decade, more and more algorithms are coming to life. More and more companies are starting to add them in their daily business.
Here, I tried to cover all the most important Deep Learning algorithms and architectures concieved over the years for use in a variety of applications such as Computer Vision and Natural Language Processing.
Some of them are used more frequently than others and each one has its own streghth and weeknesses.
My main goal is to give you a general idea of the field and help you understand what algorithm should you use in each specific case. Because I know it seems chaotic for someone who wants to start from scratch.
But after reading the guide, I am confident that you will be able to recognize what is what and you will be ready to begin using them right away.
So if you are looking for a truly complete guide on Deep Learning , let’s get started.
Deep Learning is gaining crazy amounts of
popularity in the scientific and
corporate communities. Since 2012, the year when a Convolutional Neural Network
achieved unprecedent accuracy on an image recognition competition ( ImageNet
Large Scale Visual Recognition Challenge), more and more research papers come
and more and more companies started to incorporate Neural Networks into their
businesses. It is estimated that Deep Learning is right now a 2.5 Billion Market
and expected to become 18.16 Billion by
But what is Deep learning?
According to Wikipedia: “Deep
learning (also known as deep structured learning or differential programming) is
part of a broader family of machine learning methods based on artificial neural
networks with representation learning. Learning can be supervised,
semi-supervised or unsupervised”.
In my mind, Deep Learning is a collection of algorithms inspired by the
workings of the human brain in processing data and creating patterns for use in
decision making, which are expanding and improving on the idea of a single model
architecture called Artificial Neural Network.
Just like the human brain, Neural
Networks consist of Neurons. Each Neuron
receives signals as an input, multiplies them by weights, sums them up and
applies a non-linear function. These neurons are stacked next to each other and
organized in layers.
But what do we accomplish by doing that?
It turns out that Neural Networks are excellent function approximators.
We can assume that every behavior and every system can ultimately be represented
as a mathematical function (sometimes an incredible complex one). If we somehow
manage to find that function, we can essentially understand everything about the
system. But finding the function can be extremely hard. So, we need to estimate
it. Enter Neural Networks.
Neural Networks are able to learn the desired function using big amounts of data
and an iterative algorithm called
backpropagation. We feed the
network with data, it produces an output, we compare that output with a desired
one (using a loss function) and we readjust the weights based on the difference.
And repeat. And repeat. The adjustment of weights is performed using a
non-linear optimization technique called stochastic gradient
After a while, the network will become really good at producing the output.
Hence, the training is over. Hence, we manage to approximate our function. And
if we pass an input with an unknown output to the network, it will give us an
answer based on the approximated function.
Let’s use an example to make this clearer. Let’s say that for some reason we
want to identify images with a tree. We feed the network with any kind of images
and it produces an output. Since we know if the image has actually a tree or
not, we can compare the output with our truth and adjust the network.
As we pass more and more images, the network will make fewer and fewer mistakes. Now we can
feed it with an unknown image, and it will tell us if the image contains a tree.
Pretty cool, right?
Over the years researchers came up with amazing improvements on the original
idea. Each new architecture was targeted on a specific problem and one achieved
better accuracy and speed. We can classify all those new models in specific
Feedforward Neural Networks (FNN)
Feedforward Neural Networks are usually fully
connected, which means
that every neuron in a layer is connected with all the other neurons in the next
layers. The described structure is called Multilayer Perceptron and originated
back in 1958. Single-layer perceptron can only learn linearly separable
patterns, but a multilayer perceptron is able to learn non-linear relationships
between the data.
They are exceptionally well on tasks like classification and regression.
Contrary to other machine learning algorithms, they don’t converge so easily.
The more data they have, the higher their accuracy.
Convolutional Neural Networks (CNN)
Convolutional Neural Networks employ a function called
concept behind them is that instead of connecting each neuron with all the next
ones, we connect it with only a handful of them (the receptive field).
In a way, they try to regularize feedforward networks to avoid overfitting (when the model
learns only pre-seen data and can’t generalize), which makes them very good in
identifying spatial relationships between the data.
That’s why their primary use case is Computer Vision and applications such as
image classification, video recognition, medical image analysis and
self-driving cars where they
achieve literally superhuman performance.
They are also ideal to combine with other types of models such as Recurrent Networks and Autoencoders. One such example is Sign Language Recognition.
Recurrent Neural Networks (RNN)
Recurrent networks are perfect for time-related data and they are used in time
series forecasting. They use some form of feedback, where they return the output
back to the input. You can think of it as a loop from the output to the input in
order to pass information back to the network. Therefore, they are capable to
remember past data and use that information in its prediction.
To achieve better performance researchers have modified the original neuron into
more complex structures such as GRU
and LSTM Units. LSTM units
have been used extensively in natural language processing in tasks such as
language translation, speech generation, text to speech synthesis.
Recursive Neural Network
Recursive Neural Networks are another form of recurrent networks with the
difference that they are structured in a tree-like form. As a result, they can
model hierarchical structures in the training dataset.
They are traditionally used in NLP in applications such as Audio to text
transcription and sentiment analysis because of their ties to binary trees,
contexts, and natural-language-based parsers. However, they tend to be much
slower than Recurrent Networks
Autoencoders are mostly used as an
unsupervised algorithm and its main use-case is dimensionality reduction and
compression. Their trick is that they try to make the output equal to the input.
In other works, they are trying to reconstruct the data.
They consist of an encoder and a decoder. The encoder receives the input and it encodes it in a
latent space of a lower dimension. The decoder takes that vector and decodes it
back to the original input.
That way we can extract from the middle of the network a representation of the
input with fewer dimensions. Genius, right?
Of course, we can use this idea to reproduce the same but a bit different or
even better data (training data augmentation, data denoising, etc)
Deep Belief Networks and Restricted Boltzmann Machines
are stochastic neural networks with generative capabilities as they are able to
learn a probability distribution over their inputs. Unlike other networks, they
consist of only input and hidden layers( no outputs).
In the forward part of the training the take the input and produce a
representation of it. In the backward pass they reconstruct the original input
from the representation. (Exactly like autoencoders but in a single network).
Multiple RBM can be stacked to form a Deep Belief
Network. They look exactly like
Fully Connected layers, but they differ in how they are trained. That’s because
they train layers in pairs, following the training process of RBMs (described
However, DBNs and RBMs have kind of abandoned by the scientific community in
favor of Variational Autoencoders and GANs
Generative Adversarial Networks (GAN)
introduced in 2016 by Ian Goodfellow and they are based on a simple but elegant
idea: You want to generate data, let’s say images. What do you do?
You build two models. You train the first one to generate fake data (generator) and the second
one to distinguish real from fakes ones(discriminator). And you put them to
compete against each other.
The generator becomes better and better at image generation, as its ultimate
goal is to fool the discriminator. The discriminator becomes better and better
at distinguish fake from real images, as its goal is to not be fooled. The
result is that we now have incredibly realistic fake data from the
Applications of Generative Adversarial Networks include video games,
astronomical images, interior design, fashion. Basically, if you have images in
your fields, you can potentially use GANs. Oooh, do you remember Deep Fakes?
Yeah, that was all made by GANs.
are also very new and they are mostly used in language applications as they are
starting to make recurrent networks obsolete. They based on a concept called
attention, which is used to force the network to focus on a particular data
Instead of having overly complex LSTM units, you use Attention mechanisms to
weigh different parts of the input based on their significance. The attention
is nothing more than another layer with weights and its sole purpose is to
adjust the weights in a way that prioritizes segments of inputs while
Transformers, in fact, consist of a number of stacked encoders (form the encoder
layer), a number of stacked decoders (the decoder layer) and a bunch of
attention layers (self- attentions and encoder-decoder attentions)
Transformers are designed to handle ordered sequences of data, such as natural
language, for various tasks such as machine translation and text summarization.
Nowadays BERT and GPT-2 are the two most prominent pretrained natural language
systems, used in a variety of NLP tasks, and they are both based on
Graph Neural Networks
Unstructured data are not a great fit for Deep Learning in general. And there
are many real-world applications where data are unstructured and organized in a
graph format. Think social networks, chemical compounds, knowledge graphs,
Graph Neural Networks purpose
is to model Graph data, meaning that they identify the relationships between the
nodes in a graph and produce a numeric representation of it. Just like an
embedding. So, they can later be used in any other machine learning model for
all sorts of tasks like clustering, classification, etc.
Deep Learning in Natural Language Processing (NLP)
Word Embeddings are the representations of words into numeric vectors in a way
that capture the semantic and syntactic similarity between them. This is
necessary because neural networks can only learn from numeric data so we had to
find a way to encode words and text into numbers.
Word2Vec is the most popular technique
and it tries to learn the embeddings by predicting a word based on its
context (CBOW) or by predicting the surrounding words based on the word
(Skip-Gram). Word2Vec is nothing more than a simple neural network with 2
layers that has words as inputs and outputs. Words are fed to the Neural
Network in the form of one-hot encoding.
In the case of CBOW, the inputs are the adjacent words and the output is the
desired word. In the case of Skip-Gram, it’s the other way around.
is another model that extends the idea of Word2Vec by combining it with
matrix factorization techniques such as Latent Semantic Analysis, which are
proven to be really good as global text statistics but unable to capture
local context. So the union of those two gives us the best of both worlds.
FastText by Facebook
utilizes a different approach by making use of character-level
representation instead of words.
Contextual Word Embeddings replace Word2Vec with Recurrent Neural
Networks to predict, given a current word in a sentence, the next word. That
way we can capture long term dependencies between words and each vector
contains both the information on the current word and on the past ones. The
most famous version is called ELMo and it
consists of a two-layer bi-directional LSTM network.
Attention Mechanisms and
Transformers are making RNN’s obsolete (as mentioned before), by weighting
the most related words and forgetting the unimportant ones.
Sequence models are an integral part of Natural Language Processing as it
appears on lots of common applications such as Machine
Speech Recognition, Autocompletion and Sentiment Classification. Sequence models
are able to process a sequence of inputs or events such as a document of words.
For example, imagine that you want to translate a sentence from English to
To do that you need a Sequence to Sequence model
Seq2sec models include an encoder and a decoder. The encoder takes the
sequence(sentence in English) and produces an output, a representation of the
input in a latent space. This representation is fed to the decoder, which gives
us the new sequence (sentence in France).
The most common architectures for encoder and decoder are Recurrent Neural
Networks (mostly LSTMs) because they are great in capturing long term
dependencies and Transformers that tend to be faster and easier to parallelize.
Sometimes they are also combined with Convolutional Networks for better
Deep Learning in Computer Vision
Localization and Object Detection
is the task of locating objects in an image and mark them with a bounding box,
while object detection includes also the classification of the object.
The interconnected tasks are tackled by a fundamental model (and its
improvements) in Computer Vision called R-CNN. RCNN and it’s predecessors Fast
RCNN and Faster RCNN take advantage of regions proposals and Convolutional
An external system or the network itself( in the case of Faster RCNN) proposes
some regions of interest in the form of a fixed-sized box, which might contain
objects. These boxes are classified and corrected via a CNN (such as AlexNet),
which decided if the box contains an object, what the object is and fixes the
dimensions of the bounding box.
Single-shot detectors and it’s most famous member YOLO (You Only Look
Once) ditch the idea of region proposals and
they use a set of predefined boxes.
These boxes are forwarded to a CNN, which predicts for each one a number of
bounding boxes with a confidence score, it detects one object centered in it and
it classifies the object into a category. In the end, we keep only the bounding
boxes with a high score.
Over the years YOLOv2, YOLOv3, and YOLO900 improved on the original idea both on
speed and accuracy.
One of the fundamentals tasks in computer vision is the classification of all
pixels in an image in classes based on their context, aka Semantic
Segmentation. In this
direction, Fully Convolutional Networks (FCN) and U-Nets are the two most
Fully Convolutional Networks (FCN) is an encoder-decoder architecture
with one convolutional and one deconvolutional network. The encoder
downsamples the image to capture semantic and contextual information while
the decoder upsamples to retrieve spatial information. That way we manage to
retrieve the context of the image with the smaller time and space complexity
U-Nets are based on the ingenious idea of skip-connections. Their
encoder has the same size as the decoder and skip-connections transfer
information from the first one to the latter in order to increase the
resolution of the final output.
Pose Estimation is the problem of
localizing human joints in images and videos and it can either be 2D or 3D. In
2D we estimate the (x,y) coordinates of each joint while in 3D the (x,y,z)
dominates the field (it’s the go-to model for most smartphone applications) of pose estimation and
it uses Convolutional Neural Networks (didn’t see that coming, did you?). We
feed the image to a CNN and we use a single-pose or a multi-pose algorithm to
detect poses. Each pose is associated with a confidence score and some key
points coordinates. In the end, we keep the ones with the highest confidence
There you have it. All the essential Deep Learning algorithms at the time.
Of course, I couldn’t include all the published architectures, because they are
literally thousands. But most of them are based on one of those basic models and
improve it with different techniques and tricks.
I am also confident that I will need to update this guide pretty soon, as new
papers are coming out as we speak. But that is the beauty of Deep Learning.
There is so much room for new breakthroughs that it’s kind of scary.
If you think that I forgot something, don’t hesitate to contact us on Social
media or via email. I want this post to be as complete as possible.
Now it’s your time. Go and build your own amazing applications using these
algorithms. Or even create a new one, that will make the cut in our list. Why not?
Have fun and keep learning AI.
* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.