Understanding Maximum Likelihood Estimation in Supervised Learning

This article demystifies the ML learning modeling process under the prism of statistics. We will understand how our assumptions on the data enable us to create meaningful optimization problems. In fact, we will derive commonly used criteria such as cross-entropy in classification and mean square error in regression. Finally, I am trying to answer an interview question that I encountered: What would happen if we use MSE on binary classification?

Likelihood VS probability and probability density

To begin, let’s start with a fundamental question: what is the difference between likelihood and probability? The data $x$ are connected to the possible models $\theta$ by means of a probability $P(x,\theta)$

In short, A pdf gives the probabilities of occurrence of different possible values. The pdf describes the infinitely small probability of any given value. We’ll stick with the pdf notation here. For any given set of parameters $\theta$ , $p(x,\theta)$

The likelihood $p(x,\theta)$

Notations

We will consider the case were we are dealt with a set $X$ of $m$ data instances $X= \{ \textbf{x}^{(1)}, . . , \textbf{x}^{(m)} \}$

The Independent and identically distributed assumption

This brings us to the most fundamental assumption of ML: Independent and Identically Distributed (IID) data (random variables). Statistical independence means that for random variables A and B, the joint distribution $P_{A,B}(a,b)$

Our estimator (model) will have some learnable parameters $\boldsymbol{\theta}$ that make another probability distribution $p_{model}(\textbf{x}, \boldsymbol{\theta})$

The essence of ML is to pick a good initial model that exploits the assumptions and the structure of the data. Less literally, a model with a decent inductive bias. As the parameters are iteratively optimized, $p_{model}(\textbf{x}, \boldsymbol{\theta})$

In neural networks, because the iterations happen in a mini-batch fashion instead of the whole dataset, $m$ will be the mini-batch size.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is simply a common principled method with which we can derive good estimators, hence, picking $\boldsymbol{\theta}$ such that it fits the data.

To disentangle this concept, let’s observe the formula in the most intuitive form:

\boldsymbol{\theta}_{\mathrm{MLE}}= \underset{\operatorname{params}}{\arg \max } \operatorname{p}_{\operatorname{model}}( \operatorname{output} | {\operatorname{inputs}},\operatorname{params})

The optimization problem is maximizing the likelihood of the given data. Outputs are the conditions in the probability world. Unconditional MLE means we have no conditioning on the outputs, so no labels.

\begin{aligned} \boldsymbol{\theta}_{\mathrm{MLE}} &=\underset{\boldsymbol{\theta}}{\arg \max } \operatorname{p}_{\text {model }}({X} , \boldsymbol{\theta}), \\ &=\underset{\boldsymbol{\theta}}{\arg \max } \prod_{i=1}^{m} \operatorname{p}_{\text {model }}\left(\boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right) . \end{aligned}

In a supervised ML context, the condition would simply be the data labels.

\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Quantifying distribution closeness: KL-div

One way to interpret MLE is to view it as minimizing the “closeness” between the training data distribution $p_{data}(\textbf{x})$

\begin{gathered} D_{K L}( p_{data} \| p_{model})=E_{x\sim p_{data}} [ \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})}] = \\ E_{x\sim p_{data}} [ \log p_{data}(\textbf{x}) – \log p_{model}(\textbf{x}, \boldsymbol{\theta})], \end{gathered}

where $E$ denotes the expectation over all possible training data. In general, the expected value $E$ is a weighted average of all possible outcomes. We will replace the expectation with a sum, whilst multiplying each term with its possible “weight” of happening, that is $p_{data}$

Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible. By Mundhenk at English Wikipedia, CC BY-SA 3.0

Notice that I intentionally avoided using the term distance. Why? Because a distance function is defined to be symmetric. KL-div, on the other hand, is asymmetric meaning $D_{K L}( p_{data} \| p_{model}) \neq D_{K L}(p_{model} \| p_{data} )$

Source: Datumorphism

Intuitively, you can think of $p_{data}$

By replacing the expectation $E$ with our lovely sum:

\begin{gathered} D_{K L}(p_{data} \| p_{model})=\sum_{x=1}^{N} p_{data}(\textbf{x}) \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})} \\ =\sum_{x=1}^{N} p_{data}(\textbf{x})[\log p_{data}(\textbf{x})-\log p_{model}(\textbf{x}, \boldsymbol{\theta})] \end{gathered}

When we minimize the KL divergence with respect to the parameters of our estimator, $\log p_{data}(\textbf{x})$

\nabla_{\theta} D_{K L}(p_{data} \| p_{model}) = – \sum_{x=1}^{N} p_{data}(\textbf{x}) \log p_{model}(\textbf{x}, \boldsymbol{\theta}).

In other words, minimizing KL-div is mathematically equivalent to minimizing cross-entropy ( $H(P, Q)=-\sum_{x} P(x) \log Q(x)$

\begin{aligned} H\left(p_{data}, p_{model}\right) &=H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right) \\ \nabla_{\theta} H\left(p_{data}, p_{model}\right) &=\nabla_{\theta}\left(H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right)\right) \\ &=\nabla_{\theta} D_{K L}\left(p_{data} \| p_{model}\right) \end{aligned}

The optimal parameters $\boldsymbol{\theta}$ will, in principle, be the same. Even though the optimization landscape would be different (as defined by the objective functions), maximizing the likelihood is equivalent to minimizing the KL divergence. In this case, the entropy of the data $H(p_{data})$

From the statistical point of view, it’s more of bringing the distributions close so KL-div. From the aspect of information theory, cross-entropy might make more sense to you.

MLE in Linear regression

Let’s consider linear regression. Imagine that each single prediction $\hat{y}$

Source

Now we need an assumption. We hypothesize the neural network or any estimator $f$ as $\hat{y}=f(\textbf{x} , \theta)$

\begin{aligned} & \hat{y}=f(\textbf{x} , \boldsymbol{\theta}) \\ y & \sim \mathcal{N}\left(y , \mu=\hat{y}, \sigma^{2}\right) \\ p(y \mid \textbf{x} , \boldsymbol{\theta}) &=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-(y-\hat{y})^{2}}{2 \sigma^{2}}\right) \end{aligned}

In terms of log-likelihood we can form a loss function:

\begin{aligned} L &=\sum_{i=1}^{m} \log p(y \mid \textbf{x} , \boldsymbol{\theta}) \\ &=\sum_{i=1}^{m} \log \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-\left(\hat{y}^{(i)}-y^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma \sqrt{2 \pi})-\log \exp \left(\frac{(\hat{y}^{(i)}-y^{{(i)}} )^{2}.}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m}-\log (\sigma)-\frac{1}{2} \log (2 \pi)-\frac{(\hat{y}^{(i)}-y^{{(i)}})^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left(\hat{y}^{(i)}-y^{{(i)}}\right)^{2}}{2 \sigma^{2}} \\ \end{aligned}

By taking the partial derivative with respect to the parameters, we get the desired MSE.

\begin{aligned} \nabla_{\theta} L &=-\nabla_{\theta} \sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\ &=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)- \frac{m}{2 \sigma^{2}} MSE \end{aligned}

Since $\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}$

MLE in supervised classification

In linear regression, we parametrized $p_{model}(y | \mathbf{x}, \boldsymbol{\theta})$

It is possible to convert linear regression to a classiﬁcation problem. All we need to do is encode the ground truth as a one-hot vector:

p_{data}\left(y \mid \textbf{x}_{i}\right)= \begin{cases}1 & \text { if } y=y_{i} \\ 0 & \text { otherwise }\end{cases} ,

where $i$ refer to a single data instance.

\begin{aligned} H_{i}\left(p_{data}, p_{model}\right) &=-\sum_{y \in Y} p_{data}\left(y \mid \textbf{x}_{i}\right) \log p_{model}\left(y \mid \textbf{x}_{i}\right) \\ &=-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}

For simplicity let’s consider the binary case of two labels, 0 and 1.

\begin{aligned} L &=\sum_{i=1}^{n} H_{i}\left(p_{data}, p_{model}\right) \\ &=\sum_{i=1}^{n}-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \\ &=-\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}

\begin{aligned} =\underset{\boldsymbol{\theta}}{\arg \min } L &= \underset{\boldsymbol{\theta}}{\arg \min } -\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \end{aligned}

This is in line with our definition of conditional MLE:

\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Broadly speaking, MLE can be applied to most (supervised) learning problems,

by specifying a parametric family of (conditional) probability distributions.

Another way to achieve this in a binary classification problem would be to take the scalar output $y$ of the linear layer and pass it from a sigmoid function. The output will be in the range [0,1] and we define this as the probability of $p(y = 1 | \mathbf{x}, \boldsymbol{\theta})$

p(y = 1 | \mathbf{x}, \boldsymbol{\theta}) = \sigma( \boldsymbol{\theta}^T \mathbf{x}) = \operatorname{sigmoid}( \boldsymbol{\theta}^T \mathbf{x}) \in [0,1]

Consequently, $p(y = 0 | \mathbf{x}, \boldsymbol{\theta}) = 1 – p(y = 1 | \mathbf{x}, \boldsymbol{\theta})$

Bonus: What would happen if we use MSE on binary classification?

So far I presented the basics. This is a bonus question that I was asked during an ML interview: What if we use MSE on binary classification?

When $\hat{y}^{(i)}=0$

\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|-y^{(i)}\right\|^{2}= \frac{1}{m} \sum_{i=1}^{m}\left\|-\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2} = \frac{1}{m} \sum_{i=1}^{m}\left\|\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

When $\hat{y}^{(i)}=1$

\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|1 -y^{(i)}\right\|^{2}= \frac{1}{m} \sum_{i=1}^{m}\left\|1 – \sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

One intuitive way to guess what’s happening without diving into the math is this one: in the beginning of training the network will output something very close to 0.5, which gives roughly the same signal for both classes. Below is a more principled method proposed after the initial release of the article by Jonas Maison.

Proposed demonstration by Jonas Maison

Let’s assume that we have a simple neural network with weights $\theta$ such as $z=\theta^\intercal x$

\frac{\partial L}{\partial \theta}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial \theta}

MSE Loss

L(y, \hat{y}) = \frac{1}{2}(y-\hat{y})^2

\frac{\partial L}{\partial \theta}=-(y-\hat{y})\sigma(z)(1-\sigma(z))x

\frac{\partial L}{\partial \theta}=-(y-\hat{y})\hat{y}(1-\hat{y})x

$\sigma(z)(1-\sigma(z))$

Binary Cross Entropy (BCE) Loss

L(y, \hat{y}) = -ylog(\hat{y})-(1-y)log(1-\hat{y})

For $y=0$

\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\sigma(z)(1-\sigma(z))x

\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\hat{y}(1-\hat{y})x

\frac{\partial L}{\partial \theta}=(1-y)(\hat{y})x

\frac{\partial L}{\partial \theta}=\hat{y}x

If the network is right, $\hat{y}=0$

For $y=1$

\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\sigma(z)(1-\sigma(z))x

\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\hat{y}(1-\hat{y})x

\frac{\partial L}{\partial \theta}=-y(1-\hat{y})x

\frac{\partial L}{\partial \theta}=-(1-\hat{y})x

If the network is right, $\hat{y}=1$

Conclusion and References

This short analysis explains why we blindly choose our objective functions to minimize such as cross-entropy. MLE is a principled way to define an optimization problem and I find it a common discussion topic to back up design choices during interviews.

If you like our content consider supporting us, by any possible means. It would be massively appreciated.

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.

Source link