How do we interpret the outputs of a neural network trained on classification?

This post shows how neural networks trained for classification approximate the Bayesian posterior, explaining the theoretical basis and providing empirical examples.

What do output unit activities really mean?

Deep neural networks trained for classification tasks have been a key driver in the rise of modern deep learning. Training networks for classification is often one of the first lessons or tutorials people encounter when they begin their journey with deep learning. In classification tasks, the multi-class classification problem is one of the most common types of classification problems that people encounter. Widely used datasets, like MNIST, CIFAR-10/100, and ImageNet, are all geared toward the multi-class classification problem. In this problem, the model is trained to receive some data \(X\) as input and infer its class label \(C\) from one of the \(M\) total possible classes. Each data point has one and only one class label from \(M\) possible classes \(c \in \{1 ... M\}\). For example, if we train a convolutional neural network (CNN) to do classification on ImageNet, the data \(X\) is the input image, and the output class label \(C\) is one of the 1000 classes in the ImageNet dataset.

Top 10 class predictions from an ImageNet-trained ResNet-50 model. Apart from the best class prediction, many other classes also have non-negligible output activities that make sense. What do these output activities mean?

Neural networks trained to do this kind of problem are trained on many input-class label pairs \(\{ (x_j, c_j), j \in 1 ... N \}\) and often have \(M\) output units corresponding to the \(M\) classes in their last layer, called the logit layer. After training the model, most people find the unit with the highest value, and its corresponding class is the prediction made by the model. The exact values of activations in the logit layer are often ignored. However, we know that they may be telling us something. For example, if we pass the image in the above figure to a CNN model trained on ImageNet, the best class prediction is “golden retriever.” But the units corresponding to other dog breed categories, like “Brittany spaniel,” “Blenheim spaniel,” “Pekinese,” and “papillon,” also have high activations. In a more complex image of a dog racing a go-kart, the highest output is for the class “go-kart.” However, we also see that the outputs corresponding to “sunglasses,” “golden retriever,” and “crash helmet” have non-negligible activations, which make sense to some extent.

This raises the question: What do the activations in the output layer actually represent? Many people understand that after applying the softmax function to the logit layer, the resulting values represent probabilities – specifically, how much the model believes that the input belongs to each class. But why can a model trained only with input-class label pairs represent the probabilities over many classes? And a deeper question is, what probability are they? To make this question more concrete, consider the example of coin flipping. When we say the probability of a coin landing heads is 1/2, we generally mean that if we flip the coin many times, it will land on heads approximately half the time. However, when a neural network assigns a probability of 1/2 to an image being in the “dog” category, what does this actually mean?

Here, we try to answer the question: What do the probabilities predicted by a neural network in classification tasks actually represent? We will show that, in the limit, these probabilities are trained to match the Bayesian posterior probability of an ideal observer having access to a generative model that has generated the data. This also implies that if we have access to a probabilistic generator of the data and class labels, we can train neural network models to infer the class from data without performing the often intractable posterior calculation (usually called the “inference” problem).

Previous studies on this topic

Many studies, particularly from the early 1990s, have demonstrated that neural networks can approximate the Bayesian posterior when trained on classification tasks. The mathematical proof and empirical study in this post are inspired by Richard and Lippmann 1991. While they provided proofs for cases where output units are unconstrained, they did not address the case where output units are softmax-transformed – a condition widely used in practice today. Several other papers also explored this problem and presented similar proofs. This post aims to synthesize these earlier studies and offer a clearer understanding of the results in a modern context. For those interested, we also provide the code repository for implementing the experiment using PyTorch: https://github.com/YudiXie/interpret-classification.

Deriving the classification objective using maximum likelihood

Before we dive into the interpretation of output activations, let’s first clarify how neural network training is typically set up. For multi-class classification problems, neural networks are commonly designed with \(M\) output units, each corresponding to one of the \(M\) classes. The activations of these output units are passed through a softmax layer, transforming them into a probability distribution over the \(M\) classes, denoted as \(q_{\theta}(C=i \\| X)\), for \(i = 1, \dots, M\). This ensures that each output value lies between 0 and 1, and that the activations of all \(M\) units sum to 1. We denote the activation of the \(i\)-th output unit as \(f_{\theta}^i(X)\), where \(X\) represents the input (e.g., an image), and \(\theta\) represents the trainable parameters of the neural network. After applying the softmax function, the probability assigned to the \(i\)-th class is given by:

\begin{equation} \begin{split} q_{\theta}(C=i|X) = \frac{e^{f_{\theta}^i(X)}}{\sum_{j=1}^M e^{f_{\theta}^j(X)}} \end{split} \end{equation}

The most commonly used loss function for multi-class classification problems can be derived from the maximum likelihood principle. Specifically, we aim to maximize the conditional log-likelihood of the training data under the parameterized model. The training data is represented as \({(x_j, c_j), j = 1, \ldots, N }\), where \(x_j\) is an input (e.g., an image), \(c_j\) is the corresponding class label, and \(N\) is the number of training examples. The resulting loss function, \(\mathbb{L}\), is given below. In practice, if you use PyTorch, this loss function is typically implemented using the cross-entropy loss function. (As we will discuss later, the reason this is called cross-entropy is that the loss function can be understood as the cross-entropy between the model’s predicted output distribution and the empirical distribution of the data.)

\begin{equation} \label{eq:1} \begin{split} \mathbb{L} & = - \frac{1}{N} \sum_{j=1}^N \sum_{i=1}^M \log ({q_{\theta}(C=i|x_j)}) \cdot 1\{c_j = i\}
\end{split} \end{equation}

Here, \(1\{c_j = i\}\) is the indicator function, which takes the value 1 when the class label \(c_j\) equals \(i\), and 0 otherwise. Essentially, for each input, we apply the softmax function over all the output units, then select the unit corresponding to the true class label, take the negative logarithm of that softmax value, and average over all data points. In practice, we optimize this loss function using gradient descent, typically over mini-batches of data, to train the model.

While maximum likelihood is a well-founded approach, it does not directly explain what the output activations represent in a well-trained model when applied to test data. In the following sections, we will show that, in the limit, a model trained with this objective will converge to the Bayesian posterior probability over the classes given the data, \(P(C\\|X)\).

Interpretation of outputs as the Bayesian posterior

We denote the joint distribution of the input data \(X\), and class labels \(C\) as \(P(X, C)\). The posterior distribution, \(P(C\\|X)\), can be understood as a function that maps the input \(X\) to a probability distribution over the class variable \(C\). Intuitively, computing this posterior aligns with our goal of inferring the probability of each class given the input data. In fact, we will demonstrate that optimizing the network to approximate the posterior \(P(C\\|X)\) is equivalent to training a neural network using the loss function derived in the above section.

We start from the goal of learning the parameters \(\theta\) of a neural network model \(q_{\theta}(C\\|X)\) to approximate the posterior \(P(C\\|X) = \frac{P(X, C)}{P(X)}\). The parameters are learned from the training data \(\{(x_j, c_j), j = 1, \ldots, N \}\). We try to find \(q_\theta\) by minimizing the conditional KL divergence between \(P(C\\|X)\) and \(q_\theta(C\\|X)\):

\begin{equation} \begin{split} \theta^* & = \arg \min_{\theta} \mathbb{L}_{KL}(P(C|X), q_\theta(C|X)) \end{split} \end{equation}

where

\begin{equation} \label{eq:2} \begin{split} \mathbb{L}_{KL}(P(C|X), q_\theta(C|X)) & = \mathbb{E}_{x \sim P(X)} D_{KL}(P(C|x) | q_\theta(C|x)) \end{split} \end{equation}

and we have,

\begin{equation} \begin{split} D_{KL}(P(C|x) | q_\theta(C|x)) & = \mathbb{E}_{c \sim P(C|x)} \log \frac{P(c|x)}{q_\theta(c|x)} \newline & = - H(P(C|x)) - \mathbb{E}_{c \sim P(C|x)} \log q_\theta(c|x) \end{split} \end{equation}

Because \(P\), the ground truth data distribution, does not depend on \(\theta\), the first term in the above equation is fixed. The second term is the cross entropy. So, minimizing \(\mathbb{L}_{KL}(P(C\\|X), q_\theta(C\\|X))\) is equivalent to minimizing the following expected cross-entropy loss:

\begin{equation} \label{eq:3} \begin{split} \mathbb{L}_{CE}(P(C|X), q_\theta(C|X)) & = \mathbb{E}_{x \sim P(X)} \Big[ - \mathbb{E}_{c \sim P(C|x)} \log q_\theta(c|x) \Big] \end{split} \end{equation}

So that,

\begin{equation} \begin{split} \theta^* & = \arg \min_{\theta} \mathbb{L}_{KL}(P(C|X), q_\theta(C|X)) \newline & = \arg \min_{\theta} \mathbb{L}_{CE}(P(C|X), q_\theta(C|X)) \newline & = \arg \min_{\theta} \mathbb{E}_{x \sim P(X)} \Big[ - \mathbb{E}_{c \sim P(C|x)} \log q_\theta(c|x) \Big] \newline & = \arg \min_{\theta} \mathbb{E}_{x \sim P(X)} \Big[ - \int P(c|x) \log q_\theta(c|x) \,dc \Big] \end{split} \end{equation}

In general, to compute the cross-entropy loss, we need to be able to compute the posterior \(P(C\\|X)\). However, in classification problems, we can avoid this requirement by using samples from the joint distribution \((x, c) \sim P(X, C)\). Estimating the loss without access to the posterior is tricky if \(C\) is a continuous variable. However, in classification problems, \(C\) is a discrete random variable of the class labels that takes a finite set of possible values \(c \in \{1, ..., M\}\). So, we can write the loss function as:

\begin{equation} \begin{split} \mathbb{L}_{CE}(P(C|X), q_\theta(C|X)) & = \mathbb{E}_{x \sim P(X)} \Big[ - \sum_{i=1}^M P(C=i|x) \log q_\theta(C=i|x) \Big] \end{split} \end{equation}

It is possible to derive a loss function \(\overline{\mathbb{L}}_{CE}(P(C\\|X), q_\theta(C\\|X))\) that achieve the same objective as \(\mathbb{L}_{CE}(P(C\\|X), q_\theta(C\\|X))\). We can instead optimize the following empirical cross-entropy loss:

\begin{equation} \label{eq:4} \begin{split} \overline{\mathbb{L}}_{CE}(P(C|X), q_\theta(C|X)) & = \mathbb{E}_{(x,c) \sim P(X,C)} \Big[ - \sum_{i=1}^M \log ({q_{\theta}(C=i|x)}) \cdot 1\{c = i\} \Big] \end{split} \end{equation}

The indicator function \(1\{c = i\}\) takes the value 1 only when the class label for a given data sample is \(c = i\), and 0 otherwise. We can show that \(\overline{\mathbb{L}}_{CE}(P(C\\|X), q_\theta(C\\|X))\) is the same as the cross entropy loss \(\mathbb{L}_{CE}(P(C\\|X), q_\theta(C\\|X))\)

\begin{equation} \begin{split} \overline{\mathbb{L}}_{CE}(P(C|X), q_\theta(C|X)) & = \mathbb{E}_{(x,c) \sim P(X,C)} \Big[ - \sum_{i=1}^M \log ({q_{\theta}(C=i|x)}) \cdot 1\{c = i\} \Big] \newline & = - \mathbb{E}_{x \sim P(X)} \Big[ \mathbb{E}_{c \sim P(C|x)} \Big[ \sum_{i=1}^M \log ({q_{\theta}(C=i|x)}) \cdot 1\{c = i\} \Big] \Big] \newline & = - \mathbb{E}_{x \sim P(X)} \Big[ \sum_{i=1}^M P(C=i|x) \log ({q_{\theta}(C=i|x)}) \Big] \newline & = \mathbb{L}_{CE}(P(C|X), q_\theta(C|X)) \end{split} \end{equation}

In summary, we have demonstrated that optimizing Equation \eqref{eq:4} is equivalent to optimizing Equation \eqref{eq:3}, which in turn is equivalent to optimizing Equation \eqref{eq:2}.

For Equation \eqref{eq:4}, we can derive a lower bound on the objective function:

\begin{equation} \begin{split} \overline{\mathbb{L}}_{CE}(P(C|X), q_\theta(C|X)) & = - \mathbb{E}_{x \sim P(X)} \Big[ \sum_{i=1}^M P(C=i|x) \log ({q_{\theta}(C=i|x)}) \Big] \newline & \geq - \mathbb{E}_{x \sim P(X)} \Big[ \sum_{i=1}^M P(C=i|x) \log ({P(C=i|x)}) \Big] \end{split} \end{equation}

In the final step, we apply Gibbs’ inequality. For \(x \sim P(X)\), the loss \(\overline{\mathbb{L}}_{CE}(P(C\\|X), q_\theta(C\\|X))\) is minimized when the bound becomes tight, which occurs when:

\begin{equation} \begin{split} q_{\theta^*}(C=i|x) = P(C=i|x), \quad i \in \{1, …, M\} \end{split} \end{equation}

In practice, the loss is estimated by taking mini-batches of data \(\{(x_j, c_j), j = 1, ..., N\}\) and we use stochastic gradient descent to find \(\theta^*\) by optimizing the following:

\begin{equation} \begin{split} \overline{\mathbb{L}}_{CE}(P(C|X), q_\theta(C|X)) & \approx \widehat{\mathbb{L}}_{CE}(P(C|X), q_\theta(C|X)) \newline & = - \frac{1}{N} \sum_{j=1}^N \sum_{i=1}^M \log ({q_{\theta}(C=i|x_j)}) \cdot 1\{c_j = i\} \end{split} \end{equation}

This objective is identical to the one derived from the maximum likelihood principle in Equation \eqref{eq:1}.

In summary, we have shown that minimizing the objective derived from maximum likelihood (Equation \eqref{eq:1}) is equivalent to minimizing the conditional KL divergence between the model’s output and the Bayesian posterior (Equation \eqref{eq:2}). The loss is minimized when the model’s output exactly matches the Bayesian posterior.

Let us return to our original question: What do the activations in the output layer actually represent? Our analysis has demonstrated that the softmax output activations are trained to approximate the Bayesian posterior as closely as possible. The Bayesian posterior is computed under the assumption that there is a generative model \(P(X, C)\) that has produced the data. To calculate the posterior, Bayes’ rule is applied: \(P(C\\|X) = \frac{P(X, C)}{P(X)}\).

In most real-world tasks, such as ImageNet classification, we do not have access to the true generative model of the data, making it impossible to compute the exact posterior. So, the interpretation of the output unit activities is hypothetical: The soft-maxed outputs approximate the Bayesian posterior if someone can find a generative model that has generated the data. However, in some cases, we have access to the generative model. Our results imply that the soft-maxed output activities of the networks will approximate the Bayesian posterior calculated theoretically using the generative model. We explore this further with specific examples in the following section.

Empirical studies with known generative models

A simple classification example

Now, let us explore a basic empirical example. Suppose we observe a single data point, \(x\), which is drawn from one of two possible classes, \(C = c_1\) or \(C = c_2\), with equal prior probabilities, i.e., \(P(c_1) = P(c_2) = 0.5\). Once the class is determined, the probability distribution of \(x\) given each class follows a Gaussian distribution, as shown in figure below.

\begin{equation} \begin{split} P(x|c_1) = \frac{1}{\sqrt{2 \pi \sigma_1^2} } e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} \text{, where} \, \sigma_1=1 \text{, } \mu_1=0 \newline P(x|c_2) = \frac{1}{\sqrt{2 \pi \sigma_2^2} } e^{-\frac{(x-\mu_2)^2}{2\sigma_2^2}} \text{, where} \, \sigma_2=1 \text{, } \mu_2=3 \end{split} \end{equation}

Probability density of the two classes and their mixture.

When we observe a new data point \(x\), we want to infer which class it belongs to. Intuitively, in the example above, if \(x\) is small, it is more likely to belong to the first class, while if \(x\) is large, it is more likely to belong to the second class. Concretely, there are two ways we can approach this problem:

1. Train neural networks with data. One approach is to train a neural network using a dataset of many labeled examples, represented as pairs \({(x_j, c_j), j = 1 \dots N }\). We can train a neural network model that takes \(x\) as input and predicts the corresponding class, optimizing the loss function in Equation \eqref{eq:1}. This method allows us to infer the class of new data without knowing the underlying generative process. The code to do the above is posted here: https://github.com/YudiXie/interpret-classification

2. Derive the Bayesian posterior theoretically. Another way to solve this problem is to derive it theoretically using Bayes’ rule. Given \(x\), the posterior probability of \(x\) has class label \(c_1\) or \(c_2\) can be computed using Bayes’ rule:

\begin{equation} \begin{split} P(c_1|x) & = \frac{P(x|c_1) P(c_1)}{P(x)} = \frac{P(x|c_1) P(c_1)}{P(x|c_1) P(c_1) + P(x|c_2) P(c_2)} \newline & = \frac{1}{1 + \frac{\sigma_1}{\sigma_2} e^{\frac{\sigma_2^2(x-\mu_1)^2 - \sigma_1^2(x-\mu_2)^2}{2 \sigma_1^2 \sigma_2^2}}} \newline & = \frac{1}{1 + e^{\frac{6x-9}{2}}} \end{split} \end{equation}

\begin{equation} \begin{split} P(c_2|x) & = \frac{P(x|c_2) P(c_2)}{P(x)} = \frac{P(x|c_2) P(c_2)}{P(x|c_1) P(c_1) + P(x|c_2) P(c_2)} \newline & = \frac{1}{1 + \frac{\sigma_2}{\sigma_1} e^{\frac{\sigma_1^2(x-\mu_2)^2 - \sigma_2^2(x-\mu_1)^2}{2 \sigma_1^2 \sigma_2^2}}} \newline & = \frac{1}{1 + e^{\frac{-6x+9}{2}}} \end{split} \end{equation}

Comparing the theoretically derived Bayesian posterior with neural network output at different stages of training. Showing neural network results after 10, 50, and 100 mini-batch updates.

In the previous section, we proved that the theoretically derived Bayesian posterior is the solution that minimizes the loss function used to train the neural network. But does this hold empirically? The above figure compares the theoretical solution with the solution learned by the neural network at different stages of training. As shown, the model gradually approximates the theoretical solution over the course of mini-batch updates. After 100 batches, the model’s output is nearly identical to the theoretical solution.

A harder example with a more complex posterior

Probability density of the two classes and their mixture.

Let us see another example in the above figure when \(\sigma_1=2 \text{, } \mu_1=0\) and \(\sigma_2=1 \text{, } \mu_2=1\). The posterior distribution is more complex in this case than in the previous example. As we move along the x-axis from left to right, \(P(C=c_1\\|x)\) initially dominates over \(P(C=c_2\\|x)\), but this reverses when \(x\) becomes greater than roughly 0. Interestingly, for \(x\) values larger than around 3, \(P(C=c_1\\|x)\) once again surpasses \(P(C=c_2\\|x)\) because the probability density of class 2 drops faster than that of class 1. The question arises: Can a neural network capture this more complex posterior after training?

Comparing the theoretically derived Bayesian posterior with neural network output at different stages of training. Showing neural network results after 10, 100, and 1000 mini-batch updates.

In above figure, we compare the theoretically derived Bayesian posterior with the neural network output at different stages of training. Like the previous example, the network gradually approximates the posterior shape during training. However, unlike the earlier case, the network requires more training examples to approximate this more complex posterior accurately. Noticeable gaps remain between the model’s output and the theoretical result at 100 batches, and only after 1000 mini-batch updates does the network achieve a reasonably accurate approximation of the posterior.

An even harder classification example

Probability density of the two classes and their mixture.

Let us consider an even more challenging example, shown in the above figure, where we set \(\sigma_1=2\), \(\mu_1=0\), and \(\sigma_2=1\), \(\mu_2=3\). Compared to the previous example, the only change is that the center of the class 2 distribution has been shifted to the right. While the shape of the theoretically derived posterior remains quite similar to the earlier case, this shift makes it even more difficult for the model to learn the posterior accurately.

Comparing the theoretically derived Bayesian posterior with neural network output at different stages of training. Showing neural network results after 100, 1000, and 10000 mini-batch updates.

As shown in the above figure, after 1000 mini-batch updates, the model successfully learns the first crossing point of the posterior on the left, but it struggles to capture the second crossing point. Even after 10,000 batches, the model’s output still deviates significantly from the theoretical posterior. One potential reason for this difficulty is that very little probability mass is concentrated on the right side of the x-axis for both classes. The sparsity of the training data in that region makes it much harder for the model to learn the posterior shape correctly.

In summary, we have shown that the model approximates the posterior well in many simple cases. However, the model’s ability to accurately approximate the posterior depends not only on the complexity of the shape of the posterior, but also on the underlying data distribution. In regions where the data is sparse, it becomes much more challenging for the model to learn the correct posterior shape.


If you spot any mistakes or errors in this post, feel free to leave a comment below or reach out to me via the email provided on the main page. I’d be happy to correct them promptly 😊

If you found this useful, you can cite this as:

@article{xie2024how,
  title   = {How do we interpret the outputs of a neural network trained on classification?},
  author  = {Xie, Yudi},
  journal = {yudixie.github.io},
  year    = {2024},
  month   = {Oct},
  url     = {https://yudixie.github.io/blog/2024/interpret-classification/}
}