Information Theory of Neural Networks

Opening the Black Box …. Somewhat

“Information: the negative reciprocal value of probability.” — Claude Shannon

Aim of this blog is not to understand the underlying mathematical concepts behind Neural Network but to visualise Neural Networks in terms of information manipulation.

Encoder-Decoder

Before we start:

Encoder-Decoder is not two CNNs/RNNs combined together! Neither have to be neural network in fact!

Originally, a concept of information theory. Encoder is simply compresses the information and decoder expands the encoded information.

In case of machine learning, both encoding and decoding are both lose-full processes i.e. some information is always lost.

The encoded output of the encoder is called the context vector and this is the input for the decoder.

There are two ways to setup an encoder-decoder setting:

The Decoder in inverse function of Encoder. This way the decoder tries to reproduce the original information. This is used to de-noise the data. This setting has a special name, called auto-encoder.
The Encoder is a compression algorithm and decoder is generating algorithm. This helps to transfer context from one format to another.

Example Applications:

Auto-Encoder: Encoder compressing English text into a vector. Decoder generating original English text from the vector.

Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating French translation of the original text from the vector.

Encoder-Decoder: Encoder compressing English text into a vector. Decoder generating image from the content of text.

Information Theory

Now, if I say every neural network, itself, is an encoder-decoder setting; it would sound absurd to most.

Let’s re-imagine the neural networks.

Let input layer be X and their real tags/classes (present in the training set) be Y. Now we already know Neural Networks find the underlying function between X and Y.

So X can be seen as High-Entropy distribution of Y. High entropy because X contains the information Y but it also a lot of other information.

Example:

“This boy is good.” contains enough information to tell us about it’s ‘positive’ sentiment.

But.

It also contains things like:

1. It’s a specific Boy

2. It’s just one Boy

3. Tense of sentence is present

Now no-entropy version of this sentence would be ‘positive’. Yes that’s also the output. We’ll come back to that in a while.

Now imagine every hidden layer as a single variable H ( so layers will be named as H0, H1 ….. H(n-1) )

Now every layer becomes a variable and the neural network becomes a Markov Chain. Markov Chain because each variable is only dependent upon the previous layer only.

So essentially each layer forms a partisan of information.

Following is a visualisation of a Neural Network as a Markov Chain.

The last layer Y_ is supposed to result in least entropy output (with respect to ‘Y’, the original tag/class).

This process of obtaining Y_ is squeezing the information in X layer as it flows through H layers and retaining only the information most relevant to Y. This is the information bottleneck.

Mutual Information

I(X,Y) = H(X) — H(X|Y)

H -> Entropy

H(X) -> Entropy of X

H(X|Y) -> Conditional Entropy of X given Y

In other words, H(X|Y) signifies how much uncertainty is removed form X if Y is known.

Properties of Mutual Information

When you move along the Markov Chain the mutual information only decreases
Mutual Information is invariant to reparameterization i.e. shuffling values in a layer won’t change the output

Revisiting Bottleneck

In the Markov Representation of Neural Network, every layer becomes a partition of information.

In Information Theory these partitions are known as Successive Refinement of Relevant Information. You don’t have to worry about the details.

Another way of seeing this is the input being encoded and decoded into the output.

So, for enough hidden layers:

the Sample Complexity of deep neural network is determined by encoded Mutual Information at Last Hidden Layer
the Accuracy is determined by Decoded Mutual Information of Last Hidden Layer

Sample Complexity is the number and variety of examples one needs to receive certain accuracy.

Mutual Information along the Training Phase

We calculate Mutual Information between

Layer and Input
Layer and Output

Initial Conditions

Initially,weights are randomly initialised. So barely anything is known about the correct output. With successive layers the mutual information about input decreases and the information in hidden layers about the output is low as well.

Compression Phase

As we train the Neural Network the plots start moving up, signifying gain of information about the output.

But.

Plots also start shifting towards the right side, signifying increase of information in latter layers about the input.

This his the longest phase. Here the density of the plots is maximum and plots are concentrated at the top right. This signifies compression of information about the input in relation to the output.

This is called the compression phase.

Expansion Phase

After the compression phase, the plots starts shifting towards the top but also to left side.

This signifies, with successive layers, there is loss of information about the input and what’s retained in the last layer is the lowest entropy information about the output.

Virtualisation

The Markov Chain version of Neural Network highlights one more point, learning happens from layer to layer. A layer has all the information it needs to predict the output (plus some noise).

So we use each layer to predict the output. This helps us peep into the layer-wise knowledge of the so called black box.

This gives us a perspective about how many layers are required to make an accurate enough prediction of output. If saturation is achieved at earlier layers, the layers succeeding this layer can be pruned/dropped.

These layers are usually very of hundreds or thousands of dimensions. Our evolution doesn’t allow us to visualise anything beyond the 3-dimensions. So we use dimension reduction techniques.

There are various methods of performing dimension reduction. Cristopher Olah has a brilliant blog explaining those methods. I won’t go into the details of t-SNE, you can check this blog for details.

To keep it short, t-SNE tries to reduce dimension retaining the neighbours from higher dimensions in lower dimensions. So this results in pretty accurate 2D and 3D plots.

Following are the layer plots of a language model with 2 layers.

About the plots:

Selected 16 words
Used the final language model to find N synonyms of the above 16 words (N=200 for 2D and N=50 for 3D)
Find the representation vector for each word at each layer
Find 2D and 3D reduced representation of each of above selected words and their synonyms using t-SNE
Plot the reduced representations

2D plots of Layer 1 and Layer 2

3D plots of Layer 1 and Layer 2

Summary

Most every deep neural network acts like an encoder-decoder
Most of the training time is spent in compression phase
Layers are learned bottom-up
Beyond compression phase, more the neural network forgets about the input, the more accurate it gets (wiping out the irrelevant part of input).

I have excluded mathematics from this blog. If you are comfortable enough with the mathematics of Information Theory, Game Theory, Learning Theory etc then do watch this video of the Mastero Naftali Tishby.

Thank You \(-_- )/