Chapter 13: Information Theory

This chapter covers basic information-theoretic concepts and discusses their relation to machine learning.

§13.01: Entropy 1



§13.02: Entropy 2



§13.03: Differential Entropy


Properties of Differential Entropy

  1. $h(f)$ can be negative
  2. $h(f)$ is additive for independent random variables
  3. $h(f)$ is maximized by the multivariate normal, if we restrict to all distributions with the same (co)variance so $h(X) \leq \frac{1}{2} \log(2 \pi e)$
  4. $h(f)$ is maximized by the continuous uniform distribution for a random variable with a fixed range
  5. $h(f)$ is translation invariant i.e. $h(X + a) = h(X)$
  6. $h(aX) = h(X) + \log \mid a \mid$
  7. $h(AX) = h(X) + \log \mid A \mid$ for random vectors and matrix $A$.

§13.04: KL Divergence



§13.05: Cross-Entropy and KL



§13.06: Information Theory for Machine Learning



§13.07: Joint Entropy and Mutual Information I


Mutual Information

The Mutual Information describes the amount of info about one RV obtained through another RV or how different their joint distribution is from pure independence. Thus the mutual information $I(X;Y)$ is the KL Divergence between the joint distribution and the product of marginals i.e.:

\[I(X;Y) = D_{KL}(p(X,Y) \mid \mid p(x)p(y)) = E_{p(x,y)}[\log_2 \frac{p(X,Y)}{p(X) p(Y)}]\]
  1. $I(X;Y) = \mathrm{H}(X) - \mathrm{H}(X \mid Y) = \mathrm{H}(Y) - \mathrm{H}(Y \mid X)$
  2. $I(X;Y) \leq \min{\mathrm{H}(X), \mathrm{H}(Y)}$
  3. $I(X;Y) = H(X) + H(Y) - H(X,Y)$

§13.08: Joint Entropy and Mutual Information II