19. Etymology of Entropy#

This lecture describes and compares several notions of entropy.

Among the senses of entropy, we’ll encounter these

  • A measure of uncertainty of a random variable advanced by Claude Shannon [Shannon and Weaver, 1949]

  • A key object governing thermodynamics

  • Kullback and Leibler’s measure of the statistical divergence between two probability distributions

  • A measure of the volatility of stochastic discount factors that appear in asset pricing theory

  • Measures of unpredictability that occur in classical Wiener-Kolmogorov linear prediction theory

  • A frequency domain criterion for constructing robust decision rules

The concept of entropy plays an important role in robust control formulations described in this lecture Risk and Model Uncertainty and in this lecture Robustness.

19.1. Information Theory#

In information theory [Shannon and Weaver, 1949], entropy is a measure of the unpredictability of a random variable.

To illustrate things, let X be a discrete random variable taking values x1,,xn with probabilities pi=Prob(X=xi)0,ipi=1.

Claude Shannon’s [Shannon and Weaver, 1949] definition of entropy is

(19.1)#H(p)=ipilogb(pi1)=ipilogb(pi).

where logb denotes the log function with base b.

Inspired by the limit

limp0plogp=limp0logpp1=limp0p=0,

we set plogp=0 in equation (19.1).

Typical bases for the logarithm are 2, e, and 10.

In the information theory literature, logarithms of base 2, e, and 10 are associated with units of information called bits, nats, and dits, respectively.

Shannon typically used base 2.

19.2. A Measure of Unpredictability#

For a discrete random variable X with probability density p={pi}i=1n, the surprisal for state i is si=log(1pi).

The quantity log(1pi) is called the surprisal because it is inversely related to the likelihood that state i will occur.

Note that entropy H(p) equals the expected surprisal

H(p)=ipisi=ipilog(1pi)=ipilog(pi).

19.2.1. Example#

Take a possibly unfair coin, so X={0,1} with p=Prob(X=1)=p[0,1].

Then

H(p)=(1p)log(1p)plogp.

Evidently,

H(p)=log(1p)logp=0

at p=.5 and H(p)=11p1p<0 for p(0,1).

So p=.5 maximizes entropy, while entropy is minimized at p=0 and p=1.

Thus, among all coins, a fair coin is the most unpredictable.

See Fig. 19.1

_images/MyGraph5.png

Fig. 19.1 Entropy as a function of π^1 when π1=.5.#

19.2.2. Example#

Take an n-sided possibly unfair die with a probability distribution {pi}i=1n. The die is fair if pi=1ni.

Among all dies, a fair die maximizes entropy.

For a fair die, entropy equals H(p)=n1ilog(1n)=log(n).

To specify the expected number of bits needed to isolate the outcome of one roll of a fair n-sided die requires log2(n) bits of information.

For example, if n=2, log2(2)=1.

For n=3, log2(3)=1.585.

19.3. Mathematical Properties of Entropy#

For a discrete random variable with probability vector p, entropy H(p) is a function that satisfies

  • H is continuous.

  • H is symmetric: H(p1,p2,,pn)=H(pr1,,prn) for any permutation r1,,rn of 1,,n.

  • A uniform distribution maximizes H(p): H(p1,,pn)H(1n,,1n).

  • Maximum entropy increases with the number of states: H(1n,,1n)H(1n+1,,1n+1).

  • Entropy is not affected by events zero probability.

19.4. Conditional Entropy#

Let (X,Y) be a bivariate discrete random vector with outcomes x1,,xn and y1,,ym, respectively, occurring with probability density p(xi,yi).

Conditional entropy H(X|Y) is defined as

(19.2)#H(X|Y)=i,jp(xi,yj)logp(yj)p(xi,yj).

Here p(yj)p(xi,yj), the reciprocal of the conditional probability of xi given yj, can be defined as the conditional surprisal.

19.5. Independence as Maximum Conditional Entropy#

Let m=n and [x1,,xn]=[y1,,yn].

Let jp(xi,yj)=jp(xj,yi) for all i, so that the marginal distributions of x and y are identical.

Thus, x and y are identically distributed, but they are not necessarily independent.

Consider the following problem: choose a joint distribution p(xi,yj) to maximize conditional entropy (19.2) subject to the restriction that x and y are identically distributed.

The conditional-entropy-maximizing p(xi,yj) sets

p(xi,yj)p(yj)=jp(xi,yj)=p(xi)i.

Thus, among all joint distributions with identical marginal distributions, the conditional entropy maximizing joint distribution makes x and y be independent.

19.6. Thermodynamics#

Josiah Willard Gibbs (see https://en.wikipedia.org/wiki/Josiah_Willard_Gibbs) defined entropy as

(19.3)#S=kBipilogpi

where pi is the probability of a micro state and kB is Boltzmann’s constant.

  • The Boltzmann constant kb relates energy at the micro particle level with the temperature observed at the macro level. It equals what is called a gas constant divided by an Avogadro constant.

The second law of thermodynamics states that the entropy of a closed physical system increases until S defined in (19.3) attains a maximum.

19.7. Statistical Divergence#

Let X be a discrete state space x1,,xn and let p and q be two discrete probability distributions on X.

Assume that piqt(0,) for all i for which pi>0.

Then the Kullback-Leibler statistical divergence, also called relative entropy, is defined as

(19.4)#D(p|q)=ipilog(piqi)=iqi(piqi)log(piqi).

Evidently,

D(p|q)=ipilogqi+ipilogpi=H(p,q)H(p),

where H(p,q)=ipilogqi is the cross-entropy.

It is easy to verify, as we have done above, that D(p|q)0 and that D(p|q)=0 implies that pi=qi when qi>0.

19.8. Continuous distributions#

For a continuous random variable, Kullback-Leibler divergence between two densities p and q is defined as

D(p|q)=p(x)log(p(x)q(x))dx.

19.9. Relative entropy and Gaussian distributions#

We want to compute relative entropy for two continuous densities ϕ and ϕ^ when ϕ is N(0,I) and ϕ^ is N(w,Σ), where the covariance matrix Σ is nonsingular.

We seek a formula for

ent=(logϕ^(ε)logϕ(ε))ϕ^(ε)dε.

Claim

(19.5)#ent=12logdetΣ+12ww+12trace(ΣI).

Proof

The log likelihood ratio is

(19.6)#logϕ^(ε)logϕ(ε)=12[(εw)Σ1(εw)+εεlogdetΣ].

Observe that

12(εw)Σ1(εw)ϕ^(ε)dε=12trace(I).

Applying the identity ε=w+(εw) gives

12εε=12ww+12(εw)(εw)+w(εw).

Taking mathematical expectations

12εεϕ^(ε)dε=12ww+12trace(Σ).

Combining terms gives

(19.7)#ent=(logϕ^logϕ)ϕ^dε=12logdetΣ+12ww+12trace(ΣI).

which agrees with equation (19.5). Notice the separate appearances of the mean distortion w and the covariance distortion ΣI in equation (19.7).

Extension

Let N0=N(μ0,Σ0) and N1=N(μ1,Σ1) be two multivariate Gaussian distributions.

Then

(19.8)#D(N0|N1)=12(trace(Σ11Σ0)+(μ1μ0)Σ11(μ1μ0)log(detΣ0detΣ1)k).

19.10. Von Neumann Entropy#

Let P and Q be two positive-definite symmetric matrices.

A measure of the divergence between two P and Q is

D(P|Q)=trace(PlnPPlnQP+Q)

where the log of a matrix is defined here (https://en.wikipedia.org/wiki/Logarithm_of_a_matrix).

A density matrix P from quantum mechanics is a positive definite matrix with trace 1.

The von Neumann entropy of a density matrix P is

S=trace(PlnP)

19.11. Backus-Chernov-Zin Entropy#

After flipping signs, [Backus et al., 2014] use Kullback-Leibler relative entropy as a measure of volatility of stochastic discount factors that they assert is useful for characterizing features of both the data and various theoretical models of stochastic discount factors.

Where pt+1 is the physical or true measure, pt+1 is the risk-neutral measure, and Et denotes conditional expectation under the pt+1 measure, [Backus et al., 2014] define entropy as

(19.9)#Lt(pt+1/pt+1)=Etlog(pt+1/pt+1).

Evidently, by virtue of the minus sign in equation (19.9),

(19.10)#Lt(pt+1/pt+1)=DKL,t(pt+1|pt+1),

where DKL,t denotes conditional relative entropy.

Let mt+1 be a stochastic discount factor, rt+1 a gross one-period return on a risky security, and (rt+11)1qt1=Etmt+1 be the reciprocal of a risk-free one-period gross rate of return. Then

Et(mt+1rt+1)=1

[Backus et al., 2014] note that a stochastic discount factor satisfies

mt+1=qt1pt+1/pt+1.

They derive the following entropy bound

ELt(mt+1)E(logrt+1logrt+11)

which they propose as a complement to a Hansen-Jagannathan [Hansen and Jagannathan, 1991] bound.

19.12. Wiener-Kolmogorov Prediction Error Formula as Entropy#

Let {xt}t= be a covariance stationary stochastic process with mean zero and spectral density Sx(ω).

The variance of x is

σx2=(12π)ππSx(ω)dω.

As described in chapter XIV of [Sargent, 1987], the Wiener-Kolmogorov formula for the one-period ahead prediction error is

(19.11)#σϵ2=exp[(12π)ππlogSx(ω)dω].

Occasionally the logarithm of the one-step-ahead prediction error σϵ2 is called entropy because it measures unpredictability.

Consider the following problem reminiscent of one described earlier.

Problem:

Among all covariance stationary univariate processes with unconditional variance σx2, find a process with maximal one-step-ahead prediction error.

The maximizer is a process with spectral density

Sx(ω)=2πσx2.

Thus, among all univariate covariance stationary processes with variance σx2, a process with a flat spectral density is the most uncertain, in the sense of one-step-ahead prediction error variance.

This no-patterns-across-time outcome for a temporally dependent process resembles the no-pattern-across-states outcome for the static entropy maximizing coin or die in the classic information theoretic analysis described above.

19.13. Multivariate Processes#

Let yt be an n×1 covariance stationary stochastic process with mean 0 with matrix covariogram Cy(j)=Eytytj and spectral density matrix

Sy(ω)=j=eiωjCy(j),ω[π,π].

Let

yt=D(L)ϵtj=0Djϵt

be a Wold representation for y, where D(0)ϵt is a vector of one-step-ahead errors in predicting yt conditional on the infinite history yt1=[yt1,yt2,] and ϵt is an n×1 vector of serially uncorrelated random disturbances with mean zero and identity contemporaneous covariance matrix Eϵtϵt=I.

Linear-least-squares predictors have one-step-ahead prediction error D(0)D(0) that satisfies

(19.12)#logdet[D(0)D(0)]=(12π)ππlogdet[Sy(ω)]dω.

Being a measure of the unpredictability of an n×1 vector covariance stationary stochastic process, the left side of (19.12) is sometimes called entropy.

19.14. Frequency Domain Robust Control#

Chapter 8 of [Hansen and Sargent, 2008] adapts work in the control theory literature to define a frequency domain entropy criterion for robust control as

(19.13)#Γlogdet[θIGF(ζ)GF(ζ)]dλ(ζ),

where θ(θ,+) is a positive robustness parameter and GF(ζ) is a ζ-transform of the objective function.

Hansen and Sargent [Hansen and Sargent, 2008] show that criterion (19.13) can be represented as

(19.14)#logdet[D(0)D(0)]=Γlogdet[θIGF(ζ)GF(ζ)]dλ(ζ),

for an appropriate covariance stationary stochastic process derived from θ,GF(ζ).

This explains the moniker maximum entropy robust control for decision rules F designed to maximize criterion (19.13).

19.15. Relative Entropy for a Continuous Random Variable#

Let x be a continuous random variable with density ϕ(x), and let g(x) be a nonnegative random variable satisfying g(x)ϕ(x)dx=1.

The relative entropy of the distorted density ϕ^(x)=g(x)ϕ(x) is defined as

ent(g)=g(x)logg(x)ϕ(x)dx.

Fig. 19.2 plots the functions glogg and g1 over the interval g0.

That relative entropy ent(g)0 can be established by noting (a) that gloggg1 (see Fig. 19.2) and (b) that under ϕ, Eg=1.

Fig. 19.3 and Fig. 19.4 display aspects of relative entropy visually for a continuous random variable x for two densities with likelihood ratio g0.

Where the numerator density is N(0,1), for two denominator Gaussian densities N(0,1.5) and N(0,.95), respectively, Fig. 19.3 and Fig. 19.4 display the functions glogg and g1 as functions of x.

_images/entropy_glogg.png

Fig. 19.2 The function glogg for g0. For a random variable g with Eg=1, Eglogg0.#

_images/entropy_1_over_15.jpg

Fig. 19.3 Graphs of glogg and g1 where g is the ratio of the density of a N(0,1) random variable to the density of a N(0,1.5) random variable. Under the N(0,1.5) density, Eg=1.#

_images/entropy_1_over_95.png

Fig. 19.4 glogg and g1 where g is the ratio of the density of a N(0,1) random variable to the density of a N(0,1.5) random variable. Under the N(0,1.5) density, Eg=1.#