The Central Limit Theorem; How to measure information loss.


Mr_D_studies_chess_2Hi All

This might not be the last post for a while as I have begun to start thinking about commencing to consider writing a proposal for a text book. And such a committed undertaking will occupy  (and has occupied) most of my writing time. This post only exists because I having been trying to clarify my understanding of some basic statistical concepts.

The Central Limit Theorem (CLT) states that the average (or sum) of a collection of random variables from any distribution (with a defined first and second moment; i.e., this does not Cauchy distribution) will be normally distributed. Formally if x_1, x_2, ..., x_n be an independent, identically distributed set of n random variables with PDF f_x(x; \mu, \sigma^2), where \mu and \sigma^2 and the mean and variance of f_x, respectively  then \bar{x} ~ Norm(\mu, \sigma^2 / n) as n \rightarrow \inf.

In practice the CLT is the justification for the use of the z- and t-tests. Basically I don’t have to care what the underlying distribution of some process as the average of its observably random variables will be Gaussian. While I use t-tests (or more commonly evidence tests ) for split (or AB) testings I have never been truly comfortable with the idea of the CLT. In particular I don’t doubt the various proofs that it works, however I have never seen any non-experimental process to show how fast it becomes true. For example the sum of 12 random uniform random variables makes a near perfect Gaussian and the sum of 12 log-normal random variables does not.  This post shows a process to evaluating the convergence (information loss) of the CLT. The process is demonstrated with a set of exponentially distributed variables.

The process has two steps. The first step is to derive the distribution of the average of the observed random variables f_t(t). The second step is to calculate the Kullback–Leibler divergence of the f_t(t) and a Gaussian as a function on n. Note that here the math is tractable and so an analytic result is derived however both these integrals could be conducted numerically.

Step 1) Finding f_t(t): Let x_1, x_2, ..., x_n be an independent, identically distributed set of exponentially distributed random variables (x_i ~ \lambda \exp[- \lambda x_i ] ). Let z_k = x_1+x_2+...x_k. The distribution of z_k is f_{z_k}(z_k) = \int_0^{z_k} dx_k f_{z_{k-1}}(z_k - x_k)f_x(x_k); i.e., each random variable is integrated out in turn into the sum.

To find f_{z_k}(z_k) it is, as usual, easiest to start by knowing the answer and work back wards; I got to answer in a ponderous and indirect path so I will spare you the details but it involved failing to sleep on an uncomfortable couch. Let f_{z_k}(z_k) = \lambda^{k+1} (z_k)^{k} \exp(- \lambda z_k) /  \Gamma(k+1); thus f_{z_{k+1}}(z_{k+1}) = \int_0^{z_{k+1}} dx_{k+1} [ \lambda^{k+1} (z_{k+1} - x_{k+1})^{k} \exp(- \lambda [z_{k+1} - x_{k+1}]) /  \Gamma(k+1) ] \times  \lambda \exp(- \lambda x_{k+1} ). A small amount of algebra leads to f_{z_{k+1}}(z_{k+1}) =[\lambda^{k+2}  \exp(- \lambda z_{k+1}) /  \Gamma(k+1)] \int_0^{z_{k+1}} dx_{k+1}  (z_{k+1} - x_{k+1})^{k}.

The remaining integral is trivial (i.e., wolfram alpha can solve it for me… ok in this case it is just a straight substitution of variables but normally that is what trivial integral means). Thus f_{z_{k+1}}(z_{k+1}) =[\lambda^{k+2}  \exp(- \lambda z_{k+1}) /  \Gamma(n+1)] [(z_{k+1}-x_{k+1})^{k+1} /(k+1) ]_0^{z_{k+1}} = [\lambda^{k+2}  \exp(- \lambda z_{k+1}) /  \Gamma(k+1)][z_{k+1}^{k+1} / (k+1)]. And one more “thus” leads to f_{z_{k+1}}(z_{k+1}) = \lambda^{k+2} (z_{k+1})^{k+1} \exp(- \lambda z_{k+1}) /  \Gamma(k+2). Since f_x(x) = \lambda \exp(- \lambda x_i ) is also of the form of f_{z_k}(z_k) the principle of induction shows this is always valid and all that I have to do is turn the “k“s to “n“s. I debated leaving that as an exercise to the reader but here it is f_{z_n}(z_n) = \lambda^{n+1} (z_n)^{n} \exp(- \lambda z_n) /  \Gamma(n+1).

Next since z_n is a sum not an average it must be transformed to t = z_n / n = g(z_n). Recall that for monotonic functions changes of variables are conducted using f_y(y) = |(d/dy) g^{-1}(y) |f_x(g^{-1}(y)). So f_t(t) = \lambda^{n+1} n^{n+1} (z_n)^{n} \exp(- [\lambda n] z_n) /  \Gamma(n+1). For the sake of notation let \beta = \lambda n;  f_t(t) = \beta^{n+1} (z_n)^{n} \exp(- \beta z_n) /  \Gamma(n+1).

That is the distribution of the average of n exponentially distributed random variables. I was slightly surprised to note that it is a Gamma distribution as I had not previously realized that a Gamma was sum of  exponential distributions!

Step 2) Calculate the KL divergence between f_t and a Gaussian f_g(t) with the same mean and variance:

The mean and variance of f_t are (n+1)/\beta and (n+1)/\beta^2. The KL divergence between two distributions f_1(x) and f2_(x) is \int_{\text{range}} dx f_1(x) [\log(f_1[x]) - \log(f_2[x])]. So the KL divergence is the expected log-ratio of the two distributions with respect to the first. It can be thought of a a measure of the difference between the two distributions weighted by the relative importance of the first. It has few nice properties. First it is always positive unless the two distributions are the same at which point is is zero. It works for both continuous and discrete distributions. It commonly is used to measure information gained from one distribution to another (normally the gain going from the prior to the posterior). Thus here it can be thought of as the information lost by using the CLT Gaussian approximation.  Thus here the KL divergence is \int_0^{\inf} dt f_t(t; \beta) [log(f_t[t; \beta]) - \log(f_g[t;(n+1)/\beta, (n+1)/\beta^2  ])] .

This integral large but straight forward. The first step is to group all of the constant terms into one place; let \alpha = n \log(\beta) +0.5 \log(n+1) - \log( \Gamma[n+1] ) + 0.5 \log(2 \pi) . Thus the integral can be rewritten as the more manageable \int_0^{\inf} dt [\beta^{n+1} (z_n)^{n} \exp(- \beta z_n) /  \Gamma(n+1)] \times [\alpha + n \log(t) - \beta t +0.5 (t-[n+1]/ \beta)^2/([n+1]/ \beta^2) ]. Which can be parsed as \alpha + E[\log(t)] - \beta E[t] + \beta^2/(2[n+1])Var[t], where all the expectations and variances are with respect to f_t(t). Using the many known properties of the Gamma distribution all terms can be evaluated. Thus the KL divergence is \Delta(n) = 0.5 \log(2 \pi) + 0.5 + 0.5 \log (n+1) - \log( \Gamma[n+1]) + n \Psi (n+1) -(n+1), where \Psi(x) is the digamma function (very similar to the harmonic series, i.e. \sum 1/k ).

The final function \Delta(n) is messy but it does give an objective measure of the difference between the true distribution of the average of observed exponential random variables and the CLT assumed Gaussian distribution. It is there for possible to use this method to check, directly!, if the n for a given problem is large enough to make the CLT valid.

Also as noted in the introduction while I chose to do this algebraically because it is raining you could do both sets of integrals numerical.

Cheers That is it for now.

P.S. Stay tuned for more text book developments!