# Computationally Efficient Estimation and a Frustrating Distribution that is Fat in the Middle

Hi All

This week we will look at how to efficiently make estimators for likelihood functions with lots of data. This will be a maths heavy post; but not all maths is awful. This will be used in the fabled neural net post. With those disclaimers aside to motivate why efficient estimation is a thing consider Bayesian statistical methods (i.e., sampling from the posterior). The only real problems with Bayesian statistical methods are: 1, freqentists (i.e., those statisticians with poor personal hygiene and low moral character) dislike them and 2, they are computationally intensive. In many cases the likelihood function needs to be evaluated many many times. So decreasing the time it takes to do so can radically improve the speed of sampling. This week, basically, we will look at how to evaluate the likelihoods when there are no sufficient statistics.To illustrate these techniques the likelihoods of the Weibull distribution used in Split testing with RJMCMC and a fat middled distribution (proposed by James Kay as a frankly superior alternative to my Close-enough distribution) are evaluated.

Starting with the Weibull. Its distribution is $f(x; k, \lambda ) = (k/ \lambda) (x/ \lambda)^{(k-1)}$exp$\{-(x/ \lambda)^k \}$ and its log distribution is $\log(k) - \log(\lambda) + (k-1) \log(x) -(k-1) \log( \lambda) - (x/ \lambda)^k$. Thus the log-likelihood is $\log[ L( k, \lambda | {\bf x} ) ]= \log [ \Pi_{i=1}^{n} f(x_i; k, \lambda ) ] = \sum_{i=1}^{n} \log[f(x_i; k, \lambda )]$. This can be evaluated as $\log[ L( k, \lambda | {\bf x} ) ] = n \log(k) - n \log(\lambda) - n (k-1) \log( \lambda)+$ $(k-1) \sum_{i=1}^{n} \log(x_i) - \lambda^{-k} \sum_{i=1}^{n} x_i^{k}$. The point is that only the $\sum_{i=1}^{n} \log(x_i)$ and $\sum_{i=1}^{n} x_i^{k}$ terms depend on the data. So the rest of the terms in the likelihood takes the same time to compute no mater how many data points are available. The $\sum_{i=1}^{n} \log(x_i)$ does not depend on any parameters; it can be pre-computed before any MCMC sampling. However the $\sum_{i=1}^{n} x_i^{k}$ does depend on both the data, ${\bf x}$, and the parameter $k$, so it must be computed every time $k$ is updated…

Or does it?

Alternatively $\sum_{i=1}^{n} x_i^{k}$ can be pre-computed for a range of value and an interpolater used to find arbitrarily values while sampling. Code for this is below. Figure 1 show observed (blue line) and interpolated (black cross) values for $\sum_{i=1}^{n} x_i^{k}$.

Figure 1: Observed (black cross) and Interpolated (blue line) values for pre-computed x^k.

There is more after this code!

g.intop <-function(x = rweibull(10000, shape = 10, scale = 1.5), k = seq(0, 20, length.out = 101))
{
n.k = length(k)
x.k = 1:n.k
for ( i in 1:n.k)
{
print(i)
x.k[i] = sum(x^k[i])
}
plot(k, x.k, xlab = "k", ylab = "sum(x^k)", type = "l", lwd=3, col="blue")
delta = k[2]-k[1]
poly.pars = data.frame(k= k, k.next = k+delta, A=1:n.k, B=1:n.k, C=1:n.k, D=1:n.k  )
for (i in 1:n.k)
{
if ( (i >= 2) & (i <= (n.k -2))  )
{
#xA = c(k[i-1]^0, k[i]^0, k[i+1]^0, k[i+2]^0 )
xB = c(k[i-1]^1, k[i]^1, k[i+1]^1, k[i+2]^1 )
xC = c(k[i-1]^2, k[i]^2, k[i+1]^2, k[i+2]^2 )
xD = c(k[i-1]^3, k[i]^3, k[i+1]^3, k[i+2]^3 )
y  = c(x.k[i-1], x.k[i], x.k[i+1], x.k[i+2] )
}
if ( (i == 1)  )
{
#xA = c(k[i]^0, k[i+1]^0, k[i+2]^0, k[i+3]^0 )
xB = c(k[i]^1, k[i+1]^1, k[i+2]^1, k[i+3]^1 )
xC = c(k[i]^2, k[i+1]^2, k[i+2]^2, k[i+3]^2 )
xD = c(k[i]^3, k[i+1]^3, k[i+2]^3, k[i+3]^3 )
y  = c(x.k[i], x.k[i+1], x.k[i+2], x.k[i+3] )
}

if ( (i > (n.k -2))  )
{
#xA = c(k[n.k-3]^0, k[n.k-2]^0, k[n.k-1]^0, k[n.k]^0 )
xB = c(k[n.k-3]^1, k[n.k-2]^1, k[n.k-1]^1, k[n.k]^1 )
xC = c(k[n.k-3]^2, k[n.k-2]^2, k[n.k-1]^2, k[n.k]^2 )
xD = c(k[n.k-3]^3, k[n.k-2]^3, k[n.k-1]^3, k[n.k]^3 )
y  = c(x.k[n.k-3], x.k[n.k-2], x.k[n.k-1], x.k[n.k] )
}

mod = lm(y~ xB+xC+xD)
#print(mod$coeff[1:4]) poly.pars[i,3:6] = mod$coeff[1:4]
}

k.random = 20*runif(100)
x.k.int = 1:100

for (i in 1:100)
{
pars = poly.pars[(poly.pars$k<= k.random[i])&(poly.pars$k.next > k.random[i]), ]
x.k.int[i] = pars[1,3] + pars[1,4]*k.random[i] + pars[1,5]*k.random[i]^2 +pars[1,6]*k.random[i]^3
}

points(k.random, x.k.int, pch =3)

print(poly.pars)

}
g.intop()

The pre-computing idea only works if the likelihood function is tractable. Consider James’ fat-middled distribution (FMD). It has pdf $f(x; \mu, \sigma) = 0.5 g(x; \mu, \sigma^2) + 0.5 g(x, \mu + \sigma, \sigma^2 )$ where $g$ is a Gaussian/ Normal distribution (i.e., $g(x; \nu, \tau^2) = (2 \pi \tau^2)^{-0.5} \exp \{-0.5 (x-\nu)^2 / \tau^2 \}$). There is no way to simplify the likelihood; each additional data point doubles the number of terms that need to be computed. The rest of this post explains a way of avoiding the likelihood; this is my own idea (probably already written up in many soviet papers of the 1970’s but I am not doing Lit. reviews any more.) so it might be just wrong? However it seems very sensible; anyway if you don’t like it don’t use it!

The basic idea is to conduct MCMC sampling on the Gaussian around the observed moments. For the FMD the expectations of the first two moments [$\textrm{E} (\sum_{i=1}^{n} x_i )/n$ and $\textrm{E} (\sum_{i=1}^{n} x_i^2 )/n$] are relatively easy to calculate as they are just the average of the moments of the two Gaussians [$g(x; \mu, \sigma^2)$ and $g(x, \mu + \sigma, \sigma^2 )$]. They are $\mu + 0.5 \sigma$ and $\mu^2-1.5 \sigma^2 - \mu \sigma$. The variances of the moments [$\textrm{E}( \textrm{E} (\sum_{i=1}^{n} x_i )/n -(\sum_{i=1}^{n} x_i )/n))^2$ and $\textrm{E}( \textrm{E} (\sum_{i=1}^{n} x_i ^2)/n -(\sum_{i=1}^{n} x_i^2 )/n))^2$] are a bit harder but can be found to be $(5/4) \sigma^2$ and $(1/4) \sigma^2 (20 \mu^2 + 20 \mu \sigma + 17 \sigma^2)$ respectively. Thus using the central limit theorem the probabilities of the observed first two moments are $P_1 = g( \sum_{i=1}^{n} x_i /n; \mu + 0.5 \sigma, (5/4n) \sigma^2)$ and $P_2 = g(\sum_{i=1}^{n} x_i ^2/n; \mu^2-1.5 \sigma^2 - \mu \sigma, (1/4n) \sigma^2 (20 \mu^2 + 20 \mu \sigma + 17 \sigma^2))$. The two probabilities are used to approximate the likelihood function; $L(\mu, \sigma) \approx P_1 P_2 = \hat{L}(\mu, \sigma)$. Additional moments  can be used but it was difficult, for me anyway, to find the variance of the higher order moments. Standard MCMC sampling is now conducted with $\hat{L}(\mu, \sigma)$ replacing $L(\mu, \sigma)$.

Tune in next time for more Bayesian Business intelligence!