Over-Dispersion: How sure are you?

“Precision (certainty) is important in making decisions” is a statement that I have little uncertainty that even the most disagreeable of people will accept. But how often do you check that your estimated uncertainty is consistent with your observed uncertainty?

A common barrier is that for GLMs it may not be clear what the model predicted variance should be. Which is why as always your selfless author has taken it upon him self to explain (i.e., I talk too much about things that no one else cares about for the gratification of my ego).

For example Fig. 1 shows histograms of two sets of quiz scores. The means of both sets are indistinguishable from their true mean of 5, however the variances are 21.30 and 2.72 respectively. The variance of a binomial is n p (1-p); for n = 10 and p= 0.5 the variance should be 2.5, thus the first has a much higher observed variance then theoretical variance.

Screen Shot 2015-07-18 at 11.22.02 PM

Histograms of Quiz scores for two classes of 100 students. The means are 4.97 and 5.01; the standard deviations are 4.62 and 1.65 for the two classes respectively.

The ratio of the theoretical variance to the observed variance is called dispersion (\phi ). If \phi > 1 then the model is over-dispersed; if  \phi < 1 then the model is under-dispersed.  This post will consider only over-dispersion. Over-dispersion is evidence that there are missing explanatory variables.

In the quiz example there are clearly two groups in class 1 but they are modeled as only one group. Thus the best way to model over-dispersed data is to find the missing underling predictor however in general this might not be possible (or desirable if a simple uni-variate relationship is sought). The non-Bayesian way (i.e., inferior way) to handle dispersion is to use a quasi-likelihood GLM (Prof. Altman class is where I learned about this in my under grad and I still refer to her excellent notes; If you are reading this Dr Altman please convert your notes into a text book). To summarize this method, the log likelihood is replaced with a score function (U = \sigma^{-2}V(m)^{-2} \sum_{i=1}^{n} (Y_i - m) ) that has the same properties (\text{E}[U( m ; {\bf \text{d}})]\text{Var}[U( m ; {\bf \text{d}})] = n/(\sigma^2 V(m) ), and -\text{E}[ \partial U / \partial m] = \text{Var}[U( m ; {\bf \text{d}})]) used to derive the asymptotic results in exponential family GLM modeling. Here n is the number of observations, m is the model parameter, and \sigma is the data uncertainty. The parameters are then estimated optimizing U instead of likelihood.

This method has the advantage that it can be implement in r with only a few extra key strokes:

“glm(cbind(Correct,Wrong  )~factor(Class), data=data.example, family=binomial(link=”probit”) )”

is the normal way and

“glm(cbind(Correct,Wrong  )~factor(Class), data=data.example, family=quasibinomial(link=”probit”) )”

is the quasi likelihood way. The down side is that quasi likelihood is not likelihood and consequently can not be used in many Bayesian model selection schemes (e.g., there is no evidence estimate or even BIC for quasi-likelihood models).

The Bayesian way (i.e., the correct way) is to use a prior distribution with non stated hyper parameters ({\bf m}_0 ) to give the model the flexibility to account for dispersion. The “recipe” is to make \int dm [p(d|m) \times \pi(m|{\bf m}_0)] = f(d|{\bf m}_0). For the quiz example this means using a binomial distribution for p(d|m) and a beta distribution for \pi(m|{\bf m}_0) the result of the integral is then the beta binomial distribution, though if you don’t want to take wiki’s word for it, this is a fun integral to work though in a pretentious coffee shop on a rain day. Particularly if you are trying to pull an attractive 3rd math student; the \Gamma functions give it a pleasantly complex appearance. In other words, the beta binomial distribution is the product of a binomial distribution with a beta prior having the probability of a success integrated out. Conceptually this means that the probability of a success is random from a beta distribution; unknown predictors are then absorbed into this distribution.

The pdf of beta binomial is beta_bin_pdfwhere \mu is the probability of a success and \psi is precision. For a GLM it would be normal to define \mu = h(\text{A} {\bf m}) (the function h would typically be either a probit or logit) and estimate \psi on its own. This is analogous to estimating the mean and sd for a linear regression.

The appended code gives an example of estimating the probability of success answering a question for both class with 4 additional confounding parameters. A classic binomial GLM is also run. For the GLM, most simulated data sets (about 80% but I have not conducted an exhaustive simulation) find at least one confounding parameters to be a significant predictor of probability of success. The GLM summary for one run is

glm(formula = cbind(k, n – k) ~ ., family = binomial(link = “logit”),
    data = X)

Deviance Residuals:
    Min       1Q   Median       3Q      Max  
-4.0594  -2.1550  -0.1402   2.1930   4.4550  

             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.137667   0.126606   1.087 0.276875    
V1          -0.235301   0.181677  -1.295 0.195264    
V2          -0.177382   0.085473  -2.075 0.037959 *  
V3           0.078763   0.090477   0.871 0.384009    
V4           0.351333   0.102759   3.419 0.000629 ***
V5           0.001959   0.092348   0.021 0.983073    

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 381.54  on 59  degrees of freedom
Residual deviance: 366.40  on 54  degrees of freedom
AIC: 478.64

Number of Fisher Scoring iterations: 4″

The result of the Beta Binomial model is


The parameter estimates with standard errors and 95 percent central credibility intervals (denoted by their lower and upper bounds, LB and UB).

The Beta binomial model correctly does not find any of the parameters to be significantly different from zero.

g.beta.binomial <- function(x, n, mu=0.5, phy=1000, Log.out =T )
#temp=data.frame(A1 = x+phy*mu, B1 = n-x+phy*(1-mu), A2 = phy*mu, B2= phy*(1-mu))
#LP = lbeta(temp$A1, temp$B1)-lbeta( phy*mu, phy*(1-mu))+lchoose(n, x)

LP = lbeta(x+phy*mu, n-x+phy*(1-mu))-lbeta( phy*mu, phy*(1-mu))+lchoose(n, x)
if(Log.out == F) {LP = exp(LP)}

LP[is.nan(LP)] = -300 # get NaNs out

LP = sum(LP)


g.LL <- function(M, K, n, X, npar, ndat) # the log likelihood
eta = 0*(1:ndat) + M[2] # the systimatic part
for (i in 1:npar)
eta = eta + M[2+i]*X[,i]
Mu = 1/(1+exp(-eta)) # the expected value

Phy = M[npar+3]

LL = g.beta.binomial(K, n, mu=Mu, phy=Phy, Log.out =T )


# does not do rotation
g.BBstep <- function(M, k., n., X., Q.sd., npar., ndat.)

# int step
new = M
new[2] = new[2] + Q.sd.[1]*rnorm(1)
new[1] = g.LL(new, k., n., X., npar., ndat.)
r = runif(1)
A = exp(new[1]-M[1])
if (is.na(A)) {A= 0}
if (A>=r)
M = new
# the model pars
for (i in 1:npar.)
j = i + 2
new = M
new[j] = new[j] + Q.sd.[i+1]*rnorm(1)
new[1] = g.LL(new, k., n., X., npar., ndat.)
r = runif(1)
A = exp(new[1]-M[1])
if (is.na(A)) {A= 0}
if (A>=r)
M = new
# int step
new = M
new[npar.+3] = new[npar.+3] + Q.sd.[npar.+2]*rnorm(1)
if(new[npar.+3] > 0)
{new[1] = g.LL(new, k., n., X., npar., ndat.)
r = runif(1)
A = exp(new[1]-M[1])
if (is.na(A)) {A= 0}
if (A>=r)
M = new
g.BB.glm <- function(k, n, X, NSTEPS = 50000, nsave = 1000)
npar = ncol(X) # get the number of pars
ndat = length(k)
parnames = colnames(X) # par names
old = (1:(npar+3))*0 # LL, intersept, pars, dispersion

# this gets the starting model and step size
MOD.start = glm(cbind(k, n-k)~., data=X, family = binomial(link = “logit”))
old[1] = -Inf
old[2:(npar+2)] = MOD.start$coefficients # staring model
old[npar+3] = 100*sum(n) # starting dispersion

Q.sd = sqrt(diag(vcov(MOD.start))) # step size
Q.sd = c(Q.sd, sum(n))
# fixed length thinning
par.buf = matrix(0, nrow=nsave, ncol=(npar+1+1+1+1)) # the saved models pars
# this part samples and saves the models
burn= 10000
for (i in 1:(nsave+burn)) # loop over sampling first
old = g.BBstep(old, k, n, X, Q.sd.=Q.sd, npar, ndat) # new model (1 swweep) with LL

if (i >= burn +1)
j = i – burn
par.buf.temp = c(j, old)
par.buf[j,] = par.buf.temp

for (i in 3:(ncol(par.buf)))
Q.sd[i-2] = sd(par.buf[,i])
Q.sd[length(Q.sd)] = Q.sd[length(Q.sd)]

alpha = 1 # start flt
for (i in 1:NSTEPS) # loop over sampling first
old = g.BBstep(old, k, n, X, Q.sd.=Q.sd, npar, ndat) # new model with LL
# flt
ori = sample(1:nsave, 1) # the model that gets writen over
alpha = (alpha*nsave)/(alpha+nsave) # prob current model
r = runif(1) # random value

if (r <= alpha)
par.buf.temp = c(i+nsave, old)
par.buf[ori,] = par.buf.temp


phy = par.buf[,ncol(par.buf)]
rho = (sum(n)+phy)/(phy+1)

rownames(par.buf) = 1:nsave

par.buf = cbind(par.buf, rho)
index.s = sort.int(par.buf[,1],index.return = TRUE)
par.buf = par.buf[index.s$ix,]


M.list = 3:(ncol(par.buf))
SD.list = 3:(ncol(par.buf))
LB.list = 3:(ncol(par.buf))
UB.list = 3:(ncol(par.buf))

for (i in 3:(ncol(par.buf)))
j = i-2
M.list[j] = mean(par.buf[,i])
SD.list[j] = sd(par.buf[,i])
LB.list[j] = quantile(par.buf[,i], prob=0.025)
UB.list[j] = quantile(par.buf[,i], prob=0.975)
tab = data.frame(Estimate = M.list, SD = SD.list, LB = LB.list, UB = UB.list)
rownames(tab) = c(“Intercept”, colnames(X), “Psi”, “Rho”)

print(round(tab, 4))


plot(par.buf[,1], phy, xlab=”Index”, ylab=”Percision”)
plot(par.buf[,1], rho, xlab=”Index”, ylab=”Dispersion”)

#breaks.rho = seq(1, 1.01*max(rho), length.out=201)
#h1=hist(rho, breaks=breaks.rho, plot =F)
#plot(h1$den,, h1$mids, type=”n”, xaxs=”i”, yaxs=”i”)
#polygon( c(0, h1$den, 0), c(min(h1$mids), h1$mids, max(h1$mids)), col=”grey”)

g.BB.glm(c(rbinom(15, size =10, prob = 0.05), rbinom(15, size =10, prob = 0.95),rbinom(30, size =10, prob = 0.5) ),
rep(10, 20),
X=as.data.frame(matrix(c(rep(0,30), rep(1,30), rnorm(60*4)), nrow=60, ncol=5)))

What does that look like? The output


Horseshoes, Hand grenades and Slow-dancing

Things where close counts!

Hi All. This morning I had an idea about a distribution for data where close was good enough. For example what if you are trying to model where users touch on a screen given that any place they touch a target is equivalent but any where off of the target is bad. There is no reason for the the distribution to be peaked like a Gaussian; so estimates from peaked distributions will have incorrect variances. Also It is nice for the function to be made of simple functions so that it is not computationally intensive to calculate the cdf / pdf.

The basic idea for distribution came from a Prof I had in my undergrad who used the presented cdf (with a =1) to plot limits for students. Note I was hoping to find a better web page to link to  but Google is obsessed with some yob that is a professor at Harvard with the same name. So I call this the “Close-enough” distribution. Enough rambling!

The CDF is a piece wise function F(x) = [h(x)+2a]/(4a) where h(x) = x if x in (-a , a ),  h(x) = 2a - a^2/x if x > a, and h(x) = -2a - a^2/x if x < -a.

Screen Shot 2015-07-13 at 11.11.16 PM

The cumulative density function for Close-enough distribution.

The pdf is f(x) = 1/(4a) where x in (-a , a ),  f(x) = a/(4x^2) other wise.

Screen Shot 2015-07-13 at 11.28.37 PM

The pdf of the Close-enough distribution.

I would like to like to provide a most likely estimate of a however I am lazy and MCMC is a good way to get an estimate.Code for the MCMC done here is given below.

Screen Shot 2015-07-14 at 12.23.36 AM

The PPD of estimates for a simulated example of data drawn from the Close-enough distribution with true scale parameter given by the vertical line.


g.pdf <-function (a, x=seq(-11,11, length.out=10001) )
f1 = 1/(4*a) + 0*x
f2 = (a/4)*x^(-2)
f3 = (a/4)*x^(-2)

f = f1
f[x>a] = f2[x>a]
f[x<(-a)] = f3[x<(-a)]

#plot(c(-11,x,11), c(0,f,0), type=”n”, xlim=c(-10, 10), xaxs=”i”, xlab=”x”, ylab=”P(x)”  )
#polygon(c(-11,x,11), c(0,f,0), col=”grey”)



g.inv.cdf <-function (a, P)
tv = 4*a*P – 2*a

if ((tv >= -a ) & (tv <= a ))
x = tv

if ((tv > a ))
x = -a^2/(4*a*(P-1))

if ((tv < -a ))
x = -a/(4*P)

g.rced <-function (n= 1000, A=1)
out = 1:n

for (i in 1:n)
out[i] = g.inv.cdf(P= runif(1), a=A)

#out = out[out>-10]
#out = out[out<10]

h=hist(out, breaks=101, freq=F)
d= g.pdf(x=h$mids, a=A)

lines(h$mids, d, col=”blue”)

g.mcmc <-function (N = 100, n.=100)
d = g.rced(n=n., A=1)

LL= -Inf
a = 1

mod  = c(LL, a)
rec  = data.frame(LL= 1:N, a= 1:N)
Q.sd = 0.1

for (i in 1:N )
mod.new= mod
mod.new[2] = mod.new[2] + rnorm(1, sd= Q.sd, mean=0)

mod.new[1] = sum(log(g.pdf(x=d, a = mod.new[2])))
A = exp(mod.new[1]-mod[1])
r = runif(1)

if (r<=A)
mod = mod.new
rec[i,c(1,2)] = mod
h=hist(rec[,2], breaks=101, plot = F)

plot(h$mids, h$dens, xlab=”a”, ylab=”P(a)”,type=”n”)
polygon(c(h$mids[1], h$mids, h$mids[2]), c(0,h$dens,0), col=”grey”, border=F)
abline(v=1, lwd=2, col=”blue”)


The long promised and anti climatic first draft of data augmentation

Some time people fail….

For the last few weeks I have been working on a way to do something interesting for this entry and as of yet all my ideas have had limited success. So basically this post is just going to rehash what wiki says about data augmentation. But I have appended working code so win?

The basic idea of data augmentation is that MCMC sampling would accept 100 percent of proposed models if the proposal distribution was the target distribution (this is Gibbs’ sampling). Data augmentation creates axillary parameters such that the conditional distribution of each parameter (axillary and original) is known analytically given the remaining parameters.

This is not possible for logistic regression but it is possible for probit regression. Thus there is an example of that in code below and not for example a logistic regression. Also the idea I was working on required finding the integral of a Gaussian (with an arbitrary correlation matrix)  from 0-inf in all dimensions. If you have any leads on this please leave them in the comments.



The simulated data (circles), true probability of success (blue line) and posterior predicted probability of success (heat scale).


P.S. this needs MASS to be loaded to run.
rtnorm<-function(n, mu, type=1, N=10001)
if (type == 1)
x    = seq(0, max(mu+5,5), length.out = N)
W    = exp(-0.5*(x-mu)^2)
samp = sample(1:N, n, prob=W, replace=T)
samp = x[samp]
if (type == 0)
x    = seq(min(mu-5,-5), 0, length.out = N)
W    = exp(-0.5*(x-mu)^2)
samp = sample(1:N, n, prob=W, replace=T)
samp = x[samp]



g.probit<-function(nd= 100, N= 1000, burn=100)
# the sensitivity matrix
A = matrix(0, nrow = nd, ncol=3)
A[,1] = 1
A[,2] = 1:nd
A[,3] = (1:nd)^2

# the sim data
x         = 1:nd
beta.true =  matrix(c(-10/30,3/30,-0.035/30), nrow=3, ncol=1)
eta.true  = A%*%beta.true
P.true    = pnorm(eta.true)
r         = runif(nd)
d         = 0*r
d[r<=P.true] = 1

plot(x, P.true, xlab=”x”, ylab=”y”, type=”l”, col=”blue”, lwd=2, ylim=c(0,1))
points(x, d)

# beta prior is  N(beta.0, tau)
beta.0 = matrix(c(0,0,0), nrow=3, ncol=1)
tau    = matrix(c(1,0,0,0,1,0,0,0,1), nrow=3, ncol=3)
tau.inv = solve(tau)

# useful values
A.t.A = t(A)%*%A
Sigma.inv  = (tau.inv + A.t.A )
Sigma = solve(Sigma.inv)
tau.inv.beta.0 = tau.inv%*%beta.0

# record
beta = beta.true
Rec = data.frame(M1 = 1:N, M2= 1:N, M3= 1:N )
d.star = 1:nd
pred.rec = matrix(0, nrow=N, ncol=nd)

for ( i in 1:(N+burn))
# sample d.star | beta
A.beta = A%*%beta
for (j in 1:nd)
d.star[j] = rtnorm(1, mu=A.beta[j], type=d[j], N=10001)

# sample beta   | d.star
mu.temp = Sigma%*%(tau.inv.beta.0 + t(A)%*%d.star)
beta = mvrnorm(1, mu = mu.temp, Sigma = Sigma )

# record beta
if ( i > burn)
pred.temp  = pnorm(t(A%*%beta))
pred.rec[i-burn,] = pred.temp
Rec[i-burn,] = beta

ncolor = 21
layers = matrix(0, nrow=ncolor, ncol=nd)
for (i in 1:nd)
q = quantile(pred.rec[,i], probs=seq(0.025, 0.975, length.out=ncolor))
layers[,i] = q

c.ind = (ncolor-1)/2
gcol = rev(heat.colors(c.ind + 5))[6:(c.ind + 5) ]
#gcol = c(“white”, rev(heat.colors((ncolor-1)/2) ))
#       1            2           34        4  3 21

min.y. = 0
max.y. = 1

plot(x, P.true, xlab=”x”, ylab=”y”, type=”l”, col=”blue”, lwd=2, ylim=c(0,1))

for (i in 1:((ncolor-1)/2))
PY = c(layers[i,], rev(layers[ncolor+1-i,]))
Px   = c(x, rev(x))
polygon(Px, PY, col=gcol[i], border=NA)
lines(x, layers[((ncolor-1)/2)+1,], col=gcol[((ncolor-1)/2)+1], lwd=2)
points(x, d)

lines(x, P.true, xlab=”x”, ylab=”y”, type=”l”, col=”blue”, lwd=2, ylim=c(0,1))
points(x, d)