# An ordered categorical regression approach to rating chess players.

Hi all

I was watching the Tata Steel chess tournament; the big question of the event was if Anand was going to qualify for the final in London.

As I was ponder the question I thought not for the first time that elo is more or less useless for making such predictions.  The main problems are that elo does not take into account the value of white vs black pieces and does not account for a player preponderance to draw.

This post considers an ordered categorical regression model as an alternative to elo. The model has four parameters per player; I know only having one would be nice as it lets there be a clear A is better than B sort of comparison but real life is not always nice. The four parameters are a players strength with white ($S_{W}$), strength with black $S_{B}$), preponderance to draw with white $D_{w}$), and preponderance to draw with black $D_{B}$). To explain what these parameters mean consider a match between two players (lets call them Vladimir and Nikitia). Vlad “the lad” has the white pieces; he has a $S^{\text{Vlad}}_{w} = 1$ and a $D^{\text{Vlad}}_{W} = 0.25$.  Nikitia “the shoe” has black with $S^{\text{Nik}}_{B} = 0$ and a $D^{\text{Nik}}_{B} = 0.5$.

The first value that must be computed is the “mid point” ($x_{\text{mid}}= S_{B}^{\text{Nik}} - S^{\text{Vlad}}_{w} = -1$). This can be thought of as the border between white’s territory in the outcome space and black’s. New the “draw zone” must be calculated. It has two boarders; these are black wins to draw boarder and the draw to white wins boarder. The black wins to draw board is $x_{\text{BD}} = x_{\text{mid}} - D^{\text{Nik}}_{B}= -1.5$ and the draw to white wins boarder is $x_{\text{DW}} = x_{\text{mid}} + D^{\text{Vlad}}_{W}= -0.75$. Figure 1 shows the three lines for this example.

Figure 1: The out come regions of a match between Vlad and Nikita. Red red area is the probability that Nikita wins, the blue area is the probability that the game is a draw, and the green area is the probability that Vlad wins.

Next if we imagine a standard Gaussian distribution (mean is zero and standard deviation is one) behind the $x_{\text{mid}}$, $x_{\text{BD}}$, and $x_{\text{DW}}$ points on the number line then a probability of the each result (Black wins, draw, or white wins) can be calculated by integrating over the appropriate region. That is, on Fig. 1 the red area that goes from -Inf to the black wins draw boarder is the probability that black wins and is $P[\text{black win}] = \Phi[x_{\text{BD}}]$. The blue area that goes from the black wins draw boarder  to the draw white wins board is  the probability that the game is a draw and is $P[\text{draw}] = \Phi[x_{\text{DW}}] -\Phi[x_{\text{BD}}]$.  And $P[\text{white wins}] = 1- \Phi[x_{\text{DW}}]$. In the Vlad vs Nik example these probabilities are 6.68% black wins, 15.98% draw, and 77.33% white wins.

Applying this method to the Tata Tournament and using MCMC to estimate the parameters I get the  distributions of player strengths shown in Fig. 2. Note that Anand is defined as having strength 0 for both black and white as this method is only ordering (ok not really ordering but giving relative strengths with black and white…) the users not creating an absolute measure.

Figure 2: the marginal posterior densities of player strengths with black and white.

A surprising result of Fig. 2 is that Carlsen is not the strongest player; he is ~ second strongest with white and third with black. What gives? Figure 3 shows the marginal posterior histograms of  the draw parameters for each user; there I can be seen that Giri and Nakamura are much higher in their preponderance to draw!

Figure 3: the marginal posterior densities of player preponderances to draw with black and white.

That is it for this week. I hope this was interesting.

#full result data from tata steal

# reducing it down to just which player has white, black and result.
# not player "names" are indices
d.tata = data.tata[,c(5,6,4)]

g.LL <-function(data, m)
{
n = nrow(data)
np = 10

log.P = 0

for ( i in 1:n)
{
# get the indeces of the users
W_i = data[i,1]
B_i = data[i,2]
R   = data[i,3]

# calc the boundries
mid =  m[(np)+B_i] -  m[W_i]
DW  = mid + m[(2*np)+W_i]
BD  = mid - m[(3*np)+B_i]

# calc prob of the result
if( R == "W") {P =  log(1-pnorm(DW))            }
if( R == "D") {P =  log(pnorm(DW) - pnorm(BD))  }
if( R == "B") {P =  log(pnorm(BD))              }
log.P = log.P + P
}

return(log.P)

}
#g.LL(data= d.tata, m= c(rep(0, 10), rep(0, 10), rep(0.1, 10), rep(0.1,10) ) )

g.MCMC <- function(N = 100, m.start= c(rep(0, 10), rep(0, 10), rep(0.1, 10), rep(0.1,10) )) { np = 10 # number of players in the event  # the upper and lower prior bounds of the ratings  # strengths can between -3  (very weak)and 3 (very strong) # draw peronderance must be > 0; is is nvers draws 3 is draws almost always
LB = c(rep(-3, np), # player strength white
rep(-3, np),  # player strength black
rep(0, np), # player propnderance to draw white
rep(0,np)) # player propnderance to draw black
UB = c(rep( 3, np), rep( 3, np), rep(6, np), rep(6,np))

# proposal standard error
Q.sigma = 0.25

m.old = m.start
LL.old = -Inf

# the record of the sampled paramter values
REC = data.frame(matrix(0, ncol=4*np+1, nrow=N))
colnames(REC) = c("LL",
"SW1", "SW2", "SW3", "SW4", "SW5", "SW6", "SW7",  "SW8", "SW9", "SW10",
"SB1", "SB2", "SB3", "SB4", "SB5", "SB6", "SB7",  "SB8", "SB9", "SB10",
"DW1", "DW2", "DW3", "DW4", "DW5", "DW6", "DW7",  "DW8", "DW9", "FW10",
"DB1", "DB2", "DB3", "DB4", "DB5", "DB6", "DB7",  "DB8", "DB9", "FB10")

# standard MCMC steps
for ( i in 1:N)
{
ORDER =sample(c(2:10, 12:(np*4)) )#1:(np*4)
for (j in ORDER)
{

m.prime = m.old
m.prime[j] = m.prime[j] + rnorm(1, sd = Q.sigma)
LL.prime = -Inf
if ( min(m.prime >= LB) & min(m.prime <= UB) )
{
LL.prime = g.LL(data= d.tata, m=m.prime  )
}

A = exp(LL.prime - LL.old)
r = runif(1)
if(r <= A )
{
m.old = m.prime
LL.old = LL.prime
}

}

REC[i,1] = LL.old
REC[i,2:41] = m.old

print(i)
print(LL.old)
print(round(m.old[1:10],3))
}

return(REC)
}



plotting code

Mat = matrix(c(21,22,23,
21, 1, 2,
21, 3, 4,
21, 5, 6,
21, 7, 8,
21, 9,10,
21,11,12,
21,13,14,
21,15,16,
21,17,18,
21,19,20), nrow=11, ncol=3, byrow=T)
layout(Mat, widths = c(0.2, 1,1), heights=c(0.5, 1,1,1,1,1 , 1,1,1,1,1))
par(mar=c(0.5,0,0,1))

g.hist <-function(x, bounds=c(-12, 12), nb, ylab. = "Anand", xlab. = "Strength White")
{
b = seq(bounds[1], bounds[2], length.out = nb+1 )
h= hist(x, breaks=b, plot=F)
plot(h$mids, h$den, type="n", xlab="", ylab =ylab., xaxt="n", yaxt="n")
axis(3, at = mean(b), tick=F, labels = xlab.)
axis(2, at = max(h$den)/2, tick=F, labels = ylab.) polygon(c(min(b),h$mids, max(b) ), c(0, h$den, 0), col="grey", border=F) abline(v = 0, lwd=3, col="blue") } g.hist(x=Samp[,2], bounds=c(-3, 3), nb= 50, ylab. = "Anand", xlab. = "Strength White") g.hist(x=Samp[,12],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "Strength Black") g.hist(x=Samp[,3], bounds=c(-3, 3), nb= 50, ylab. = "Aronian", xlab. = "") g.hist(x=Samp[,13],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,4], bounds=c(-3, 3), nb= 50, ylab. = "Carlsen", xlab. = "") g.hist(x=Samp[,14],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,5], bounds=c(-3, 3), nb= 50, ylab. = "Giri", xlab. = "") g.hist(x=Samp[,15],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,6], bounds=c(-3, 3), nb= 50, ylab. = "Gujrathi", xlab. = "") g.hist(x=Samp[,16],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,7], bounds=c(-3, 3), nb= 50, ylab. = "Harikrishna ", xlab. = "") g.hist(x=Samp[,17],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,8], bounds=c(-3, 3), nb= 50, ylab. = "Liren", xlab. = "") g.hist(x=Samp[,18],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,9], bounds=c(-3, 3), nb= 50, ylab. = "Nakamura", xlab. = "") g.hist(x=Samp[,19],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,10], bounds=c(-3, 3), nb= 50, ylab. = "Nepomniachtchi", xlab. = "") g.hist(x=Samp[,20],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,11], bounds=c(-3, 3), nb= 50, ylab. = "So", xlab. = "") g.hist(x=Samp[,21],bounds=c(-3, 3), nb= 50, ylab. = "", xlab. = "") Mat = matrix(c(21,22,23, 21, 1, 2, 21, 3, 4, 21, 5, 6, 21, 7, 8, 21, 9,10, 21,11,12, 21,13,14, 21,15,16, 21,17,18, 21,19,20), nrow=11, ncol=3, byrow=T) layout(Mat, widths = c(0.2, 1,1), heights=c(0.5, 1,1,1,1,1 , 1,1,1,1,1)) par(mar=c(0.5,0,0,1)) g.hist(x=Samp[,2+20], bounds=c(0, 3), nb= 50, ylab. = "Anand", xlab. = "Draw White") g.hist(x=Samp[,12+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "Draw Black") g.hist(x=Samp[,3+20], bounds=c(0, 3), nb= 50, ylab. = "Aronian", xlab. = "") g.hist(x=Samp[,13+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,4+20], bounds=c(0, 3), nb= 50, ylab. = "Carlsen", xlab. = "") g.hist(x=Samp[,14+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,5+20], bounds=c(0, 3), nb= 50, ylab. = "Giri", xlab. = "") g.hist(x=Samp[,15+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,6+20], bounds=c(0, 3), nb= 50, ylab. = "Gujrathi", xlab. = "") g.hist(x=Samp[,16+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,7+20], bounds=c(0, 3), nb= 50, ylab. = "Harikrishna ", xlab. = "") g.hist(x=Samp[,17+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,8+20], bounds=c(0, 3), nb= 50, ylab. = "Liren", xlab. = "") g.hist(x=Samp[,18+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,9+20], bounds=c(0, 3), nb= 50, ylab. = "Nakamura", xlab. = "") g.hist(x=Samp[,19+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,10+20],bounds=c(0, 3), nb= 50, ylab. = "Nepomniachtchi", xlab. = "") g.hist(x=Samp[,20+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") g.hist(x=Samp[,11+20],bounds=c(0, 3), nb= 50, ylab. = "So", xlab. = "") g.hist(x=Samp[,21+20],bounds=c(0, 3), nb= 50, ylab. = "", xlab. = "") <\pre> # Annealing Neural Networks: Another keeping it real post Hi all This post discuses a couple of strategies for estimating the weights (training) of a neural network (NN). Both strategies are based on down sampling the data ($latex {\bf d}). Both strategies make analogy with annealing; i.e., ignoring some of  the observations makes the objective function flatter and thus easier to traverse. This is sort of equivalent to racing the likelihood to a power of $1/t$ as in done in simulated annealing. If the objective function is $\phi({\bf w})= \log ( \Pi_{i=1}^{N} P[d_i|{\bf w}])$  then $\phi({\bf w})^{1/t} \approx \log ( \Pi_{i \text{ in } S} P[d_i|{\bf w}])$ where $S$ is a random sample from the $N$ observations of size $sim N/t$. For example if $t = 2$ then $\latex S$ should be of size $N/2$; in other words using have the data is similar to raising the objective function to a temperature of 2. This is sort of intuitive (at-least to me) in that if there is enough data most of it will just sort of be repeating it self.

Before going into the details I would like to highlight that neither of these worked particularly well. In general, a solution of similar quality could be found faster (fewer forward computations) just by using all the data, the backwards propagation with random jitter of my earlier post, and starting a few times (at different random locations), then taking the best result. This might mean that my test problem is too simple (a global optimal is easy to find so there is not need to explore the whole space) or the methods them selves are just bad.

The first method I am calling batched annealing. Here the data (${\bf d}$ with corresponding feature matrix $A$ ) are organized in a random order. Training is then conducted repeatedly. At each iteration one or more data points are added to the training set. Intuitively this is how I personally memorize things; I start with a small set of thing to memorize and add more as I go.  In sudo code:

index_rand = random_order(1:n) # randomly sort data
d_star = d[index_rand]
A_star = A[index_rand]
For i in 1:n # loop over observations
d_temp = d_star[1:i] # down sample the data
A_temp = A_star[1:i, ] # down sample the
W_old = train_NN(d_temp, A_temp, W_old)

What I found doing this was that very often I would get two sets of data that could not both be fit; which ever one got more members into the training set first would be fit and the other set would “ignored” / poorly fit. The issue is that if fitting one set resulted in a better (either by hold out error / or just likelihood / misfit of data) solution it would not find that solution if the other set had its data added first.

The second method is batched tempering (at least I am good at coming up with names for these methods…) is an analogy to parallel tempering. Again data are excluded to mimic higher temperatures. This is a bit more complex than the first methods. Instead of one optimizer there are  $k$ optimizers. These optimizers exchange information every 1000 iterations. By exchange information I really mean they just re order themselves such that the better optimizers (those with a better objective function; higher for log-likelihood and lower for misfit) are more likely to user the fuller data sets. In this way optimizers that have found good weights locally optimize them and  and ones that have not found good weight values keep exploring the space. I also re sample the data sub sets once the new optimizers are assigned.

In true parallel tempering individual Markov chains (which are analogous to the optimizers used here) equivalent optimizers swap with randomly chosen partners. Here because there is no requirement to make the process ergodic I simply order the optimizers using a weighted random order. Note that it is very important not to just force the optimizers into a deterministic / strict ordering; doing so will stop the best chains from having an opportunity to explore around their space.

The code is appended relative to the previous posts on NNs  the only really important part is the “g.sto” function.  In sudo code

for ( i in 1:N) # loop over the number of steps you want  to optimize the function
for (j in 1:num_opt) # loop through each of the optimizers
W_old = record[j] # get previous model for optimizer j
index_rand = random_samp(sample_from = 1:n, size = NS[j] ) # randomly sample data NS is a list of sizes
d_temp = d[index_rand] # get the down sampled data
A_temp = A[index_rand]  # get the down sampled sensitivity matrix
W_temp = train_NN(d_temp, A_temp, maxit = 1000) # train the NN for 1000 iterations on the subset of data
phi_all_temp = objective_function_NN(d, A, W_temp) # evaluate the model with all data
record = record_udate(W_temp, phi_all_temp) # record the model
record = order(record)

That is really it for now. Sorry this is more or a reminder than a description.

Tune in next time for more adventures in analytics.

g.logit <-function(x)
{
return(1/(1+exp(-x)))
}
g.pred <-function(w, A, d)
{
zeta21 = w[1]*A[,1] + w[ 2]*A[,2] +w[ 3]*A[,3] +w[ 4]*A[,4] +w[ 5]*A[,5] +w[ 6]*A[,6] +w[ 7]*A[,7] +w[ 8]*A[,8] + w[9]
zeta22 = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3] +w[13]*A[,4] +w[14]*A[,5] +w[15]*A[,6] +w[16]*A[,7] +w[17]*A[,8] +w[18]
zeta23 = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3] +w[22]*A[,4] +w[23]*A[,5] +w[24]*A[,6] +w[25]*A[,7] +w[26]*A[,8] +w[27]
zeta24 = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3] +w[31]*A[,4] +w[32]*A[,5] +w[33]*A[,6] +w[34]*A[,7] +w[35]*A[,8] +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31 = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32 = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41 = w[47]*sigma31+w[48]*sigma32 +w[49]
sigma41 =g.logit(zeta41)

return(sigma41)
}
g.phy <-function(w, A, d)
{
sigma41 = g.pred(w, A, d)
#phy = 0.5*sum((d-sigma41)^2)
phy = -2*sum( d* log(sigma41) + (1-d)*log(1-sigma41) ) # + sum(w^2)/nrow(A) # the penalty for size

return(phy)
}

g.phy.prime <-function(w, A, d)
{
num.obs = nrow(A)

zeta21 = w[1]*A[,1] + w[ 2]*A[,2] +w[ 3]*A[,3] +w[ 4]*A[,4] +w[ 5]*A[,5] +w[ 6]*A[,6] +w[ 7]*A[,7] +w[ 8]*A[,8] + w[9]
zeta22 = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3] +w[13]*A[,4] +w[14]*A[,5] +w[15]*A[,6] +w[16]*A[,7] +w[17]*A[,8] +w[18]
zeta23 = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3] +w[22]*A[,4] +w[23]*A[,5] +w[24]*A[,6] +w[25]*A[,7] +w[26]*A[,8] +w[27]
zeta24 = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3] +w[31]*A[,4] +w[32]*A[,5] +w[33]*A[,6] +w[34]*A[,7] +w[35]*A[,8] +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31 = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32 = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41 = w[47]*sigma31+w[48]*sigma32 +w[49]
sigma41 =g.logit(zeta41)

gradphy = -2* ((d/sigma41) - (1-d)/(1-sigma41))

)

}
NNplot<-function(w = w.old)
{

plot(1:4, 1:4, type="n", xlab= "", ylab="", ylim=c(0,9), xlim=c(0,5), xaxt="n", yaxt="n")

w = 10*w/max(abs(w))
g.col = rep("blue", 49)
g.col[w < 0] = "red"

#print(w[1:8])
segments(x0=1, x1=2, y0=1, y1 = 3, lwd=abs(w[1]), col=g.col[1])
segments(x0=1, x1=2, y0=2, y1 = 3, lwd=abs(w[2]), col=g.col[2])
segments(x0=1, x1=2, y0=3, y1 = 3, lwd=abs(w[3]), col=g.col[3])
segments(x0=1, x1=2, y0=4, y1 = 3, lwd=abs(w[4]), col=g.col[4])
segments(x0=1, x1=2, y0=5, y1 = 3, lwd=abs(w[5]), col=g.col[5])
segments(x0=1, x1=2, y0=6, y1 = 3, lwd=abs(w[6]), col=g.col[6])
segments(x0=1, x1=2, y0=7, y1 = 3, lwd=abs(w[7]), col=g.col[7])
segments(x0=1, x1=2, y0=8, y1 = 3, lwd=abs(w[8]), col=g.col[8])

segments(x0=1, x1=2, y0=1, y1 = 4, lwd=abs(w[10]), col=g.col[10])
segments(x0=1, x1=2, y0=2, y1 = 4, lwd=abs(w[11]), col=g.col[11])
segments(x0=1, x1=2, y0=3, y1 = 4, lwd=abs(w[12]), col=g.col[12])
segments(x0=1, x1=2, y0=4, y1 = 4, lwd=abs(w[13]), col=g.col[13])
segments(x0=1, x1=2, y0=5, y1 = 4, lwd=abs(w[14]), col=g.col[14])
segments(x0=1, x1=2, y0=6, y1 = 4, lwd=abs(w[15]), col=g.col[15])
segments(x0=1, x1=2, y0=7, y1 = 4, lwd=abs(w[16]), col=g.col[16])
segments(x0=1, x1=2, y0=8, y1 = 4, lwd=abs(w[17]), col=g.col[17])

segments(x0=1, x1=2, y0=1, y1 = 5, lwd=abs(w[19]), col=g.col[19])
segments(x0=1, x1=2, y0=2, y1 = 5, lwd=abs(w[20]), col=g.col[20])
segments(x0=1, x1=2, y0=3, y1 = 5, lwd=abs(w[21]), col=g.col[21])
segments(x0=1, x1=2, y0=4, y1 = 5, lwd=abs(w[22]), col=g.col[22])
segments(x0=1, x1=2, y0=5, y1 = 5, lwd=abs(w[23]), col=g.col[23])
segments(x0=1, x1=2, y0=6, y1 = 5, lwd=abs(w[24]), col=g.col[24])
segments(x0=1, x1=2, y0=7, y1 = 5, lwd=abs(w[25]), col=g.col[25])
segments(x0=1, x1=2, y0=8, y1 = 5, lwd=abs(w[26]), col=g.col[26])

segments(x0=1, x1=2, y0=1, y1 = 6, lwd=abs(w[28]), col=g.col[28])
segments(x0=1, x1=2, y0=2, y1 = 6, lwd=abs(w[29]), col=g.col[29])
segments(x0=1, x1=2, y0=3, y1 = 6, lwd=abs(w[30]), col=g.col[30])
segments(x0=1, x1=2, y0=4, y1 = 6, lwd=abs(w[31]), col=g.col[31])
segments(x0=1, x1=2, y0=5, y1 = 6, lwd=abs(w[32]), col=g.col[32])
segments(x0=1, x1=2, y0=6, y1 = 6, lwd=abs(w[33]), col=g.col[33])
segments(x0=1, x1=2, y0=7, y1 = 6, lwd=abs(w[34]), col=g.col[34])
segments(x0=1, x1=2, y0=8, y1 = 6, lwd=abs(w[35]), col=g.col[35])

segments(x0=2, x1=3, y0=3, y1 = 4, lwd=abs(w[37]), col=g.col[37])
segments(x0=2, x1=3, y0=4, y1 = 4, lwd=abs(w[38]), col=g.col[38])
segments(x0=2, x1=3, y0=5, y1 = 4, lwd=abs(w[39]), col=g.col[39])
segments(x0=2, x1=3, y0=6, y1 = 4, lwd=abs(w[40]), col=g.col[40])

segments(x0=2, x1=3, y0=3, y1 = 5, lwd=abs(w[42]), col=g.col[42])
segments(x0=2, x1=3, y0=4, y1 = 5, lwd=abs(w[43]), col=g.col[43])
segments(x0=2, x1=3, y0=5, y1 = 5, lwd=abs(w[44]), col=g.col[44])
segments(x0=2, x1=3, y0=6, y1 = 5, lwd=abs(w[45]), col=g.col[45])

segments(x0=3, x1=4, y0=4, y1 = 4.5, lwd=abs(w[47]), col=g.col[47])
segments(x0=3, x1=4, y0=5, y1 = 4.5, lwd=abs(w[48]), col=g.col[48])

symbols(x=1, y=1, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=2, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=7, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=8, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(1,8), y= 1:8, labels = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta") )
symbols(x=2, y=3, circles=0.5 , add=T, inches = .20, bg ="grey")
symbols(x=2, y=4, circles=0.5 , add=T, inches = .20, bg ="grey")
symbols(x=2, y=5, circles=0.5 , add=T, inches = .20, bg ="grey")
symbols(x=2, y=6, circles=0.5 , add=T, inches = .20, bg ="grey")

symbols(x=3, y=4, circles=0.5 , add=T, inches = .20, bg ="grey")
symbols(x=3, y=5, circles=0.5 , add=T, inches = .20, bg ="grey")

symbols(x=4, y=4.5, circles=0.5 , add=T, inches = .20, bg ="grey")
}

NN2 <-function(w = (runif(49)-0.5)*10 , A=A., d=d., N= 10000, mod = 1, t = 100, ME = 0.0001,verb = 0 )
{

PP = 0.01
eta = 1
ND = nrow(A)
#print(w)
phy = g.phy(w, A,d)
#print(phy)

rec = c(phy, w)
phy.best = phy
w.best = w

fm = 0
phy.last = phy
i.last = 1
w.last = w

for ( i in 1:N)
{
phy = g.phy(w, A, d)

#update the weights
if (mod == 6 )
{
eta = 0.5

r = runif(1)
if( r <= 0.01) { mag.w = sqrt(sum(w*w)) j = rnorm(49, sd = mag.w /sqrt(49))/5 } if( r > 0.01)
{
j = rnorm(49, sd = mag.grad /sqrt(49))/5
}
#print(sqrt(sum(j*j)))
#new misfit
phy.new = g.phy(w.new, A,d)

}
phy.new = g.phy(w.new, A,d)
if(is.nan(phy.new) )
{
phy.new = Inf
}
k = 0
fm = fm + 1

#if(verb == 1)
#{
#print(phy)
#print(phy.new)
#}
#print(phy.new > phy )
TEST = ( phy.new > phy )
if(is.na(TEST)) {TEST = FALSE}

while( ( TEST ) & ( k <= 100 ) )
{

#if(verb == 1)
#{
#print("---------------------- in the while loop ----------------------")
#print(phy)
#print(phy.new)
#print(k)
#}

# normal back prop step shrink

if (mod == 6 )
{
#print("here")
eta = eta/2
r = runif(1)
if( r <= 0.1) { mag.w = sqrt(sum(w*w)) j = rnorm(49, sd = mag.w /sqrt(49)) } if( r > 0.1)
{
j = rnorm(49, sd = mag.grad /sqrt(49))
}

#j = rnorm(49, sd = 1)

#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

k = k+ 1
#if(k == 50)
# {print("k==50!")}

if(is.nan(phy.new) )
{
phy.new = Inf

}
if(phy < phy.best) { phy.best = phy w.best = w } TEST = ( phy.new > phy )
if(is.na(TEST)) {TEST = FALSE}

}
if(phy.new < phy)
{
phy = phy.new
w = w.new
}
# add the model to the
rec = rbind(rec, c(phy,w) )
rownames(rec) = NULL
sigma41 = g.pred(w, A, d)
r = runif(1)

if (r <= PP) { PP= PP/2 ND.40 = min(c(ND, 40)) #if ( verb == 1) # { # print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models")) # print(paste("# Obs.", ND) ) # print(phy) # print(paste("log phy best = ", log(phy.best) )) # print(paste("max abs error", max(abs(d-sigma41))) ) # print(paste("max abs error", max(abs(d[1:ND.40]-sigma41[1:ND.40]))) ) # } ## ##print( data.frame(A[1:ND.40,], value= d[1:ND.40], pred = round(sigma41[1:ND.40], 4), error = round(abs(d[1:ND.40] -sigma41[1:ND.40] ),4) )) #if ( log(phy.last) - log(phy) >= 0.5 )
# {
# x11()
par(mfrow=c(2,1), mar=c(4,4,1,1))
NNplot(w.last)
plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
# }
phy.last = phy
w.last = w
i.last = i
}

if( (max(abs(d-sigma41)) <= ME ) & (i >= 10 ) ) # & (i >= 1000)
{
if ( verb == 1)
{
ND.40 = min(c(ND, 40))
sigma41 = g.pred(w, A, d)
print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
print(paste("# Obs.", ND) )
print(phy)
print(paste("log phy best = ", log(phy.best) ))
print(paste("max abs error", max(abs(d-sigma41))) )
print(paste("max abs error", max(abs(d[1:ND.40]-sigma41[1:ND.40]))) )
}
par(mfrow=c(2,1), mar=c(4,4,1,1))
NNplot(w.last)
plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")

return(c(1,phy , w))

}

}

if ( verb == 1)
{
ND.40 = min(c(ND, 40))
sigma41 = g.pred(w, A, d)
print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
print(paste("# Obs.", ND) )
print(phy)
print(paste("log phy best = ", log(phy.best) ))
print(paste("max abs error", max(abs(d-sigma41))) )
print(paste("max abs error", max(abs(d[1:ND.40]-sigma41[1:ND.40]))) )
}

return(c(0,phy,w))
}
g.sto<-function( ME. = 0.01)
{

A.= matrix(c(0, 1,0,0,0,0, 1, 1,
0, 1,0,0,0,0, 0, 1,
0, 1,0,0,0,0, 1, 0,
0, 1,0,0,0,0, 0, 0,
0, 0,1,0,0,0, 1, 1,
0, 0,1,0,0,0, 0, 1,
0, 0,1,0,0,0, 1, 0,
0, 0,1,0,0,0, 0, 0,
0, 0,0,1,0,0, 1, 1,
0, 0,0,1,0,0, 0, 1,
0, 0,0,1,0,0, 1, 0,
0, 0,0,1,0,0, 0, 0,
0, 0,0,0,1,0, 1, 1,
0, 0,0,0,1,0, 0, 1,
0, 0,0,0,1,0, 1, 0,
0, 0,0,0,1,0, 0, 0,
0, 0,0,0,0,1, 1, 1,
0, 0,0,0,0,1, 0, 1,
0, 0,0,0,0,1, 1, 0,
0, 0,0,0,0,1, 0, 0,
1, 1,0,0,0,0, 1, 1,
1, 1,0,0,0,0, 0, 1,
1, 1,0,0,0,0, 1, 0,
1, 1,0,0,0,0, 0, 0,
1, 0,1,0,0,0, 1, 1,
1, 0,1,0,0,0, 0, 1,
1, 0,1,0,0,0, 1, 0,
1, 0,1,0,0,0, 0, 0,
1, 0,0,1,0,0, 1, 1,
1, 0,0,1,0,0, 0, 1,
1, 0,0,1,0,0, 1, 0,
1, 0,0,1,0,0, 0, 0,
1, 0,0,0,1,0, 1, 1,
1, 0,0,0,1,0, 0, 1,
1, 0,0,0,1,0, 1, 0,
1, 0,0,0,1,0, 0, 0,
1, 0,0,0,0,1, 1, 1,
1, 0,0,0,0,1, 0, 1,
1, 0,0,0,0,1, 1, 0,
1, 0,0,0,0,1, 0, 0)
, nrow=40, ncol=8, byrow=T)

colnames(A.) = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta")

A.=rbind(A.,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5,
A. + (runif(320)-0.5)*0.5
)
d. = matrix(c(1,0,0,0, 1,1,1,0, 0,1,1,0, 1,0,1,1, 1,1,0,1,
0,1,1,1, 0,0,0,1, 1,0,0,1, 0,1,0,0, 0,0,1,0), nrow= 40, ncol=1 )

d. = rbind(d., d., d., d., d.,
d., d., d., d.,
d.)

N = nrow(A.)

NC = 40
NS = floor(seq(0, N, length.out = (NC+1)))[2:(NC+1)]
# make the intial data sets
DS = list()
for ( i in 1:NC)
{
DS[[i]] = sort(sample(1:N, NS[i]))
}
Ws.all = list()
Ws.sub = list()
for ( i in 1:NC)
{
Ws.sub[[i]] = c(0, Inf, rnorm(49))

# c(-8.08041439 , 2.06346497 , -1.54031571 , -2.88445872 , 6.47911272 , -8.49851364 , -7.15282381 , -7.68137939 , 11.59349865 , -11.86608750 , 1.99563545 , -2.98180221 , 10.13098873 , -6.46718542 , 4.07691956 , 6.02385674 , -7.61151232 , 7.61163996 , -3.35555794 , 0.03171125 , -1.51265323 , 1.08372308 , 0.74957913 , -2.19298751 , -1.24309429 , -4.29052458 , 5.43952098 , -1.52537648 , 4.59789938 , 1.61840032 , 1.20710391 , -2.13490775 , -1.63754937 , -3.15367652 , -2.95884757 , 1.42927769 , -17.70410965 , -6.64222147 , 8.18730908 , -11.45157253 , 24.59673331 , -18.76273941 , -14.44924125 , 20.51598107 , -17.67475199 , 6.38152918 , 33.44444894 , -35.86176580 , -14.85993747))
Ws.all[[i]] = Ws.sub[[i]]

}

MBL = 100
BLI = 0
while ( (Ws.all[[NC]][1] == 0) & (BLI <= MBL) )
{
print("-------------------------------------------------------------")
print(BLI)
BLI = BLI +1

for( i in 1:NC)
{
w.temp = Ws.sub[[i]][3:51]
SAMP.temp = DS[[i]]

w.temp = NN2(w= w.temp, N= 500, A= A.[SAMP.temp,], d = d.[SAMP.temp,], mod = 6, t = 100, ME = 0.1 , verb = 0 )
#w.temp = NN2(w= w.temp, N= 100, A= A., d = d., mod = 6, t = 100, ME = 0.05 , verb = 0 )
Ws.sub[[i]] = w.temp

#print("------------------- pre sort ------------------")
#print(NS[[i]])
#print(Ws.sub[[i]][1:5])
#print(i)
}
phys = 1:NC
for( i in 1:NC)
{
w.temp = Ws.sub[[i]][3:51]
v. = 0
if(i ==NC) {v. = 1}
w.temp = NN2(w= w.temp, N= 10, A= A., d = d., mod = 6, t = 100, ME = 0.1 , verb = v. )
Ws.sub[[i]] = w.temp
phys[i] = w.temp[2]

}
#print(phys)
probs = -0.5*phys
#print(probs)
probs = ( probs - min(probs) )
#print(probs)
probs = exp(10*probs/max(probs))
#print(probs)

NI = rev(sample(1:NC, prob = probs))

#for( i in 1:NC)
# {
# print("------------------- pre sort ------------------")
# print(NS[[i]])
# print(Ws.sub[[i]][1:5])
# print(i)
# }
print(
data.frame(true.order = order(sapply(Ws.sub, function(x) x[2], simplify=TRUE), decreasing=TRUE) ,
probs_true = probs[ order(sapply(Ws.sub, function(x) x[2], simplify=TRUE), decreasing=TRUE)],
phy = phys[ order(sapply(Ws.sub, function(x) x[2], simplify=TRUE), decreasing=TRUE)],
rand.order = NI, probs = probs[NI] )

)

#Ws.sub = Ws.sub[order(sapply(Ws.sub, function(x) x[2], simplify=TRUE), decreasing=TRUE)]

Ws.sub.temp = Ws.sub

for( i in 1:NC)
{
Ws.sub[i] = Ws.sub.temp[NI[i]]
}
best.index = order(sapply(Ws.sub, function(x) x[2], simplify=TRUE), decreasing=TRUE)[NC]
for ( i in 1:NC)
{
DS[[i]] = sort(sample(1:N, NS[i]))
r = runif(1)
if( (i <= NC/2) )
{
if( (r<= 0.45) & (r >=0) ) {Ws.sub[[i]] = Ws.sub[[best.index]]}
if( (r<= 0.5) & (r >=0.45) ) {Ws.sub[[i]] = c(0, Inf, rnorm(49)) }
}

}
print("Model with full data ")
print(Ws.sub[[best.index]])
}
w.old = Ws.sub[[NC]]

return(w.old)
}
w.old = g.sto( )


# Optimizing the choice of objective function

Hi all

With my last couple posts I have been exploring neural nets. This post continues the topic with a quick look at objective functions.

In the last two posts I glossed over the assumption of the objective function. I used the sum of squared error (SSE) but that is a bit of strange choice as the data being modelled are dichotomous. This post gives a more rigorously objective function and explore the difference between it and SSE.

Recall that the SEE objective function for a network with one output node is $\phi ({\bf W}) = \sum_i [d_i - v_i ({\bf W})]^2$ where ${\bf d}$ are the observed data, ${\bf W}$ are the network weights and $v_i ({\bf W})$ is the output of the neural network for the i’th observation written as a function of the network weights.

Also recall that the probability of a sample of IID Gaussian random variables with known variance $\sigma^2 = 1$ is $P({\bf d}) = (2 \pi)^{-1/2} \exp[-1/2 \sum_i (d_i - \mu)^2 ]$. Thus the -2 log-probability for a Gaussian distribution is the SSE; this is also called the deviance.

Therefore (my favourite pretentious word), a good way to make a make an objective function is to use the  -2 log-probability for a distribution that is more representative of dichotomous data.  Here I will use the Bernoulli distribution. The probability of a set of observed Bernoulli random variables with $p = v_i ({\bf W})$ is $\text{Prob}({\bf d}) = \Pi_i [ v_i ({\bf W})]^{d_i} [1-v_i ({\bf W}) ]^{1-d_i}$. Thus the deviance and our objective function is $\phi({\bf W}) = -2 \sum_i (d_i \log[v_i ({\bf W})]+ [1-d_i] \log[1-v_i ({\bf W})] )$. The partial derivative (used in back propagation) for this objective function is $(\partial / \partial v_i) \phi({\bf W}) = -2 \sum_i (d_i / v_i ({\bf W}) - [1-d_i]/[1-v_i ({\bf W})] )$.

So what is the difference between the SSE and Bernoulli objective functions? Consider a data set with one observation $d_1 = 1$. Figure 1 shows the two objective functions as a function of $v_1$ in the range $[0.001, 1]$. As can be seen the SSE penalizes bad fits (relative to good fits) much less than Bernoulli. In particular as fit with $v_1 = 0.01$ is meaningfully better than a fit with $v_1 = 0.001$.

That is it for now.

# Neural Nets and linearization; what seemed to be a good idea just does not work.

Hi all

The TLDR of this post is that I tried something, it did not work but it seemed enough of a good idea that it is worth writing down incase anyone can improve it.

Those that look at the code in the last post would have seen an un mentioned “mode 4” of getting the weights of a NN.  That mode or method is to linearize the NN inverse (using the data to solve for the unknown weights) problem.

To describe linearization consider a classic linear inverse problem. Let $A$ be the sensitivity matrix ( $n$ rows by $k$ columns), ${\bf d}$ be the data (a column vector of $n$ rows), and ${\bf m}$ be the unknown model parameters (a column vector of $k$ rows). Further more, let the the data covariance matrix be known to be the $I_n$ (an $n$ by $n$ identify matrix); i.e., the model errors ${\bf e}$ are independent gaussians with variance equal to one . Thus ${\bf d} = A{\bf m} + {\bf e}$.

The MLE estimate for the parameters is the well known $\hat{\bf m} = (A' A)^{-1} A' {\bf d}$ and the posterior covariance matrix of ${\bf m}$ is $(A' A)$. Note “prime” here denotes matrix transpose.

So how is this relevant to NNs?

The point is that the $i,j$th element of $A$ is the partial derivative of the $i$th observation with respect to the $j$th parameter. In back propagation those derivatives are analytically calculated. So it is possible to construct an approximation to $A$ that is denoted here as $B$ (I know a creative naming scheme). The $i,j$th element of $B$ is $b_{i,j} = (\partial /\partial w_j ) d_{i}$ where ${\bf W}$ are the weights to the NN.

Thus it should (remember this did not actually work in the end) be possible make iterative estimates of ${\bf W}$ by ${ \bf W}_t = ({B'}_{t-1} B_{t-1})^{-1} {B'}_{t-1} {\bf d}$, but in practice I found that this caused the ${\bf W}_t$ to cycle over a group of bad (poorly fitting values) and more or less never converge.

I don’t have a good reason why this does not work other than it looks like the NN gradients are just not smooth enough. It is also possible that because the in my toy problem there is not enough data (asymptotically all posteriors become gaussians as data is added; the proof of this is very similar to the central limit theorem). I have not tried starting at an optimal solution and seeing if this method stays there; but will be doing that soon as regardless of linearization value in optimization it is useful for initializing MCMC sampling.

That’s it for now.

# Some thoughts on neural nets; the importance of random in back-propagation.

Hi All

(this post is a bit rough I am hoping to clean it up over the next few days but I just wanted to get this down)

Artificial Neural Networks are a subject on which traditionally I have held views. Basically, I became very interested in them in the lat 1990s/ early 2000s at which time they were more or less useless. As an undergrad I enjoyed Radford Neal‘s thesis; I did not really understand it. What I did understand was that I did not have access to the computing power required to do much. I also more or less learned though experience that NNs at that time were just not very good, Loess is better at fitting an arbitrary curve, ARIMA at short term for-casting etc. Well things seem to have changed (with strong GPU/CPU combos NNs now really are able to tackle important and difficult problems) so I need to change.  This is going to be the first of a few posts on Artificial Neural Networks. As usual I am going to approach the subject from a mathematical stand point; as is my affliction I am trying not to use any pre packages libraries. This particular post is going to focus on the staple of NNs, back propagation.

Neural Networks like many regression strategies require the estimation of unknown parameters. Back propagation (a special case of gradient decent) is an iterative way of estimating NN parameters for a given data set. It is in principle a deterministic algorithm; something that grates on the MCMC of my sole. This post considers some was to incorporate random steps into Back propagation and tests them on a simple toy problem.

Before exploring any of the modifications to back propagation it is useful to review it. And to do that it is necessary to create a notation for Neural Networks.  Let ${\bf d}$ be an observed data set (what is predicted by a NN), let $A$ be a feature matrix (the input of a NN), and let ${\bf w}$ be the unknown parameters of the NN.

Figure 1: The structure of an example Neural Network. The circles indicate nodes; they are labelled by their depth/rank and order in their rank. The lines indicate the weights (w); two of the weights are labelled. Weights are specified using their destination node and the rank order of their origin node.

The output of nodes “n 1 1” to “n 1 8” are the feature vectors. The outputs of nodes “n 2 1” to “n 2 4” are denoted $v_{2,1}$ to $v_{2,4}$ and $k$‘th value can be calculated as

.

That is the logistic of the dot product of the previously layer outputs plus the node bias $w_{2,k,b}$. As an aside note that the use of the logistic function is purely out of mathematical connivance and “s” shaped function (think a CDF) would work it is just that the logit has nice derivatives.

The result can of course be generalized (but it made describing it a bit easier to show only one case first) as

.

The NN can thus be understood as series of logistic regressions feeding into each other.

It is also important to consider the objective function [$\phi({\bf w})$] of the inversion. That is, how are two NNs evaluated with respect to a given data set? In this work the sum of squared error is used; I hop to have a post exploring different objective function soon.  For sum of squared error is used $\phi({\bf w}) = \sum_i (d_i - v_{4,1} )^2$.

Finally, before describing back propagation it is helpful to consider optimization in general. That is, given local information about an objective function what are good choices to make in order to find the global best (smallest) value. First it is portably a good idea to generally move in a direction that is “downhill” from your current location. Second If you current location is very flat it you might be near the global value (as the derivative will be zero at the lowest value) so it is advisable to take smaller steps. Third any optimal point found might not be global so it is worth investing some time into checking areas that are not near an observed extrema. Spoilers, back Propagation does the first two very well but ignores the third.

Back propagation works by subtracting the gradient vector ($\Delta {\bf w}$) to the current weights; i.e., it takes a step down hill. In detail ${\bf w}_t = {\bf w}_{t-1} - \eta \Delta {\bf w}_{t-1}$. The gradient vector can be found using the chain rule; it is the partial derivative of the objective function ($\phi$) with respect to each weight ($w_i$). These derivatives are actually quite easy to get (traditionally I am always trying to integrate things this was a nice change).

The point is that each term closer to the start of the network can have its gradient calculated by multiplying on another derivative of the last term. The only complication comes from possible multiple paths. For example consider $w_{1,1,1}$,

The summation of the fourth line is the weighting of the two paths from the input layer to the 4th layer. As networks become more complicated the number of paths increase; a good exercise is to add some more nodes in the second and third layers and find the derivatives.

The rest of this point considers three small modifications to back propagation and tests them on a small data set. The data (${\bf d}$) set has 40 observations. Each observation is has eight features (resented by the matrix $A$) and represents a logic gate. The features are the two propositions $\alpha$ and $\beta$, the five logic gate types (“if-then”, “Contrapositive”, “Xor”, “Or”, “And”), and a negation term. The data are noiseless. The full tableau of the data is

.

The first modification is to iteratively reduce the step size ($\eta$) if  $\phi_t$ > $\phi_{t-1}$. That is at each step $t$ while $\phi_t$ > $\phi_{t-1}$ $\eta$ is halved. The see appended code; this is done using a while loop so many reductions may occur at a single step. The reason for this is that since the gradient is know there must be a small enough step size such that a step of that magnitude in the direction of the gradient must improve the model. Thus if a step in the direction of the gradient worsens the model the step must be too big.

The second modification is to add random noise to proposed model; i.e., ${\bf w}_t = {\bf w}_{t-1} + {\bf e}_t$. The ${\bf e}_{\star}$ are Gaussian random variables with mean zero and variance selected such that $\text{E}\left( |{\bf e}_t | \right) = 0.5 \eta |\Delta {\bf w}|$.

The third modification is to periodically add not just the small jitter of the second modification to the proposed weights but also to add errors with a magnitude equal to ${\bf w}_{t-1}$. Here this additional randomization is done 1/10th of the iterations.

In order to add some more interest to the test two initial step sizes where used for the first modifications. Thus there are five methods (denoted M1-M5) in total:

1. Standard Backwards propagation with $\eta = 0.1$
2. Reduced step size with $\eta = 1$
3. Reduced step size $\eta = 10$
4.  Small random jitter $\eta = 1$
5. Small random jitter and periodic large random jitter $\eta = 1$

All methods were run 100 times. Each run used the same random starting weights for all methods and each run was capped to a maximum of 25,000 iterations. Optimizations were stoped before 25,000 iterations if the smallest absolute error term was lower than 0.01.

The results of the inversions are shown in table 1. In order to evaluate the efficiency of the five methods two values are considered; these are the % of models that managed to converge (that is had maximum absolute misfit  < 0.01 before 25,000 steps) and the average number of forward model (evaluating the NN for data with the given weights; this is done once every time a step is evaluated) calculations that were needed. It seems reasonable to assume an exponential distribution and thus adjust the average number of forward models by the number that are censored (took more than 25k steps).

Table 1: The average number of forward models, conversion rates and adjusted forward models for the 5 estimation methods.

The clear winner is M5 adding both the small and larger jitter!

That is it for now

P.S.

The code used in this example. As usual I have tried to reduce the number of “black box” functions. Also note there was a failed 6th inversion method that I will talk about next week.

A.= matrix(c(0, 1,0,0,0,0, 1, 1,
0, 1,0,0,0,0, 0, 1,
0, 1,0,0,0,0, 1, 0,
0, 1,0,0,0,0, 0, 0,
0, 0,1,0,0,0, 1, 1,
0, 0,1,0,0,0, 0, 1,
0, 0,1,0,0,0, 1, 0,
0, 0,1,0,0,0, 0, 0,
0, 0,0,1,0,0, 1, 1,
0, 0,0,1,0,0, 0, 1,
0, 0,0,1,0,0, 1, 0,
0, 0,0,1,0,0, 0, 0,
0, 0,0,0,1,0, 1, 1,
0, 0,0,0,1,0, 0, 1,
0, 0,0,0,1,0, 1, 0,
0, 0,0,0,1,0, 0, 0,
0, 0,0,0,0,1, 1, 1,
0, 0,0,0,0,1, 0, 1,
0, 0,0,0,0,1, 1, 0,
0, 0,0,0,0,1, 0, 0,
1, 1,0,0,0,0, 1, 1,
1, 1,0,0,0,0, 0, 1,
1, 1,0,0,0,0, 1, 0,
1, 1,0,0,0,0, 0, 0,
1, 0,1,0,0,0, 1, 1,
1, 0,1,0,0,0, 0, 1,
1, 0,1,0,0,0, 1, 0,
1, 0,1,0,0,0, 0, 0,
1, 0,0,1,0,0, 1, 1,
1, 0,0,1,0,0, 0, 1,
1, 0,0,1,0,0, 1, 0,
1, 0,0,1,0,0, 0, 0,
1, 0,0,0,1,0, 1, 1,
1, 0,0,0,1,0, 0, 1,
1, 0,0,0,1,0, 1, 0,
1, 0,0,0,1,0, 0, 0,
1, 0,0,0,0,1, 1, 1,
1, 0,0,0,0,1, 0, 1,
1, 0,0,0,0,1, 1, 0,
1, 0,0,0,0,1, 0, 0)
, nrow=40, ncol=8, byrow=T)

colnames(A.) = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta")

#A.=rbind(A., A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2 )

d. = matrix(c(1,0,0,0, 1,1,1,0, 0,1,1,0, 1,0,1,1, 1,1,0,1,
0,1,1,1, 0,0,0,1, 1,0,0,1, 0,1,0,0, 0,0,1,0), nrow= 40, ncol=1 )

g.logit <-function(x)
{
return(1/(1+exp(-x)))

}

g.phy <-function(w, A, d)
{
sigma41 = g.pred(w, A, d)
phy  = 0.5*sum((d-sigma41)^2)
return(phy)
}

g.pred <-function(w, A, d)
{
zeta21  = w[1]*A[,1] + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

return(sigma41)
}

g.phy.prime <-function(w, A, d)
{

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

#return(0)

)

}

g.LinPro <-function(w, A, d)
{

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

#return(0)

)

AtA     =  t(A_hat)%*%A_hat
Q       = svd(AtA)
#print(Q)
D       = Q$d+0.001 # stablize eigen value inv.D = 1/D inv.D = diag(inv.D) inv.AtA = Q$u%*%inv.D%*%t(Q\$v)
#print(round(inv.AtA%*%AtA),3)

w.prime = inv.AtA%*%t(A_hat)%*%d
#print(w.prime)

return(w.prime)

}

NNplot<-function(w = w.old)
{

plot(1:4, 1:4, type="n", xlab= "", ylab="", ylim=c(0,9), xlim=c(0,5), xaxt="n", yaxt="n")

max.abs.w = max(abs(w))
w.raw = w

w  = 10*w/max(abs(w))
g.col = rep("blue", 49)
g.col[w < 0] = "red"

#print(w[1:8])

segments(x0=1, x1=2, y0=1, y1 = 3, lwd=abs(w[1]), col=g.col[1])
segments(x0=1, x1=2, y0=2, y1 = 3, lwd=abs(w[2]), col=g.col[2])
segments(x0=1, x1=2, y0=3, y1 = 3, lwd=abs(w[3]), col=g.col[3])
segments(x0=1, x1=2, y0=4, y1 = 3, lwd=abs(w[4]), col=g.col[4])
segments(x0=1, x1=2, y0=5, y1 = 3, lwd=abs(w[5]), col=g.col[5])
segments(x0=1, x1=2, y0=6, y1 = 3, lwd=abs(w[6]), col=g.col[6])
segments(x0=1, x1=2, y0=7, y1 = 3, lwd=abs(w[7]), col=g.col[7])
segments(x0=1, x1=2, y0=8, y1 = 3, lwd=abs(w[8]), col=g.col[8])

segments(x0=1, x1=2, y0=1, y1 = 4, lwd=abs(w[10]), col=g.col[10])
segments(x0=1, x1=2, y0=2, y1 = 4, lwd=abs(w[11]), col=g.col[11])
segments(x0=1, x1=2, y0=3, y1 = 4, lwd=abs(w[12]), col=g.col[12])
segments(x0=1, x1=2, y0=4, y1 = 4, lwd=abs(w[13]), col=g.col[13])
segments(x0=1, x1=2, y0=5, y1 = 4, lwd=abs(w[14]), col=g.col[14])
segments(x0=1, x1=2, y0=6, y1 = 4, lwd=abs(w[15]), col=g.col[15])
segments(x0=1, x1=2, y0=7, y1 = 4, lwd=abs(w[16]), col=g.col[16])
segments(x0=1, x1=2, y0=8, y1 = 4, lwd=abs(w[17]), col=g.col[17])

segments(x0=1, x1=2, y0=1, y1 = 5, lwd=abs(w[19]), col=g.col[19])
segments(x0=1, x1=2, y0=2, y1 = 5, lwd=abs(w[20]), col=g.col[20])
segments(x0=1, x1=2, y0=3, y1 = 5, lwd=abs(w[21]), col=g.col[21])
segments(x0=1, x1=2, y0=4, y1 = 5, lwd=abs(w[22]), col=g.col[22])
segments(x0=1, x1=2, y0=5, y1 = 5, lwd=abs(w[23]), col=g.col[23])
segments(x0=1, x1=2, y0=6, y1 = 5, lwd=abs(w[24]), col=g.col[24])
segments(x0=1, x1=2, y0=7, y1 = 5, lwd=abs(w[25]), col=g.col[25])
segments(x0=1, x1=2, y0=8, y1 = 5, lwd=abs(w[26]), col=g.col[26])

segments(x0=1, x1=2, y0=1, y1 = 6, lwd=abs(w[28]), col=g.col[28])
segments(x0=1, x1=2, y0=2, y1 = 6, lwd=abs(w[29]), col=g.col[29])
segments(x0=1, x1=2, y0=3, y1 = 6, lwd=abs(w[30]), col=g.col[30])
segments(x0=1, x1=2, y0=4, y1 = 6, lwd=abs(w[31]), col=g.col[31])
segments(x0=1, x1=2, y0=5, y1 = 6, lwd=abs(w[32]), col=g.col[32])
segments(x0=1, x1=2, y0=6, y1 = 6, lwd=abs(w[33]), col=g.col[33])
segments(x0=1, x1=2, y0=7, y1 = 6, lwd=abs(w[34]), col=g.col[34])
segments(x0=1, x1=2, y0=8, y1 = 6, lwd=abs(w[35]), col=g.col[35])

segments(x0=2, x1=3, y0=3, y1 = 4, lwd=abs(w[37]), col=g.col[37])
segments(x0=2, x1=3, y0=4, y1 = 4, lwd=abs(w[38]), col=g.col[38])
segments(x0=2, x1=3, y0=5, y1 = 4, lwd=abs(w[39]), col=g.col[39])
segments(x0=2, x1=3, y0=6, y1 = 4, lwd=abs(w[40]), col=g.col[40])

segments(x0=2, x1=3, y0=3, y1 = 5, lwd=abs(w[42]), col=g.col[42])
segments(x0=2, x1=3, y0=4, y1 = 5, lwd=abs(w[43]), col=g.col[43])
segments(x0=2, x1=3, y0=5, y1 = 5, lwd=abs(w[44]), col=g.col[44])
segments(x0=2, x1=3, y0=6, y1 = 5, lwd=abs(w[45]), col=g.col[45])

segments(x0=3, x1=4, y0=4, y1 = 4.5, lwd=abs(w[47]), col=g.col[47])
segments(x0=3, x1=4, y0=5, y1 = 4.5, lwd=abs(w[48]), col=g.col[48])

symbols(x=1, y=1, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=2, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=7, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=8, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(1,8), y= 1:8, labels = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta") )

symbols(x=2, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(2,4), y= 3:6, labels = round(c(w.raw[9], w.raw[18], w.raw[27], w.raw[36]),3) )

symbols(x=3, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=3, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=rep(3,2), y= 4:5, labels = round(c(w.raw[41], w.raw[46]),3) )

symbols(x=4, y=4.5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=4, y= 4.5, labels = round(c(w.raw[49]),3) )

rang.w   = rev(seq(- max.abs.w, max.abs.w, length.out = 6 ))
rang.lwd = abs(seq(-10, 10, length.out = 6 ))
rang.col = rev(c(rep("red",3), rep("blue",3)))
legend("topright", col= rang.col, legend = round(rang.w, 3), lwd=rang.lwd, title = "line - w" )

}

NN2 <-function(w = (runif(49)-0.5)*10 , A=A., d=d., N= 10000, mod = 1  )
{

#print(w)
phy = g.phy(w, A,d)
#print(phy)

rec = c(phy, w)

phy.best = phy
w.best   = w

fm        = 0
phy.last  = phy
i.last    = 1
w.last    = w

for ( i in 1:N)
{

#update the weights
if (mod == 1 )
{
eta       = 0.1
#new misfit
phy.new = g.phy(w.new, A,d)
}
if (mod == 2 )
{
eta       = 1
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 3 )
{
eta       = 10
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 4 )
{
eta       = 1
w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 5 )
{
eta       = 1
j = rnorm(49, sd = mag.grad /sqrt(49))
#j = rnorm(49, sd = 1)

#print(sqrt(sum(j*j)))
#new misfit
phy.new = g.phy(w.new, A,d)

}

if (mod == 6 )
{
eta       = 1

r = runif(1)
if( r <= 0.1)
{
mag.w = sqrt(sum(w*w))
j = rnorm(49, sd = mag.w /sqrt(49))
}
if( r > 0.1)
{
j = rnorm(49, sd = mag.grad /sqrt(49))
}

#j = rnorm(49, sd = 1)

#print(sqrt(sum(j*j)))
#new misfit
phy.new = g.phy(w.new, A,d)

}

k = 0
fm = fm + 1

# check  that new is better
while( (phy.new > phy)& (k <= 100) )
{
# normal back prop step shrink
if(mod <= 3)
{
eta = eta/2
#print(paste("eta", eta))
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

# Linearized step
if (mod == 4 )
{
eta       = eta/2
w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

if (mod == 5 )
{
#print("here")
eta       = eta/2
j = rnorm(49, sd = mag.grad /sqrt(49))
#j = rnorm(49, sd = 1)

#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

if (mod == 6 )
{
#print("here")
eta       = eta/2

r = runif(1)
if( r <= 0.1)
{
mag.w = sqrt(sum(w*w))
j = rnorm(49, sd = mag.w /sqrt(49))
}
if( r > 0.1)
{
j = rnorm(49, sd = mag.grad /sqrt(49))
}

#j = rnorm(49, sd = 1)

#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

k = k+ 1

if(k == 50)
{print("k==100!")}

if(k == 100)
{print("k==100!")}
}
phy = phy.new
w = w.new

if(phy < phy.best)
{
phy.best = phy
w.best = w
}

# add the model to the
rec = rbind(rec, c(phy,w))
rownames(rec)= NULL
sigma41 = g.pred(w, A, d)
r = runif(1)
if (r <= 0.001)
{
#print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
#print(phy)
#print(paste("delta log phy = ", log(phy.last) - log(phy) ))
#print(paste("max abs error", max(abs(d-sigma41))) )
#if ( log(phy.last) - log(phy) >= 0.5 )
#    {
#    quartz()
#    par(mfrow=c(2,1), mar=c(4,4,1,1))
#    NNplot(w.last)
#    plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
#    print("------------------> new plot <---------------------------------")
#    }
phy.last = phy
w.last = w
i.last = i

}

if(max(abs(d-sigma41)) <= 0.01 )
{
#print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
#quartz()
#par(mfrow=c(2,1), mar=c(4,4,1,1))
#NNplot(w.last)
#plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
#return(c(fm, 1))
return(rec)

}

}

return(c(fm, 0))
}
w.start = (runif(49)-0.5)
rec1  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)

NNplot(w.last[10,2:50])

out = matrix(0, nrow=100, ncol = 10 )
for(i in 1:100)
{
print(i)
w.start = (runif(49)-0.5)
rec1  = NN2(  mod = 1,   w = w.start,      N= 25000,A= A.,  d = d.)
rec2  = NN2(  mod = 2,   w = w.start,      N= 25000,A= A.,  d = d.)
rec3  = NN2(  mod = 3,   w = w.start,      N= 25000,A= A.,  d = d.)
#rec4  = NN2(  mod = 4,   w = w.start,      N= 25000,A= A.,  d = d.)
rec5  = NN2(  mod = 5,   w = w.start,      N= 25000,A= A.,  d = d.)
rec6  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)
out[i, 1] = rec1[1]
out[i, 2] = rec2[1]
out[i, 3] = rec3[1]
out[i, 4] = rec5[1]
out[i, 5] = rec6[1]

out[i, 6] = rec1[2]
out[i, 7] = rec2[2]
out[i, 8] = rec3[2]
out[i, 9] = rec5[2]
out[i,10] = rec6[2]

print(i)
print(apply(out, 2, mean))
print("--------------------------")

}



# Clustering and other news

Hi All

It has been a while. Mostly I have just been focused on other projects. In particular I wrote a paper on passive churn that was rejected (a first for me). So I am trying to wash the bitter taste of failure out of my mind with this post.

An interesting problem that has recently been on my mind is clustering users so that we can use qualitative methods to understand better what they want from our product.

I do not have a lot of experience with clustering so I started by looking at a few genetic algorithms. These where non negative matrix factorization (NNMF) and K-means. In exploring these I more or less reinvented expected maximization (although to be fair I have not researched it deeply so I might have reinvented a janky in- efficient/effective EM clustering algorithm).

The concerns that I had with the both NNMF and k-means are that they do not include prior knowledge, they are deterministic and only find local optimals (i.e., are extremely sensitive to starting model), and have no inherent ability to determine the number of clusters.  My algorithm (see how I took ownership of EM; lets call my algorithm GEM… because she is truly outrageous) addresses all of these concerns (at least party) and is demonstrated on a racist’s flower data set.

The algorithm must be initialized with prior knowledge (even if it is uninformative) of cluster group memberships and an assumed known number $k$ of clusters. GEM is then iterative and has two steps:

1. Determine the cluster membership $M$ from the current cluster attributes [$Q = (\mu_1, \mu_2, ... \mu_k, \Sigma_1, ... \Sigma_k)$ ].
2. Update the estimate of $Q$ given $M$.

Steps one and two are repeated until the algorithm converges. Also GEM is run at several temperatures simultaneously to help with convergence. The chains at each temperature are allowed to interact using standard REM methods.

In more detail:

$M$ is a matrix with $n$ rows (where $n$ is the number of observations) and $k$ columns. The $M_{i,j}$ is the probability that observation $i$ is in group $j$.

Where

${\bf d}$ is the matrix of observations, $\hat{P}$ is the prior probability of observation $i$ being in cluster $j$ and $\alpha_0$ is the weight of the prior. For the initialization  of the algorithm $\alpha_0 = 1$. At each step on of the GEM iteration each observation is randomly assigned to each cluster with corresponding probability from $M$.

For step two the $\mu$s and $\Sigma$‘s must be estimated. For $\Sigma_j$ I just use the data covariance matrix of the observations assigned to cluster $j$.  For the $\mu$‘s a modification to the standard Bayesian posterior distributions are calculated and them sampled from. The purpose of the sampling in stead of the MAP is to add additional variation to the process to help navigate out of local minima for a global estimate (I have no proof this actually works but it seems like a good idea). The modified posterior distribution is

where $n_j$ number of observations assigned to the cluster $j$. The temperature of the sample is $t$. And yes I am one of those an improper prior; yes it is philosophically inconsistent; I don’t care!).  And repeat!

The code for my example is appended at the bottom of the post:

Now let’s see how it does with the IRIS’s.

Figure 1: Fischer Iris data, color indicates iris species

For this data set looking at petal length vs petal width is likely sufficient to see all clusterings and allows for much simpler plots.

Figure 2: : Fischer Iris data, with GEM cluster centroid paths

From Fig. 2 it is fairly clear that the GEM algorithm is able to converge  to reasonable centroids for the three species of iris.  Figure 3 shows the Membership probabilities of the clusters. In almost all cases the all the observations from a single species are grouped to the same cluster and are decisively more likely to be part of that cluster.

Figure 3: The log probability of cluster membership for the 3 clusters. The colors indicate the true iris species

To help interpret the cluster it is useful to do two things (or more but I like these two). First to correlate the membership probabilities ($M_{\star, j}$) of the clusters with the observations dimensions and two to plot the  membership probabilities as a function of the observation dimensions. To that end Tab. 1 shows the correlation of the clusters with the observation dimensions. Thus Cluster 3 is the smallest in sepal length, petal length and petal width but largest in sepal width. Cluster 2 is the opposite. Cluster 1 is a less extreme version of cluster 2.

Table 1: Correlation between the ${\bf d}_j$ and $M_{\star, k}$

Because correlations hide a wealth of non linear information the second step of looking at the scatter plots cluster membership and the data attributes is also helpful. Here a smoothing tend line is added by regressing $\log( - \log[M_{\star, k}] ) ~ {\bf d}_j$. The complimentary log -log transform is used to ensure that smooth is always between 0 and 1.

Figure 4: Smooths regressing $M_{\star, k}$ on ${\bf d}_j$.  Each row gives membership in a cluster and each column gives the is a dimension of observed  data. the black lines give a lowess smooth through the data.

Finally my last concern was how to know the number of clusters in a data set. this can be accomplished naturally by using the BIC. The BIC is a selection criterion that weighs model fit against model complexity.  The $BIC = (4+4+3+2+1)*g*\log(150) - 2 \log(L)$.; that is each cluster has 4 parameters for the mean and 10 for the covariance matrix. Frustratingly (yes I write these posts as a go sometimes instead of doing all the work first and the writing it up) when I got here the BIC is optimized for 4 groups in the data instead of 3. That more or less destroys my ending to post but i think is quite interesting;I am still thinking about why BIC does not work here and don’t have a good explanation.

Any way that is it for now. Hopefully I will be able to write more of these in the coming weeks.

V = iris[,1:4]
n = 150 # number of data
g = 3 # the number of groups

P = seq(0, 1, length.out = g + 1  )
P = P[2:(g+1)]

P.1 = c(0.8, 0.9,1)
P.2 = c(0.1, 0.9,1)
P.3 = c(0.1, 0.2,1)

MEM.P = matrix(P, nrow= n, ncol= g, byrow=T)
#MEM.P = rbind(matrix(P.1, nrow= 50, ncol= g, byrow=T),
#	          matrix(P.2, nrow= 50, ncol= g, byrow=T),
#	          matrix(P.3, nrow= 50, ncol= g, byrow=T))

MEM.P.0 = MEM.P
alpha = 0.01

plot(V, col=c(rep("red", 50), rep("blue", 50), rep("green",50)))

L = list()
ref = cbind(
mu.1 = apply(V[1:50, ],   2, mean),
mu.2 = apply(V[51:100,],  2, mean),
mu.3 = apply(V[101:150,], 2, mean))

t.list = c(0, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 4)
MEM.P.list = list()
for (i in 1:10)
{
MEM.P.list[[i]] = MEM.P
}
prob.mod.list = rep(-Inf, 10)

P.best = -Inf
MEM.P.best = MEM.P

nt = length(t.list)
N= 2000
prob.mod = 1:N
prob.mod.best = 1:N
for (i in 1:N)
{
print(i)

for (k in 1:10)
{
MEM.P = MEM.P.list[[k]]
t = t.list[k]

r= runif(150)

MEM = rep(g, 150)
for ( j in (g-1):1)
{
MEM[r <= MEM.P[,j]] = j  			}   		 		 		for ( j in 1:g) 			{ 			V.temp     = V[MEM==j,] 		    mu.temp    = apply(V.temp, 2, mean) 		    Sigma.temp = cov(V.temp) 		    n.temp = nrow(V.temp ) 	 		    if (t > 0.001 )
{
mu.temp    = rmvnorm(1, mean= mu.temp, sigma = t*Sigma.temp/(n.temp)  )
}

if(i == 1)
{
print(mu.temp)
print(Sigma.temp)
}

P.g[[j]]   = dmvnorm(V, mean= mu.temp, sigma = Sigma.temp)
}

P.norm = rep(0, n)
for(j in 1:(g))
{
P.norm = P.norm + P.g[[j]]
}

MEM.P[,1] = P.g[[1]]
for(j in 2:(g-1))
{
MEM.P[,j] = MEM.P[,j-1] + P.g[[j]]
}
MEM.P[,1:(g-1)] =(1-alpha)*MEM.P[,1:(g-1)]/ P.norm + alpha*MEM.P.0[,1:(g-1)]
MEM.P.list[[k]] = MEM.P

P.mod.temp =  sum(log(P.norm))
prob.mod.list[k] = P.mod.temp

if (P.mod.temp  >= P.best )
{
P.best = P.mod.temp
MEM.P.best = MEM.P

}

if (k == 1)
{
L[[i]] = mu.g
prob.mod[i] = P.mod.temp
prob.mod.best[i] = P.best

}

}
r = runif(1) # for REM step
if(r <= 0.1) 	for (k in 1:100) 		{ 		SAMP = sample(1:10, 2) 		index1 = SAMP[1] 		index2 = SAMP[2] 		t1 = t.list[index1] 		t2 = t.list[index2] 		LL1 = prob.mod.list[index1] 		LL2 = prob.mod.list[index2]		 		MEM.P1  = MEM.P.list[[index1]] 		MEM.P2  = MEM.P.list[[index2]] 		A = exp(  (LL2 - LL1) * (1/t1 - 1/t2) ) 		r = runif(1) 		if(index1 == 1) 			{ 			print("-----------> temp zero is 1 <----------")
print(LL1)
print(LL2)
print(t1)
print(t2)

print(A)

}

if(r <= A)
{

MEM.P.list[[index1]] = MEM.P2
MEM.P.list[[index2]] = MEM.P1
prob.mod.list[index1] = LL2
prob.mod.list[index2] = LL1

}

}
print(t(prob.mod.list))

}



# An introduction or “vulgurization” on Churn

Hi all, My friend and former colleague Alex has asked me to write a short “vulgurization” on my methods for modelling churn. This will likely be a review for all my loyal readers, however summarizing and clarifying is rarely a wast of time.

Before continuing I should just add that Alex’s blog has had some very insightful posts about churn. In general, I think he is very good (definitely better than my self) at describing the qualitative, non mathematical ideas of business intelligence.

On to vulgarization:

In short, I model login velocity (time between logins) using survival analysis. This approach solves the two primary difficulties of modelling churn on mobile products. The first is that churn is passive not an active; i.e., users do not cancel their subscription they simply stop returning. The second is that all users may have different lengths of time to return; e.g., if user A stopped playing 6 months ago they have had more time to return than user B that stopped 1 week ago.

How does a survival survival analysis approach address these problems? First, with regards to the passive nature of churn, by modelling the time until return the passive non-action of not returning is replaced by an active action of returning. The second problem, how to account for each user having a different length of time to return in, is is done through the use of data censoring.

Data censoring is probability the only tricky part to using this method; to describe it consider the problem of measuring phone call durations (trust me alternative examples are far more depressing). If right now a large phone company was to pull the durations of all phone calls using its network it could get an estimate however many phone call would be ongoing at the time it pulled the information. And importantly some of the ongoing phone calls might be the very longest. Thus for phone calls that have already ended there is a measurable duration however for ongoing phone calls all that is know is that they have not ended yet (in other words that they lasted at least this long). The way survival analysis uses this information censored information can get a bit mathy. However, in a more common sense sort of way of thinking about it; if you are calculating the percentage of users that have not returned by some length of time from their previous session then censored users would count in the both the numerator and denominator before their censor time and neither after it.

In mobile churn the censor time just as in the phone example is the current time. For users that have already returned there is a defined time between logins for users that have not already returned all that is know is that they have not returned by now (the current time). The main difference in this application of survival analysis is that unlike most conventional problems most of our observations will be censored instead of most being observed.

The SQL code to pull the inter arrival times should look something like this:

SELECT
player_id,
datetime_diff(COALESCE(NEXT_DATE, EOT ), LOGIN_DATE, DAY) as EVENT_TIME, -- the time between events or login and censor date
CASE WHEN NEXT_DATE iS NULL THEN 0 ELSE 1 END as EVENT -- if the event is censored
FROM
(
SELECT
max(LOGIN_DATE) over (PARTITION BY player_id rows between 1 following and 1 following) as NEXT_DATE, -- get next login
max(LOGIN_DATE) over () as EOT,    -- get time at end of data set
ROW_NUMBER() OVER (PARTITION BY DATETIME_TRUNC(LOGIN_DATE, MONTH) ORDER BY RAND()) as index -- this query can't be aggregated so we need to sample the observations
)
WHERE index <= 5000 -- 5k observations per month

Note that this query was made so that assuming that we are interested in cohorting hte users by last contact month. If a different attribute is of interest that should be used instead.

In order to actually conduct the survival analysis I use R. Though it is possible in some parametric cased to do so in SQL.
The two most important survival analysis methods Kaplan Meier Plots and Cox Proportional Hazard model. Describing the nuts and bolts o these will probably fall out side the scope of a “vulgurization” like this so I will just leave you with the R code for them (note you will need to load the surv package for this code to work) and short description of how to interpret the results.

Before starting lets simulate some data.

########## simluate data
d.g1 = rexp(1000, rate = 1/720)
C.g1 = rev(seq(1,365, length.out = 1000 ))
E.g1 = rep(1,1000)
E.g1[d.g1 >= C.g1] = 0
d.g1[d.g1 >= C.g1] = C.g1[d.g1 >= C.g1]

d.g2 = rexp(1000, rate = 1/180)
C.g2 = rev(seq(1,365, length.out = 1000 ))
E.g2 = rep(1,1000)
E.g2[d.g2 >= C.g2] = 0
d.g2[d.g2 >= C.g2] = C.g2[d.g2 >= C.g2]

d.g3 = rweibull(1000, shape =1/2, scale = 180)
C.g3 = rev(seq(1,365, length.out = 1000 ))
E.g3 = rep(1,1000)
E.g3[d.g3 >= C.g3] = 0
d.g3[d.g3 >= C.g3] = C.g3[d.g3 >= C.g3]

event.time = c(d.g1, d.g2, d.g3)
event.result = c(E.g1, E.g2, E.g3)
Group = c(rep(1, 1000), rep(2, 1000), rep(3, 1000))

Here the “Group” field of the data frame would be equivalent to the active_on_month from the SQL query.

########################## kaplan Meier plot with groups
km = survfit(Surv(time=event.time , event=event.result )~factor(Group))
plot(km,
col=c("red", "blue", "green"),
lwd=3,
mark.time=TRUE,
mark=3,
xlim=c(0, 365),
xlab = "time (days)", ylab = "S(d)")



This produces Fig. 1 which is a Kaplan Meier plot that shows the empirical survival function (how many people have not returned buy the indicated number of days) for the simulated data. The crosses are times that a observation was censored (i.e., hit the ever moving now).

Figure 1

The next thing is to fit a Cox proportional hazards model to the data. And plot the predicted (modelled) survival functions.

###################################### cox model
mod = coxph(Surv(time=event.time , event=event.result )~factor(Group))
km = survfit(Surv(time=event.time , event=event.result )~factor(Group))
plot(km, col=c("red", "blue", "green"), lwd=3,
mark.time=FALSE, mark=3,
xlim=c(0, 365),
xlab = "time (days)", ylab = "S(d)")
lines(survfit(mod, newdata = data.frame(Group = 1:3)),
col=c("red", "blue", "green"), lty=2,
mark.time=FALSE )

Figure 2: Kaplan Meier plot (solid lines) with Cox estimated survival functions (dashed lines)

Really that is it for now. But a few final points. In this example the cox model is not a good choice because the groups do not have proportional hazards (the Kaplan Meier plots cross).

Cheers Gavin.

P.S. if you are new to my blog (or have been here for a long time) please ask questions in the comments.

# Estimating Passive Churn or Life time Conversion.

Hi All

This week I have been thinking about an idea that seems have enormous potential that I have not yet been able to make work with real data. For churn analysis the basic question is what percentage of uses will never return. For conversion (making an in app purchase) the question is what percentage will ever make a purchase. In both cases it is common to create a window of time say 7 or 14 days and then ask the simpler questions of “what percentage of user will not return in 7 days?” or “what percentage of users will have converted by 14 days after install?”, then to model this with a probit or logit regression. On this blog I have been advocating for the use of survival analysis to instead estimate the time until return or the time from install until conversion. While both method can provide useful information they both fail to answer the question of “given forever what percentage will have x”. To address this I have been considering what happens with cumulative survival functions $S(t)$ that do not have the property that $\lim_{t\to\infty} S(t) = 0$. For example in the time to convention case the $\lim_{t\to\infty} S(t) = 1-P$ where $P$ is the probability that user will convert given forever. Or in the time to return example $P$ is the probability that a user will ever return.

Before getting into the challenges with this idea First I want to talk about what has been working. That is I have been able to derive a parametric likelihood function for live time conversion or passive churn data. To get the likelihood [$L(t| {\bf m})$] of an observation $t_i$ with a survival function that does not asymptote to zero there are 3 steps. The first is to define the survival function, the second is to use the survival function to derive a hazard function [$h(t)$], and the third is to “build” the likelihood from the survival function, hazard function, and censoring information ($d_i$).

Step 1:
The if $\hat{S}(t)$ is a conventional survival function (i.e., asymptotes to zero) then $S(t) = P\hat{S}(t) + (1-P)$ is a survival function that  asymptotes to $(1-P)$. here by survival function I mean that positive, real, and monotonically decreasing. If we want to think about the underlying distribution of $S(t)$ is? It has a point mass of probability $(1-P)$ at $\infty$ other wise it is the scaled distribution [$\hat{f}(t)$] that underlies $\hat{S}(t)$.

Step 2:
To derive the hazard function recall that $h(t) = -S'(t)/S(t)$. Thus  $h(t) = P \hat{f}(t)/ [P\hat{S}(t) + (1-P)]$.

Step 3:
To get the likelihood for observed and right censored data again recall (with all these recollections is this blog really needed?) that the probability of an observed event is $f(t) = h(t) S(t)$ and the probability of a right censored event is $S(t)$. Thus if $d_i$ is 1 if and only if the either observation has an observed event time and is zero other wise then the likelihood can be written as  $L({\bf t }, {\bf d} | {\bf m}) = h(t_i)^{d_i} S(t_i)$.

What has not worked very well so far is picking a parametric distribution for $\hat{f}(t)$. I have tried the usual suspects-the exponential, Weibull, and gamma distributions- however those have not adequately described my data. To estimate the model parameters (${\bf m}$, i know it is sloppy to only define these now ) I have used MCMC. It may be the case that I need to find a better way of describing the prior distribution of ${\bf m}$ (currently i am using bounded uniforms) and that is disrupting my estimates though in that case I am surpassed that the MLE point estimate is so bad. In any event stay tuned as I improve this process.

Cheers

# A short note on risk adjusted Survival functions

Hi all

First no this post does not mean that I have finished writing my textbook. In fact truth be told I have made very little progress since my last post. I choose to highlight that I have given four talks on data science in that time so 1-1 tie?

Secondly I want to bring up that my former colleague Alexandre Macmillan has also started a blog on business analytics. Alex is less mathy than me, however he is extremely insightful when considering user behavior; if he applies to your company just hire him.

Now onto risk adjusted survival functions…

As noted before survival analysis is great for modeling both churn and conversion. Also Cox proportional hazards (Cox PH) is a great way to do survival analysis. However, Cox PH models the relative hazards of members of a population and the common way to plot (and think about?) survival data is the empirical survival function. This post addresses how to integrate these two methods.

Before continuing I should not that r does this auto-magically with the “survfit” function:

# max some simulated data
d.g1 = rexp(1000, rate = 1/720)
C.g1 = rev(seq(1,365, length.out = 1000 ))
E.g1 = rep(1,1000)
E.g1[d.g1 >= C.g1] = 0
d.g1[d.g1 >= C.g1] = C.g1[d.g1 >= C.g1]

d.g2 = rexp(1000, rate = 1/180)
C.g2 = rev(seq(1,365, length.out = 1000 ))
E.g2 = rep(1,1000)
E.g2[d.g2 >= C.g2] = 0
d.g2[d.g2 >= C.g2] = C.g2[d.g2 >= C.g2]

d.g3 = rweibull(1000, shape =1/2, scale = 180)
C.g3 = rev(seq(1,365, length.out = 1000 ))
E.g3 = rep(1,1000)
E.g3[d.g3 >= C.g3] = 0
d.g3[d.g3 >= C.g3] = C.g3[d.g3 >= C.g3]

event.time = c(d.g1, d.g2, d.g3)
event.result = c(E.g1, E.g2, E.g3)
Group = c(rep(1, 1000), rep(2, 1000), rep(3, 1000))

# fit a cox model and plot the result with a "predicted" survival curv
mod = coxph(Surv(time=event.time , event=event.result )~factor(Group))
km = survfit(Surv(time=event.time , event=event.result )~factor(Group))
plot(km, col=c("red", "blue", "green"), lwd=3,
mark.time=FALSE, mark=3,
xlim=c(0, 365),
xlab = "time (days)", ylab = "S(d)")
lines(survfit(mod, newdata = data.frame(Group = 1:3)),
col=c("red", "blue", "green"), lty=2,
mark.time=FALSE )

Figure 1: a Kaplan Meier Curve for simulated data (solid lines) with a Cox predicted (risk adjusted) survival function (dashed lines). Colors indicate the data cohort.

Figure 1 shows the result of the r code. Note that Cox model would not be appropriate in the is case as the proportional hazards assumption is not met; i.e., the KM lines cross!

So how does r calculate the risk adjusted survival function from the Cox model? To answer this question recall the definition of a hazard function; $h(x) = f(x)/[1-F(x)]$. Also recall that the Cox model describes the hazard function as $\exp({\bf m}^t {\bf z}) h_0(x)$. Thus what is needed is some function $g(y)$ such that

is true.

All we have to do is find $g(y)$ and use it to calculate the risk adjusted survival function from the empirical survival function.  But how to find $g(y)$? well like a lot of math it is much easier to know the answer and work back wards!

In this case I tried a few functions and could not get any of them to work. Then I had the idea to assume $F(x) = 1- \exp(-\lambda x)$, i.e., it is exponential. Using this it became clear that $g(y) = 1-(1-y)^{\exp({\bf m}^t {\bf z})}$. It then turns out that this definition of $g(y)$ works for all $F(x)$!

Thus risk adjusted survival function $\hat{S}(x) = S(x)^{\exp({\bf m}^t {\bf z})}$.

That’s it for now.

# More on Bayesian Split testing

Hi All

This with this post I am expanding on the previous post on split testings. In particular, the last post assumed that the test groups had distinct and unknown means and variances. While that case is interesting it might not be useful.   The problem is that it tests if the mean or variance are different between the groups; life is too short to care about differences in variance.

Here I give a table of evidence ratios for three types of tests. First where the variance of the two groups is know. Second where the variance of the two groups are unknown but assumed to be the same. And third where the variances are unknown and allowed to be different (the previously presented case).

Before jumping into that it is useful (though maybe not interesting? look you chose to read this) to review the paradigm of Bayesian evidence as used in split testing.

Bayesian evidence is the probability of the observations marginalizing over all parameters. To justify that claim consider Bayes formula as commonly used in statistical inference:

Here $\pi({\bf m}_{j})$ is the prior distribution of ${\bf m}_{j}$, $\mathcal{L}({\bf m}_{j})$ is the likelihood
of the parameter vector (i.e., the probability of the data given the model) and $\mathcal{Z}_{j}$ is the Bayesian evidence. The sub script $j$ denotes the choice of parameterization. The posterior distribution $P({\bf m}_{j} | {\bf d})$, is a distribution! That might seem obvious however it has a significant consequence; distributions must sum / integrate to 1. Thus $\int d{\bf m}_{j} P({\bf m}_{j} | {\bf d}) = 1$. Which in turn means that $\mathcal{Z}_{j} = \int d{\bf m}_{j} \pi({\bf m}_{j}) \mathcal{L}({\bf m}_{j})$.

Returning to Bayes theorem:

If we relabel A, B as the model ${\bf m}_{j}$ and data ${\bf d}$ then

i.e., $P( {\bf d}) = \mathcal{Z}_{j} = \int d{\bf m}_{j} \pi({\bf m}_{j}) \mathcal{L}({\bf m}_{j})$. The probability of the data is the integral over all parameters of the prior and likelihood.

The the evidence ratio of two groups of sizes N and M with unknown means and the same known variance to one group with known variance is:

The the evidence ratio of two groups of sizes N and M with unknown means and the same unknown variance to one group with unknown variance is:

The the evidence ratio of two groups of sizes N and M with unknown means and different unknown variances to one group with unknown variance is:

A few final notes these equations look ugly; but actually they are very easy to interpret. If the ratio is greater than 1 then the data (observations) are more likely to have come from the more complex model and if they are less than 1 they are more likely to have come from the simpler model. None of the straw man garbage of the t test.  Also expanding these for multiple groups does not fall into the p-val problem of multiple comparisons.  Finally you probably want to compute the numerators and denominators in logs and the values get really big / small for any reasonably sized data set.