# Clustering and other news

Hi All

It has been a while. Mostly I have just been focused on other projects. In particular I wrote a paper on passive churn that was rejected (a first for me). So I am trying to wash the bitter taste of failure out of my mind with this post.

An interesting problem that has recently been on my mind is clustering users so that we can use qualitative methods to understand better what they want from our product.

I do not have a lot of experience with clustering so I started by looking at a few genetic algorithms. These where non negative matrix factorization (NNMF) and K-means. In exploring these I more or less reinvented expected maximization (although to be fair I have not researched it deeply so I might have reinvented a janky in- efficient/effective EM clustering algorithm).

The concerns that I had with the both NNMF and k-means are that they do not include prior knowledge, they are deterministic and only find local optimals (i.e., are extremely sensitive to starting model), and have no inherent ability to determine the number of clusters.  My algorithm (see how I took ownership of EM; lets call my algorithm GEM… because she is truly outrageous) addresses all of these concerns (at least party) and is demonstrated on a racist’s flower data set.

The algorithm must be initialized with prior knowledge (even if it is uninformative) of cluster group memberships and an assumed known number $k$ of clusters. GEM is then iterative and has two steps:

1. Determine the cluster membership $M$ from the current cluster attributes [$Q = (\mu_1, \mu_2, ... \mu_k, \Sigma_1, ... \Sigma_k)$ ].
2. Update the estimate of $Q$ given $M$.

Steps one and two are repeated until the algorithm converges. Also GEM is run at several temperatures simultaneously to help with convergence. The chains at each temperature are allowed to interact using standard REM methods.

In more detail:

$M$ is a matrix with $n$ rows (where $n$ is the number of observations) and $k$ columns. The $M_{i,j}$ is the probability that observation $i$ is in group $j$.

Where

${\bf d}$ is the matrix of observations, $\hat{P}$ is the prior probability of observation $i$ being in cluster $j$ and $\alpha_0$ is the weight of the prior. For the initialization  of the algorithm $\alpha_0 = 1$. At each step on of the GEM iteration each observation is randomly assigned to each cluster with corresponding probability from $M$.

For step two the $\mu$s and $\Sigma$‘s must be estimated. For $\Sigma_j$ I just use the data covariance matrix of the observations assigned to cluster $j$.  For the $\mu$‘s a modification to the standard Bayesian posterior distributions are calculated and them sampled from. The purpose of the sampling in stead of the MAP is to add additional variation to the process to help navigate out of local minima for a global estimate (I have no proof this actually works but it seems like a good idea). The modified posterior distribution is

where $n_j$ number of observations assigned to the cluster $j$. The temperature of the sample is $t$. And yes I am one of those an improper prior; yes it is philosophically inconsistent; I don’t care!).  And repeat!

The code for my example is appended at the bottom of the post:

Now let’s see how it does with the IRIS’s.

Figure 1: Fischer Iris data, color indicates iris species

For this data set looking at petal length vs petal width is likely sufficient to see all clusterings and allows for much simpler plots.

Figure 2: : Fischer Iris data, with GEM cluster centroid paths

From Fig. 2 it is fairly clear that the GEM algorithm is able to converge  to reasonable centroids for the three species of iris.  Figure 3 shows the Membership probabilities of the clusters. In almost all cases the all the observations from a single species are grouped to the same cluster and are decisively more likely to be part of that cluster.

Figure 3: The log probability of cluster membership for the 3 clusters. The colors indicate the true iris species

To help interpret the cluster it is useful to do two things (or more but I like these two). First to correlate the membership probabilities ($M_{\star, j}$) of the clusters with the observations dimensions and two to plot the  membership probabilities as a function of the observation dimensions. To that end Tab. 1 shows the correlation of the clusters with the observation dimensions. Thus Cluster 3 is the smallest in sepal length, petal length and petal width but largest in sepal width. Cluster 2 is the opposite. Cluster 1 is a less extreme version of cluster 2.

Table 1: Correlation between the ${\bf d}_j$ and $M_{\star, k}$

Because correlations hide a wealth of non linear information the second step of looking at the scatter plots cluster membership and the data attributes is also helpful. Here a smoothing tend line is added by regressing $\log( - \log[M_{\star, k}] ) ~ {\bf d}_j$. The complimentary log -log transform is used to ensure that smooth is always between 0 and 1.

Figure 4: Smooths regressing $M_{\star, k}$ on ${\bf d}_j$.  Each row gives membership in a cluster and each column gives the is a dimension of observed  data. the black lines give a lowess smooth through the data.

Finally my last concern was how to know the number of clusters in a data set. this can be accomplished naturally by using the BIC. The BIC is a selection criterion that weighs model fit against model complexity.  The $BIC = (4+4+3+2+1)*g*\log(150) - 2 \log(L)$.; that is each cluster has 4 parameters for the mean and 10 for the covariance matrix. Frustratingly (yes I write these posts as a go sometimes instead of doing all the work first and the writing it up) when I got here the BIC is optimized for 4 groups in the data instead of 3. That more or less destroys my ending to post but i think is quite interesting;I am still thinking about why BIC does not work here and don’t have a good explanation.

Any way that is it for now. Hopefully I will be able to write more of these in the coming weeks.

V = iris[,1:4]
n = 150 # number of data
g = 3 # the number of groups

P = seq(0, 1, length.out = g + 1  )
P = P[2:(g+1)]

P.1 = c(0.8, 0.9,1)
P.2 = c(0.1, 0.9,1)
P.3 = c(0.1, 0.2,1)

MEM.P = matrix(P, nrow= n, ncol= g, byrow=T)
#MEM.P = rbind(matrix(P.1, nrow= 50, ncol= g, byrow=T),
#	          matrix(P.2, nrow= 50, ncol= g, byrow=T),
#	          matrix(P.3, nrow= 50, ncol= g, byrow=T))

MEM.P.0 = MEM.P
alpha = 0.01

plot(V, col=c(rep("red", 50), rep("blue", 50), rep("green",50)))

L = list()
ref = cbind(
mu.1 = apply(V[1:50, ],   2, mean),
mu.2 = apply(V[51:100,],  2, mean),
mu.3 = apply(V[101:150,], 2, mean))

t.list = c(0, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 4)
MEM.P.list = list()
for (i in 1:10)
{
MEM.P.list[[i]] = MEM.P
}
prob.mod.list = rep(-Inf, 10)

P.best = -Inf
MEM.P.best = MEM.P

nt = length(t.list)
N= 2000
prob.mod = 1:N
prob.mod.best = 1:N
for (i in 1:N)
{
print(i)

for (k in 1:10)
{
MEM.P = MEM.P.list[[k]]
t = t.list[k]

r= runif(150)

MEM = rep(g, 150)
for ( j in (g-1):1)
{
MEM[r <= MEM.P[,j]] = j  			}   		 		 		for ( j in 1:g) 			{ 			V.temp     = V[MEM==j,] 		    mu.temp    = apply(V.temp, 2, mean) 		    Sigma.temp = cov(V.temp) 		    n.temp = nrow(V.temp ) 	 		    if (t > 0.001 )
{
mu.temp    = rmvnorm(1, mean= mu.temp, sigma = t*Sigma.temp/(n.temp)  )
}

if(i == 1)
{
print(mu.temp)
print(Sigma.temp)
}

P.g[[j]]   = dmvnorm(V, mean= mu.temp, sigma = Sigma.temp)
}

P.norm = rep(0, n)
for(j in 1:(g))
{
P.norm = P.norm + P.g[[j]]
}

MEM.P[,1] = P.g[[1]]
for(j in 2:(g-1))
{
MEM.P[,j] = MEM.P[,j-1] + P.g[[j]]
}
MEM.P[,1:(g-1)] =(1-alpha)*MEM.P[,1:(g-1)]/ P.norm + alpha*MEM.P.0[,1:(g-1)]
MEM.P.list[[k]] = MEM.P

P.mod.temp =  sum(log(P.norm))
prob.mod.list[k] = P.mod.temp

if (P.mod.temp  >= P.best )
{
P.best = P.mod.temp
MEM.P.best = MEM.P

}

if (k == 1)
{
L[[i]] = mu.g
prob.mod[i] = P.mod.temp
prob.mod.best[i] = P.best

}

}
r = runif(1) # for REM step
if(r <= 0.1) 	for (k in 1:100) 		{ 		SAMP = sample(1:10, 2) 		index1 = SAMP[1] 		index2 = SAMP[2] 		t1 = t.list[index1] 		t2 = t.list[index2] 		LL1 = prob.mod.list[index1] 		LL2 = prob.mod.list[index2]		 		MEM.P1  = MEM.P.list[[index1]] 		MEM.P2  = MEM.P.list[[index2]] 		A = exp(  (LL2 - LL1) * (1/t1 - 1/t2) ) 		r = runif(1) 		if(index1 == 1) 			{ 			print("-----------> temp zero is 1 <----------")
print(LL1)
print(LL2)
print(t1)
print(t2)

print(A)

}

if(r <= A)
{

MEM.P.list[[index1]] = MEM.P2
MEM.P.list[[index2]] = MEM.P1
prob.mod.list[index1] = LL2
prob.mod.list[index2] = LL1

}

}
print(t(prob.mod.list))

}