A hybrid Optimizer

Hi all

I am taking a brake from neural nets to post some notes on numerical optimization. The algorithm that is described is called adaptive simplex simulated annealing (ASSA) and was developed by my Ph.D. supervisor. It is a fusion of adaptive simplex  and a simulated annealing optimization. In short, ASSA works by using the convergence speed of the geometric AS optimizer and the global optimization of the simulated annealing.

SA_1

Figure 1: Possible moves of the AS algorithm through out a 2 dimensional parameter space.

 

Before describing the full ASSA algorithm it is helpful to go over the adaptive simplex (AS) optimizer.  The AS algorithm starts by constructing a simplex (think a generalization of a triangle in 2D space to nD space, i.e., a polygon with 1 more vertex then the dimension of the space it inhabits) and then reflecting, contracting, or expanding the simplex at each step to move around the parameter space. At each iteration of the algorithm the worst (sorted by objective function value, e.g., energy, misfit, or -log likelihood) is moved to a better position. The possible moves are shown in Fig. 1 and the steps of the AS algorithm are:

  1. Choose a simplex with N + 1 models at random in the parameter space
  2. Reflect the model with the highest objective function value in the simplex, mhigh. If the objective function value of the reflected model, f(mrefl), fulfils:
    • A) f(mlow) < f(mrefl) < f(m2ndhigh), replace mhigh with mrefl, go to 7;
    • B) f(mrefl) < f(mlow), go to 3;
    • C) f(mrefl) > f(mhigh), go to 4;
    • D) f(m2ndhigh) < f(mrefl) < f(mhigh), go to 5;
  3. Reflect mhigh and expand by a factor two to get mreflExp. The one of mrefl and mreflExp with the lowest objective function value replaces mhigh. Go to 7;
  4. Contract mhigh by a factor two to get mcon. If f(mcon) < f(mhigh) then replace mhigh with mcon and go to 7, else go to 6;
  5. Reflect mhigh and contract by a factor two to get mreflCon. If f(mreflCon) < f(mhigh) then replace mhigh with m and go to 7, else go to 6;
  6. Perform a multiple contraction (figure around mlow. Go to 7;
  7. Continue from 2 until the stop criterion, equation, is fulfilled.

Intuitively, the AS algorithm encode the gradient information of the objective function. My R code for 2 dimensional AS is

g.simplex <-function(N = 100)
{

x1 = 100*runif(1) - 50 
x2 = 100*runif(1) - 50 
x3 = 100*runif(1) - 50  

y1 = 100*runif(1) - 50 
y2 = 100*runif(1) - 50 
y3 = 100*runif(1) - 50

MF1 = g.error.function(x1,y1)
MF2 = g.error.function(x2,y2)
MF3 = g.error.function(x3,y3)

simplex = data.frame(x = c(x1, x2, x3), y = c(y1, y2, y3), MF = c(MF1, MF2, MF3) )
simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # wrost is always the first row
rownames(simplex ) =NULL

S.E     = simplex$MF[3] - simplex$MF[1]
k = 0

for(i in  1:N)
    {
    print(k)
    k = k + 1
    simplex.temp = simplex
    x = simplex$x 
    y = simplex$y 
    MF = simplex$MF

    #plot(c(-50, 50), c(-50, 50), xaxt= "n", yaxt ="n", xlab ="", ylab ="", type= "n")
    #polygon(simplex$x, simplex$y, border = "grey")

    # reflection step 
    alpha = 1 
    x.0 = (x[2] + x[3])/2
    y.0 = (y[2] + y[3])/2
    x.new  = (1+alpha)*x.0 - alpha*x[1]
    y.new  = (1+alpha)*y.0 - alpha*y[1]
    MF.new = g.error.function(x.new,y.new)

    if(MF.new > MF[2])
        {
        MF.new2 = MF.new # expand step 
        j = 0
        while((MF.new2 >= MF[3]) & (j <= 50) ) # this goes untill the new point is not best point  
            {
            j = j +1
            alpha  = 1.5*alpha
            x.new2  = (1+alpha)*x.0 - alpha*x[1]
            y.new2  = (1+alpha)*y.0 - alpha*y[1]
            MF.new2 = g.error.function(x.new2,y.new2)
            
            if (MF.new2 >=  MF.new) 
                {
                x.new  = x.new2
                y.new  = y.new2
                MF.new = MF.new2
                }
    
            }
        }

    if(MF.new <= MF[2]) # if new point is worst point
        {
        x.new = (x.0 + x[1])/2
        y.new = (y.0 + x[1])/2
        MF.new = g.error.function(x.new,y.new)
        }
    w = 1                  # if the point is still bad then it goes towards the best point
    while((MF.new <  MF[2]) & (w <=100) )
        {
        x.new = (w*x[3] + x[1])/(1+w)
        y.new = (w*y[3] + x[1])/(1+w)
        MF.new = g.error.function(x.new,y.new)
        w = w + 1
        }
    simplex[1,1:3] = c(x.new, y.new, MF.new)
    simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # worst is always the first row
    rownames(simplex ) =NULL
    dif  = log(simplex[3,3] / simplex[1,3])
    print(dif)
    points(simplex$x[3], simplex$y[3])
    print(simplex)
    }
}

 

The idea of simulated annealing should be straight forward to anyone one that has used at MCMC. The steps of the SA algorithm are:

  1. Initialize with random model (parameter values) m, with function value V.
  2. Propose new model m, (add random noise) and calculate its value V’.
  3. Accept new model with probability A = \max (1,\exp[(V' - V)/t_i])
  4. Set new temp t_{i+1} = \alpha t_i. Where \alpha <1 and go to 2.

The choice of \alpha and proposal distribution determines the speed and stability of convergence. I don’t believe there is a one size fits all solution to this, however, I do believe that there are a few universal considerations. These are 1 to use a wide tailed proposal distribution such a Cauchy or for bounded problems uniform. This will stop the optimizer from getting trapped in local structure. Because optimizes do not need to be ergodic it is possible to  have complex rules such as propose from a uniform if that fails propose from a Cauchy, and only if that fails also try a Gaussian.

My code for a SA optimizer is

g.aneal <-function(N = 100, n= 10, t0 = 5, tstep = 0.995)
{
t = t0
x = 100*runif(1) - 50  
y = 100*runif(1) - 50 
MF = g.error.function(x,y)
x.best  = x 
y.best  = y
MF.best = MF
MF.list = MF
X.list  = x 
y.list  = y
t.list  = t
for (i in 1:N)
    {
    t.0.2 = t^0.2 
    for ( j in 1:n)
        {
        x.prime  = x +   0.1*tan(pi*(runif(1) - 0.5))
        y.prime  = y +   0.1*tan(pi*(runif(1) - 0.5))
        if ( (x.prime > 50) | (x.prime < -50)  ) {x.prime = 100*runif(1) - 50  }
        if ( (y.prime > 50) | (y.prime < -50)  ) {y.prime = 100*runif(1) - 50  }
        MF.prime = g.error.function(x =x.prime , y = y.prime)
        A = exp( (MF.prime - MF)/t )
        if (is.na(A) ) {A= 0 }
        r = runif(1)
        if(MF.best <= MF.prime) 
            {
            x.best  = x.prime 
            y.best  = y.prime
            MF.best = MF.prime
            }
        if ( r <= A) 
            {
            x  = x.prime 
            y  = y.prime
            MF = MF.prime
            }
        }
    MF.list = c(MF.list, MF)
    X.list  = c(X.list , x )
    y.list  = c(y.list , y )
    t.list  = c(t.list , t )
    t = 0.995*t 
    print(round(c(i,x.best, y.best, MF.best, t),3))
    }    
out = data.frame(MF = MF.list, x =X.list, y =y.list, t=t.list )
return(out)
}

 

So finally the ASSA optimizer. The steps of the algorithm are:

  1. Randomly initialize N+1 models to generate a starting simplex, for a problem with dimension N;
  2. Reflect and perturb the model with the highest objective function value in the simplex, mhigh, to get mreflPert; Check with the SA criteria if mreflPert is to replace mhigh in the simplex;
  3. If the objective function value of the reflected and perturbed model, mreflPert, fulfills:
    • A) f(mreflPert) > f(mlow) go to 4;
    • B) f(mreflPert) < f(m2ndhigh) go to 5.
    • C) Else, go to 6;
  4. Expand and perturb mreflPert, which must have been included in the simplex in this case, to get mexpPert; Check with the first SA criterion if mexpPert is to replace mreflPert in the simplex. Go to 6;
  5. Contract and perturb one of mhigh and mreflPert that is present in the simplex to get mconPert; Check with the two SA criteria if mconPert is to replace the model in the simplex which was contracted. Go to 6;
  6. If at least one new model replacement has taken place in the simplex, the number of accepted models is increased by one. If enough models have been accepted at the present temperature, go to 7, else go to 2.
  7. Reduce temperature. Unless the required number of temperature steps has been taken

And my code for a 2D ASSA optimization is

g.ASSA <-function(N = 100, n= 100, delta = 0.995, sig = 0.01)
{
x1 = 100*runif(1) - 50 
x2 = 100*runif(1) - 50 
x3 = 100*runif(1) - 50  
y1 = 100*runif(1) - 50 
y2 = 100*runif(1) - 50 
y3 = 100*runif(1) - 50
MF1 = g.error.function(x1,y1)
MF2 = g.error.function(x2,y2)
MF3 = g.error.function(x3,y3)
simplex = data.frame(x = c(x1, x2, x3), y = c(y1, y2, y3), MF = c(MF1, MF2, MF3) )
simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # wrost is always the first row
rownames(simplex ) =NULL
S.E     = simplex$MF[3] - simplex$MF[1]
k = 0
t = 10
for(i in  1:N)
    {
    for (j in 1:n)
        {
        print(k)
        k = k + 1
        simplex.temp = simplex
        x = simplex$x 
        y = simplex$y 
        MF = simplex$MF
         # reflection step
         AC1   = 0  
         alpha = 1 
        x.0 = (x[2] + x[3])/2
        y.0 = (y[2] + y[3])/2
         x.new  = (1+alpha)*x.0 - alpha*x[1] +  sig*tan(pi*(runif(1) - 0.5))
         y.new  = (1+alpha)*y.0 - alpha*y[1] +  sig*tan(pi*(runif(1) - 0.5))

        if ( (x.new > 50) | (x.new < -50)  ) {x.new = 100*runif(1) - 50  }
        if ( (y.new > 50) | (y.new < -50)  ) {y.new = 100*runif(1) - 50  }
        MF.new = g.error.function(x.new,y.new)

        A = exp( (MF.new - MF[2])/t )
        if (is.na(A) ) {A= 0 }
        r = runif(1)

        if(r <= A) 
            {
            simplex[1,1:3] = c(x.new, y.new, MF.new)
            AC1 = 1
            simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # worst is always the first row
            rownames(simplex ) =NULL
            x = simplex$x 
            y = simplex$y 
            MF = simplex$MF
            x.0 = (x[2] + x[3])/2
            y.0 = (y[2] + y[3])/2
            }
       if(MF.new > MF[3])
             {
             MF.new2 = MF.new # expand step 
             j = 0
            while((MF.new2 >= MF[3]) & (j <= 50) ) # this goes untill the new point is not best point  
                {
                j = j +1
                alpha  = 1.5*alpha
                x.new2  = (1+alpha)*x.0 - alpha*x[1] +  sig*tan(pi*(runif(1) - 0.5))
                y.new2  = (1+alpha)*y.0 - alpha*y[1] +  sig*tan(pi*(runif(1) - 0.5))
                if ( (x.new > 50) | (x.new < -50)  ) {x.new = 100*runif(1) - 50  }
                if ( (y.new > 50) | (y.new < -50)  ) {y.new = 100*runif(1) - 50  }
                MF.new2 = g.error.function(x.new2,y.new2) 
                if (MF.new2 >=  MF.new) 
                    {
                    x.new  = x.new2
                    y.new  = y.new2
                    MF.new = MF.new2
                    }
                }
            }
        if(AC1 == 0) # if new point is worst point
            {
            x.new = (x.0 + x[1])/2 +  sig*tan(pi*(runif(1) - 0.5))
            y.new = (y.0 + x[1])/2 +  sig*tan(pi*(runif(1) - 0.5))
            if ( (x.new > 50) | (x.new < -50)  ) {x.new = 100*runif(1) - 50  }
            if ( (y.new > 50) | (y.new < -50)  ) {y.new = 100*runif(1) - 50  }
            MF.new = g.error.function(x.new,y.new)
            A = exp( (MF.new - MF[2])/t )
            if (is.na(A) ) {A= 0 }
            r = runif(1)
            if(r <= A) 
                {
                simplex[1,1:3] = c(x.new, y.new, MF.new)
                AC1 = 1
                simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # worst is always the first row
                rownames(simplex ) =NULL
                x = simplex$x 
                y = simplex$y 
                MF = simplex$MF
                x.0 = (x[2] + x[3])/2
                y.0 = (y[2] + y[3])/2
                }
            }
         w = 1                  # if the point is still bad then it goes towards the best point
        while((AC1 == 0) & (w <=100) & (MF.new < MF[1])  )
            {
            x.new = (w*x[3] + x[1])/(1+w)
            y.new = (w*y[3] + x[1])/(1+w)
            MF.new = g.error.function(x.new,y.new)
            w = w + 1
            }
            A = exp( (MF.new - MF[2])/t )
            if (is.na(A) ) {A= 0 }
            r = runif(1)
            if(r <= A) 
                {
                simplex[1,1:3] = c(x.new, y.new, MF.new)
                AC1 = 1
                simplex = simplex[sort(simplex[,3], index.return =T)$ix,] # worst is always the first row
                rownames(simplex ) =NULL
                x = simplex$x 
                y = simplex$y 
                MF = simplex$MF
                x.0 = (x[2] + x[3])/2
                y.0 = (y[2] + y[3])/2
                }
    
        #dif  = log(simplex[3,3] / simplex[1,3])
        #print(dif)
        ####polygon(simplex$x, simplex$y)
        points(x[3], y[3])
        #points(x[2], y[2], pch=".", col="grey")
        #print(simplex)
        print(t)
        #print(AC1)
        }
    t = t*delta
    }
}

Thanks All! Tune in next time for part 3 of Bayesian text recognition.

 

Advertisement

Bayesian Text Recognition part 2

Hi All
Sorry for the short part one of this post but I have had this on the back burner for too long and just need to start getting things written. So far all BBI has is an alphabet and a way of plotting it. This post will show how to estimate parameters for a 1-layer neural network  (NN) with a fixed number of hidden nodes. The parallel tempering sampling method will be used; this is less efficient then NN specific optimizers however in the third entry of this series an unknown number of hidden layers is considered.

Screen Shot 2016-08-11 at 8.28.59 PM

Figure 1: A smaller neural net of the same structure used here. Red blocks are the input nodes, blue blocks are the hidden nodes, and green blocks are the output nodes.

Figure one shows “similar” architecture to the NN that will be used here. However, the NN in figure 1 is for a problem with symbols described with 4 not 25 binary dimensions, only 6 hidden nodes, and only 2 not 16 different symbols. The red blocks are the input nodes. Here these will be either zero or one and be determined by the X1 – X25 from table 1 of Bayesian Text recognition. The hidden nodes will output a weighted sum of the input nodes. There is one weight per input and hidden node combination. In the small NN in Fig. 1 there are 24 (4*6) weights. In linear algebra notation the jth hidden node has value, v^{\star}_j = {\bf X} \cdot {\bf W}_j . The v^{\star}_j are transformed to v_j by the Probit function (Update — Having tried a few things this seems like it only makes the problem harder and I don’t currently do it but I think it is worth keeping for generality).  Output nodes scale and sum the outputs of the hidden nodes; i.e., the kth output node has value P^{\star}_k = {\bf V} \cdot {\bf Q}_k. The P^{\star}_k  are exponentiated and normalized to give a probability of each state; i.e., P_k = \exp(P^{\star}_k)/ \sum_{k=1}^K[ \exp(P^{\star}_k)]. Thus the probability of the kth symbol in an alpha bet is P_k. The use of the exponential to create probabilities in this way is slightly non-standard but it is similar to the Multinomial Poisson transformation. In any event it seems reasonable.

For a prior distribution on the w_{i,j} I tried two ideas the first is a Gaussian with variance  = N^-1 where N is the number of hidden nodes. The second idea is just an unsophisticated uniform distribution. I believe the first is Radford Neal‘s idea but I no-longer have fee access to his thesis (Graduating sucks). The point is that as the flexibility (number of nodes) of the NN goes up their flexibility goes down. There is a chapter in his (maybe someone elses) thesis that does some great analysis of this; but because we are basing this off of my fallible monkey brain’s recollection it might be some other function that decreases and N get bigger. The frustratingly the unsophisticated uniform distribution seems to be the better choice for this problem. “Sometimes reality is unworthy of our creativity”

To initialize the sampling process I cheat. See the “g.syms.2.pars” function in the code.

Table 1 shows confusion matrix of the best model found by the optimization process. The matrix is near diagonal; i.e., it perfectly classifies all symbols. This is not surprising as there is is one node for each symbol so all that needs to happen is that each node just need to fire for one symbol.

Screen Shot 2016-08-14 at 1.08.36 PM

Table 1: The confusion matrix of the best model from the MCMC sampling. The row is the true value and the column is the probability assigned by the best model.

 

So once again I am out of time to write this post. Code is appended.
Tune in next time for RJMCMC choosing the number of nodes.

 

g.sym.plot <-function(v, name = "x")
    {
    plot(0:5,0:5, type= "n", asp=1, xaxt="n", yaxt="n")
    
    x = 0
    y = 1
    for ( i in 1:25)
        {
        x = x+1

        if ( v[i] == 1)
            {
            x0 = x - 1
            x1 = x  
                
            y0 = y - 1
            y1 = y  

            polygon(c(x0, x0, x1, x1), c(y0, y1, y1, y0), col="black" )

            }

        if (x == 5) 
            {
            x = 0
            y = y+1    
            }
        }

    text(2.5, 2.5, name, col="red", cex=6)



    }           
#                1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5     



g.v.maker <-function(type = "x", noise = F, P=0.05)
    {
    v = rep(0, 25)


    if  (type == "x") 
        {
        v[c(1,5,7,9,13,17,19,21,25)] = 1     
        }

        print(type)

    if  (type == "x1") 
        {
        v[c(1,5,7,9,13)] = 1     
        }
    
    if  (type == "x2") 
        {
        v[c(13,17,19,21,25)] = 1     
        }

    if  (type == "x3") 
        {
        v[c(1,7,13,17,21)] = 1     
        }

    if  (type == "x4") 
        {
        v[c(5,9,13,19,25)] = 1     
        }

    if  (type == "x5") 
        {
        v[c(1,7,13,19,25)] = 1     
        }

    if  (type == "x6") 
        {
        v[c(5,9,13,17,21)] = 1     
        }



    if  (type == "o") 
        {
        v[c(1:5,6,10,11,15,16,20,21:25)] = 1     
        }    

    if  (type == "oc1") 
        {
        v[c(1:3,6,11)] = 1     
        }    

    if  (type == "oc2") 
        {
        v[c(3:5,10,15)] = 1     
        }    

    if  (type == "oc3") 
        {
        v[c(23:25,15,20)] = 1     
        }    
    if  (type == "oc4") 
        {
        v[c(11,16,21:23)] = 1     
        }    
    if  (type == "ou1") 
        {
        v[c(1:3,6,11,3:5,10,15 )] = 1     
        }    
    if  (type == "ou2") 
        {
        v[c(3:5,10,15, 23:25,15,20 )] = 1     
        }    
    if  (type == "ou3") 
        {
        v[c(23:25,15,20, 11,16,21:23)] = 1     
        }    
    if  (type == "ou4") 
        {
        v[c(11,16,21:23,1:3,6,11 )] = 1     
        }

    if(noise)
        {
        for ( i in 1:25)
            {
            r = runif(1)
            if (r <= P)
                {
                v[i] = 1
                }    
            if (r >= 1-P)
                {
                v[i] = 0
                }

            }
            
        }


    return(v)    

    }



g.data.maker<-function(n=16, set = c("x", "x1","x2","x3","x4","x5","x6","o", "oc1","oc2","oc3","oc4","ou1","ou2","ou3","ou4") )
{
out1 = matrix(0, nrow= n, ncol=25)
out2 = rep("junk", n)
out3 = rep("x", n)
out4 = rep("x", n)
out5 = rep("Part", n)
out6 = 1:n

for ( i in 1:n)
    {
    j = sample(1:16, 1)
    j = i 
    out2[i]  = set[j]
    out6[i]  = j  
    if (j >= 8)  {out3[i] = "o"}
    if (j >= 2)  {out4[i] = "chevron"}
    if (j >= 6)  {out4[i] = "slash"}
    if (j == 8)  {out4[i] = "o"}
    if (j >= 9)  {out4[i] = "corner"}
    if (j >= 13) {out4[i] = "cup"}
    if (j %in% c(1,8) ) {out5[i] = "Full"}

    out1[i,] = g.v.maker(type = set[j],  noise = F, P=0.05)
    }

out = data.frame( out2,out3,out4,out5, out1, out6)

return(out)

}


syms = g.data.maker()

par(mfrow = c(4,4),mar = c(0.1,0.1, 0.1,0.1))

for ( i in 1:16)
    {
    g.sym.plot(v= syms[i,5:29], name = syms[i,1])
    }

matrix(rnorm(41), nrow=1, ncol=41)

g.LL <-function( P.star = g.syms.2.pars(data= syms) , d.obs = syms, X = syms[,5:29], type =30, print.name = 0 , nnodes =16)
    {
    P =matrix(P.star, nrow = nnodes , ncol =41, byrow=T)
    #print(P)
    Y  = d.obs[,type]

    n.data = nrow(X)

    LL.NA = 0

    #print(P)

    Q  = as.matrix(X)%*%as.matrix(t(P[,1:25]))
    #print("Q")
    #print(Q)
    # keeping this off #Q  = pnorm(Q)

    #print("secound weights")
    #print(as.matrix(P[,26:41]))
    Q2 = Q%*%as.matrix(P[,26:41])

    #print("Q2")
    #print(Q2)

    Q3 = exp(Q2)
    RS = rowSums(Q3)

    #print(Q3)
    #print(RS)

    if(print.name ==1) 
        {
        Q4 = Q3
        for ( i in 1:n.data)
            {
            Q4[i,] = Q4[i,]/RS[i]    
            }
        print(round(Q4,3))
        }

    L = Q3[1:n.data, Y]
    L = diag(L)

    LL.out = sum(log(L) - log(RS))

    #LL.out = LL.out^1  # there are only 16 data points but I am weighing each 1 as 10 data points 

        
    return(LL.out)
    }

g.LL(  d.obs = syms, X = syms[,5:29], type =30)



g.syms.2.pars<-function(data)
{

n= nrow(data)

W = as.numeric(data[1, 5:29])
#colnames(W) = NULL

W    = 5*(W-0.5)
Q    = rep(0,16) 
Q[1] = 1 
#print(1)

out = c(W,Q)

for ( i in 2:n)
    {
    W    = as.numeric(data[i, 5:29])
    W    = 5*(W-0.5)
    Q    = rep(0,16) 
    Q[i] = 1
    #Q    = rnorm(16) 

    out  = c(out, W,Q)

    }
out = out #+ rnorm(length(out), sd=0.01)

return(out)
}
junk= g.syms.2.pars(data= syms)



g.REM.mcmc <- function(N= 1000, DREM = 1 )
{


N.temps  = 10
EX       = seq(0, 2.5, length.out = N.temps )
t.star   = 2^EX
NP       = 41*16  
ACC      = 1
trys     = 2
UB       = 12
LB       = -12
    

M = matrix(0, nrow = N.temps, ncol = 2 + NP)
for ( i in 1:N.temps)
    {
    m.old               = g.syms.2.pars(data= syms)
    ll.old              = g.LL(P.star = m.old)
    print(ll.old)
    M[i,1]              = ll.old
    M[i,3:(2 + NP)]     = m.old 
    M[i,2]              = t.star[i]
    }
#return(0)



M.list = c(ll.old,1, m.old,0)
alpha = 1 
scale.jump =0.000001

l.pi.old = 0 # sum(dnorm(m.old, sd= 1/4, log =T))
best.ll = ll.old +l.pi.old 

for ( i in 1:N)
    {
    print(i)
    # the mcmc steps for each chain
    #if (i <= 2000)
    #    {
        delta.step = (ACC/trys)/(1-ACC/trys) 
        scale.jump  = delta.step* scale.jump 
    #    }

    for (j in 1:N.temps)
        {
        m.old  = M[j,3:(NP+2)]
        ll.old = M[j,1]
        t.star = M[j,2]

        for (k in 1:16)
            { 
            #print(k)
            # next post I am going to write about how to linearize a problem for better choise of step sds, 
            jump   = rcauchy(n =41, scale=scale.jump*t.star  ) 
            node.pars.index = (1+ 41*(k-1)):(k*41)
            m.new  = m.old[node.pars.index] + jump
            m.new  = m.old + jump
    
            l.pi.old = 0 #= sum(dnorm(m.old, sd= 1/4, log =T))
            l.pi.new = 0 #= sum(dnorm(m.new, sd= 1/4, log =T))
    
            A = 0
            if ((max(m.new) <=UB)&(min(m.new) >= LB))
                {    
                ll.new = g.LL(P.star = m.new)
                # standard mcmc acceptance rule 

                A = exp((ll.new - ll.old + l.pi.new -l.pi.old)/t.star)
                #print(ll.new)
                #print(ll.old)
                #print("A")
                #print(A)
                if(is.na(A)) {A = 0}
                } 
            r = runif(1)
    
            if(t.star == 1) 
                {
                trys = trys +1
                flag =0
                }
    
            
            if (r <= A)
                {
                if (t.star == 1) 
                    {
                    ACC = ACC + 1
                    flag =1
                    }
                M[j,1:(NP +2)] = c(l.pi.new + ll.new , t.star, m.new) 

                if ( l.pi.new + ll.new  > best.ll)
                    {
                        best.ll = g.LL(P.star = m.new, print.name=1)
                        m.best  = m.new
                        

                    }


                }
            }
        }


    # Replica exchange method steps  
    
    r = runif(1)
    if ( (r<= 0.25) & (DREM == 1)) # clearly this (the DREM on switch) should also be in the liklihood step but i am lazy and this fast anyway 
        {
    
        for (j  in 1:100)
            {
            k = sample(x = 1:N.temps, size= 2, replace =T)
            T1  = M[k[1],2]
            T2  = M[k[2],2]
            LL1 = M[k[1],1]
            LL2 = M[k[2],1]


            A = exp( (LL1- LL2)*( (1/ T1) - (1/T2 )  ))
            r = runif(1)
            if ( r <= A)
                {
                M[k[1],2] = T2 
                M[k[2],2] = T1
                }
            }
        }

    if ( i <= 999) 
        {

        M.list = rbind(M.list, c(M[M[,2]==1,1:(NP+2) ],i))
        rownames(M.list) =NULL
        #print(M.list[,1:10])
        print(round(c(trys, M[M[,2]==1,1],scale.jump,flag,ACC,ACC/trys),6))
        #print(M.list[,c(1:5,659)])

        }    

    # fixed length thinning 
    if ( i > 999) 
        {
        alpha = alpha*1000/(alpha+1000)
        r = runif(1)
        if ( r <= alpha)
            { 
            row = sample(1:1000,1)
            M.list[row,] = c(M[M[,2]==1,1:(NP+2) ],i)
            }
        print(round(c(trys, M[M[,2]==1,1],scale.jump,flag,ACC,ACC/trys),6))
        }    
       

    }




N.s = nrow(M.list)

best.ll = g.LL(P.star = m.best, print.name=1)
M.list = rbind( c(best.ll, 1, m.best, 0) , M.list)

return(M.list) 
}
samples = g.REM.mcmc(N=10000, DREM = 1 )

Bayesian Text Recognition

Hi All
This week I am finally getting around to my neural net post. Thanks for waiting but to be clear you are still going to be waiting because this is not a complete post just introducing the problem.

So I invented an alphabet its not useful but it might be interesting. So my alphabet is shown in table 1.

Table 1: The Name, type, sub-type, and demotions of my alphabet.
Screen Shot 2016-08-07 at 9.23.08 PM

There are two types of symbols in my alphabets x’s and o’s. These have the sub types for the x’s the sub types are  x, chevron, and slash.  For the o’s the sub types are 0, corner, and cup. The x1 through x25 are demotions of the symbols. Figure 1 shows a plot of the 16 symbols.
Screen Shot 2016-08-07 at 9.40.41 PM
Figure 1: The alphabet plotted with demotion arranged on a 5×5 grid. One’s are plotted as black. The name of each symbol is shown in red.

That is it for now! Tune in later this week (I mean this week) for the actual recognition part of this post.