Some thoughts on neural nets; the importance of random in back-propagation.

Hi All

(this post is a bit rough I am hoping to clean it up over the next few days but I just wanted to get this down)

Artificial Neural Networks are a subject on which traditionally I have held views. Basically, I became very interested in them in the lat 1990s/ early 2000s at which time they were more or less useless. As an undergrad I enjoyed Radford Neal‘s thesis; I did not really understand it. What I did understand was that I did not have access to the computing power required to do much. I also more or less learned though experience that NNs at that time were just not very good, Loess is better at fitting an arbitrary curve, ARIMA at short term for-casting etc. Well things seem to have changed (with strong GPU/CPU combos NNs now really are able to tackle important and difficult problems) so I need to change.  This is going to be the first of a few posts on Artificial Neural Networks. As usual I am going to approach the subject from a mathematical stand point; as is my affliction I am trying not to use any pre packages libraries. This particular post is going to focus on the staple of NNs, back propagation.

Neural Networks like many regression strategies require the estimation of unknown parameters. Back propagation (a special case of gradient decent) is an iterative way of estimating NN parameters for a given data set. It is in principle a deterministic algorithm; something that grates on the MCMC of my sole. This post considers some was to incorporate random steps into Back propagation and tests them on a simple toy problem.

Before exploring any of the modifications to back propagation it is useful to review it. And to do that it is necessary to create a notation for Neural Networks.  Let {\bf d} be an observed data set (what is predicted by a NN), let A be a feature matrix (the input of a NN), and let {\bf w} be the unknown parameters of the NN.

Screen Shot 2019-09-02 at 2.43.57 PM

Figure 1: The structure of an example Neural Network. The circles indicate nodes; they are labelled by their depth/rank and order in their rank. The lines indicate the weights (w); two of the weights are labelled. Weights are specified using their destination node and the rank order of their origin node.

The output of nodes “n 1 1” to “n 1 8” are the feature vectors. The outputs of nodes “n 2 1” to “n 2 4” are denoted v_{2,1} to v_{2,4} and k‘th value can be calculated as

Screen Shot 2019-09-02 at 3.16.44 PM .

That is the logistic of the dot product of the previously layer outputs plus the node bias w_{2,k,b}. As an aside note that the use of the logistic function is purely out of mathematical connivance and “s” shaped function (think a CDF) would work it is just that the logit has nice derivatives.

The result can of course be generalized (but it made describing it a bit easier to show only one case first) as

Screen Shot 2019-09-02 at 3.16.12 PM.

The NN can thus be understood as series of logistic regressions feeding into each other.

 

It is also important to consider the objective function [\phi({\bf w})] of the inversion. That is, how are two NNs evaluated with respect to a given data set? In this work the sum of squared error is used; I hop to have a post exploring different objective function soon.  For sum of squared error is used \phi({\bf w}) = \sum_i (d_i - v_{4,1} )^2.

Finally, before describing back propagation it is helpful to consider optimization in general. That is, given local information about an objective function what are good choices to make in order to find the global best (smallest) value. First it is portably a good idea to generally move in a direction that is “downhill” from your current location. Second If you current location is very flat it you might be near the global value (as the derivative will be zero at the lowest value) so it is advisable to take smaller steps. Third any optimal point found might not be global so it is worth investing some time into checking areas that are not near an observed extrema. Spoilers, back Propagation does the first two very well but ignores the third.

Back propagation works by subtracting the gradient vector (\Delta {\bf w}) to the current weights; i.e., it takes a step down hill. In detail {\bf w}_t = {\bf w}_{t-1} - \eta \Delta {\bf w}_{t-1}. The gradient vector can be found using the chain rule; it is the partial derivative of the objective function (\phi) with respect to each weight (w_i). These derivatives are actually quite easy to get (traditionally I am always trying to integrate things this was a nice change).

Screen Shot 2019-09-02 at 7.09.34 PM The point is that each term closer to the start of the network can have its gradient calculated by multiplying on another derivative of the last term. The only complication comes from possible multiple paths. For example consider w_{1,1,1},

Screen Shot 2019-09-05 at 9.30.38 PM

The summation of the fourth line is the weighting of the two paths from the input layer to the 4th layer. As networks become more complicated the number of paths increase; a good exercise is to add some more nodes in the second and third layers and find the derivatives.

The rest of this point considers three small modifications to back propagation and tests them on a small data set. The data ({\bf d}) set has 40 observations. Each observation is has eight features (resented by the matrix A) and represents a logic gate. The features are the two propositions \alpha and \beta, the five logic gate types (“if-then”, “Contrapositive”, “Xor”, “Or”, “And”), and a negation term. The data are noiseless. The full tableau of the data is

Screen Shot 2019-09-06 at 7.06.45 PM.png.

The first modification is to iteratively reduce the step size (\eta) if  \phi_t > \phi_{t-1}. That is at each step t while \phi_t > \phi_{t-1} \eta is halved. The see appended code; this is done using a while loop so many reductions may occur at a single step. The reason for this is that since the gradient is know there must be a small enough step size such that a step of that magnitude in the direction of the gradient must improve the model. Thus if a step in the direction of the gradient worsens the model the step must be too big.

The second modification is to add random noise to proposed model; i.e., {\bf w}_t =  {\bf w}_{t-1} + {\bf e}_t. The {\bf e}_{\star} are Gaussian random variables with mean zero and variance selected such that \text{E}\left( |{\bf e}_t | \right) = 0.5 \eta |\Delta {\bf w}|.

The third modification is to periodically add not just the small jitter of the second modification to the proposed weights but also to add errors with a magnitude equal to ${\bf w}_{t-1}$. Here this additional randomization is done 1/10th of the iterations.

In order to add some more interest to the test two initial step sizes where used for the first modifications. Thus there are five methods (denoted M1-M5) in total:

  1. Standard Backwards propagation with \eta = 0.1
  2. Reduced step size with \eta = 1
  3. Reduced step size \eta = 10
  4.  Small random jitter \eta = 1
  5. Small random jitter and periodic large random jitter \eta = 1

All methods were run 100 times. Each run used the same random starting weights for all methods and each run was capped to a maximum of 25,000 iterations. Optimizations were stoped before 25,000 iterations if the smallest absolute error term was lower than 0.01.

The results of the inversions are shown in table 1. In order to evaluate the efficiency of the five methods two values are considered; these are the % of models that managed to converge (that is had maximum absolute misfit  < 0.01 before 25,000 steps) and the average number of forward model (evaluating the NN for data with the given weights; this is done once every time a step is evaluated) calculations that were needed. It seems reasonable to assume an exponential distribution and thus adjust the average number of forward models by the number that are censored (took more than 25k steps).

 

Screen Shot 2019-09-05 at 10.24.27 PM

Table 1: The average number of forward models, conversion rates and adjusted forward models for the 5 estimation methods.

The clear winner is M5 adding both the small and larger jitter!

That is it for now

P.S.

The code used in this example. As usual I have tried to reduce the number of “black box” functions. Also note there was a failed 6th inversion method that I will talk about next week.

 

A.= matrix(c(0, 1,0,0,0,0, 1, 1,
             0, 1,0,0,0,0, 0, 1,
             0, 1,0,0,0,0, 1, 0,
             0, 1,0,0,0,0, 0, 0,
             0, 0,1,0,0,0, 1, 1,
             0, 0,1,0,0,0, 0, 1,
             0, 0,1,0,0,0, 1, 0,
             0, 0,1,0,0,0, 0, 0,
             0, 0,0,1,0,0, 1, 1,
             0, 0,0,1,0,0, 0, 1,
             0, 0,0,1,0,0, 1, 0,
             0, 0,0,1,0,0, 0, 0,
             0, 0,0,0,1,0, 1, 1,
             0, 0,0,0,1,0, 0, 1,
             0, 0,0,0,1,0, 1, 0,
             0, 0,0,0,1,0, 0, 0,
             0, 0,0,0,0,1, 1, 1,
             0, 0,0,0,0,1, 0, 1,
             0, 0,0,0,0,1, 1, 0,
             0, 0,0,0,0,1, 0, 0,
             1, 1,0,0,0,0, 1, 1,
             1, 1,0,0,0,0, 0, 1,
             1, 1,0,0,0,0, 1, 0,
             1, 1,0,0,0,0, 0, 0,
             1, 0,1,0,0,0, 1, 1,
             1, 0,1,0,0,0, 0, 1,
             1, 0,1,0,0,0, 1, 0,
             1, 0,1,0,0,0, 0, 0,
             1, 0,0,1,0,0, 1, 1,
             1, 0,0,1,0,0, 0, 1,
             1, 0,0,1,0,0, 1, 0,
             1, 0,0,1,0,0, 0, 0,
             1, 0,0,0,1,0, 1, 1,
             1, 0,0,0,1,0, 0, 1,
             1, 0,0,0,1,0, 1, 0,
             1, 0,0,0,1,0, 0, 0,
             1, 0,0,0,0,1, 1, 1,
             1, 0,0,0,0,1, 0, 1,
             1, 0,0,0,0,1, 1, 0,
             1, 0,0,0,0,1, 0, 0)
               , nrow=40, ncol=8, byrow=T)

colnames(A.) = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta")

#A.=rbind(A., A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2 )


d. = matrix(c(1,0,0,0, 1,1,1,0, 0,1,1,0, 1,0,1,1, 1,1,0,1,
              0,1,1,1, 0,0,0,1, 1,0,0,1, 0,1,0,0, 0,0,1,0), nrow= 40, ncol=1 ) 


g.logit <-function(x)
{
	 return(1/(1+exp(-x)))

}


g.phy <-function(w, A, d) 
{
sigma41 = g.pred(w, A, d)
phy  = 0.5*sum((d-sigma41)^2) 
return(phy)	
}


g.pred <-function(w, A, d) 
{
zeta21  = w[1]*A[,1] + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

return(sigma41)	
}

g.phy.prime <-function(w, A, d)
{
# get the grad vecter 

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)
 
gradphy = (sigma41 - d)

grad41 = sigma41*(1-sigma41)

grad31 = sigma31*(1-sigma31)
grad32 = sigma32*(1-sigma32)

grad21 = sigma21*(1-sigma21)
grad22 = sigma22*(1-sigma22)
grad23 = sigma23*(1-sigma23)
grad24 = sigma24*(1-sigma24)

 #print(grad31)
 #return(0)

gradmat = cbind( (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,1],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,2],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,3],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,4],
	             (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,5],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,6],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,7],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21*A[,8],(grad32*w[48]*w[42] + grad31*w[47]*w[37])*gradphy*grad41*grad21 ,
                 (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,1],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,2],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,3],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,4], 
                 (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,5],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,6],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,7],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22*A[,8],(grad32*w[48]*w[43] + grad31*w[47]*w[38])*gradphy*grad41*grad22, 
                 (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,1],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,2],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,3],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,4], 
                 (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,5],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,6],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,7],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23*A[,8],(grad32*w[48]*w[44] + grad31*w[47]*w[39])*gradphy*grad41*grad23, 
                 (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,1],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,2],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,3],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,4], 
                 (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,5],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,6],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,7],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24*A[,8],(grad32*w[48]*w[45] + grad31*w[47]*w[40])*gradphy*grad41*grad24, 

                 w[47]*gradphy*grad41*grad31*sigma21, w[47]*gradphy*grad41*grad31*sigma22, w[47]*gradphy*grad41*grad31*sigma23, w[47]*gradphy*grad41*grad31*sigma24, w[47]*gradphy*grad41*grad31, # 5 
                 w[48]*gradphy*grad41*grad32*sigma21, w[48]*gradphy*grad41*grad32*sigma22, w[48]*gradphy*grad41*grad32*sigma23, w[48]*gradphy*grad41*grad32*sigma24, w[48]*gradphy*grad41*grad32, # 5 
                 
                 gradphy*grad41*sigma31, gradphy*grad41*sigma32 ,gradphy*grad41 # 3 
                 )
#print(gradmat)

grad = apply(gradmat, 2, sum )	

return(grad)

}


g.LinPro <-function(w, A, d)
{
# get the grad vecter 

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)
 
gradphy = (sigma41 - d)

grad41 = sigma41*(1-sigma41)

grad31 = sigma31*(1-sigma31)
grad32 = sigma32*(1-sigma32)

grad21 = sigma21*(1-sigma21)
grad22 = sigma22*(1-sigma22)
grad23 = sigma23*(1-sigma23)
grad24 = sigma24*(1-sigma24)

 #print(grad31)
 #return(0)

A_hat   = cbind( (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,1],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,2],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,3],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,4],
	             (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,5],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,6],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,7],  (grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21*A[,8],(grad32*w[48]*w[42] + grad31*w[47]*w[37])*grad41*grad21 ,
                 (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,1],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,2],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,3],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,4], 
                 (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,5],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,6],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,7],  (grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22*A[,8],(grad32*w[48]*w[43] + grad31*w[47]*w[38])*grad41*grad22, 
                 (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,1],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,2],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,3],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,4], 
                 (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,5],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,6],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,7],  (grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23*A[,8],(grad32*w[48]*w[44] + grad31*w[47]*w[39])*grad41*grad23, 
                 (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,1],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,2],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,3],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,4], 
                 (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,5],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,6],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,7],  (grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24*A[,8],(grad32*w[48]*w[45] + grad31*w[47]*w[40])*grad41*grad24, 

                 w[47]*grad41*grad31*sigma21, w[47]*grad41*grad31*sigma22, w[47]*grad41*grad31*sigma23, w[47]*grad41*grad31*sigma24, w[47]*grad41*grad31, # 5 
                 w[48]*grad41*grad32*sigma21, w[48]*grad41*grad32*sigma22, w[48]*grad41*grad32*sigma23, w[48]*grad41*grad32*sigma24, w[48]*grad41*grad32, # 5 
                 
                 grad41*sigma31, grad41*sigma32 ,grad41 # 3 
                 )
#print(gradmat)


AtA     =  t(A_hat)%*%A_hat
Q       = svd(AtA)
#print(Q)
D       = Q$d+0.001 # stablize eigen value  
inv.D = 1/D 
inv.D = diag(inv.D)

inv.AtA = Q$u%*%inv.D%*%t(Q$v)
#print(round(inv.AtA%*%AtA),3)

w.prime = inv.AtA%*%t(A_hat)%*%d
#print(w.prime)	

return(w.prime)

}


NNplot<-function(w = w.old)
{

plot(1:4, 1:4, type="n", xlab= "", ylab="", ylim=c(0,9), xlim=c(0,5), xaxt="n", yaxt="n")


max.abs.w = max(abs(w))  
w.raw = w 

w  = 10*w/max(abs(w)) 
g.col = rep("blue", 49)
g.col[w < 0] = "red"

#print(w[1:8])


segments(x0=1, x1=2, y0=1, y1 = 3, lwd=abs(w[1]), col=g.col[1])
segments(x0=1, x1=2, y0=2, y1 = 3, lwd=abs(w[2]), col=g.col[2])
segments(x0=1, x1=2, y0=3, y1 = 3, lwd=abs(w[3]), col=g.col[3])
segments(x0=1, x1=2, y0=4, y1 = 3, lwd=abs(w[4]), col=g.col[4])
segments(x0=1, x1=2, y0=5, y1 = 3, lwd=abs(w[5]), col=g.col[5])
segments(x0=1, x1=2, y0=6, y1 = 3, lwd=abs(w[6]), col=g.col[6])
segments(x0=1, x1=2, y0=7, y1 = 3, lwd=abs(w[7]), col=g.col[7])
segments(x0=1, x1=2, y0=8, y1 = 3, lwd=abs(w[8]), col=g.col[8])

segments(x0=1, x1=2, y0=1, y1 = 4, lwd=abs(w[10]), col=g.col[10])
segments(x0=1, x1=2, y0=2, y1 = 4, lwd=abs(w[11]), col=g.col[11])
segments(x0=1, x1=2, y0=3, y1 = 4, lwd=abs(w[12]), col=g.col[12])
segments(x0=1, x1=2, y0=4, y1 = 4, lwd=abs(w[13]), col=g.col[13])
segments(x0=1, x1=2, y0=5, y1 = 4, lwd=abs(w[14]), col=g.col[14])
segments(x0=1, x1=2, y0=6, y1 = 4, lwd=abs(w[15]), col=g.col[15])
segments(x0=1, x1=2, y0=7, y1 = 4, lwd=abs(w[16]), col=g.col[16])
segments(x0=1, x1=2, y0=8, y1 = 4, lwd=abs(w[17]), col=g.col[17])

segments(x0=1, x1=2, y0=1, y1 = 5, lwd=abs(w[19]), col=g.col[19])
segments(x0=1, x1=2, y0=2, y1 = 5, lwd=abs(w[20]), col=g.col[20])
segments(x0=1, x1=2, y0=3, y1 = 5, lwd=abs(w[21]), col=g.col[21])
segments(x0=1, x1=2, y0=4, y1 = 5, lwd=abs(w[22]), col=g.col[22])
segments(x0=1, x1=2, y0=5, y1 = 5, lwd=abs(w[23]), col=g.col[23])
segments(x0=1, x1=2, y0=6, y1 = 5, lwd=abs(w[24]), col=g.col[24])
segments(x0=1, x1=2, y0=7, y1 = 5, lwd=abs(w[25]), col=g.col[25])
segments(x0=1, x1=2, y0=8, y1 = 5, lwd=abs(w[26]), col=g.col[26])

segments(x0=1, x1=2, y0=1, y1 = 6, lwd=abs(w[28]), col=g.col[28])
segments(x0=1, x1=2, y0=2, y1 = 6, lwd=abs(w[29]), col=g.col[29])
segments(x0=1, x1=2, y0=3, y1 = 6, lwd=abs(w[30]), col=g.col[30])
segments(x0=1, x1=2, y0=4, y1 = 6, lwd=abs(w[31]), col=g.col[31])
segments(x0=1, x1=2, y0=5, y1 = 6, lwd=abs(w[32]), col=g.col[32])
segments(x0=1, x1=2, y0=6, y1 = 6, lwd=abs(w[33]), col=g.col[33])
segments(x0=1, x1=2, y0=7, y1 = 6, lwd=abs(w[34]), col=g.col[34])
segments(x0=1, x1=2, y0=8, y1 = 6, lwd=abs(w[35]), col=g.col[35])

segments(x0=2, x1=3, y0=3, y1 = 4, lwd=abs(w[37]), col=g.col[37])
segments(x0=2, x1=3, y0=4, y1 = 4, lwd=abs(w[38]), col=g.col[38])
segments(x0=2, x1=3, y0=5, y1 = 4, lwd=abs(w[39]), col=g.col[39])
segments(x0=2, x1=3, y0=6, y1 = 4, lwd=abs(w[40]), col=g.col[40])

segments(x0=2, x1=3, y0=3, y1 = 5, lwd=abs(w[42]), col=g.col[42])
segments(x0=2, x1=3, y0=4, y1 = 5, lwd=abs(w[43]), col=g.col[43])
segments(x0=2, x1=3, y0=5, y1 = 5, lwd=abs(w[44]), col=g.col[44])
segments(x0=2, x1=3, y0=6, y1 = 5, lwd=abs(w[45]), col=g.col[45])

segments(x0=3, x1=4, y0=4, y1 = 4.5, lwd=abs(w[47]), col=g.col[47])
segments(x0=3, x1=4, y0=5, y1 = 4.5, lwd=abs(w[48]), col=g.col[48])

symbols(x=1, y=1, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=2, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=7, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=8, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(1,8), y= 1:8, labels = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta") )


symbols(x=2, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(2,4), y= 3:6, labels = round(c(w.raw[9], w.raw[18], w.raw[27], w.raw[36]),3) )


symbols(x=3, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=3, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=rep(3,2), y= 4:5, labels = round(c(w.raw[41], w.raw[46]),3) )


symbols(x=4, y=4.5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=4, y= 4.5, labels = round(c(w.raw[49]),3) )


rang.w   = rev(seq(- max.abs.w, max.abs.w, length.out = 6 ))
rang.lwd = abs(seq(-10, 10, length.out = 6 ))
rang.col = rev(c(rep("red",3), rep("blue",3)))
legend("topright", col= rang.col, legend = round(rang.w, 3), lwd=rang.lwd, title = "line - w" )

}

NN2 <-function(w = (runif(49)-0.5)*10 , A=A., d=d., N= 10000, mod = 1  )
{

#print(w)
phy = g.phy(w, A,d)
#print(phy)

rec = c(phy, w) 


phy.best = phy 
w.best   = w 

fm        = 0 
phy.last  = phy 
i.last    = 1 
w.last    = w 

for ( i in 1:N)
		{

		# get the grad vecter 
		grad = g.phy.prime(w, A, d)

		#update the weights 
		if (mod == 1 )
			{
		    eta       = 0.1
            w.new     = w-eta*grad
            #new misfit 
            phy.new = g.phy(w.new, A,d)
			}
		if (mod == 2 )
			{
		    eta       = 1
            w.new     = w-eta*grad
            #new misfit 
            phy.new = g.phy(w.new, A,d)
			}

		if (mod == 3 )
			{
		    eta       = 10
            w.new     = w-eta*grad
            #new misfit 
            phy.new = g.phy(w.new, A,d)
			}

        if (mod == 4 )
            {
            eta       = 1
            w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
            #new misfit 
            phy.new = g.phy(w.new, A,d)
            }


        if (mod == 5 )
            {
            eta       = 1
            mag.grad  = sqrt(sum(grad*grad))
            j = rnorm(49, sd = mag.grad /sqrt(49))
            #j = rnorm(49, sd = 1)


            #print(sqrt(sum(j*j)))
            w.new     = w-eta*(grad+j)
            #new misfit 
            phy.new = g.phy(w.new, A,d)

            }

        if (mod == 6 )
            {
            eta       = 1

            r = runif(1)
            if( r <= 0.1)
                {
                mag.w = sqrt(sum(w*w)) 
                j = rnorm(49, sd = mag.w /sqrt(49))
                }
            if( r > 0.1)
                {
                mag.grad  = sqrt(sum(grad*grad))
                j = rnorm(49, sd = mag.grad /sqrt(49))
                }

            #j = rnorm(49, sd = 1)


            #print(sqrt(sum(j*j)))
            w.new     = w-eta*(grad+j)
            #new misfit 
            phy.new = g.phy(w.new, A,d)

            }

        k = 0 
        fm = fm + 1 

        # check  that new is better  
        while( (phy.new > phy)& (k <= 100) )
        		{
                # normal back prop step shrink
                if(mod <= 3)
                    {    
                    eta = eta/2
                    #print(paste("eta", eta))
		            w.new     = w-eta*grad
                    phy.new = g.phy(w.new, A,d)
                    fm = fm + 1 
                    }

                # Linearized step 
                if (mod == 4 )
                    {
                    eta       = eta/2
                    w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
                    #new misfit 
                    phy.new = g.phy(w.new, A,d)
                    fm = fm + 1 
                    }


                if (mod == 5 )
                    {
                    #print("here")
                    eta       = eta/2
                    mag.grad  = sqrt(sum(grad*grad))
                    j = rnorm(49, sd = mag.grad /sqrt(49))
                    #j = rnorm(49, sd = 1)
                    w.new     = w-eta*(grad+j)

                    #new misfit 
                    phy.new = g.phy(w.new, A,d)
                    fm = fm + 1 
                    }


                if (mod == 6 )
                    {
                    #print("here")
                    eta       = eta/2


                r = runif(1)
                if( r <= 0.1)
                    {
                    mag.w = sqrt(sum(w*w)) 
                    j = rnorm(49, sd = mag.w /sqrt(49))
                    }
                if( r > 0.1)
                    {
                    mag.grad  = sqrt(sum(grad*grad))
                    j = rnorm(49, sd = mag.grad /sqrt(49))
                    }
    
                    #j = rnorm(49, sd = 1)
                    w.new     = w-eta*(grad+j)

                    #new misfit 
                    phy.new = g.phy(w.new, A,d)
                    fm = fm + 1 
                    }

                k = k+ 1 

                if(k == 50)
                    {print("k==100!")}

                if(k == 100)
                    {print("k==100!")}
        		}
        phy = phy.new
        w = w.new

        if(phy < phy.best)
        		{
        		phy.best = phy
        		w.best = w
        		}

        # add the model to the  
		rec = rbind(rec, c(phy,w))
        rownames(rec)= NULL 
        sigma41 = g.pred(w, A, d)
        r = runif(1)
        if (r <= 0.001)
        		{
        		#print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
		        #print(phy)
                #print(paste("delta log phy = ", log(phy.last) - log(phy) ))
                #print(paste("max abs error", max(abs(d-sigma41))) )
                #if ( log(phy.last) - log(phy) >= 0.5 )
                #    {
                #    quartz()
                #    par(mfrow=c(2,1), mar=c(4,4,1,1))
                #    NNplot(w.last)
                #    plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
                #    print("------------------> new plot <---------------------------------")
                #    }                
                phy.last = phy 
                w.last = w
                i.last = i  

        		}

		if(max(abs(d-sigma41)) <= 0.01 )
			{
            #print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
            #quartz()
            #par(mfrow=c(2,1), mar=c(4,4,1,1))
            #NNplot(w.last)
            #plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
			#return(c(fm, 1))
            return(rec)

			}

		}

return(c(fm, 0))
}
    w.start = (runif(49)-0.5) 
    rec1  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)

 NNplot(w.last[10,2:50])


out = matrix(0, nrow=100, ncol = 10 )
for(i in 1:100)
    {
        print(i)
    w.start = (runif(49)-0.5) 
    rec1  = NN2(  mod = 1,   w = w.start,      N= 25000,A= A.,  d = d.)
    rec2  = NN2(  mod = 2,   w = w.start,      N= 25000,A= A.,  d = d.)
    rec3  = NN2(  mod = 3,   w = w.start,      N= 25000,A= A.,  d = d.)
    #rec4  = NN2(  mod = 4,   w = w.start,      N= 25000,A= A.,  d = d.)
    rec5  = NN2(  mod = 5,   w = w.start,      N= 25000,A= A.,  d = d.)
    rec6  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)
    out[i, 1] = rec1[1]
    out[i, 2] = rec2[1]
    out[i, 3] = rec3[1]
    out[i, 4] = rec5[1]
    out[i, 5] = rec6[1]

    out[i, 6] = rec1[2]
    out[i, 7] = rec2[2]
    out[i, 8] = rec3[2]
    out[i, 9] = rec5[2]
    out[i,10] = rec6[2]

    print(i)
    print(apply(out, 2, mean))
    print("--------------------------")

    }




 

1 thought on “Some thoughts on neural nets; the importance of random in back-propagation.

  1. Pingback: Neural Nets and linearization; what seemed to be a good idea just does not work. | Bayesian Business Intelligence

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s