Optimizing the choice of objective function

Hi all

With my last couple posts I have been exploring neural nets. This post continues the topic with a quick look at objective functions.

In the last two posts I glossed over the assumption of the objective function. I used the sum of squared error (SSE) but that is a bit of strange choice as the data being modelled are dichotomous. This post gives a more rigorously objective function and explore the difference between it and SSE.

Recall that the SEE objective function for a network with one output node is $\phi ({\bf W}) = \sum_i [d_i - v_i ({\bf W})]^2$ where ${\bf d}$ are the observed data, ${\bf W}$ are the network weights and $v_i ({\bf W})$ is the output of the neural network for the i’th observation written as a function of the network weights.

Also recall that the probability of a sample of IID Gaussian random variables with known variance $\sigma^2 = 1$ is $P({\bf d}) = (2 \pi)^{-1/2} \exp[-1/2 \sum_i (d_i - \mu)^2 ]$. Thus the -2 log-probability for a Gaussian distribution is the SSE; this is also called the deviance.

Therefore (my favourite pretentious word), a good way to make a make an objective function is to use the  -2 log-probability for a distribution that is more representative of dichotomous data.  Here I will use the Bernoulli distribution. The probability of a set of observed Bernoulli random variables with $p = v_i ({\bf W})$ is $\text{Prob}({\bf d}) = \Pi_i [ v_i ({\bf W})]^{d_i} [1-v_i ({\bf W}) ]^{1-d_i}$. Thus the deviance and our objective function is $\phi({\bf W}) = -2 \sum_i (d_i \log[v_i ({\bf W})]+ [1-d_i] \log[1-v_i ({\bf W})] )$. The partial derivative (used in back propagation) for this objective function is $(\partial / \partial v_i) \phi({\bf W}) = -2 \sum_i (d_i / v_i ({\bf W}) - [1-d_i]/[1-v_i ({\bf W})] )$.

So what is the difference between the SSE and Bernoulli objective functions? Consider a data set with one observation $d_1 = 1$. Figure 1 shows the two objective functions as a function of $v_1$ in the range $[0.001, 1]$. As can be seen the SSE penalizes bad fits (relative to good fits) much less than Bernoulli. In particular as fit with $v_1 = 0.01$ is meaningfully better than a fit with $v_1 = 0.001$.

That is it for now.

Neural Nets and linearization; what seemed to be a good idea just does not work.

Hi all

The TLDR of this post is that I tried something, it did not work but it seemed enough of a good idea that it is worth writing down incase anyone can improve it.

Those that look at the code in the last post would have seen an un mentioned “mode 4” of getting the weights of a NN.  That mode or method is to linearize the NN inverse (using the data to solve for the unknown weights) problem.

To describe linearization consider a classic linear inverse problem. Let $A$ be the sensitivity matrix ( $n$ rows by $k$ columns), ${\bf d}$ be the data (a column vector of $n$ rows), and ${\bf m}$ be the unknown model parameters (a column vector of $k$ rows). Further more, let the the data covariance matrix be known to be the $I_n$ (an $n$ by $n$ identify matrix); i.e., the model errors ${\bf e}$ are independent gaussians with variance equal to one . Thus ${\bf d} = A{\bf m} + {\bf e}$.

The MLE estimate for the parameters is the well known $\hat{\bf m} = (A' A)^{-1} A' {\bf d}$ and the posterior covariance matrix of ${\bf m}$ is $(A' A)$. Note “prime” here denotes matrix transpose.

So how is this relevant to NNs?

The point is that the $i,j$th element of $A$ is the partial derivative of the $i$th observation with respect to the $j$th parameter. In back propagation those derivatives are analytically calculated. So it is possible to construct an approximation to $A$ that is denoted here as $B$ (I know a creative naming scheme). The $i,j$th element of $B$ is $b_{i,j} = (\partial /\partial w_j ) d_{i}$ where ${\bf W}$ are the weights to the NN.

Thus it should (remember this did not actually work in the end) be possible make iterative estimates of ${\bf W}$ by ${ \bf W}_t = ({B'}_{t-1} B_{t-1})^{-1} {B'}_{t-1} {\bf d}$, but in practice I found that this caused the ${\bf W}_t$ to cycle over a group of bad (poorly fitting values) and more or less never converge.

I don’t have a good reason why this does not work other than it looks like the NN gradients are just not smooth enough. It is also possible that because the in my toy problem there is not enough data (asymptotically all posteriors become gaussians as data is added; the proof of this is very similar to the central limit theorem). I have not tried starting at an optimal solution and seeing if this method stays there; but will be doing that soon as regardless of linearization value in optimization it is useful for initializing MCMC sampling.

That’s it for now.

Some thoughts on neural nets; the importance of random in back-propagation.

Hi All

(this post is a bit rough I am hoping to clean it up over the next few days but I just wanted to get this down)

Artificial Neural Networks are a subject on which traditionally I have held views. Basically, I became very interested in them in the lat 1990s/ early 2000s at which time they were more or less useless. As an undergrad I enjoyed Radford Neal‘s thesis; I did not really understand it. What I did understand was that I did not have access to the computing power required to do much. I also more or less learned though experience that NNs at that time were just not very good, Loess is better at fitting an arbitrary curve, ARIMA at short term for-casting etc. Well things seem to have changed (with strong GPU/CPU combos NNs now really are able to tackle important and difficult problems) so I need to change.  This is going to be the first of a few posts on Artificial Neural Networks. As usual I am going to approach the subject from a mathematical stand point; as is my affliction I am trying not to use any pre packages libraries. This particular post is going to focus on the staple of NNs, back propagation.

Neural Networks like many regression strategies require the estimation of unknown parameters. Back propagation (a special case of gradient decent) is an iterative way of estimating NN parameters for a given data set. It is in principle a deterministic algorithm; something that grates on the MCMC of my sole. This post considers some was to incorporate random steps into Back propagation and tests them on a simple toy problem.

Before exploring any of the modifications to back propagation it is useful to review it. And to do that it is necessary to create a notation for Neural Networks.  Let ${\bf d}$ be an observed data set (what is predicted by a NN), let $A$ be a feature matrix (the input of a NN), and let ${\bf w}$ be the unknown parameters of the NN.

Figure 1: The structure of an example Neural Network. The circles indicate nodes; they are labelled by their depth/rank and order in their rank. The lines indicate the weights (w); two of the weights are labelled. Weights are specified using their destination node and the rank order of their origin node.

The output of nodes “n 1 1” to “n 1 8” are the feature vectors. The outputs of nodes “n 2 1” to “n 2 4” are denoted $v_{2,1}$ to $v_{2,4}$ and $k$‘th value can be calculated as

.

That is the logistic of the dot product of the previously layer outputs plus the node bias $w_{2,k,b}$. As an aside note that the use of the logistic function is purely out of mathematical connivance and “s” shaped function (think a CDF) would work it is just that the logit has nice derivatives.

The result can of course be generalized (but it made describing it a bit easier to show only one case first) as

.

The NN can thus be understood as series of logistic regressions feeding into each other.

It is also important to consider the objective function [$\phi({\bf w})$] of the inversion. That is, how are two NNs evaluated with respect to a given data set? In this work the sum of squared error is used; I hop to have a post exploring different objective function soon.  For sum of squared error is used $\phi({\bf w}) = \sum_i (d_i - v_{4,1} )^2$.

Finally, before describing back propagation it is helpful to consider optimization in general. That is, given local information about an objective function what are good choices to make in order to find the global best (smallest) value. First it is portably a good idea to generally move in a direction that is “downhill” from your current location. Second If you current location is very flat it you might be near the global value (as the derivative will be zero at the lowest value) so it is advisable to take smaller steps. Third any optimal point found might not be global so it is worth investing some time into checking areas that are not near an observed extrema. Spoilers, back Propagation does the first two very well but ignores the third.

Back propagation works by subtracting the gradient vector ($\Delta {\bf w}$) to the current weights; i.e., it takes a step down hill. In detail ${\bf w}_t = {\bf w}_{t-1} - \eta \Delta {\bf w}_{t-1}$. The gradient vector can be found using the chain rule; it is the partial derivative of the objective function ($\phi$) with respect to each weight ($w_i$). These derivatives are actually quite easy to get (traditionally I am always trying to integrate things this was a nice change).

The point is that each term closer to the start of the network can have its gradient calculated by multiplying on another derivative of the last term. The only complication comes from possible multiple paths. For example consider $w_{1,1,1}$,

The summation of the fourth line is the weighting of the two paths from the input layer to the 4th layer. As networks become more complicated the number of paths increase; a good exercise is to add some more nodes in the second and third layers and find the derivatives.

The rest of this point considers three small modifications to back propagation and tests them on a small data set. The data (${\bf d}$) set has 40 observations. Each observation is has eight features (resented by the matrix $A$) and represents a logic gate. The features are the two propositions $\alpha$ and $\beta$, the five logic gate types (“if-then”, “Contrapositive”, “Xor”, “Or”, “And”), and a negation term. The data are noiseless. The full tableau of the data is

.

The first modification is to iteratively reduce the step size ($\eta$) if  $\phi_t$ > $\phi_{t-1}$. That is at each step $t$ while $\phi_t$ > $\phi_{t-1}$ $\eta$ is halved. The see appended code; this is done using a while loop so many reductions may occur at a single step. The reason for this is that since the gradient is know there must be a small enough step size such that a step of that magnitude in the direction of the gradient must improve the model. Thus if a step in the direction of the gradient worsens the model the step must be too big.

The second modification is to add random noise to proposed model; i.e., ${\bf w}_t = {\bf w}_{t-1} + {\bf e}_t$. The ${\bf e}_{\star}$ are Gaussian random variables with mean zero and variance selected such that $\text{E}\left( |{\bf e}_t | \right) = 0.5 \eta |\Delta {\bf w}|$.

The third modification is to periodically add not just the small jitter of the second modification to the proposed weights but also to add errors with a magnitude equal to ${\bf w}_{t-1}$. Here this additional randomization is done 1/10th of the iterations.

In order to add some more interest to the test two initial step sizes where used for the first modifications. Thus there are five methods (denoted M1-M5) in total:

1. Standard Backwards propagation with $\eta = 0.1$
2. Reduced step size with $\eta = 1$
3. Reduced step size $\eta = 10$
4.  Small random jitter $\eta = 1$
5. Small random jitter and periodic large random jitter $\eta = 1$

All methods were run 100 times. Each run used the same random starting weights for all methods and each run was capped to a maximum of 25,000 iterations. Optimizations were stoped before 25,000 iterations if the smallest absolute error term was lower than 0.01.

The results of the inversions are shown in table 1. In order to evaluate the efficiency of the five methods two values are considered; these are the % of models that managed to converge (that is had maximum absolute misfit  < 0.01 before 25,000 steps) and the average number of forward model (evaluating the NN for data with the given weights; this is done once every time a step is evaluated) calculations that were needed. It seems reasonable to assume an exponential distribution and thus adjust the average number of forward models by the number that are censored (took more than 25k steps).

Table 1: The average number of forward models, conversion rates and adjusted forward models for the 5 estimation methods.

The clear winner is M5 adding both the small and larger jitter!

That is it for now

P.S.

The code used in this example. As usual I have tried to reduce the number of “black box” functions. Also note there was a failed 6th inversion method that I will talk about next week.

A.= matrix(c(0, 1,0,0,0,0, 1, 1,
0, 1,0,0,0,0, 0, 1,
0, 1,0,0,0,0, 1, 0,
0, 1,0,0,0,0, 0, 0,
0, 0,1,0,0,0, 1, 1,
0, 0,1,0,0,0, 0, 1,
0, 0,1,0,0,0, 1, 0,
0, 0,1,0,0,0, 0, 0,
0, 0,0,1,0,0, 1, 1,
0, 0,0,1,0,0, 0, 1,
0, 0,0,1,0,0, 1, 0,
0, 0,0,1,0,0, 0, 0,
0, 0,0,0,1,0, 1, 1,
0, 0,0,0,1,0, 0, 1,
0, 0,0,0,1,0, 1, 0,
0, 0,0,0,1,0, 0, 0,
0, 0,0,0,0,1, 1, 1,
0, 0,0,0,0,1, 0, 1,
0, 0,0,0,0,1, 1, 0,
0, 0,0,0,0,1, 0, 0,
1, 1,0,0,0,0, 1, 1,
1, 1,0,0,0,0, 0, 1,
1, 1,0,0,0,0, 1, 0,
1, 1,0,0,0,0, 0, 0,
1, 0,1,0,0,0, 1, 1,
1, 0,1,0,0,0, 0, 1,
1, 0,1,0,0,0, 1, 0,
1, 0,1,0,0,0, 0, 0,
1, 0,0,1,0,0, 1, 1,
1, 0,0,1,0,0, 0, 1,
1, 0,0,1,0,0, 1, 0,
1, 0,0,1,0,0, 0, 0,
1, 0,0,0,1,0, 1, 1,
1, 0,0,0,1,0, 0, 1,
1, 0,0,0,1,0, 1, 0,
1, 0,0,0,1,0, 0, 0,
1, 0,0,0,0,1, 1, 1,
1, 0,0,0,0,1, 0, 1,
1, 0,0,0,0,1, 1, 0,
1, 0,0,0,0,1, 0, 0)
, nrow=40, ncol=8, byrow=T)

colnames(A.) = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta")

#A.=rbind(A., A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2,  A. + (runif(320)-0.5)/2 )

d. = matrix(c(1,0,0,0, 1,1,1,0, 0,1,1,0, 1,0,1,1, 1,1,0,1,
0,1,1,1, 0,0,0,1, 1,0,0,1, 0,1,0,0, 0,0,1,0), nrow= 40, ncol=1 )

g.logit <-function(x)
{
return(1/(1+exp(-x)))

}

g.phy <-function(w, A, d)
{
sigma41 = g.pred(w, A, d)
phy  = 0.5*sum((d-sigma41)^2)
return(phy)
}

g.pred <-function(w, A, d)
{
zeta21  = w[1]*A[,1] + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

return(sigma41)
}

g.phy.prime <-function(w, A, d)
{

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

#return(0)

)

}

g.LinPro <-function(w, A, d)
{

zeta21  = w[1]*A[,1]  + w[ 2]*A[,2] +w[ 3]*A[,3]  +w[ 4]*A[,4]    +w[ 5]*A[,5]    +w[ 6]*A[,6]    +w[ 7]*A[,7]  +w[ 8]*A[,8]   + w[9]
zeta22  = w[10]*A[,1] + w[11]*A[,2] +w[12]*A[,3]  +w[13]*A[,4]    +w[14]*A[,5]    +w[15]*A[,6]    +w[16]*A[,7]  +w[17]*A[,8]   +w[18]
zeta23  = w[19]*A[,1] + w[20]*A[,2] +w[21]*A[,3]  +w[22]*A[,4]    +w[23]*A[,5]    +w[24]*A[,6]    +w[25]*A[,7]  +w[26]*A[,8]   +w[27]
zeta24  = w[28]*A[,1] + w[29]*A[,2] +w[30]*A[,3]  +w[31]*A[,4]    +w[32]*A[,5]    +w[33]*A[,6]    +w[34]*A[,7]  +w[35]*A[,8]   +w[36]
sigma21 = g.logit(zeta21)
sigma22 = g.logit(zeta22)
sigma23 = g.logit(zeta23)
sigma24 = g.logit(zeta24)
zeta31  = w[37]*sigma21+w[38]*sigma22+w[39]*sigma23 +w[40]*sigma24 +w[41]
zeta32  = w[42]*sigma21+w[43]*sigma22+w[44]*sigma23 +w[45]*sigma24 +w[46]
sigma31 = g.logit(zeta31)
sigma32 = g.logit(zeta32)
zeta41  = w[47]*sigma31+w[48]*sigma32  +w[49]
sigma41 =g.logit(zeta41)

#return(0)

)

AtA     =  t(A_hat)%*%A_hat
Q       = svd(AtA)
#print(Q)
D       = Q$d+0.001 # stablize eigen value inv.D = 1/D inv.D = diag(inv.D) inv.AtA = Q$u%*%inv.D%*%t(Q\$v)
#print(round(inv.AtA%*%AtA),3)

w.prime = inv.AtA%*%t(A_hat)%*%d
#print(w.prime)

return(w.prime)

}

NNplot<-function(w = w.old)
{

plot(1:4, 1:4, type="n", xlab= "", ylab="", ylim=c(0,9), xlim=c(0,5), xaxt="n", yaxt="n")

max.abs.w = max(abs(w))
w.raw = w

w  = 10*w/max(abs(w))
g.col = rep("blue", 49)
g.col[w < 0] = "red"

#print(w[1:8])

segments(x0=1, x1=2, y0=1, y1 = 3, lwd=abs(w[1]), col=g.col[1])
segments(x0=1, x1=2, y0=2, y1 = 3, lwd=abs(w[2]), col=g.col[2])
segments(x0=1, x1=2, y0=3, y1 = 3, lwd=abs(w[3]), col=g.col[3])
segments(x0=1, x1=2, y0=4, y1 = 3, lwd=abs(w[4]), col=g.col[4])
segments(x0=1, x1=2, y0=5, y1 = 3, lwd=abs(w[5]), col=g.col[5])
segments(x0=1, x1=2, y0=6, y1 = 3, lwd=abs(w[6]), col=g.col[6])
segments(x0=1, x1=2, y0=7, y1 = 3, lwd=abs(w[7]), col=g.col[7])
segments(x0=1, x1=2, y0=8, y1 = 3, lwd=abs(w[8]), col=g.col[8])

segments(x0=1, x1=2, y0=1, y1 = 4, lwd=abs(w[10]), col=g.col[10])
segments(x0=1, x1=2, y0=2, y1 = 4, lwd=abs(w[11]), col=g.col[11])
segments(x0=1, x1=2, y0=3, y1 = 4, lwd=abs(w[12]), col=g.col[12])
segments(x0=1, x1=2, y0=4, y1 = 4, lwd=abs(w[13]), col=g.col[13])
segments(x0=1, x1=2, y0=5, y1 = 4, lwd=abs(w[14]), col=g.col[14])
segments(x0=1, x1=2, y0=6, y1 = 4, lwd=abs(w[15]), col=g.col[15])
segments(x0=1, x1=2, y0=7, y1 = 4, lwd=abs(w[16]), col=g.col[16])
segments(x0=1, x1=2, y0=8, y1 = 4, lwd=abs(w[17]), col=g.col[17])

segments(x0=1, x1=2, y0=1, y1 = 5, lwd=abs(w[19]), col=g.col[19])
segments(x0=1, x1=2, y0=2, y1 = 5, lwd=abs(w[20]), col=g.col[20])
segments(x0=1, x1=2, y0=3, y1 = 5, lwd=abs(w[21]), col=g.col[21])
segments(x0=1, x1=2, y0=4, y1 = 5, lwd=abs(w[22]), col=g.col[22])
segments(x0=1, x1=2, y0=5, y1 = 5, lwd=abs(w[23]), col=g.col[23])
segments(x0=1, x1=2, y0=6, y1 = 5, lwd=abs(w[24]), col=g.col[24])
segments(x0=1, x1=2, y0=7, y1 = 5, lwd=abs(w[25]), col=g.col[25])
segments(x0=1, x1=2, y0=8, y1 = 5, lwd=abs(w[26]), col=g.col[26])

segments(x0=1, x1=2, y0=1, y1 = 6, lwd=abs(w[28]), col=g.col[28])
segments(x0=1, x1=2, y0=2, y1 = 6, lwd=abs(w[29]), col=g.col[29])
segments(x0=1, x1=2, y0=3, y1 = 6, lwd=abs(w[30]), col=g.col[30])
segments(x0=1, x1=2, y0=4, y1 = 6, lwd=abs(w[31]), col=g.col[31])
segments(x0=1, x1=2, y0=5, y1 = 6, lwd=abs(w[32]), col=g.col[32])
segments(x0=1, x1=2, y0=6, y1 = 6, lwd=abs(w[33]), col=g.col[33])
segments(x0=1, x1=2, y0=7, y1 = 6, lwd=abs(w[34]), col=g.col[34])
segments(x0=1, x1=2, y0=8, y1 = 6, lwd=abs(w[35]), col=g.col[35])

segments(x0=2, x1=3, y0=3, y1 = 4, lwd=abs(w[37]), col=g.col[37])
segments(x0=2, x1=3, y0=4, y1 = 4, lwd=abs(w[38]), col=g.col[38])
segments(x0=2, x1=3, y0=5, y1 = 4, lwd=abs(w[39]), col=g.col[39])
segments(x0=2, x1=3, y0=6, y1 = 4, lwd=abs(w[40]), col=g.col[40])

segments(x0=2, x1=3, y0=3, y1 = 5, lwd=abs(w[42]), col=g.col[42])
segments(x0=2, x1=3, y0=4, y1 = 5, lwd=abs(w[43]), col=g.col[43])
segments(x0=2, x1=3, y0=5, y1 = 5, lwd=abs(w[44]), col=g.col[44])
segments(x0=2, x1=3, y0=6, y1 = 5, lwd=abs(w[45]), col=g.col[45])

segments(x0=3, x1=4, y0=4, y1 = 4.5, lwd=abs(w[47]), col=g.col[47])
segments(x0=3, x1=4, y0=5, y1 = 4.5, lwd=abs(w[48]), col=g.col[48])

symbols(x=1, y=1, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=2, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=7, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=1, y=8, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(1,8), y= 1:8, labels = c("Not", "And", "or", "Xor", "contra", "IfThen", "alpha", "beta") )

symbols(x=2, y=3, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=2, y=6, circles=0.5 , add=T, inches = .25, bg ="grey")

text(x=rep(2,4), y= 3:6, labels = round(c(w.raw[9], w.raw[18], w.raw[27], w.raw[36]),3) )

symbols(x=3, y=4, circles=0.5 , add=T, inches = .25, bg ="grey")
symbols(x=3, y=5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=rep(3,2), y= 4:5, labels = round(c(w.raw[41], w.raw[46]),3) )

symbols(x=4, y=4.5, circles=0.5 , add=T, inches = .25, bg ="grey")
text(x=4, y= 4.5, labels = round(c(w.raw[49]),3) )

rang.w   = rev(seq(- max.abs.w, max.abs.w, length.out = 6 ))
rang.lwd = abs(seq(-10, 10, length.out = 6 ))
rang.col = rev(c(rep("red",3), rep("blue",3)))
legend("topright", col= rang.col, legend = round(rang.w, 3), lwd=rang.lwd, title = "line - w" )

}

NN2 <-function(w = (runif(49)-0.5)*10 , A=A., d=d., N= 10000, mod = 1  )
{

#print(w)
phy = g.phy(w, A,d)
#print(phy)

rec = c(phy, w)

phy.best = phy
w.best   = w

fm        = 0
phy.last  = phy
i.last    = 1
w.last    = w

for ( i in 1:N)
{

#update the weights
if (mod == 1 )
{
eta       = 0.1
#new misfit
phy.new = g.phy(w.new, A,d)
}
if (mod == 2 )
{
eta       = 1
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 3 )
{
eta       = 10
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 4 )
{
eta       = 1
w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
#new misfit
phy.new = g.phy(w.new, A,d)
}

if (mod == 5 )
{
eta       = 1
j = rnorm(49, sd = mag.grad /sqrt(49))
#j = rnorm(49, sd = 1)

#print(sqrt(sum(j*j)))
#new misfit
phy.new = g.phy(w.new, A,d)

}

if (mod == 6 )
{
eta       = 1

r = runif(1)
if( r <= 0.1)
{
mag.w = sqrt(sum(w*w))
j = rnorm(49, sd = mag.w /sqrt(49))
}
if( r > 0.1)
{
j = rnorm(49, sd = mag.grad /sqrt(49))
}

#j = rnorm(49, sd = 1)

#print(sqrt(sum(j*j)))
#new misfit
phy.new = g.phy(w.new, A,d)

}

k = 0
fm = fm + 1

# check  that new is better
while( (phy.new > phy)& (k <= 100) )
{
# normal back prop step shrink
if(mod <= 3)
{
eta = eta/2
#print(paste("eta", eta))
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

# Linearized step
if (mod == 4 )
{
eta       = eta/2
w.new     =eta*g.LinPro(w, A, d)+(1-eta)*w
#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

if (mod == 5 )
{
#print("here")
eta       = eta/2
j = rnorm(49, sd = mag.grad /sqrt(49))
#j = rnorm(49, sd = 1)

#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

if (mod == 6 )
{
#print("here")
eta       = eta/2

r = runif(1)
if( r <= 0.1)
{
mag.w = sqrt(sum(w*w))
j = rnorm(49, sd = mag.w /sqrt(49))
}
if( r > 0.1)
{
j = rnorm(49, sd = mag.grad /sqrt(49))
}

#j = rnorm(49, sd = 1)

#new misfit
phy.new = g.phy(w.new, A,d)
fm = fm + 1
}

k = k+ 1

if(k == 50)
{print("k==100!")}

if(k == 100)
{print("k==100!")}
}
phy = phy.new
w = w.new

if(phy < phy.best)
{
phy.best = phy
w.best = w
}

# add the model to the
rec = rbind(rec, c(phy,w))
rownames(rec)= NULL
sigma41 = g.pred(w, A, d)
r = runif(1)
if (r <= 0.001)
{
#print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
#print(phy)
#print(paste("delta log phy = ", log(phy.last) - log(phy) ))
#print(paste("max abs error", max(abs(d-sigma41))) )
#if ( log(phy.last) - log(phy) >= 0.5 )
#    {
#    quartz()
#    par(mfrow=c(2,1), mar=c(4,4,1,1))
#    NNplot(w.last)
#    plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
#    print("------------------> new plot <---------------------------------")
#    }
phy.last = phy
w.last = w
i.last = i

}

if(max(abs(d-sigma41)) <= 0.01 )
{
#print(paste(i, "of", N, "(=", i/N,") and", fm, "forward models"))
#quartz()
#par(mfrow=c(2,1), mar=c(4,4,1,1))
#NNplot(w.last)
#plot( log(rec[1:i.last,1] ), pch=".", xlab = "Index", ylab = "Phy")
#return(c(fm, 1))
return(rec)

}

}

return(c(fm, 0))
}
w.start = (runif(49)-0.5)
rec1  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)

NNplot(w.last[10,2:50])

out = matrix(0, nrow=100, ncol = 10 )
for(i in 1:100)
{
print(i)
w.start = (runif(49)-0.5)
rec1  = NN2(  mod = 1,   w = w.start,      N= 25000,A= A.,  d = d.)
rec2  = NN2(  mod = 2,   w = w.start,      N= 25000,A= A.,  d = d.)
rec3  = NN2(  mod = 3,   w = w.start,      N= 25000,A= A.,  d = d.)
#rec4  = NN2(  mod = 4,   w = w.start,      N= 25000,A= A.,  d = d.)
rec5  = NN2(  mod = 5,   w = w.start,      N= 25000,A= A.,  d = d.)
rec6  = NN2(  mod = 6,   w = w.start,      N= 25000,A= A.,  d = d.)
out[i, 1] = rec1[1]
out[i, 2] = rec2[1]
out[i, 3] = rec3[1]
out[i, 4] = rec5[1]
out[i, 5] = rec6[1]

out[i, 6] = rec1[2]
out[i, 7] = rec2[2]
out[i, 8] = rec3[2]
out[i, 9] = rec5[2]
out[i,10] = rec6[2]

print(i)
print(apply(out, 2, mean))
print("--------------------------")

}