Hi all

The TLDR of this post is that I tried something, it did not work but it seemed enough of a good idea that it is worth writing down incase anyone can improve it.

Those that look at the code in the last post would have seen an un mentioned “mode 4” of getting the weights of a NN. That mode or method is to linearize the NN inverse (using the data to solve for the unknown weights) problem.

To describe linearization consider a classic linear inverse problem. Let be the sensitivity matrix ( rows by columns), be the data (a column vector of rows), and be the unknown model parameters (a column vector of rows). Further more, let the the data covariance matrix be known to be the (an by identify matrix); i.e., the model errors are independent gaussians with variance equal to one . Thus .

The MLE estimate for the parameters is the well known and the posterior covariance matrix of is . Note “prime” here denotes matrix transpose.

So how is this relevant to NNs?

The point is that the th element of is the partial derivative of the th observation with respect to the th parameter. In back propagation those derivatives are analytically calculated. So it is possible to construct an approximation to that is denoted here as (I know a creative naming scheme). The th element of is where are the weights to the NN.

Thus it should (remember this did not actually work in the end) be possible make iterative estimates of by , but in practice I found that this caused the to cycle over a group of bad (poorly fitting values) and more or less never converge.

I don’t have a good reason why this does not work other than it looks like the NN gradients are just not smooth enough. It is also possible that because the in my toy problem there is not enough data (asymptotically all posteriors become gaussians as data is added; the proof of this is very similar to the central limit theorem). I have not tried starting at an optimal solution and seeing if this method stays there; but will be doing that soon as regardless of linearization value in optimization it is useful for initializing MCMC sampling.

That’s it for now.