Neural Nets and linearization; what seemed to be a good idea just does not work.

Hi all

The TLDR of this post is that I tried something, it did not work but it seemed enough of a good idea that it is worth writing down incase anyone can improve it.

Those that look at the code in the last post would have seen an un mentioned “mode 4” of getting the weights of a NN.  That mode or method is to linearize the NN inverse (using the data to solve for the unknown weights) problem.

To describe linearization consider a classic linear inverse problem. Let A be the sensitivity matrix ( n rows by k columns), {\bf d} be the data (a column vector of n rows), and {\bf m} be the unknown model parameters (a column vector of k rows). Further more, let the the data covariance matrix be known to be the I_n (an n by n identify matrix); i.e., the model errors {\bf e} are independent gaussians with variance equal to one . Thus {\bf d} = A{\bf m} + {\bf e}.

The MLE estimate for the parameters is the well known \hat{\bf m} = (A' A)^{-1} A' {\bf d} and the posterior covariance matrix of {\bf m} is (A' A). Note “prime” here denotes matrix transpose.

So how is this relevant to NNs?

The point is that the i,jth element of A is the partial derivative of the ith observation with respect to the jth parameter. In back propagation those derivatives are analytically calculated. So it is possible to construct an approximation to A that is denoted here as B (I know a creative naming scheme). The i,jth element of B is b_{i,j} = (\partial /\partial w_j ) d_{i} where {\bf W} are the weights to the NN.

Thus it should (remember this did not actually work in the end) be possible make iterative estimates of {\bf W} by { \bf W}_t = ({B'}_{t-1} B_{t-1})^{-1} {B'}_{t-1} {\bf d}, but in practice I found that this caused the {\bf W}_t to cycle over a group of bad (poorly fitting values) and more or less never converge.

I don’t have a good reason why this does not work other than it looks like the NN gradients are just not smooth enough. It is also possible that because the in my toy problem there is not enough data (asymptotically all posteriors become gaussians as data is added; the proof of this is very similar to the central limit theorem). I have not tried starting at an optimal solution and seeing if this method stays there; but will be doing that soon as regardless of linearization value in optimization it is useful for initializing MCMC sampling.

That’s it for now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s