Hi all
With my last couple posts I have been exploring neural nets. This post continues the topic with a quick look at objective functions.
In the last two posts I glossed over the assumption of the objective function. I used the sum of squared error (SSE) but that is a bit of strange choice as the data being modelled are dichotomous. This post gives a more rigorously objective function and explore the difference between it and SSE.
Recall that the SEE objective function for a network with one output node is where
are the observed data,
are the network weights and
is the output of the neural network for the i’th observation written as a function of the network weights.
Also recall that the probability of a sample of IID Gaussian random variables with known variance is
. Thus the -2 log-probability for a Gaussian distribution is the SSE; this is also called the deviance.
Therefore (my favourite pretentious word), a good way to make a make an objective function is to use the -2 log-probability for a distribution that is more representative of dichotomous data. Here I will use the Bernoulli distribution. The probability of a set of observed Bernoulli random variables with is
. Thus the deviance and our objective function is
. The partial derivative (used in back propagation) for this objective function is
.
So what is the difference between the SSE and Bernoulli objective functions? Consider a data set with one observation . Figure 1 shows the two objective functions as a function of
in the range
. As can be seen the SSE penalizes bad fits (relative to good fits) much less than Bernoulli. In particular as fit with
is meaningfully better than a fit with
.
That is it for now.