# Optimizing the choice of objective function

Hi all

With my last couple posts I have been exploring neural nets. This post continues the topic with a quick look at objective functions.

In the last two posts I glossed over the assumption of the objective function. I used the sum of squared error (SSE) but that is a bit of strange choice as the data being modelled are dichotomous. This post gives a more rigorously objective function and explore the difference between it and SSE.

Recall that the SEE objective function for a network with one output node is $\phi ({\bf W}) = \sum_i [d_i - v_i ({\bf W})]^2$ where ${\bf d}$ are the observed data, ${\bf W}$ are the network weights and $v_i ({\bf W})$ is the output of the neural network for the i’th observation written as a function of the network weights.

Also recall that the probability of a sample of IID Gaussian random variables with known variance $\sigma^2 = 1$ is $P({\bf d}) = (2 \pi)^{-1/2} \exp[-1/2 \sum_i (d_i - \mu)^2 ]$. Thus the -2 log-probability for a Gaussian distribution is the SSE; this is also called the deviance.

Therefore (my favourite pretentious word), a good way to make a make an objective function is to use the  -2 log-probability for a distribution that is more representative of dichotomous data.  Here I will use the Bernoulli distribution. The probability of a set of observed Bernoulli random variables with $p = v_i ({\bf W})$ is $\text{Prob}({\bf d}) = \Pi_i [ v_i ({\bf W})]^{d_i} [1-v_i ({\bf W}) ]^{1-d_i}$. Thus the deviance and our objective function is $\phi({\bf W}) = -2 \sum_i (d_i \log[v_i ({\bf W})]+ [1-d_i] \log[1-v_i ({\bf W})] )$. The partial derivative (used in back propagation) for this objective function is $(\partial / \partial v_i) \phi({\bf W}) = -2 \sum_i (d_i / v_i ({\bf W}) - [1-d_i]/[1-v_i ({\bf W})] )$.

So what is the difference between the SSE and Bernoulli objective functions? Consider a data set with one observation $d_1 = 1$. Figure 1 shows the two objective functions as a function of $v_1$ in the range $[0.001, 1]$. As can be seen the SSE penalizes bad fits (relative to good fits) much less than Bernoulli. In particular as fit with $v_1 = 0.01$ is meaningfully better than a fit with $v_1 = 0.001$.

That is it for now.