Optimizing the choice of objective function

Hi all

With my last couple posts I have been exploring neural nets. This post continues the topic with a quick look at objective functions.

In the last two posts I glossed over the assumption of the objective function. I used the sum of squared error (SSE) but that is a bit of strange choice as the data being modelled are dichotomous. This post gives a more rigorously objective function and explore the difference between it and SSE.

Recall that the SEE objective function for a network with one output node is \phi ({\bf W}) = \sum_i [d_i - v_i ({\bf W})]^2 where {\bf d} are the observed data, {\bf W} are the network weights and v_i ({\bf W}) is the output of the neural network for the i’th observation written as a function of the network weights.

Also recall that the probability of a sample of IID Gaussian random variables with known variance \sigma^2 = 1 is P({\bf d}) = (2 \pi)^{-1/2} \exp[-1/2  \sum_i (d_i - \mu)^2 ]  . Thus the -2 log-probability for a Gaussian distribution is the SSE; this is also called the deviance.

Therefore (my favourite pretentious word), a good way to make a make an objective function is to use the  -2 log-probability for a distribution that is more representative of dichotomous data.  Here I will use the Bernoulli distribution. The probability of a set of observed Bernoulli random variables with p = v_i ({\bf W}) is \text{Prob}({\bf d}) = \Pi_i [ v_i ({\bf W})]^{d_i} [1-v_i ({\bf W}) ]^{1-d_i} . Thus the deviance and our objective function is \phi({\bf W}) = -2 \sum_i (d_i \log[v_i ({\bf W})]+ [1-d_i] \log[1-v_i ({\bf W})]  ). The partial derivative (used in back propagation) for this objective function is (\partial / \partial v_i)  \phi({\bf W}) = -2 \sum_i (d_i / v_i ({\bf W}) -  [1-d_i]/[1-v_i ({\bf W})]  ).

So what is the difference between the SSE and Bernoulli objective functions? Consider a data set with one observation d_1 = 1. Figure 1 shows the two objective functions as a function of v_1 in the range [0.001, 1]. As can be seen the SSE penalizes bad fits (relative to good fits) much less than Bernoulli. In particular as fit with v_1 = 0.01 is meaningfully better than a fit with v_1 = 0.001.

That is it for now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s