Here is another small tip regarding the loss function. In the book the loss function is implemented with
-np.sum(Y * np.log(y_hat)) / Y.shape[0]
Numerically that isn’t sound because y_hat
can be zero and then the log function can’t compute it. What I have seen some do is add a small number to y_hat
like
-np.sum(Y * np.log(y_hat+1e-8)) / Y.shape[0]
That works but introduces a small error in the reported loss. Sometimes for a well trained model the loss can get pretty small and then that small error can become noticeable. It isn’t a big deal since this is only a report function but I think a better way to deal with it is with the masked array feature of numpy.
-np.sum(Y * np.ma.log(y_hat)) / Y.shape[0]
So instead of np.log
simply use np.ma.log
. This masks out the elements for which y_hat
is zero. In most cases for a well-trained network when y_hat
is zero so is Y
which means the product should be zero as well. x log(x)
is zero for x equal to zero. The masking does the right thing in this case because for the sum it doesn’t matter if zero is added or if it is ignored. The result is exactly the same.
This implementation is only wrong for y_hat=0
and Y
non-zero. While this can potentially happen early on in the training with randomly initialized weights, once a network has been sufficiently trained I think this essentially never happens.