Programming Machine Learning: Help: weird results I don't understand

I encountered something that I can’t explain. Any help, tips, or explanations would be great.

I followed the one hidden layer example with 100 nodes and sigmoid activation function. Works great and I can get to 98.6% accuracy with a learning rate of 1.0, a batch size of 1000, and 100 epochs.

I then decided to exchange the sigmoid activation function with the ReLU. This is not done in the book at this point but it is easy enough to program the ReLU and its derivative. Here is the Python code I used:

def relu(z):
    return np.maximum(0.0,z)
def relu_gradient(z):
    return (z > 0)*1

Works fine as long as one reduces the learning rate which I did reduce to 0.1. It reaches about the same level of accuracy as with the sigmoid. I then did one insignificant change in the gradient of the ReLU. Instead of z > 0 I wrote z >= 0. So the code for the gradient was now:

def relu_gradient(z):
    return (z >= 0)*1

This I thought should not make any difference because how often would z be exactly zero? How often would the weighted sum of all inputs in the floating point format be exactly zero? Perhaps never. Even if it is zero occasionally it should hardly make any big difference. But to my surprise, it makes a profound difference. I can only get to about 95%. Why? Why is there almost 4% difference in accuracy for this insignificant change? There must be something weird happening.

I tried this several times to rule out that somehow the random initialization was unusual. I tried it with different learning rates and different batch sizes. None made any difference in the result. I checked for dead neurons. Found none. If somebody can tell me what is going on here I would really appreciate it.

This is intriguing. Are you running this network on MNIST, or is it another data set? Also: did you try counting the number of times relu_gradient() actually receives a 0, and what the inputs look like for those cases? (Just to rule out bugs like having all the inputs at 0.)

Turns out it was a bug. Using the nomenclature of the book I was feeding h into the gradient function when I should have fed a into it. With the >= comparison this made all the gradients 1 and thus it acted like the linear activation function. (The linear activation function does produce about 94% accuracy.) Properly using the gradient function produces the expected results. It doesn’t matter if one uses > or >=.

I am happy I found this bug. But this is also part of why your book is so great. Programming it yourself forces one to understand the little details and allows one to change and modify the algorithms at the very core, which leads to much deeper understanding of how this all works.

Here is an insight that my experimentation produced. I tested a bunch of different activation functions including weird piecewise linear ones, periodic ones with sin and cos, combinations thereof etc. It surprised me that many work just as good as ReLU or sigmoid with a single hidden layer. (I intend to extend this experimentation to multiple hidden layers.) For example, it is kind of shocking at first that the absolute-value-function works just as good as ReLU. This kind of makes sense in the biological case. A neuron being a cell would not be completely identical to its neighbor neuron. Neurons in nature would certainly have different activation functions. Perhaps not as different as I experimented with but they would perhaps be noisy and distorted versions of sigmoid or ReLU. It doesn’t matter, it still works fine.

Further, this makes me wonder if perhaps that variation in activation functions in nature is a benefit. I am wondering if folks have tried to make nets where each activation function of each neuron is different. Perhaps that confers a training advantage to the network because not everything behaves in exactly the same way? I will try to explore this question. But first I need to extend the code to allow for multiple hidden layers.

This is one critique I have to make. In my opinion, it would have been better to go further with the code and extend it to multiple hidden layers than to switch to libraries. The point of the book is programming it yourself to allow full unmitigated experimentation. I would have added one or two chapters to extend the code further even if that would have meant leaving out libraries altogether. Numpy should be fast enough to explore multilayer networks on a single average computer.

1 Like

Using the nomenclature of the book I was feeding h into the gradient function […]

Aaah, that explains it. Machine learning bugs can manifest themselves in subtle ways, don’t they? It’s also interesting that having an identity activation function still results in pretty high accuracy.

Thank you for the appreciation you show for the book @wasshuber! I totally agree that getting your hands dirty is the best way to understand ML inside out. I also appreciated your ideas about using multiple activation functions–either different functions for different neurons, as you say, or even different functions for the same neuron at different times. Would that instability result in better training overall in some cases? In an academic setting, that might turn out to be a worthy question. If you ever follow up on this hunch, by all means let me know the results.

About the switch from hand-coded algorithms to libraries in the book: I tried, but the code for arbitrary neural networks was becoming too long, and the advanced chapters were losing focus as a result. Switching to libraries was a way to abstract out the details and focus on the higher-level picture. That being said, I was also sorry to switch approach. I was even tempted to leave Part 3 to another book–but the book felt incomplete as a result, and I had to bite that bullet and go down the Keras path. I wish I could have avoided it. (As a side effect, Keras and Tensorflow are giving me a lot of headaches with installation instructions becoming obsolete on different platforms.)

I haven’t done it yet, but I don’t think going to multi-layer networks is a big step. In fact, it looks rather trivial. If we put the weights in a list rather than named variables, and we do the same for the activation functions, then all we need to do is add a loop in the initialize, forward and back functions to sequentially go over the layers. One combines h and y-hat into another list of layer output arrays. Maybe I am missing something, but this would add merely a few lines of code.

It’s been a while, so I honestly don’t remember how much more code it was in practice–just that the current examples are about as long as I could comfortably fit in the book without turning it into pages over pages of code. Also, there was the problem of performance: once you scale to four or five layers, training the network can take a very long time if you rely on Numpy directly, and that can discourage experimentation in chapters like the one on overfitting. And even though the deep learning discussion in the last few chapters is little more than a high-level overview, doing things like CNNs without a ML library becomes a whole different level of complexity.

All in all, it was a painful trade-off, but I decided that the “write everything by yourself” approach was starting to burst at the seams by the time we reached Part 3. I see your point, however.

I totally understand the difficult tradeoffs you had to make. In any case, it was the type of book I was looking for and it enabled me to experiment with a technology I wanted to understand and comprehend on a code level, not just a conceptual level. For that I thank you!

And thank you for the kind feedback!

There is one little coding snafu that I forgot to point out. You have one fairly big inefficiency in the code. In every training loop, you calculate y_hat twice for the training data. The first time it is calculated to pass to the backprop function, the second time it is calculated in the reporting function. If the reporting function is moved up before the weights update, one can pass to the reporting function y_hat rather than the x-training-data and thus only calculate it once.

That’s intentional, to avoid explaining the refactoring and preserve the flow of the explanation–but I might mention it later, so that people can fix it if they want to.