Tip: If you are on a slow or old machine like me, or if you want to run many different examples to explore the design space you can speed up the calculations by removing a border from the MNIST image data. Every image has a 1-pixel white border. Removing this border reduces the number of input variables by 108 or more than 13%. In fact, you can drop even a 3-pixel border without any impact that I can notice. Dropping more is also possible, but then the expected max accuracy will also start to drop. But it is quite remarkable that even using only the innermost 8x8 image fragment one can easily get above 80% accuracy.
Gotcha: I have run the one hidden layer with 100 nodes scenario with the original test set of 10,000 examples. I did not split it into the 5,000 for validation and 5,000 for testing. I was surprised that the maximum accuracy I could achieve was only 97.8%, not the 98.6% stated in the book. However, this is purely an effect of the training set. When I did the splitting into validation and testing set with 5,000 for testing I got the 98.6% accuracy with the same network weights. This was surprising to me, that there is that big a change in accuracy due to the size of the test set.
About the tip: That’s smart unconventional thinking. I’ll be honest: I never thought about that. I wouldn’t have put that technique in the book anyway, because it might confuse matters (and result in different outcomes from the examples and end-of-chapter exercises), but if you’re willing to sacrifice some % points for speeds, it might be worth it. Just out of curiosity: in the exercise where you aim for 99% accuracy, how much do you lose by removing the border.
About the gotcha: if I understand correctly, you trained over the whole MNIST (training and test sets together) and tested over all of it as well. If so, then I’m surprised by this result. If anything, I’d expect training/testing over the same exact set to give an unrealistically high %, because of overfitting. Can you please confirm that you’re training and testing over the same set of 10,000 examples? Or maybe you’re training over 10,000, and then testing over the 5,000 test examples only?
I haven’t yet fully explored the 99% exercise with varying boundaries removed, but from the testing I have done so far removing a 3-pixel boundary from images doesn’t reduce max accuracy at all. The changes are within the noise. And even removing a bigger boundary say 5 or 6 pixels only has a relatively moderate impact on max accuracy ~0.2-0.4%. There doesn’t seem to be a big drop off because removing even a 10-pixel boundary, leaving only the center 8x8 pixels still achieves accuracies in the mid 80% range.
Gotcha: No, I did not train over the entire 70,000 MNIST data. I trained over the 60,000 training set and then tested over the 10,000 validation + test set. In other words, I did train exactly as in the book, but I tested over validation + test set combined. And this produces almost a 1% lower accuracy. Testing only over the 5,000 test set produces a higher accuracy. So the choice to split the original 10,000 test data into 5000 validation and 5000 test was a lucky decision, otherwise it would not be possible to reach 99% with a one hidden layer net.
About the border removal: that is indeed interesting. Today I learned! I wonder what happens if one removes random pixels from the image instead. (Keeping it consistent across images). It would be fun to check what is the breaking point where inference really starts to suffer. I expect that the border pixels are less important than the central ones (for the carefully resized and centered MNIST numbers at least), but I’d be curious to see how much information the algorithm actually needs.
About the gotcha. Aaah, OK. I see. I was expecting that a few thousands of test cases would basically level off random variations like those, but apparently that is not the case. On the other hand, 99% is a pretty arbitrary number: I checked how far I could get with one hidden layer, and used that number as a challenge. Did you counter-check by testing over the validation set only, instead of the test set? By your result, it seems that more “harder” images ended up in the validation set.
By the way, thank you for sharing this information. I’m enjoying reading about your experiments.
Here is another small tip regarding the loss function. In the book the loss function is implemented with
-np.sum(Y * np.log(y_hat)) / Y.shape
Numerically that isn’t sound because y_hat can be zero and then the log function can’t compute it. What I have seen some do is add a small number to y_hat like
-np.sum(Y * np.log(y_hat+1e-8)) / Y.shape
That works but introduces a small error in the reported loss. Sometimes for a well trained model the loss can get pretty small and then that small error can become noticeable. It isn’t a big deal since this is only a report function but I think a better way to deal with it is with the masked array feature of numpy.
-np.sum(Y * np.ma.log(y_hat)) / Y.shape
So instead of np.log simply use np.ma.log. This masks out the elements for which y_hat is zero. In most cases for a well-trained network when y_hat is zero so is Y which means the product should be zero as well. x log(x) is zero for x equal to zero. The masking does the right thing in this case because for the sum it doesn’t matter if zero is added or if it is ignored. The result is exactly the same.
This implementation is only wrong for y_hat=0 and Y non-zero. While this can potentially happen early on in the training with randomly initialized weights, once a network has been sufficiently trained I think this essentially never happens.
If you like numerical issues then I will describe a problem I chassed for 3 days. During implementing dropout regularization I encountered an issue with the implementation of softmax that cost me three days delay. In your book the implementation of softmax is fine but basic. Meaning it does not protect against over- or underflow issues with the exponentials. What some do, for example, is to subtract the maximum value first before the exponential is applied. Mathematically this is equivalent because it is simply a multiplication of a constant factor of the numerator and denominator in the softmax formula. Nothing changes. Online I even found Python code for it that was something like
e = np.exp(x - np.max(x))
The problem with this code is subtle but numerically it is stupid. What happens is the following. np.max(x) returns the maximum from the entire matrix, meaning the maximum in the entire mini-batch. But we only need the maximum for each input (image) and not across several inputs. Numerically this causes problems because in some cases it can push the argument of the exponential so far to negative values that they all underflow and all exponentials return zero. The solution for this is to implement it such that the maximum subtracted is only the row maximum not the maximum across the entire mini-batch. Something like
e = np.exp(x - np.max(x,axis=1).reshape(-1,1))
This numerical issue manifested itself in the following way. Initially, the network was training perfectly fine. It reached about the accuracy it should reach. Then the accuracy started to drop, first slowly but then very quickly, and over the course of a few epochs the entire network blew up with all weights increasing until everything was saturated. Nothing could stop it. I tried clipping the gradients and limiting the weights norms, etc. The issue was the above-mentioned bad implementation of the softmax function.
Here is another numerical improvement I found. When using ReLU in a multi-layer network the weights get on average bigger with each layer. With two or three hidden layers this isn’t a big problem, but for deeper layers, it becomes an issue. Typically this is corrected with some kind of normalization layer. However, I found a simple solution for this that doesn’t require any normalization strategy. Rather than a ReLU I use a shifted down ReLU:
Instead of being zero for negative values this function is -1, and it is x for anything larger than -1. It is exactly the same function just shifted down and to the left by 1. Using this as an activation function eliminates the progressively growing weights with deeper layers. No normalization is needed. On top of this, my first tests indicate that this version of ReLU works somewhat better on the MNIST data in combination with dropout regularization. I have no idea why, but it does.
@wasshuber, this idea sounds brilliant. Did you borrow it from somewhere else, or did you have it yourself? It seems to me that some of your ideas would deserve some further investigation by people with an academic bent. I love your experimental approach!
Do you have a hypothesis for why exactly a shifted down ReLU would partially eliminate the need for normalization? Might that be something that could alternately be done by picking different initialization values for the weight?
I discovered this myself by experimenting with all kinds of activation functions. It was easy to change the code from sigmoid to other activation functions and I was curious about what changes if I used different functions. I tried some really weird ones, too.
This is why I choose your path of coding it myself because then it is much easier to change the things I wanted to change. With a library, one is in a straight-jacket and one can only change what the library allows you to change.
What made me analyze it more carefully was the fact that this shifted ReLU learned better in combination with dropout. So I tried to see why and noticed that the magnitude of the weights going from layer to layer stayed about the same when with ReLU they keep growing. I don’t have any good explanation for why this is better except that if there is a sort of additional bias the weights have to learn (their magnitude increases with deeper layers) then this will take longer in the learning process than if they do not have to learn this bias.
Then again, this is such a simple modification that I would be surprised if nobody has tried this before and noted the improvement. Searching online I do see shifted ReLUs being mentioned in lists of activation functions, but I have not found anything that mentions the improvement to learning they achieve and how this may be connected to the weight magnitude staying the same. We should also not forget that I only applied this to the MNIST data set. I don’t know if my observations hold in general.
Another tip that seems to be helping speed up training: I do a batch-size ramp. I start with batches of about 2-3 times the class size (for MNIST class size is 10). For example, I start with batch size of 20. I double the batch size with each epoch until I reach the final batch size of my choice and then continue with this batch size until the end.
The advantage here is that at the beginning when the weights are far away from their optimum, it is not necessary to have a particularly good estimator for the gradient, thus small batch sizes are fine and faster. But as we are approaching the optimum larger batch sizes are helpful to get an accurate gradient.
This reduces the importance of setting a proper batch size. One can take a larger batch size without negatively impacting the final accuracy of the model. Large batch size can sometimes mean that one gets stuck in a local minimum and the final accuracy of the model suffers. Ramping the batch size combines the advantages of small and large batch sizes.
@wasshuber, you should consider starting a blog/YT channel on ML training tricks and tips. I’m not kidding–you have a brilliant approach. I like it how you explore so many different approaches and develop a deep intuition about them. You’re also good at explaining the reason why something works. I recognize the mark of an educator here.
That is an interesting suggestion. I have been mulling over the possibility of either releasing my code with notes, perhaps in a Jupyter notebook, or writing a book, sort of the sequel to your book where I continue without any libraries to expand to multilayer, dropout, better optimizers like Adam, etc.
For example, I have now expanded the code to allow any general combination of layers, either serially or in parallel, any directed graph. This allows one to experiment with skip connections or layers that have a mix of activation functions, etc. All of this is still less than 300 lines of code. The primary advantage I see with this type of minimalistic code implementation is that one can make code-level changes almost immediately and thus experiment with ideas that would otherwise be much harder to try. If one wanted to make code-level changes to the Keras library, this would require a much steeper learning curve. Very few are willing to do that.
A lot of experimentation is done on the application level where folks change the size and number of layers and play around with hyperparameters, but very little gets done on the underlying algorithms. I think there is an opportunity here to encourage more experimentation on the code level.
Margaret, is it possible to publish a Jupyter notebook as an ebook-only publication with you? I think the Jupyter notebook would be the ideal format because it encourages code-level experimentation while also allowing extensive text and illustrations for explanations.