Programming Machine Learning: MNIST benchmark for multi-layer networks

Does anybody have benchmark results on what accuracy is achievable on the MNIST data with a multilayer network? I am particularly interested in smaller node numbers but deeper networks. For example what can be done with two, three or four layers of 100 nodes? Or similar.

I have extended the one hidden layer code to multiple hidden layers. Now I am wondering how much better the results should get. For example, with a 100-node hidden layer the book and my own experimentations achieve 98.6%. How much better should it get if one adds a second 100-node layer? I am asking because my early results do not show much if any improvement. I have even upgraded the SGD to the Adam algorithm which is a lot faster but the final accuracy it achieves is pretty much identical, perhaps slightly higher by 0.05% or so.

I found a few resources while preparing the book, starting with the official MNIST site, that collects results from multiple papers:

http://yann.lecun.com/exdb/mnist/

Also check out:

In the early 10s, around the emergence of AlexNet, the error rates on MNIST dropped so low that the dataset stopped being a useful benchmark for advanced classifiers, and it was replaced by more complex datasets like ImageNet.

Thank you! This was very helpful.

One question that I am asking myself is the following. Let’s say we give ourselves a certain node budget. For example 1000 nodes. What is the best way to distribute them over the various layers? Is it best to put them all in one hidden layer? Or is it better to do 10 layers with 100 nodes each? Or something in between? From my reading so far it appears it is advantageous to have the first hidden layers larger and then progressively reduce the number. But I have not found anything that is more specific, something like ‘reducing the number of nodes in each layer by a factor of 2 is optimal’. It looks like there should be some kind of optimum at least when we limit ourselves to one data set like MNIST.

I never found an authoritative answer to questions such as this one–not in the general case, at least. I came to the conclusion that deciding on the number of layers and the number of nodes per layer is more art/experimentation than science. Rules of thumbs like the one you mentioned are all we have.

When it comes to the modern deep learning models with trillions of parameters, I see that there is a general consensus that “more is better”. That means more layers, more nodes and more training data. Having more of everything almost always helps, and almost never hurts–at least when it comes to accuracy. (OTOH, if you care about considerations such as training time, memory footprint, and overfitting small data sets, then having more nodes and layers can, of course, hurt, although having more data will generally always help, as long as the data distribution during training is comparable to the one you get during inference).

That makes sense that throwing more resources at the problem helps. But that is why I think limiting the resources and then asking what is optimal would be very educational. Fixing the total number of nodes seems to me as a particularly good and simple metric because the number of nodes determines how much memory/storage the model needs, and it also determines how much computing has to be done in the forward direction.

Now with my multi-layer code I will explore the 1000-node space. I am really curious what the optimum is. Perhaps there is no real optimum and it doesn’t matter much, within reason, how the nodes are distributed over layers.

I expect that you’ll start overfitting quite soon, at least if you keep using a simple dataset such as MNIST. That’s what I got from my experiments. Let me know if that matches your experience.

I have now run a good number of combinations of layers and nodes. The best I have found so far is a two hidden layer configuration with 950 nodes in the first layer with ReLU activation and 50 nodes in the second hidden layer with sigmoid, the output layer is a softmax. This gives me a 99.28% accuracy on the test set of the last 5000 images in MNIST.

With numbers that high, I’d say you need to scale up to a more complex dataset. A random fluctuation in recognizing a handful of images could now make a significant difference.

I agree. The optimum is not very pronounced and there is noise to deal with. But for example, going from a single layer with 1000 nodes to two layers with 950-50 nodes provides a 0.2% improvement which is statistically significant. This nicely demonstrates the benefit of a deeper network. Going even deeper, I didn’t find any improvements, which I guess is due to what you said, the training data is probably the limiting factor.

This is an off-topic question, but do you know by any chance how in BERT the word embeddings are calculated? Is this a separate process or does this happen together with training the stack of transformers? If it is separate I guess it is some kind of autoencoder structure, but I am wondering how in detail this looks, how many layers, how many nodes?

I would love to see a similar code-level treatment, like you did, for the transformer structure. Perhaps on a language that only has 100 different words (such as Toki Pona which has only 123 words) or some other reduced vocabulary that allows one to train a transformer on an average computer and in so doing have a chance to really understand how it works.

I don’t know how the word vectors are calculated. I always assumed that they’re the outcome of training, but I might be wrong.

I would really love to extend this book to more complex/modern architectures like transformers and attention layers. I doubt I’ll ever have the resources, considering that the further you go into implementation, the more niche the information becomes. Most people working in ML don’t care that much about the implementation, and they don’t consider themselves programmers.

Yeah, wishful thinking on my part :slight_smile: Though I would think there are a good number of programmers who want to learn about these algorithms. The sales of your book should be able to answer that question.

More on topic. Is the dropout method applied both in forward direction and backprop training? Or is it only implemented for the backprop training? In other words, is it implemented by simply setting a certain portion of the gradients to zero? Or does one have to also delete them from the forward calculation?

I am trying to implement layer normalization. In the forward direction, this is easy. But doing the gradient going backward is not so trivial. This causes each variable to have gradient contributions in all other variables. They are relatively small so I am wondering are these in practice being ignored or are these actually calculated and used during backprop training?

I never implemented dropout myself (although I did use it). However, out of the top of my head: if you “turn off” a node on the forward calculations, then its gradient during backprop must be zero.

BTW, the difficulties you’re experiencing with implementing layer normalization are the reason why I didn’t plan to extend the “learn by coding” approach to more advanced architectures. Coding backprop for these algorithms is actually non-trivial.

Thanks. Makes sense. So for dropout one essentially pretends that node doesn’t exist.

It is true, it gets a bit harder for some algorithms, but then again I think it is just one more tensor multiplication added. However, other advanced algorithms are quite easy to implement. For example, I implemented the Adam optimizer. This was straightforward and just a few lines of code, and it provides a wonderful speed-up.

1 Like