Programming Machine Learning: MNIST benchmark for multi-layer networks

I never found an authoritative answer to questions such as this one–not in the general case, at least. I came to the conclusion that deciding on the number of layers and the number of nodes per layer is more art/experimentation than science. Rules of thumbs like the one you mentioned are all we have.

When it comes to the modern deep learning models with trillions of parameters, I see that there is a general consensus that “more is better”. That means more layers, more nodes and more training data. Having more of everything almost always helps, and almost never hurts–at least when it comes to accuracy. (OTOH, if you care about considerations such as training time, memory footprint, and overfitting small data sets, then having more nodes and layers can, of course, hurt, although having more data will generally always help, as long as the data distribution during training is comparable to the one you get during inference).