Programming Machine Learning:

@nusco

Programming Machine Learning
From Coding to Deep Learning
German edition

Chapter 11
Training the network
Page 191

def back(X, Y, y_hat, w2, h):
w2_gradient = np.matmul(prepend_bias(h).T, y_hat - Y) / X.shape[0]

 a_gradient = np.matmul(y_hat - Y, w2[1:].T) * sigmoid_gradient(h)
 w1_gradient = np.matmul(prepend_bias(X).T, a_gradient) / X.shape[0]

 return (w1_gradient, w2_gradient)

Hi Paolo,
can you explain me how to come to the decision to multiply this expression:
np.matmul(y_hat - Y, w2[1:].T) * sigmoid_gradient(h)
element by element?

Sincerely,
Kai

Hello, Kai! Sorry that I took some time to reply.

I’ll start from the formula that you can read right before the code. That formula, in words, would read like this:

(Y_hat - Y) multiplied by w2, multiplied by the derivative of the sigmoid

I’m assuming that the sigmoid_gradient() function is clear. The difference between the expected output and the ground truth (Y_hat - Y) should also be clear. (Please let me know if its’not.) To make things easier in the following text, I’ll call Y_hat - Y “the errors” and w2 “the weights”.

So I suspect that the confusing part is the multiplication of the error by the weights: np.matmul(y_hat - Y, w2[1:].T). I can see two things here that might be confusing. One is the w2[1:] operation, and the other is the fact that we traspose the weights with .T.

About the w2[1:] operation: remember that the first row of the weights contains the weights associated with the bias column–the one that is filled with 1s. During forward propagation, the biases gets added right after calculating the intermediate result h. So now that we’re backpropagating, we need to remove the relevant weights at this same point. Like the book says, those weights won’t have any effect to what happens in the previous layer, because the biases haven’t been added yet in the previous layer. That’s what w2[1:] does: it removes the first line of weights from w2. The result is a smaller matrix of weights, without the biases. (Let’s call it that: “weights-without-biases”).

The second step is multiplying the errors by the weights-without-biases. You might wonder why we have to translate that weight matrix. As I wrote in the previous section of the book, matrix dimensions are hard to explain on paper. Your best option might be to experiment with that code in a command line or a debugger, and see what the matrix dimensions are… But maybe I can make it easier for you by giving you a very simplified example of matrix multiplication. Consider what happens in the neural network during forward propagation. Here is the line that computes y_hat:

y_hat = softmax(np.matmul(prepend_bias(h), w2))

Let’s reason about the matrix dimensions involved. First, forget about prepend_bias, because we already discussed how to deal with the biases above. Also, consider that y_hat has the same dimensions as (y_hat - Y). And finally, consider that the softmax doesn’t change the dimensions of the matrix. So, to have a simplified example, let’s focus on a formula like:

y = np.matmul(h, w)

I renamed y_hat to y and w2 to w to make it as simple as possible. Now let’s simulate that operation with a couple of tiny matrices in a Python interpreter:

>>> import numpy as np

>>> h = np.array([[1, 2], [3, 4]])
>>> h
array([[1, 2],
       [3, 4]])

>>> w = np.array([[10], [20]])
>>> w
array([[10],
       [20]])

>>> y = np.matmul(h, w)
>>> y
array([[ 50],
       [110]])

That’s the calculation that happens during forward propagation. Note the dimensions involved:

>>> y.shape
(2, 1)
>>> h.shape
(2, 2)
>>> w.shape
(2, 1)

Now, consider backward propagation. Now we need to multiply y (or rather, something with the same dimensions of y) by w. We cannot do that by just multiplying them:

>>> np.matmul(y, w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)

Instead, because of the way matrix multiplication is defined, we need to transpose w so that the dimensions match, and each prediction gets matched with the “right” weight:

>>> np.matmul(y, w.T)
array([[ 500, 1000],
       [1100, 2200]])

Does that explanation help your intuition?

Hello Paolo,
first of all I would like to thank you for your very laborious answer. At the same time I would like to apologize for not having sufficiently spelled out my question.
Because your answer unfortunately missed my point, so I will try to formulate my question again but more precisely:

How do I come to the conclusion that I multiply the two matrices element by element and not go the way of transposing one of the two matrices and then multiplying them together by conventional matrix multiplication.

For me it is surprising that at the end the correct information is contained in the gradient although we exchange the operands, transpose, multiply matrices conventionally and element-wise.

Is it the intended dimension of the gradient that leads us to these operations. If yes, there are also wrong paths, through which the correct information is lost, but still end up in the correct dimension.

I have translated the text with DeepL into English and hope that it is sufficiently grammatically correct.

I wish you a nice day
Kai

Aaah, I understand what you mean.

About your specific question: no, it’s not that the matrix dimensions are leading us to do this computation. Instead, the matrix dimensions are a decent guideline to double-check that we’re not doing the wrong operations. You’re right that there are paths that might result in operations that are possible (i.e., the matrix dimensions are OK) but not right. However, dimensions are a decent guideline to rule out a few obvious mistakes. In a relatively short sequence of operations like this one, that basically consist of a matrix multiplication followed by an element-by-element multiplication, matrix dimensions are a decently reliable way of detecting mistakes.

That being said, the operations in this line (and most operations that involve matrices in general) are hard to justify in text. The second best way to justify them would be a concrete numerical example, possibly involving very small matrices. (I usually go for (2, 3) or similar small sizes). That’s something you can do on paper, maybe starting with the example in my previous answer.

But of course, even numbers don’t prove anything–they’re just a deeper, safer double-check. The only thing that proves something would be a formal mathematical proof that the gradient of forward propagation is indeed expressed the formula above. But I’m not much of a mathematician, and I didn’t produce this proof. It is, however, part of the literature.

So, if you ask me: “Why does the formula look like that?”, I have to give you a handwavy answer: “Because it has been proven mathematically”. That sounds like I’m running away from the question, but it’s a core idea in this book: don’t bother proving math. For example, I never prove the chain rule of derivation, or the fact that the derivative of the mean squared error is the one we’re using in the book–I just lift those formulae from math books, or from my own university knowledge.

OTOH, it seems to me that you’d be happy with an intuitive explanation, more than a formal proof. So, to attempt an answer to your question: why are we multiplying those matrices element by element? Here is the intuition: during forward propagation, we take a matrix and pass it through the sigmoid function. So we have function composition:

y = sigmoid(M)

M is the result of whatever comes before the sigmoid. The output of this composition is a matrix that has the same dimensions as M. We just take each element and apply the sigmoid.

When we do backprop, this composition becomes a multiplication, because of the chain rule:

y’ = sigmoid’ * M’

There is no reason to transpose anything here. In forward propagation, each element in M was passed through the sigmoid; now, in backpropagation, the gradient of each element is multiplied by the gradient of the sigmoid. We didn’t do anything special with sums or matrix multiplication on our way forward, so we have no reason to do it on our way back. As a concrete example, imagine that M is a (1, 1) matrix with one element. Now you can think of M as a scalar, and these formulae are just a straightforward application of the chain rule. Now imagine doing that on many elements instead of one, and you can get a mental picture of why we’re just multiplying element by element.

In contrast, the operations involving h and w2 were matrix operations on the way forward, so we should take that into account when we backpropagate.

I hope that the above explanation helped your intuition, kai!

As I understand it, there is no recipe to derive compositions of functions, whose inputs are matrices, according to them, as it is the case with a scalar.

Thank you for giving me an answer that helped my understanding anyway.

I am fascinated by mathematics and its power. But to be able to see its beauty, it requires a lot of time. That is why I asked such a question.

Your book was recommended to me and declared very worthwhile, which I can confirm 100%. It is my introduction to ML. Thank you for it!

Translated with DeepL Translate: The world's most accurate translator (free version)

1 Like