# Programming Machine Learning:

@nusco

Programming Machine Learning
From Coding to Deep Learning
German edition

Chapter 11
Training the network
Page 191

``````def back(X, Y, y_hat, w2, h):
w2_gradient = np.matmul(prepend_bias(h).T, y_hat - Y) / X.shape[0]

``````

Hi Paolo,
can you explain me how to come to the decision to multiply this expression:
np.matmul(y_hat - Y, w2[1:].T) * sigmoid_gradient(h)
element by element?

Sincerely,
Kai

Hello, Kai! Sorry that I took some time to reply.

Iâ€™ll start from the formula that you can read right before the code. That formula, in words, would read like this:

`(Y_hat - Y) multiplied by w2, multiplied by the derivative of the sigmoid`

Iâ€™m assuming that the `sigmoid_gradient()` function is clear. The difference between the expected output and the ground truth (`Y_hat - Y`) should also be clear. (Please let me know if itsâ€™not.) To make things easier in the following text, Iâ€™ll call `Y_hat - Y` â€śthe errorsâ€ť and `w2` â€śthe weightsâ€ť.

So I suspect that the confusing part is the multiplication of the error by the weights: `np.matmul(y_hat - Y, w2[1:].T)`. I can see two things here that might be confusing. One is the `w2[1:]` operation, and the other is the fact that we traspose the weights with `.T`.

About the `w2[1:]` operation: remember that the first row of the weights contains the weights associated with the bias columnâ€“the one that is filled with 1s. During forward propagation, the biases gets added right after calculating the intermediate result `h`. So now that weâ€™re backpropagating, we need to remove the relevant weights at this same point. Like the book says, those weights wonâ€™t have any effect to what happens in the previous layer, because the biases havenâ€™t been added yet in the previous layer. Thatâ€™s what `w2[1:]` does: it removes the first line of weights from `w2`. The result is a smaller matrix of weights, without the biases. (Letâ€™s call it that: â€śweights-without-biasesâ€ť).

The second step is multiplying the errors by the weights-without-biases. You might wonder why we have to translate that weight matrix. As I wrote in the previous section of the book, matrix dimensions are hard to explain on paper. Your best option might be to experiment with that code in a command line or a debugger, and see what the matrix dimensions areâ€¦ But maybe I can make it easier for you by giving you a very simplified example of matrix multiplication. Consider what happens in the neural network during forward propagation. Here is the line that computes `y_hat`:

``````y_hat = softmax(np.matmul(prepend_bias(h), w2))
``````

Letâ€™s reason about the matrix dimensions involved. First, forget about `prepend_bias`, because we already discussed how to deal with the biases above. Also, consider that `y_hat` has the same dimensions as `(y_hat - Y)`. And finally, consider that the softmax doesnâ€™t change the dimensions of the matrix. So, to have a simplified example, letâ€™s focus on a formula like:

``````y = np.matmul(h, w)
``````

I renamed `y_hat` to `y` and `w2` to `w` to make it as simple as possible. Now letâ€™s simulate that operation with a couple of tiny matrices in a Python interpreter:

``````>>> import numpy as np

>>> h = np.array([[1, 2], [3, 4]])
>>> h
array([[1, 2],
[3, 4]])

>>> w = np.array([[10], [20]])
>>> w
array([[10],
[20]])

>>> y = np.matmul(h, w)
>>> y
array([[ 50],
[110]])
``````

Thatâ€™s the calculation that happens during forward propagation. Note the dimensions involved:

``````>>> y.shape
(2, 1)
>>> h.shape
(2, 2)
>>> w.shape
(2, 1)
``````

Now, consider backward propagation. Now we need to multiply `y` (or rather, something with the same dimensions of `y`) by `w`. We cannot do that by just multiplying them:

``````>>> np.matmul(y, w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
``````

Instead, because of the way matrix multiplication is defined, we need to transpose `w` so that the dimensions match, and each prediction gets matched with the â€śrightâ€ť weight:

``````>>> np.matmul(y, w.T)
array([[ 500, 1000],
[1100, 2200]])
``````

Hello Paolo,
first of all I would like to thank you for your very laborious answer. At the same time I would like to apologize for not having sufficiently spelled out my question.
Because your answer unfortunately missed my point, so I will try to formulate my question again but more precisely:

How do I come to the conclusion that I multiply the two matrices element by element and not go the way of transposing one of the two matrices and then multiplying them together by conventional matrix multiplication.

For me it is surprising that at the end the correct information is contained in the gradient although we exchange the operands, transpose, multiply matrices conventionally and element-wise.

Is it the intended dimension of the gradient that leads us to these operations. If yes, there are also wrong paths, through which the correct information is lost, but still end up in the correct dimension.

I have translated the text with DeepL into English and hope that it is sufficiently grammatically correct.

I wish you a nice day
Kai

Aaah, I understand what you mean.

About your specific question: no, itâ€™s not that the matrix dimensions are leading us to do this computation. Instead, the matrix dimensions are a decent guideline to double-check that weâ€™re not doing the wrong operations. Youâ€™re right that there are paths that might result in operations that are possible (i.e., the matrix dimensions are OK) but not right. However, dimensions are a decent guideline to rule out a few obvious mistakes. In a relatively short sequence of operations like this one, that basically consist of a matrix multiplication followed by an element-by-element multiplication, matrix dimensions are a decently reliable way of detecting mistakes.

That being said, the operations in this line (and most operations that involve matrices in general) are hard to justify in text. The second best way to justify them would be a concrete numerical example, possibly involving very small matrices. (I usually go for (2, 3) or similar small sizes). Thatâ€™s something you can do on paper, maybe starting with the example in my previous answer.

But of course, even numbers donâ€™t prove anythingâ€“theyâ€™re just a deeper, safer double-check. The only thing that proves something would be a formal mathematical proof that the gradient of forward propagation is indeed expressed the formula above. But Iâ€™m not much of a mathematician, and I didnâ€™t produce this proof. It is, however, part of the literature.

So, if you ask me: â€śWhy does the formula look like that?â€ť, I have to give you a handwavy answer: â€śBecause it has been proven mathematicallyâ€ť. That sounds like Iâ€™m running away from the question, but itâ€™s a core idea in this book: donâ€™t bother proving math. For example, I never prove the chain rule of derivation, or the fact that the derivative of the mean squared error is the one weâ€™re using in the bookâ€“I just lift those formulae from math books, or from my own university knowledge.

OTOH, it seems to me that youâ€™d be happy with an intuitive explanation, more than a formal proof. So, to attempt an answer to your question: why are we multiplying those matrices element by element? Here is the intuition: during forward propagation, we take a matrix and pass it through the sigmoid function. So we have function composition:

``````y = sigmoid(M)
``````

M is the result of whatever comes before the sigmoid. The output of this composition is a matrix that has the same dimensions as M. We just take each element and apply the sigmoid.

When we do backprop, this composition becomes a multiplication, because of the chain rule:

yâ€™ = sigmoidâ€™ * Mâ€™

There is no reason to transpose anything here. In forward propagation, each element in M was passed through the sigmoid; now, in backpropagation, the gradient of each element is multiplied by the gradient of the sigmoid. We didnâ€™t do anything special with sums or matrix multiplication on our way forward, so we have no reason to do it on our way back. As a concrete example, imagine that M is a (1, 1) matrix with one element. Now you can think of M as a scalar, and these formulae are just a straightforward application of the chain rule. Now imagine doing that on many elements instead of one, and you can get a mental picture of why weâ€™re just multiplying element by element.

In contrast, the operations involving `h` and `w2` were matrix operations on the way forward, so we should take that into account when we backpropagate.

I hope that the above explanation helped your intuition, kai!

As I understand it, there is no recipe to derive compositions of functions, whose inputs are matrices, according to them, as it is the case with a scalar.

Thank you for giving me an answer that helped my understanding anyway.

I am fascinated by mathematics and its power. But to be able to see its beauty, it requires a lot of time. That is why I asked such a question.

Your book was recommended to me and declared very worthwhile, which I can confirm 100%. It is my introduction to ML. Thank you for it!

Translated with DeepL Translate: The world's most accurate translator (free version)

1 Like