Hello, Kai! Sorry that I took some time to reply.
I’ll start from the formula that you can read right before the code. That formula, in words, would read like this:
(Y_hat - Y) multiplied by w2, multiplied by the derivative of the sigmoid
I’m assuming that the sigmoid_gradient()
function is clear. The difference between the expected output and the ground truth (Y_hat - Y
) should also be clear. (Please let me know if its’not.) To make things easier in the following text, I’ll call Y_hat - Y
“the errors” and w2
“the weights”.
So I suspect that the confusing part is the multiplication of the error by the weights: np.matmul(y_hat - Y, w2[1:].T)
. I can see two things here that might be confusing. One is the w2[1:]
operation, and the other is the fact that we traspose the weights with .T
.
About the w2[1:]
operation: remember that the first row of the weights contains the weights associated with the bias column–the one that is filled with 1s. During forward propagation, the biases gets added right after calculating the intermediate result h
. So now that we’re backpropagating, we need to remove the relevant weights at this same point. Like the book says, those weights won’t have any effect to what happens in the previous layer, because the biases haven’t been added yet in the previous layer. That’s what w2[1:]
does: it removes the first line of weights from w2
. The result is a smaller matrix of weights, without the biases. (Let’s call it that: “weights-without-biases”).
The second step is multiplying the errors by the weights-without-biases. You might wonder why we have to translate that weight matrix. As I wrote in the previous section of the book, matrix dimensions are hard to explain on paper. Your best option might be to experiment with that code in a command line or a debugger, and see what the matrix dimensions are… But maybe I can make it easier for you by giving you a very simplified example of matrix multiplication. Consider what happens in the neural network during forward propagation. Here is the line that computes y_hat
:
y_hat = softmax(np.matmul(prepend_bias(h), w2))
Let’s reason about the matrix dimensions involved. First, forget about prepend_bias
, because we already discussed how to deal with the biases above. Also, consider that y_hat
has the same dimensions as (y_hat - Y)
. And finally, consider that the softmax doesn’t change the dimensions of the matrix. So, to have a simplified example, let’s focus on a formula like:
y = np.matmul(h, w)
I renamed y_hat
to y
and w2
to w
to make it as simple as possible. Now let’s simulate that operation with a couple of tiny matrices in a Python interpreter:
>>> import numpy as np
>>> h = np.array([[1, 2], [3, 4]])
>>> h
array([[1, 2],
[3, 4]])
>>> w = np.array([[10], [20]])
>>> w
array([[10],
[20]])
>>> y = np.matmul(h, w)
>>> y
array([[ 50],
[110]])
That’s the calculation that happens during forward propagation. Note the dimensions involved:
>>> y.shape
(2, 1)
>>> h.shape
(2, 2)
>>> w.shape
(2, 1)
Now, consider backward propagation. Now we need to multiply y
(or rather, something with the same dimensions of y
) by w
. We cannot do that by just multiplying them:
>>> np.matmul(y, w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
Instead, because of the way matrix multiplication is defined, we need to transpose w
so that the dimensions match, and each prediction gets matched with the “right” weight:
>>> np.matmul(y, w.T)
array([[ 500, 1000],
[1100, 2200]])
Does that explanation help your intuition?