Hello, Kai! Sorry that I took some time to reply.

Iâ€™ll start from the formula that you can read right before the code. That formula, in words, would read like this:

`(Y_hat - Y) multiplied by w2, multiplied by the derivative of the sigmoid`

Iâ€™m assuming that the `sigmoid_gradient()`

function is clear. The difference between the expected output and the ground truth (`Y_hat - Y`

) should also be clear. (Please let me know if itsâ€™not.) To make things easier in the following text, Iâ€™ll call `Y_hat - Y`

â€śthe errorsâ€ť and `w2`

â€śthe weightsâ€ť.

So I suspect that the confusing part is the multiplication of the error by the weights: `np.matmul(y_hat - Y, w2[1:].T)`

. I can see two things here that might be confusing. One is the `w2[1:]`

operation, and the other is the fact that we traspose the weights with `.T`

.

About the `w2[1:]`

operation: remember that the first row of the weights contains the weights associated with the bias columnâ€“the one that is filled with 1s. During forward propagation, the biases gets added right after calculating the intermediate result `h`

. So now that weâ€™re backpropagating, we need to *remove* the relevant weights at this same point. Like the book says, those weights wonâ€™t have any effect to what happens in the previous layer, because the biases havenâ€™t been added yet in the previous layer. Thatâ€™s what `w2[1:]`

does: it removes the first line of weights from `w2`

. The result is a smaller matrix of weights, without the biases. (Letâ€™s call it that: â€śweights-without-biasesâ€ť).

The second step is multiplying the errors by the weights-without-biases. You might wonder why we have to translate that weight matrix. As I wrote in the previous section of the book, matrix dimensions are hard to explain on paper. Your best option might be to experiment with that code in a command line or a debugger, and see what the matrix dimensions areâ€¦ But maybe I can make it easier for you by giving you a very simplified example of matrix multiplication. Consider what happens in the neural network during forward propagation. Here is the line that computes `y_hat`

:

```
y_hat = softmax(np.matmul(prepend_bias(h), w2))
```

Letâ€™s reason about the matrix dimensions involved. First, forget about `prepend_bias`

, because we already discussed how to deal with the biases above. Also, consider that `y_hat`

has the same dimensions as `(y_hat - Y)`

. And finally, consider that the softmax doesnâ€™t change the dimensions of the matrix. So, to have a simplified example, letâ€™s focus on a formula like:

```
y = np.matmul(h, w)
```

I renamed `y_hat`

to `y`

and `w2`

to `w`

to make it as simple as possible. Now letâ€™s simulate that operation with a couple of tiny matrices in a Python interpreter:

```
>>> import numpy as np
>>> h = np.array([[1, 2], [3, 4]])
>>> h
array([[1, 2],
[3, 4]])
>>> w = np.array([[10], [20]])
>>> w
array([[10],
[20]])
>>> y = np.matmul(h, w)
>>> y
array([[ 50],
[110]])
```

Thatâ€™s the calculation that happens during forward propagation. Note the dimensions involved:

```
>>> y.shape
(2, 1)
>>> h.shape
(2, 2)
>>> w.shape
(2, 1)
```

Now, consider backward propagation. Now we need to multiply `y`

(or rather, something with the same dimensions of `y`

) by `w`

. We cannot do that by just multiplying them:

```
>>> np.matmul(y, w)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)
```

Instead, because of the way matrix multiplication is defined, we need to transpose `w`

so that the dimensions match, and each prediction gets matched with the â€śrightâ€ť weight:

```
>>> np.matmul(y, w.T)
array([[ 500, 1000],
[1100, 2200]])
```

Does that explanation help your intuition?