I’m following the tutorial over at https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ and so far everything makes sense to me. I am now trying to reason about how these formulas extend to any number of hidden layers. Specifically, how to proceed when you have the partial derivative of total error with respect to a weight. In the above post, ∂Etotal∂w1 is just the sum of the partials for Eo1 and Eo2 with respect to w1. Let’s say that there is an extra layer between the h nodes and the o nodes, with nodes j1 and j2. Is ∂Etotal∂w1=∂Ej1∂w1+∂Ej2∂w1 or is it more complicated? Something like: ∂Etotal∂w1=∂Eo1∂w1+∂Eo2∂w1=(∂Ej1∂w1+∂Ej2∂w1)+(∂Ej1∂w1+∂Ej2∂w1)
The short answer is yes, ∂Etotal∂w1=∂Ej1∂w1+∂Ej2∂w1.
The long answer. The key formula is the chain rule, as D.W. mentioned:
What’s good about it is that all three components on the left are local information. No matter what the next or previous layers are,
Hence, the calculation is the current node depends only on forward messages from direct neighbors to the left and backward messages from direct neighbors to the right:
outh1 and i1 are known from the forward pass, and ∂Etotal∂outh1 is the total backward message.
In the architecture from the post, the node h1 has two direct neighbors to the right: o1 and o2, and that explains the sum ∂Etotal∂outh1=∂Eo1∂outh1+∂Eo2∂outh1. In your example, the neighbors are j1 and j2, so it will be the sum ∂Etotal∂outh1=∂Ej1∂outh1+∂Ej2∂outh1. If h1 has even more connections, all of them will pass the backward message and they will be added up.