Calculus: What are derivatives and why should we care?
Jacobian Matrix
Motivation As we have seen, computing multivariate derivatives can be quite convoluted. Luckily for us, we can often save a lot of work in machine learning. In this theory page, we will cover some basic patterns of computing higher-dimensional derivatives.
The linear layer One of the most iconic functions in deep learning is the `linear layer', which takes some input and takes linear combinations (with different factors) of the inputs. This linear layer can be considered a function such that , where we still write . We call the the weights of the function. Notice that for the entire function we have of such sets of weights, i.e., in total weights. Please verify that we can write this functions more compactly as , where When we now imagine all the streams of influence found between the and , we realize that each element is dependent on each variable , since the contributes to each output. If this is unclear, it is recommended to draw out the graphical representation of this function. A consequence of this is that we have a lot of derivatives, namely for each of the outputs we have different derivatives (for the inputs).
To make our lives a whole lot easier, we simply determine the derivative of the ith output with respect to the jth input and see if what we end up with generalizes. We hence wanna find This is not too bad to do, since we know that Let us know zoom in on one of the terms of the summation, i.e., we only consider . In case , we will always have , because the entire term does not depend on . In case , however, we see that the partial derivative is given by . Make sure this makes sense: we basically arrive at an `if-else'statement in our derivative, yielding We can express this `if-else'’ statement quite easily mathematically using something called the Kronecker delta. The Kronecker delta over two variables and is equal to if is equal to , and qual to iotherwise, i.e., Sometimes this is written with so-called Iverson brackets as . These brackets do the same thing, i.e. if is true, else , for any statement . You might now ask yourself: how does this help us finding the derivative? For this, we need to use one very important property of the Kronecker Delta: given that it encodes an if-else statement, it can be used to filter out relevant terms. Essentially, we can drop summations by ‘summing out’ the Kronecker Delta. For this, observe that i.e., when summing over elements , we can filter out by introducing . Please verify this carefully, for this will be our main workhorse throughout this theory page.
If we go back to our example, we see that hence our derivative is given by for any combination of and . That is, the derivative is equal to if and are different, , and equal to when and are the same. Plugging this back in, we find This we know how to evaluate using our workhorse, and hence we see that
Neat! We just found a general approach to taking the derivative of the linear layer and concluded that the effect of the jth variable on the ith output is given by the weight . This makes sense: the influence of on output is exactly given by its `weight', i.e., by how much it contributes to the sum that forms , We can write out the entire Jacobian (where the element in ith row, jth column is the derivative of with respect to ) again: But wait! This matrix we recognize from earlier, namely as our matrix . This allows us to write So, not only did we find a very efficient way to reduce our problem from calculating derivatives to simply finding a general derivative we also end up with a very clean Jacobian when they are combined back together again. The fact that this derivative is so clean is one of the major reasons deep learning is so efficient, as computing derivatives can be done very, very cheaply computationally. We call this approach if finding a single entry of the derivative and then generalizing the `index method', and it will pop up a lot while doing deep learning.
Matrix cookbook Please note since we just found the derivative for arbitrary matrices and vectors, that is., we also know now that by simply remembering that for some matrix . This is another commonn pattern: we derive these identities and use and combine them to find more complex derivatives. One common resource for these identities is the Matrix Cookbook which can be found online.
Another very common expreesion that you will encounter is , where is a constant vector of the same shape as . In this case, we have that is a scalar, and hence our gradient and Jacobian matrix will of shape and , respectively, if is -dimensional. Let us again use the index method, and aim to determine This means that the jth element of the gradient is given by
, or stated in another way, the gradient is given by . Thus where the left part indeed denotes the gradient of .
However, the derivative is the Jacobian matrix with one row, which is the transpose of the gradient. So the Jacobian matrix has only one row that is equal to . This is slightly annoying, but luckily our answer is always either correct or needs to be only transposed, and checking it will become second nature soon enough! Let this inconvenience not distract us from the fact that we did just find our new identity though, that is:
Try and verify that the derivative , where , , and .
Please do this by using
- index notation, and
- identity friends we have already found.
We know that is a scalar (why?), and hence the Jacobian will be again of shape if is a -dimensional vector. Using index notation, we aim to find . We observe that Again, because the are just scalars, we know that the derivative is simply found by This gives us the following derivative: But this term we recognize as . This means that the derivative is simply given by , which aligns with our desired shape; so we are done with part 1 of the exercise.
So Using index notation is quite a lot of work. And actually, we could have done way less work using a previously identity, namely, for a constant vector of the same shape as . Observe that is just a row vector, that is, it can be written as for some vector . Thus we can write . But we know how to differentiate with our tricks, that is , and hence we know that . This second method is faster and less error-prone.
This should cover the basics of vector calculus. If you understand these basics, you are well on your way to doing machine learning soon enough!
Summary In this theory page, you have seen the Kronecker delta and a general strategy for finding multivariate derivatives.