Calculus: What are derivatives and why should we care?
Jacobian Matrix
Motivation As we have seen, computing multivariate derivatives can be quite convoluted. Luckily for us, we can often save a lot of work in machine learning. In this theory page, we will cover some basic patterns of computing higher-dimensional derivatives.
The linear layer One of the most iconic functions in deep learning is the `linear layer', which takes some input \(x\in\mathbb{R}^m\) and takes \(n\) linear combinations (with different factors) of the inputs. This linear layer can be considered a function \(\boldsymbol{f}: \mathbb{R}^n\to \mathbb{R}^m\) such that \(y_i=w_{i1}x_1+\cdots+ w_{in}x_n=\sum_{j=1}^{n}w_{ij}x_j\), where we still write \(\boldsymbol{y}=\boldsymbol{f}(\boldsymbol{x})\). We call the \(\{w_{ij}\}_{j=1}^{n}\) the weights of the function. Notice that for the entire function \(\boldsymbol{f}\) we have \(m\) of such sets of weights, i.e., in total \(m\times n\) weights. Please verify that we can write this functions more compactly as \(\boldsymbol{y}=\boldsymbol{W}\boldsymbol{x}\), where \[\boldsymbol{W}= \matrix{w_{11} & w_{12} & \ldots & w_{1n}\\ w_{21} & w_{22} & \ldots & w_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ w_{m1} & w_{m2} & \ldots & w_{mn}}\] When we now imagine all the streams of influence found between the \(\boldsymbol{y}\) and \(\boldsymbol{x}\), we realize that each element \(y+i\) is dependent on each variable \(x_j\), since the \(x_j\) contributes to each output. If this is unclear, it is recommended to draw out the graphical representation of this function. A consequence of this is that we have a lot of derivatives, namely for each of the \(m\) outputs \(y_i\) we have \(n\) different derivatives (for the \(n\) inputs).
To make our lives a whole lot easier, we simply determine the derivative of the ith output with respect to the jth input and see if what we end up with generalizes. We hence wanna find \[\frac{\partial y_i}{\partial x_j} = \frac{\partial}{\partial x_j}\left(\sum_{k=1}^{n}w_{ik}x_k\right)\text.\] This is not too bad to do, since we know that \[\frac{\partial y_i}{\partial x_j} = \sum_{k=1}^{n}\frac{\partial }{\partial x_j}(w_{ik}x_k)\text.\] Let us know zoom in on one of the terms of the summation, i.e., we only consider \(\dfrac{\partial }{\partial x_j}(w_{ik}x_k)\). In case \(x_k\neq x_j\), we will always have \(\dfrac{\partial }{\partial x_j}(w_{ik}x_k)=0\), because the entire term does not depend on \(x_j\). In case \(x_k= x_j\), however, we see that the partial derivative is given by \(w_{ik}\). Make sure this makes sense: we basically arrive at an `if-else'statement in our derivative, yielding \[\dfrac{\partial }{\partial x_j}(w_{ik}x_k)=\begin{cases} w_{ik} & \text{if } k=j\\ 0 & \text{if } k\neq j\end{cases}\] We can express this `if-else'’ statement quite easily mathematically using something called the Kronecker delta. The Kronecker delta over two variables \(i\) and \(j\) is equal to \(1\) if \(i\) is equal to \(j\), and qual to \(0\) iotherwise, i.e., \[\delta_{ij}=\begin{cases} 1 & \text{if } i=j\\ 0 & \text{otherwise } \end{cases}\] Sometimes this is written with so-called Iverson brackets as \([i=j]\). These brackets do the same thing, i.e. \([S]=1\) if \(S\) is true, else \([S]=0\), for any statement \(S\). You might now ask yourself: how does this help us finding the derivative? For this, we need to use one very important property of the Kronecker Delta: given that it encodes an if-else statement, it can be used to filter out relevant terms. Essentially, we can drop summations by ‘summing out’ the Kronecker Delta. For this, observe that \[\sum_{j} \delta_{ij}x_j=x_i,\] i.e., when summing over elements \(x_j\), we can filter out \(x_i\) by introducing \(\delta_{ij}\). Please verify this carefully, for this will be our main workhorse throughout this theory page.
If we go back to our example, we see that hence our derivative is given by \(\dfrac{\partial }{\partial x_j}(w_{ik}x_k)=\delta_{jk}w_{ik}\) for any combination of \(x_j\) and \(x_k\). That is, the derivative is equal to \(0\) if \(x_j\) and \(X_k\) are different, , and equal to\(1\) when \(x_j\) and \(X_k\) are the same. Plugging this back in, we find \[\sum_{k=1}^{n}\frac{\partial }{\partial x_j}(w_{ik}x_k)= \sum_{k=1}^{n}\delta_{jk}w_{ik}\text.\] This we know how to evaluate using our workhorse, and hence we see that \[\begin{aligned}\frac{\partial y_i}{\partial x_j}&=\sum_{k=1}^{n}\delta_{jk}w_{ik}\\[0.25cm] &=w_{ij}\end{aligned}\]
Neat! We just found a general approach to taking the derivative of the linear layer and concluded that the effect of the jth variable on the ith output is given by the weight \(w_{ij}\). This makes sense: the influence of \(x_j\) on output \(y_i\) is exactly given by its `weight', i.e., by how much it contributes to the sum that forms \(y_i\), We can write out the entire Jacobian (where the element in ith row, jth column is the derivative of \(y_i\) with respect to \(x_j\)) again: \[\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}=\matrix{w_{11} & w_{12} & \ldots & w_{1n}\\ w_{21} & w_{22} & \ldots & w_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ w_{m1} & w_{m2} & \ldots & w_{mn}}\in\mathbb{R}^{m\times n}\] But wait! This matrix we recognize from earlier, namely as our matrix \(\boldsymbol {W}\). This allows us to write \[\frac{\partial\boldsymbol{y}}{\partial\boldsymbol{x}}=\boldsymbol {W}\] So, not only did we find a very efficient way to reduce our problem from calculating \(m\times n\) derivatives to simply finding a general derivative \(\dfrac{\partial y_i}{\partial x_j}\) we also end up with a very clean Jacobian when they are combined back together again. The fact that this derivative is so clean is one of the major reasons deep learning is so efficient, as computing derivatives can be done very, very cheaply computationally. We call this approach if finding a single entry of the derivative and then generalizing the `index method', and it will pop up a lot while doing deep learning.
Matrix cookbook Please note since we just found the derivative \(\dfrac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{W}\boldsymbol{x})=\boldsymbol{W}\) for arbitrary matrices and vectors, that is., we also know now that \[\dfrac{\partial}{\partial \boldsymbol{v}}\bigl((\boldsymbol{A}\boldsymbol{B}+\boldsymbol{C})\boldsymbol{v})\bigr)=\boldsymbol{A}\boldsymbol{B}+\boldsymbol{C},\] by simply remembering that \(\boldsymbol{A}\boldsymbol{B}+\boldsymbol{C}=\boldsymbol{W'}\) for some matrix \(\boldsymbol{W'}\). This is another commonn pattern: we derive these identities and use and combine them to find more complex derivatives. One common resource for these identities is the Matrix Cookbook which can be found online.
Another very common expreesion that you will encounter is \(\dfrac{\dd}{\dd \boldsymbol{x}}(\boldsymbol{a}^{\top}\boldsymbol{x})\), where \(\boldsymbol{a}\) is a constant vector of the same shape as \(\boldsymbol{x}\). In this case, we have that \(\boldsymbol{a}^{\top}\boldsymbol{x}\) is a scalar, and hence our gradient and Jacobian matrix will of shape \(n\times 1\) and \(1\times n\), respectively, if \(\boldsymbol{x}\) is \(n\)-dimensional. Let us again use the index method, and aim to determine \[\begin{aligned}\dfrac{\dd}{\dd x_j}(\boldsymbol{a}^{\top}\boldsymbol{x})&= \dfrac{\dd}{\dd x_j}\left(\sum_{k=1}^{n}a_kx_k\right) \\[0.25cm] &= \sum_{k=1}^{n}\dfrac{\dd}{\dd x_j}( a_kx_k)\\[0.25cm] &=\delta_{jk}a_k\\[0.25cm] &=a_k\end{aligned}\] This means that the jth element of the gradient is given by
\(a_j\), or stated in another way, the gradient is given by \(\boldsymbol{a}\). Thus \[\text{The gradient: }\dfrac{\dd }{\dd \boldsymbol{x}}(\boldsymbol{a}^{\top}\boldsymbol{x}) = \boldsymbol {a},\] where the left part indeed denotes the gradient of \(\boldsymbol{a}^{\top}\boldsymbol{x}\).
However, the derivative is the Jacobian matrix with one row, which is the transpose of the gradient. So the Jacobian matrix has only one row that is equal to \(\boldsymbol{a}^{\top}\). This is slightly annoying, but luckily our answer is always either correct or needs to be only transposed, and checking it will become second nature soon enough! Let this inconvenience not distract us from the fact that we did just find our new identity though, that is: \[\text{The derivative: }\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{a}^{\top}\boldsymbol{x})=\boldsymbol{a}^{\top}\text.\]
Try and verify that the derivative \(\dfrac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\boldsymbol{x})=\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\), where \(\boldsymbol{x}\in\mathbb{R}^n\), \(\boldsymbol{A}\in\mathbb{R}^{m\times n}\), and \(\boldsymbol{x}\in\mathbb{R}^m\).
Please do this by using
- index notation, and
- identity friends we have already found.
We know that \(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\) is a scalar (why?), and hence the Jacobian will be again of shape \(1\times n\) if \(\boldsymbol{x}\) is a \(n\)-dimensional vector. Using index notation, we aim to find \(\dfrac{\partial}{\partial x_i}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A})\). We observe that \[\begin{aligned}\dfrac{\partial}{\partial x_i}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A})&=\dfrac{\partial}{\partial x_i}\left(\sum_{k=1}^{m}\sum_{j=1}^{n}y_{k}A_{kj}x_{j}\right)\\[0.25cm] &=\sum_{k=1}^{m}\sum_{j=1}^{n}\dfrac{\partial}{\partial x_i}\left(y_{k}A_{kj}x_{j}\right)\\[0.25cm] \end{aligned}\] Again, because the \(y_kA_{kj}\) are just scalars, we know that the derivative is simply found by \[\dfrac{\partial}{\partial x_i}\left(y_{k}A_{kj}x_{j}\right)=\delta_{ij}y_{k}A_{kj}\] This gives us the following derivative: \[\begin{aligned}\dfrac{\partial}{\partial x_i}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A})&=\sum_{k=1}^{m}\sum_{j=1}^{n}\dfrac{\partial}{\partial x_i}\left(y_{k}A_{kj}x_{j}\right)\\[0.25cm] &=\sum_{k=1}^{m}\sum_{j=1}^{n}\delta_{ij}y_{k}A_{kj}\\[0.25cm] &=\sum_{k=1}^{m}y_{k}A_{ki}\end{aligned}\] But this term we recognize as \(\left(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\right)_i\). This means that the derivative \(\dfrac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\boldsymbol{x})\) is simply given by \(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\), which aligns with our desired shape; so we are done with part 1 of the exercise.
\(\phantom{abc}\)
So\(\ldots\) Using index notation is quite a lot of work. And actually, we could have done way less work using a previously identity, namely, \(\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{a}^{\top}\!\boldsymbol{x})=\boldsymbol{a}^{\top}\) for a constant vector \(\boldsymbol{a}\) of the same shape as \(\boldsymbol{x}\). Observe that \(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\) is just a row vector, that is, it can be written as \(\boldsymbol{v}^{\top}=\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\) for some vector \(\boldsymbol{v}\). Thus we can write \(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\boldsymbol{x}=\boldsymbol{v}^{\top}\!\boldsymbol{x}\). But we know how to differentiate with our tricks, that is \(\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{v}^{\top}\!\boldsymbol{x})=\boldsymbol{v}^{\top}\), and hence we know that \(\dfrac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\boldsymbol{x})=\boldsymbol{y}^{\top}\!\!\boldsymbol{A}\). This second method is faster and less error-prone.
This should cover the basics of vector calculus. If you understand these basics, you are well on your way to doing machine learning soon enough!
Summary In this theory page, you have seen the Kronecker delta and a general strategy for finding multivariate derivatives.