glouppe
diff --git a/‎code/lec2-space-stretching.ipynb
+363-18 b/‎code/lec2-space-stretching.ipynb
+363-18
diff --git a/‎code/lec2-spiral-classification.ipynb
+83-31 b/‎code/lec2-spiral-classification.ipynb
+83-31
diff --git a/‎lecture2.md
+1-2 b/‎lecture2.md
+1-2
diff --git a/‎lecture3.md
+26-10 b/‎lecture3.md
+26-10
diff --git a/‎pdf/lec3.pdf
724 KB b/‎pdf/lec3.pdf
724 KB
@@ -69,7 +69,7 @@ class: middle, center, black-slide
 
 <iframe width="600" height="450" src="https://www.youtube.com/embed/cNxadbrN_aI" frameborder="0" allowfullscreen></iframe>
 
-The machine could classify simple images.
+The machine could learn to classify simple images.
 
 ---
 
@@ -767,7 +767,6 @@ $$\nabla \mathcal{\ell}(\theta) =
 $$
 i.e., a vector that gathers the partial derivatives of the loss for each model parameter $\theta\_k$ for $k=0, \ldots, K-1$.
 
-
 These derivatives can be evaluated automatically from the *computational graph* of $\ell$ using **automatic differentiation**.
 
 ---
 
@@ -201,7 +201,7 @@ class: middle
 
 Let us assume a function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ that decomposes as a chain composition
 $$\mathbf{f} = \mathbf{f}\_t \circ \mathbf{f}\_{t-1} \circ \ldots \circ \mathbf{f}\_1,$$
-for functions $\mathbf{f}\_k : \mathbb{R}^{n\_{k-1}} \times \mathbb{R}^{n\_k}$, for $k=1, \ldots, t$.
+for functions $\mathbf{f}\_k : \mathbb{R}^{n\_{k-1}} \to \mathbb{R}^{n\_k}$, for $k=1, \ldots, t$.
 
 ---
 
@@ -445,7 +445,6 @@ $$\frac{\partial \mathbf{x}\_t}{\partial \mathbf{x}\_k} = \underbrace{\frac{\par
 The Jacobian $\left[ \frac{\partial \mathbf{x}\_m}{\partial \mathbf{x}\_k} \right]$ is never explicitly built. It is usually simpler, faster, and more memory efficient to compute the VJP directly.
 - Most reverse mode AD systems compose VJPs backward to compute $\frac{\partial \mathbf{x}\_t}{\partial \mathbf{x}\_1}$.
 
-
 ---
 
 class: middle
@@ -460,6 +459,31 @@ is usually implemented in terms of **Jacobian-vector products** (JVP) locally de
 
 class: middle
 
+## Checkpointing
+
+Checkpointing consists in marking intermediate variables for which the forward values are not stored in memory. These are recomputed during the backward pass, which can save memory at the cost of recomputation.
+
+```python
+class MLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.layer1 = nn.Linear(100, 200)
+        self.relu = nn.ReLU()
+        self.layer2 = nn.Linear(200, 10)
+    
+    def forward(self, x):
+        x = checkpoint(self.layer1, x) # x is not stored 
+                                       # it will be recomputed 
+                                       # during the backward pass
+        x = self.relu(x)
+        x = self.layer2(x)
+        return x
+```
+
+---
+
+class: middle
+
 ## Higher-order derivatives
 
 ```python
@@ -518,14 +542,6 @@ Optimizing a wing (Sam Greydanus, 2020)
 
 ---
 
-class: middle, center
-
-.width-75[![](figures/lec3/tweet.png)]
-
-... and plenty of other applications! (See this [thread](https://twitter.com/glouppe/status/1361941266901131265))
-
----
-
 # Summary
 
 - Automatic differentiation is one of the keys that enabled the deep learning revolution.