Skip to content

Commit 272ad5d

Browse files
committed
Lecture 12
1 parent 0dd57eb commit 272ad5d

10 files changed

+418
-88
lines changed

closing.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,12 @@ The models covered in this course have broad applications in artificial intellig
3030

3131
class: middle
3232

33+
The field of deep learning is evolving rapidly. What you have learned in this course is just the beginning!
34+
35+
---
36+
37+
class: middle
38+
3339
## Exam
3440

3541
- 1 question on the fundamentals of deep learning (lectures 1 to 4)
@@ -73,7 +79,7 @@ class: black-slide, middle
7379
- Deep Learning is more than feedforward networks.
7480
- It is a .bold[methodology]:
7581
- assemble networks of parameterized functional blocks
76-
- train them from examples using some form of gradient-based optimisation.
82+
- train them from data using some form of gradient-based optimisation.
7783
- Bricks are simple, but their nested composition can be arbitrarily complicated.
7884
- Think like an architect: make cathedrals!
7985
]

code/lec11-vae.ipynb

+48-61
Large diffs are not rendered by default.

course-syllabus.md

+3
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ Prof. Gilles Louppe<br>
1212

1313
R: paper https://t.co/wVg6xUmt7d
1414

15+
Q&A: ask me anything (during the course, on any topic, questions collected on a platform)
16+
Give examples more related to engineering and science. Focus on people rather than technology.
17+
1518
---
1619

1720
# Us

figures/lec11/embedding0.png

37.1 KB
Loading

figures/lec12/ald.gif

5.73 MB
Loading

figures/lec12/assimilation.svg

+255
Loading

figures/lec12/sda-qg.png

1.09 MB
Loading

lecture11.md

+11-1
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,15 @@ count: false
148148

149149
class: middle
150150

151+
.center.width-90[![](figures/lec11/embedding0.png)]
152+
153+
.footnote[Credits: Francois Fleuret, [Deep Learning](https://fleuret.org/dlc/), UNIGE/EPFL.]
154+
155+
---
156+
157+
class: middle
158+
count: false
159+
151160
.center.width-90[![](figures/lec11/embedding1.png)]
152161

153162
.footnote[Credits: Francois Fleuret, [Deep Learning](https://fleuret.org/dlc/), UNIGE/EPFL.]
@@ -558,7 +567,8 @@ Unbiased gradients of the ELBO with respect to the generative model parameters $
558567
$$\begin{aligned}
559568
\nabla\_\theta \text{ELBO}(\mathbf{x};\theta,\phi) &= \nabla\_\theta \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p\_\theta(\mathbf{x},\mathbf{z}) - \log q\_\phi(\mathbf{z}|\mathbf{x})\right] \\\\
560569
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta ( \log p\_\theta(\mathbf{x},\mathbf{z}) - \log q\_\phi(\mathbf{z}|\mathbf{x}) ) \right] \\\\
561-
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x},\mathbf{z}) \right],
570+
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x},\mathbf{z}) \right] \\\\
571+
&= \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \nabla\_\theta \log p\_\theta(\mathbf{x} | \mathbf{z}) \right],
562572
\end{aligned}$$
563573
which can be estimated with Monte Carlo integration.
564574

lecture12.md

+94-25
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,6 @@ Lecture 12: Diffusion models
88
Prof. Gilles Louppe<br>
99
1010

11-
???
12-
13-
Good references:
14-
- https://arxiv.org/pdf/2208.11970.pdf
15-
- https://cvpr2022-tutorial-diffusion-models.github.io/
16-
- Understanding Deep Learning book
17-
- Continuous : infinite noise levels https://www.youtube.com/watch?v=wMmqCMwuM2Q (build some intuition first)
18-
19-
- Rewrite to better match the sidenotes
20-
- Give more intuition about the score function and about the annealing schedule
21-
2211
---
2312

2413
# Today
@@ -113,6 +102,16 @@ class: middle
113102

114103
class: middle
115104

105+
## Data assimilation in ocean models
106+
107+
.center.width-65[![](./figures/lec12/sda-qg.png)]
108+
109+
.footnote[Credits: [Rozet and Louppe](https://arxiv.org/pdf/2306.10574.pdf), 2023.]
110+
111+
---
112+
113+
class: middle
114+
116115
# VAEs
117116

118117
A short recap.
@@ -141,27 +140,23 @@ $$\begin{aligned}
141140
&= \arg \max\_{\theta,\phi} \mathbb{E}\_{p(\mathbf{x})} \left[ \mathbb{E}\_{q\_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p\_\theta(\mathbf{x}|\mathbf{z})\right] - \text{KL}(q\_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})) \right].
142141
\end{aligned}$$
143142

143+
.alert[Issue: The prior matching term limits the expressivity of the model.]
144+
144145
---
145146

146-
class: middle
147+
class: middle, black-slide, center
148+
count: false
147149

148-
The prior matching term limits the expressivity of the model.
150+
Solution: Make $p(\mathbf{z})$ a learnable distribution.
149151

150-
Solution: Make $p(\mathbf{z})$ a learnable distribution.
152+
.width-80[![](figures/lec12/deeper.jpg)]
151153

152154
???
153155

154156
Explain the maths on the black board, taking the expectation wrt $p(\mathbf{x})$ of the ELBO and consider the expected KL terms.
155157

156158
---
157159

158-
class: middle, black-slide, center
159-
count: false
160-
161-
.width-80[![](figures/lec12/deeper.jpg)]
162-
163-
---
164-
165160
class: middle
166161

167162
## (Markovian) Hierarchical VAEs
@@ -262,6 +257,12 @@ class: middle
262257

263258
.center.width-100[![](figures/lec12/diffusion-kernel-1.png)]
264259

260+
.center[
261+
262+
Diffusion kernel $q(\mathbf{x}\_t | \mathbf{x}\_{0})$ for different noise levels $t$.
263+
264+
]
265+
265266
.footnote[Credits: [Simon J.D. Prince](https://udlbook.github.io/udlbook/), 2023.]
266267

267268
---
@@ -270,6 +271,12 @@ class: middle
270271

271272
.center.width-100[![](figures/lec12/diffusion-kernel-2.png)]
272273

274+
.center[
275+
276+
Marginal distribution $q(\mathbf{x}\_t)$.
277+
278+
]
279+
273280
.footnote[Credits: [Simon J.D. Prince](https://udlbook.github.io/udlbook/), 2023.]
274281

275282
---
@@ -416,6 +423,8 @@ $$\begin{aligned}
416423

417424
class: middle
418425

426+
In summary, training and sampling thus eventually boils down to:
427+
419428
.center.width-100[![](figures/lec12/algorithms.png)]
420429

421430
???
@@ -428,7 +437,7 @@ class: middle
428437

429438
## Network architectures
430439

431-
Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}\_\theta(\mathbf{x}\_t, t)$ or $\epsilon\_\theta(\mathbf{x}\_t, t)$.
440+
Diffusion models often use U-Net architectures (at least for image data) with ResNet blocks and self-attention layers to represent $\hat{\mathbf{x}}\_\theta(\mathbf{x}\_t, t)$ or $\epsilon\_\theta(\mathbf{x}\_t, t)$.
432441

433442
<br>
434443

@@ -446,11 +455,71 @@ class: middle
446455

447456
class: middle
448457

449-
The .bold[score function] $\nabla\_{\mathbf{x}\_0} \log q(\mathbf{x}\_0)$ is a vector field that points in the direction of the highest density of the data distribution $q(\mathbf{x}\_0)$.
458+
## Score-based models
459+
460+
Maximum likelihood estimation for energy-based probabilistic models $$p\_{\theta}(\mathbf{x}) = \frac{1}{Z\_{\theta}} \exp(-f\_{\theta}(\mathbf{x}))$$ can be intractable when the partition function $Z\_{\theta}$ is unknown.
461+
We can sidestep this issue with a score-based model $$s\_\theta(\mathbf{x}) \approx \nabla\_{\mathbf{x}} \log p(\mathbf{x})$$ that approximates the (Stein) .bold[score function] of the data distribution. If we parameterize the score-based model with an energy-based model, then we have $$s\_\theta(\mathbf{x}) = \nabla\_{\mathbf{x}} \log p\_{\theta}(\mathbf{x}) = -\nabla\_{\mathbf{x}} f\_{\theta}(\mathbf{x}) - \nabla\_{\mathbf{x}} \log Z\_{\theta} = -\nabla\_{\mathbf{x}} f\_{\theta}(\mathbf{x}),$$
462+
which discards the intractable partition function and expands the family of models that can be used.
463+
464+
---
465+
466+
class: middle
467+
468+
The score function points in the direction of the highest density of the data distribution.
469+
It can be used to find modes of the data distribution or to generate samples by .bold[Langevin dynamics] by iterating the following sampling rule
470+
$$\mathbf{x}\_{i+1} = \mathbf{x}\_i + \epsilon \nabla\_{\mathbf{x}\_i} \log p(\mathbf{x}\_i) + \sqrt{2\epsilon} \mathbf{z}\_i,$$
471+
where $\epsilon$ is the step size and $\mathbf{z}\_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. When $\epsilon$ is small, Langevin dynamics will converge to the data distribution $p(\mathbf{x})$.
472+
473+
.center.width-30[![](figures/lec12/langevin.gif)]
474+
475+
.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]
476+
477+
---
478+
479+
class: middle
480+
481+
Similarly to likelihood-based models, score-based models can be trained by minimizing the .bold[Fisher divergence] between the data distribution $p(\mathbf{x})$ and the model distribution $p\_\theta(\mathbf{x})$ as
482+
$$\mathbb{E}\_{p(\mathbf{x})} \left[ || \nabla\_{\mathbf{x}} \log p(\mathbf{x}) - s\_\theta(\mathbf{x}) ||\_2^2 \right].$$
483+
484+
---
485+
486+
class: middle
487+
488+
Unfortunately, the explicit score matching objective leads to inaccurate estimates in low-density regions, where few data points are available to constrain the score.
489+
490+
Since initial sample points are likely to be in low-density regions in high-dimensional spaces, the inaccurate score-based model will derail the Langevin dynamics and lead to poor sample quality.
491+
492+
.center.width-100[![](figures/lec12/pitfalls.jpg)]
493+
494+
.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]
495+
496+
---
497+
498+
class: middle
499+
500+
To address this issue, .bold[denoising score matching] can be used to train the score-based model to predict the score of increasingly noisified data points.
501+
502+
For each noise level $t$, the score-based model $s\_\theta(\mathbf{x}\_t, t)$ is trained to predict the score of the noisified data point $\mathbf{x}\_t$ as
503+
$$s\_\theta(\mathbf{x}\_t, t) \approx \nabla\_{\mathbf{x}\_t} \log p\_{t} (\mathbf{x}\_t)$$
504+
where $p\_{t} (\mathbf{x}\_t)$ is the noise-perturbed data distribution
505+
$$p\_{t} (\mathbf{x}\_t) = \int p(\mathbf{x}\_0) \mathcal{N}(\mathbf{x}\_t ; \mathbf{x}\_0, \sigma^2\_t \mathbf{I}) d\mathbf{x}\_0$$
506+
and $\sigma^2\_t$ is an increasing sequence of noise levels.
507+
508+
---
509+
510+
class: middle
511+
512+
The training objective for $s\_\theta(\mathbf{x}\_t, t)$ is then a weighted sum of Fisher divergences for all noise levels $t$,
513+
$$\sum\_{t=1}^T \lambda(t) \mathbb{E}\_{p\_{t}(\mathbf{x}\_t)} \left[ || \nabla\_{\mathbf{x}\_t} \log p\_{t}(\mathbf{x}\_t) - s\_\theta(\mathbf{x}\_t, t) ||\_2^2 \right]$$
514+
where $\lambda(t)$ is a weighting function that increases with $t$ to give more importance to the noisier samples.
515+
516+
---
517+
518+
class: middle
450519

451-
It can be used to find modes of the data distribution or to generate samples by Langevin dynamics.
520+
Finally, annealed Langevin dynamics can be used to sample from the score-based model by running Langevin dynamics with decreasing noise levels $t=T, ..., 1$.
452521

453-
.center.width-40[![](figures/lec12/langevin.gif)]
522+
.center.width-100[![](figures/lec12/ald.gif)]
454523

455524
.footnote[Credits: [Song](https://yang-song.net/blog/2021/score/), 2021.]
456525

pdf/lec12.pdf

1.32 MB
Binary file not shown.

0 commit comments

Comments
 (0)