class: middle, center, title-slide
Lecture 11: Auto-encoders and variational auto-encoders
Prof. Gilles Louppe
[email protected]
Learn a model of the data.
- Auto-encoders
- Variational inference
- Variational auto-encoders
class: middle
.italic["The brain has about
.pull-right[Geoffrey Hinton, 2014.]
class: middle
.grid[
.kol-1-3[.circle.width-95[]]
.kol-2-3[.width-100[
]]
]
.italic["We need tremendous amount of information to build machines that have common sense and generalize."]
.pull-right[Yann LeCun, 2016.]
class: middle
Deep unsupervised learning is about learning a model of the data, explicitly or implicitly, without requiring labels.
- Generative models: recreate the raw data distribution (e.g., the distribution of natural images).
- Self-supervised learning: solve puzzle tasks that require semantic understanding (e.g., predict a missing word in a sequence).
class: middle
A (deep) generative model is a probabilistic model
Formally, a generative model defines a probability distribution
???
This is conceptually identical to what we already did in Lecture 10 when we wanted to learn
class: middle, black-slide
.grid[
.kol-1-2.center[
.width-80[]
]
.kol-1-2.center[
.width-75[]
]
]
.grid[
.kol-1-2.center[
Variational auto-encoders
(Kingma and Welling, 2013)
] .kol-1-2.center[
Diffusion models
(Midjourney, 2023)
] ]
class: black-slide background-image: url(./figures/lec11/landscape.png) background-size: contain
.footnote[Credits: Karsten et al, 2022; Siddharth Mishra-Sharma, 2023.]
class: middle
.grid[ .kol-1-3.center[
Produce samples
]
.kol-1-3.center[
Evaluate densities
]
.kol-1-3.center[
Encode complex priors
] ]
.grid[
.kol-1-3.center[.width-100[]]
.kol-1-3.center[.width-100[
]]
.kol-1-3.center[.width-90[
]]
]
.footnote[Credits: Siddharth Mishra-Sharma, 2023.]
class: middle count: false
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle count: false
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle count: false
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
An auto-encoder is a composite function made of
- an encoder
$f$ from the original space$\mathcal{X}$ to a latent space$\mathcal{Z}$ , - a decoder
$g$ to map back to$\mathcal{X}$ ,
such that
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
Let
Given two parameterized mappings
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
For example, when the auto-encoder is linear,
$$
\begin{aligned}
f: \mathbf{z} &= \mathbf{U}^T \mathbf{x} \\
g: \hat{\mathbf{x}} &= \mathbf{U} \mathbf{z},
\end{aligned}
$$
with
In this case, an optimal solution is given by PCA.
class: middle
Better results can be achieved with more sophisticated classes of mappings than linear projections: use deep neural networks for
For instance,
- by combining a multi-layer perceptron encoder
$f : \mathbb{R}^p \to \mathbb{R}^d$ with a multi-layer perceptron decoder$g: \mathbb{R}^d \to \mathbb{R}^p$ . - by combining a convolutional network encoder
$f : \mathbb{R}^{w\times h \times c} \to \mathbb{R}^d$ with a decoder$g : \mathbb{R}^d \to \mathbb{R}^{w\times h \times c}$ composed of the reciprocal transposed convolutional layers.
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
To get an intuition of the learned latent representation, we can pick two samples
.center.width-80[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
Besides dimension reduction, auto-encoders can capture dependencies between signal components to restore degraded or noisy signals.
In this case, the composition
The goal is to optimize
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
A fundamental weakness of denoising auto-encoders is that the posterior
If we train an auto-encoder with the quadratic loss (i.e., implicitly assuming a Gaussian likelihood), then the best reconstruction is
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
???
Also, the quadratic loss leads to blurry and unrealistic reconstructions, for the reason that the quadratic loss minimizer may be very unlikely under the posterior.
The generative capability of the decoder
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
For instance, a factored Gaussian model with diagonal covariance matrix,
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
These results are not satisfactory because the density model on the latent space is too simple and inadequate.
Building a good model in latent space amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle count: false
???
Switch to BB.
class: middle
Consider for now a prescribed latent variable model that relates a set of observable variables
The probabilistic model defines a joint probability distribution
???
The probabilistic model is given and motivated by domain knowledge assumptions.
Examples include:
- Linear discriminant analysis
- Bayesian networks
- Hidden Markov models
- Probabilistic programs
class: middle, black-slide
.center[
???
If we interpret
--
count: false
.alert[The curse of dimensionality will lead to poor estimates of the expectation.]
class: middle
Let us instead consider a variational approach to fit the model parameters
Using a variational distribution
class: middle
Using the Bayes rule, we can also write $$\begin{aligned} \text{ELBO}(\mathbf{x};\theta, \phi) &= \mathbb{E}_{q_\phi(\mathbf{z})}\left[ \log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z})} \right] \\ &= \mathbb{E}_{q_\phi(\mathbf{z})}\left[ \log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z})} \frac{p_\theta(\mathbf{x})}{p_\theta(\mathbf{x})} \right] \\ &= \mathbb{E}_{q_\phi(\mathbf{z})}\left[ \log \frac{p_\theta(\mathbf{z}|\mathbf{x})}{q_\phi(\mathbf{z})} p_\theta(\mathbf{x}) \right] \\ &= \log p_\theta(\mathbf{x}) - \text{KL}(q_\phi(\mathbf{z}) || p_\theta(\mathbf{z}|\mathbf{x})). \end{aligned}$$
Therefore,
class: middle
Provided the KL gap remains small, the model parameters can now be optimized by maximizing the ELBO,
class: middle
We can proceed by gradient ascent, provided we can evaluate
In general, the gradient of the ELBO is intractable to compute, but we can estimate it with Monte Carlo integration.
class: middle count: false
class: middle
class: middle
So far we assumed a prescribed probabilistic model motivated by domain knowledge.
We will now directly learn a stochastic generating process
We will also amortize the inference process by learning a second neural network
class: middle
class: middle
A variational auto-encoder is a deep latent variable model where:
- The prior
$p(\mathbf{z})$ is prescribed, and usually chosen to be Gaussian. - The likelihood
$p_\theta(\mathbf{x}|\mathbf{z})$ is parameterized with a generative network$\text{NN}_\theta$ (or decoder) that takes as input$\mathbf{z}$ and outputs parameters$\varphi = \text{NN}_\theta(\mathbf{z})$ to the data distribution. E.g., $$\begin{aligned} \mu, \sigma &= \text{NN}_\theta(\mathbf{z}) \\ p_\theta(\mathbf{x}|\mathbf{z}) &= \mathcal{N}(\mathbf{x}; \mu, \sigma^2\mathbf{I}) \end{aligned}$$ - The approximate posterior
$q_\phi(\mathbf{z}|\mathbf{x})$ is parameterized with an inference network$\text{NN}_\phi$ (or encoder) that takes as input$\mathbf{x}$ and outputs parameters$\nu = \text{NN}_\phi(\mathbf{x})$ to the approximate posterior. E.g., $$\begin{aligned} \mu, \sigma &= \text{NN}_\phi(\mathbf{x}) \\ q_\phi(\mathbf{z}|\mathbf{x}) &= \mathcal{N}(\mathbf{z}; \mu, \sigma^2\mathbf{I}) \end{aligned}$$
class: middle
As before, we can use variational inference to jointly optimize the encoder and decoder networks parameters
Interpretation:
- Given some decoder network set at
$\theta$ , we want to put the mass of the latent variables, by adjusting$\phi$ , such that they explain the observed data, while remaining close to the prior. - Given some encoder network set at
$\phi$ , we want to put the mass of the observed variables, by adjusting$\theta$ , such that they are well explained by the latent variables.
class: middle
Unbiased gradients of the ELBO with respect to the generative model parameters
However, gradients with respect to the inference model parameters
class: middle
Let us abbreviate
$$\begin{aligned}
\text{ELBO}(\mathbf{x};\theta,\phi) &= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ \log p_\theta(\mathbf{x},\mathbf{z}) - \log q_\phi(\mathbf{z}|\mathbf{x})\right] \\
&= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ f(\mathbf{x}, \mathbf{z}; \phi) \right].
\end{aligned}$$
The computational graph of a Monte Carlo estimate of the ELBO would look like .grid[
.kol-1-5[]
.kol-4-5[.center.width-75[]]
]
Issue: We cannot backpropagate through the stochastic node
class: middle
The reparameterization trick consists in re-expressing the variable
class: middle
.grid[
.kol-1-5[]
.kol-4-5[.center.width-70[]]
]
If
class: middle
Given this change of variable, the ELBO can be rewritten as
$$\begin{aligned}
\text{ELBO}(\mathbf{x};\theta,\phi) &= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[ f(\mathbf{x}, \mathbf{z}; \phi) \right]\\
&= \mathbb{E}_{p(\epsilon)} \left[ f(\mathbf{x}, g(\phi,\mathbf{x},\epsilon); \phi) \right].
\end{aligned}$$
Therefore estimating the gradient of the ELBO with respect to
The last required ingredient is the evaluation of the approximate posterior
class: middle, center
(demo)
class: middle
Consider as data
class: middle
class: middle
- Decoder
$p_\theta(\mathbf{x}|\mathbf{z})$ : $$\begin{aligned} \mathbf{z} &\in \mathbb{R}^d \\ p(\mathbf{z}) &= \mathcal{N}(\mathbf{z}; \mathbf{0},\mathbf{I})\\ p_\theta(\mathbf{x}|\mathbf{z}) &= \mathcal{N}(\mathbf{x};\mu(\mathbf{z};\theta), \sigma^2(\mathbf{z};\theta)\mathbf{I}) \\ \mu(\mathbf{z};\theta) &= \mathbf{W}_2^T\mathbf{h} + \mathbf{b}_2 \\ \log \sigma^2(\mathbf{z};\theta) &= \mathbf{W}_3^T\mathbf{h} + \mathbf{b}_3 \\ \mathbf{h} &= \text{ReLU}(\mathbf{W}_1^T \mathbf{z} + \mathbf{b}_1)\\ \theta &= \{ \mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2, \mathbf{W}_3, \mathbf{b}_3 \} \end{aligned}$$
class: middle
- Encoder
$q_\phi(\mathbf{z}|\mathbf{x})$ : $$\begin{aligned} q_\phi(\mathbf{z}|\mathbf{x}) &= \mathcal{N}(\mathbf{z};\mu(\mathbf{x};\phi), \sigma^2(\mathbf{x};\phi)\mathbf{I}) \\ p(\epsilon) &= \mathcal{N}(\epsilon; \mathbf{0}, \mathbf{I}) \\ \mathbf{z} &= \mu(\mathbf{x};\phi) + \sigma(\mathbf{x};\phi) \odot \epsilon \\ \mu(\mathbf{x};\phi) &= \mathbf{W}_5^T\mathbf{h} + \mathbf{b}_5 \\ \log \sigma^2(\mathbf{x};\phi) &= \mathbf{W}_6^T\mathbf{h} + \mathbf{b}_6 \\ \mathbf{h} &= \text{ReLU}(\mathbf{W}_4^T \mathbf{x} + \mathbf{b}_4)\\ \phi &= \{ \mathbf{W}_4, \mathbf{b}_4, \mathbf{W}_5, \mathbf{b}_5, \mathbf{W}_6, \mathbf{b}_6 \} \end{aligned}$$
Note that there is no restriction on the encoder and decoder network architectures. They could as well be arbitrarily complex convolutional networks.
class: middle
Plugging everything together, the objective can be expressed as
$$\begin{aligned}
\text{ELBO}(\mathbf{x};\theta,\phi) &= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} \left[ \log p_\theta(\mathbf{x}|\mathbf{z}) \right] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})) \\
&= \mathbb{E}_{p(\epsilon)} \left[ \log p(\mathbf{x}|\mathbf{z}=g(\phi,\mathbf{x},\epsilon);\theta) \right] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})),
\end{aligned}
$$
where the negative KL divergence can be expressed analytically as
class: middle
.footnote[Credits: Kingma and Welling, 2013.]
class: middle
.footnote[Credits: Kingma and Welling, 2013.]
class: middle
The prior-matching term
.footnote[Credits: Siddharth Mishra-Sharma, 2023.]
Hierarchical .bold[compression of images and other data],
e.g., in video conferencing systems (Gregor et al, 2016).
]
exclude: true class: middle
.bold[Understanding the factors of variation and invariances] (Higgins et al, 2017). ]
class: middle
.center[
.bold[Voice style transfer] [demo] (van den Oord et al, 2017). ]
class: middle
.center[.bold[Design of new molecules] with desired chemical properties
(Gomez-Bombarelli et al, 2016).]
exclude: true class: middle
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/Wd-1WU8emkw?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Bridging the .bold[simulation-to-reality] gap (Inoue et al, 2017).
]
class: end-slide, center count: false
The end.