class: middle, center, title-slide
Lecture 7: Machine learning and neural networks
Prof. Gilles Louppe
[email protected]
???
!!! The transition, motivation and intuition towards CNNs should be improved. This is going too fast and not explained as clearly as MLPs.
Learning from data is a key component of artificial intelligence. In this lecture, we will introduce the principles of:
- Machine learning
- Neural networks
.footnote[Credits: CS188, UC Berkeley.]
class: middle
What if the environment is unknown?
- Learning provides an automated way to modify the agent's internal decision mechanisms to improve its own performance.
- It exposes the agent to reality rather than trying to hardcode reality into the agent's program.
More generally, learning is useful for any task where it is difficult to write a program that performs the task but easy to obtain examples of desired behavior.
class: middle
class: middle
.center[
.width-40[]
.width-40[
]
]
.question[How would you write a computer program that recognizes cats from dogs?]
class: middle
count: false class: middle
count: false class: black-slide, middle background-image: url(./figures/lec7/cat3.png) background-size: cover
count: false class: black-slide, middle
background-image: url(./figures/lec7/cat4.png) background-size: cover
class: middle
.center[The deep learning approach.]
.grid[ .kol-2-3[
Let
From this data, we want to identify a probabilistic model
]
.kol-1-3[
.center.width-80[]]
]
class: middle
.center[Regression (
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
.center[Supervised learning with structured outputs (
.footnote[Credits: Simon J.D. Prince, 2023.]
Let us first assume that
.center.width-90[]
.footnote[Credits: CS188, UC Berkeley.]
???
Do it on the blackboard.
class: middle
.grid[
.kol-1-5[
.center.width-100[]]
.kol-4-5[.center.width-50[
]]
]
Linear regression considers a parameterized linear Gaussian model for its parametric model of
.footnote[Credits: Simon J.D. Prince, 2023.]
To learn the conditional distribution
--
count: false
By constraining the derivatives of the log-likelihood to
class: middle
.center[
Minimizing the negative log-likelihood of a linear Gaussian model reduces to minimizing the sum of squared residuals.]
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
If we absorb the bias term
Let us now assume
.center.width-50[]
.footnote[Credits: CS188, UC Berkeley.]
class: middle
Logistic regression models the conditional as
]
???
This model is the core building block of deep neural networks!
class: middle
Following the principle of maximum likelihood estimation, we have
This loss is an estimator of the cross-entropy
Unfortunately, there is no closed-form solution for the MLE of
class: middle
Let
To minimize
For
class: middle
A minimizer of the approximation
Therefore, model parameters can be updated iteratively using the update rule
-
$\theta_0$ are the initial parameters of the model, -
$\gamma$ is the learning rate.
class: center, middle
count: false class: center, middle
count: false class: center, middle
count: false class: center, middle
count: false class: center, middle
count: false class: center, middle
count: false class: center, middle
count: false class: center, middle
class: middle, center
(Step-by-step code example)
class: middle
Can we learn to play Pacman only from observations?
- Feature vectors
$\mathbf{x} = g(s)$ are extracted from the game states$s$ . Output values$y$ corresponds to actions$a$ . - State-action pairs
$(\mathbf{x}, y)$ are collected by observing an expert playing. - We want to learn the actions that the expert would take in a given situation. That is, learn the mapping
$f:\mathbb{R}^d \to \mathcal{A}$ . - This is a multiclass classification problem that can be solved by combining binary classifers.
.footnote[Credits: CS188, UC Berkeley.]
class: middle, black-slide
.center[
The agent observes a very good Minimax-based agent for two games and updates its weight vectors as data are collected. ]
.footnote[Credits: CS188, UC Berkeley.]
class: middle, black-slide
.center[
]
.footnote[Credits: CS188, UC Berkeley.]
class: middle, black-slide
.center[
After two training episodes, the ML-based agents plays.
No more Minimax!
]
.footnote[Credits: CS188, UC Berkeley.]
class: middle
(a short introduction)
A shallow network is a function
???
Draw the (generic) architecture of a shallow network.
class: middle
We first consider the case where
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
a) The input
b) More compact representation of the same network where we omit the bias terms, the weight labels and the activation functions.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
This network defines a family of piecewise linear functions where the positions of the joints, the slopes and the heights of the functions are determined by the 10 parameters
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
The number
The universal approximation theorem states that a single-hidden-layer network with a finite number of hidden units can approximate any continuous function on a compact subset of
class: middle
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
To extend the network to multivariate outputs
For example, a network with two output units
class: middle
a) With two output units, the network can model two functions of the input
b) The four joints of these functions are constrained to be at the same positions, but the slopes and heights of the functions can vary independently.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
To extend the network to multivariate inputs
For example, a network with two inputs
class: middle
.footnote[Credits: Simon J.D. Prince, 2023.]
We first consider the composition of two shallow networks, where the output of the first network is fed as input to the second network as $$\begin{aligned} h_0 &= \sigma\left( w_{0} x + b_0 \right) \\ h_1 &= \sigma\left( w_{1} x + b_1 \right) \\ h_2 &= \sigma\left( w_{2} x + b_2 \right) \\ y &= v_{0} h_0 + v_{1} h_1 + v_{2} h_2 + c \\ h_0' &= \sigma\left( w'_{0} y + b'_0 \right) \\ h_1' &= \sigma\left( w'_{1} y + b'_1 \right) \\ h_2' &= \sigma\left( w'_{2} y + b'_2 \right) \\ y' &= v'_{0} h_0' + v'_{1} h_1' + v'_{2} h_2' + c'. \end{aligned}$$
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
With
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
Folding interpretation of a deep network:
a) The first network folds the input space back on itself.
b) The second network applies its function to the folded space.
c) The final output is revealed by unfolding the folded space.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
Similarly, composing a multivariate shallow network with a shallow network further divides the input space into more linear regions.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
Since the operation from
It follows that the composition of the two shallow networks is a special case of a deep network with two hidden layers where the first layer is defined as
$$\begin{aligned}
h_0 &= \sigma\left( w_{0} x + b_0 \right) \\
h_1 &= \sigma\left( w_{1} x + b_1 \right) \\
h_2 &= \sigma\left( w_{2} x + b_2 \right),
\end{aligned}$$
the second layer is defined from the outputs of the first layer as
$$\begin{aligned}
h_0' &= \sigma\left( w'_{00} h_0 + w'_{01} h_1 + w'_{02} h_2 + b'_0 \right) \\
h_1' &= \sigma\left( w'_{10} h_0 + w'_{11} h_1 + w'_{12} h_2 + b'_1 \right) \\
h_2' &= \sigma\left( w'_{20} h_0 + w'_{21} h_1 + w'_{22} j_2 + b'_2 \right),
\end{aligned}$$
and the output is defined as
class: middle
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
The computation of a hidden layer can be written in matrix form as
$$\begin{aligned}
\mathbf{h} &= \begin{bmatrix} h_0 \\ h_1 \\ \vdots \\ h_{q-1} \end{bmatrix} = \sigma\left( \begin{bmatrix} w_{00} & w_{01} & \cdots & w_{0(d_\text{in}-1)} \\ w_{10} & w_{11} & \cdots & w_{1(d_\text{in}-1)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{(q-1)0} & w_{(q-1)1} & \cdots & w_{(q-1)(d_\text{in}-1)} \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_{d_\text{in}-1} \end{bmatrix} + \begin{bmatrix} b_0 \\ b_1 \\ \vdots \\ b_{q-1} \end{bmatrix} \right) \\
&= \sigma(\mathbf{W}^T \mathbf{x} + \mathbf{b})
\end{aligned}$$
where
class: middle
Hidden layers can be composed in series to form a deep network with
This model is known as the feedforward neural network, the fully connected network, or the .bold[multilayer perceptron] (MLP).
class: middle
The choice of the activation function
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
- For regression, the width
$q$ of the last layer$L$ is set to the dimensionality of the output$d_\text{out}$ and the activation function is the identity$\sigma(\cdot) = \cdot$ , which results in a vector$\mathbf{h}_L \in \mathbb{R}^{d_\text{out}}$ . - For binary classification, the width
$q$ of the last layer$L$ is set to$1$ and the activation function is the sigmoid$\sigma(\cdot) = \frac{1}{1 + \exp(-\cdot)}$ , which results in a single output$h_L \in [0,1]$ that models the probability$p(y=1|\mathbf{x})$ . - For multi-class classification, the sigmoid activation
$\sigma$ in the last layer can be generalized to produce a vector$\mathbf{h}_L \in \bigtriangleup^C$ of probability estimates$p(y=i|\mathbf{x})$ . This activation is the$\text{Softmax}$ function, where its$i$ -th output is defined as$$\text{Softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)},$$ for$i=1, ..., C$ .
class: middle
The parameters (e.g.,
The loss function is derived from the likelihood:
- For regression, assuming a Gaussian likelihood, the loss is the mean squared error
$\mathcal{L}(\theta) = \frac{1}{N} \sum_{(\mathbf{x}_j, \mathbf{y}_j) \in \mathbf{d}} (\mathbf{y}_j - f(\mathbf{x}_j; \theta))^2$ . - For classification, assuming a categorical likelihood, the loss is the cross-entropy
$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{(\mathbf{x}_j, \mathbf{y}_j) \in \mathbf{d}} \sum_{i=1}^C y_{ij} \log f_{i}(\mathbf{x}_j; \theta)$ .
class: middle, center
(Step-by-step code example)
class: middle
The MLP architecture is appropriate for tabular data, but not for images.
- Each pixel of an image is an input feature, leading to a high-dimensional input vector.
- Each hidden unit is connected to all input units, leading to a high-dimensional weight matrix.
class: middle
We want to design a neural architecture such that:
- in the earliest layers, the network responds similarly to similar patches of the image, regardless of their location;
- the earliest layers focus on local regions of the image, without regard for the contents of the image in distant regions;
- in the later layers, the network combines the information from the earlier layers to focus on larger and larger regions of the image, eventually combining all the information from the image to classify the image into a category.
Convolutional neural networks extend fully connected architectures with
- convolutional layers acting as local feature detectors;
- pooling layers acting as spatial down-samplers.
.center.width-80[]
class: middle
For the one-dimensional input
class: middle
Convolutions can implement differential operators:
]
or crude template matchers:
]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
A convolutional layer is defined by a set of
Assuming as input a 3D tensor
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Give some intuition about the interpretation of the convolution in terms of similarity between the input and the kernel.
class: middle
Convolutional layers (c-f) are a special case of fully connected layers (a-b) where hidden units are connected to local regions of the input through shared weights (the kernels).
- The connectivity allows the network to learn local patterns in the input.
- Weight sharing allows the network to learn the same patterns at different locations in the input.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
Pooling layers are used to progressively reduce the spatial size of the representation, hence capturing longer-range dependencies between features.
Considering a pooling area of size
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle, center
(Step-by-step code example)
???
See also https://poloclub.github.io/cnn-explainer/
When the input is a sequence
For example,
???
Skip or go fast.
class: middle
Notice how this is similar to filtering and dynamic decision networks:
-
$\mathbf{h}_t$ can be viewed as some current belief state; -
$\mathbf{x}_{1:T}$ is a sequence of observations; -
$\mathbf{h}_{t+1}$ is computed from the current belief state$\mathbf{h}_t$ and the latest evidence$\mathbf{x}_t$ through some fixed computation (in this case a neural network, instead of being inferred from the assumed dynamics). -
$\mathbf{h}_t$ can also be used to decide on some action, through another network$f$ such that$a_t = f(\mathbf{h}_t;\theta)$ .
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/Ipi40cb_RsI?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>A recurrent network playing Mario Kart. ]
Transformers are deep neural networks at the core of large-scale language models.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
For language modeling, transformers define an .bold[autoregressive model] that predicts the next word in a sequence given the previous words.
Formally,
class: middle
The decoder-only transformer is a stack of
The output of the last block is used to predict the next word in the sequence, as in a regular classifier.
.footnote[Credits: Simon J.D. Prince, 2023.]
class: middle
- The more data, the better the model.
- The more parameters, the better the model.
- The more compute, the better the model.
class: middle
class: black-slide, middle
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/HS1wV9NMLr8?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>How AI Helps Autonomous Vehicles See Outside the Box
(See also other episodes from NVIDIA DRIVE Labs)
]
class: black-slide, middle, center
Hydranet (Tesla, 2021)
???
70 networks
class: middle, black-slide, center
<iframe width="600" height="450" src="https://www.youtube.com/embed/AbdVsi1VjQY" frameborder="0" allowfullscreen></iframe>How machine learning is advancing medicine (Google, 2018)
- Deep learning is a powerful tool for learning from data.
- Neural networks are composed of layers of neurons that are connected to each other.
- The weights of the connections are learned by minimizing a loss function.
- Convolutional networks are used for image processing.
- Transformers are used for language processing.
class: middle
.italic[For the last forty years we have programmed computers; for the next forty years we will train them.]
.pull-right[Chris Bishop, 2020.]
class: end-slide, center count: false
The end.