Skip to content

Commit a43e891

Browse files
committed
save
1 parent ea60033 commit a43e891

File tree

2 files changed

+17
-18
lines changed

2 files changed

+17
-18
lines changed

lecture2.md

+17-18
Original file line numberDiff line numberDiff line change
@@ -61,23 +61,23 @@ class: middle, center, black-slide
6161
.kol-1-2[<br><br>.width-100[![](figures/lec2/perceptron3.jpg)]]
6262
]
6363

64-
The Mark I Percetron (Rosenblatt, 1960).
64+
The Mark I Percetron was implemented in hardware.
6565

6666
---
6767

6868
class: middle, center, black-slide
6969

7070
<iframe width="600" height="450" src="https://www.youtube.com/embed/cNxadbrN_aI" frameborder="0" allowfullscreen></iframe>
7171

72-
The Perceptron
72+
The machine could classify simple images.
7373

7474
---
7575

7676
class: middle
7777

78-
The Mark I Perceptron is composed of association and response units, each acting as a binary classifier that computes a linear combination of its inputs and applies a step function to the result.
78+
The Mark I Perceptron is composed of association and response units (or "perceptrons"), each acting as a binary classifier that computes a linear combination of its inputs and applies a step function to the result.
7979

80-
Formally, given an input vector $\mathbf{x} \in \mathbb{R}^p$, each unit computes its output as
80+
In the modern sense, given an input $\mathbf{x} \in \mathbb{R}^p$, each unit computes its output as
8181
$$f(\mathbf{x}) = \begin{cases}
8282
1 &\text{if } \sum_i w_i x_i + b \geq 0 \\\\
8383
0 &\text{otherwise}
@@ -308,6 +308,10 @@ $$\text{Softmax}(\mathbf{z})\_i = \frac{\exp(z\_i)}{\sum\_{j=1}^C \exp(z\_j)},$$
308308
for $i=1, ..., C$.
309309
- For regression, the width $q$ of the last layer $L$ is set to the dimensionality of the output $d\_\text{out}$ and the activation function is the identity $\sigma(\cdot) = \cdot$, which results in a vector $\mathbf{h}\_L \in \mathbb{R}^{d\_\text{out}}$.
310310

311+
???
312+
313+
Draw each.
314+
311315
---
312316

313317
class: middle, center
@@ -320,7 +324,7 @@ class: middle
320324

321325
## Expressiveness
322326

323-
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{sign}(x + b_i).$$ This model can approximate .bold[any] smooth 1D function, provided enough hidden units.
327+
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{sign}(x + b_i).$$ This model can approximate any smooth 1D function to arbitrary precision, provided enough hidden units.
324328

325329
---
326330

@@ -431,7 +435,7 @@ class: middle
431435

432436
# Loss functions
433437

434-
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$ of $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
438+
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$) of an MLP $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
435439

436440
The loss function is derived from the likelihood:
437441
- For classification, assuming a categorical likelihood, the loss is the cross-entropy $\mathcal{L}(\theta) = -\frac{1}{N} \sum\_{(\mathbf{x}\_j, \mathbf{y}\_j) \in \mathbf{d}} \sum\_{i=1}^C y\_{ji} \log f\_{i}(\mathbf{x}\_j; \theta)$.
@@ -718,13 +722,13 @@ Why is stochastic gradient descent still a good idea?
718722
- Informally, averaging the update
719723
$$\theta\_{t+1} = \theta\_t - \gamma \nabla \ell(y\_{i(t+1)}, f(\mathbf{x}\_{i(t+1)}; \theta\_t)) $$
720724
over all choices $i(t+1)$ restores batch gradient descent.
721-
- Formally, if the gradient estimate is **unbiased**, e.g., if
725+
- Formally, if the gradient estimate is **unbiased**, that is, if
722726
$$\begin{aligned}
723727
\mathbb{E}\_{i(t+1)}[\nabla \ell(y\_{i(t+1)}, f(\mathbf{x}\_{i(t+1)}; \theta\_t))] &= \frac{1}{N} \sum\_{\mathbf{x}\_i, y\_i \in \mathbf{d}} \nabla \ell(y\_i, f(\mathbf{x}\_i; \theta\_t)) \\\\
724728
&= \nabla \mathcal{L}(\theta\_t)
725729
\end{aligned}$$
726-
then the formal convergence of SGD can be proved, under appropriate assumptions (see references).
727-
- If training is limited to single pass over the data, then SGD directly minimizes the **expected** risk.
730+
then the formal convergence of SGD can be proved, under appropriate assumptions.
731+
- If training is limited to a single pass over the data, then SGD directly minimizes the **expected** risk.
728732

729733
---
730734

@@ -745,7 +749,7 @@ where
745749

746750
class: middle
747751

748-
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield the best generalization performance (in terms of excess error) despite being the worst optimization algorithms for minimizing the empirical risk.
752+
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield strong generalization performance (in terms of excess error) despite being poor optimization algorithms for minimizing the empirical risk.
749753

750754
---
751755

@@ -770,18 +774,13 @@ These derivatives can be evaluated automatically from the *computational graph*
770774

771775
class: middle
772776

773-
In Leibniz notations, the **chain rule** states that
777+
## Backpropagation
778+
779+
- In Leibniz notations, the **chain rule** states that
774780
$$
775781
\begin{aligned}
776782
\frac{\partial \ell}{\partial \theta\_i} &= \sum\_{k \in \text{parents}(\ell)} \frac{\partial \ell}{\partial u\_k} \underbrace{\frac{\partial u\_k}{\partial \theta\_i}}\_{\text{recursive case}}
777783
\end{aligned}$$
778-
779-
---
780-
781-
class: middle
782-
783-
## Backpropagation
784-
785784
- Since a neural network is a **composition of differentiable functions**, the total
786785
derivatives of the loss can be evaluated backward, by applying the chain rule
787786
recursively over its computational graph.

pdf/lec2.pdf

-120 KB
Binary file not shown.

0 commit comments

Comments
 (0)