You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Mark I Perceptron is composed of association and response units, each acting as a binary classifier that computes a linear combination of its inputs and applies a step function to the result.
78
+
The Mark I Perceptron is composed of association and response units (or "perceptrons"), each acting as a binary classifier that computes a linear combination of its inputs and applies a step function to the result.
79
79
80
-
Formally, given an input vector $\mathbf{x} \in \mathbb{R}^p$, each unit computes its output as
80
+
In the modern sense, given an input $\mathbf{x} \in \mathbb{R}^p$, each unit computes its output as
- For regression, the width $q$ of the last layer $L$ is set to the dimensionality of the output $d\_\text{out}$ and the activation function is the identity $\sigma(\cdot) = \cdot$, which results in a vector $\mathbf{h}\_L \in \mathbb{R}^{d\_\text{out}}$.
310
310
311
+
???
312
+
313
+
Draw each.
314
+
311
315
---
312
316
313
317
class: middle, center
@@ -320,7 +324,7 @@ class: middle
320
324
321
325
## Expressiveness
322
326
323
-
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{sign}(x + b_i).$$ This model can approximate .bold[any] smooth 1D function, provided enough hidden units.
327
+
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{sign}(x + b_i).$$ This model can approximate any smooth 1D function to arbitrary precision, provided enough hidden units.
324
328
325
329
---
326
330
@@ -431,7 +435,7 @@ class: middle
431
435
432
436
# Loss functions
433
437
434
-
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$ of $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
438
+
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$) of an MLP $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
435
439
436
440
The loss function is derived from the likelihood:
437
441
- For classification, assuming a categorical likelihood, the loss is the cross-entropy $\mathcal{L}(\theta) = -\frac{1}{N} \sum\_{(\mathbf{x}\_j, \mathbf{y}\_j) \in \mathbf{d}} \sum\_{i=1}^C y\_{ji} \log f\_{i}(\mathbf{x}\_j; \theta)$.
@@ -718,13 +722,13 @@ Why is stochastic gradient descent still a good idea?
then the formal convergence of SGD can be proved, under appropriate assumptions (see references).
727
-
- If training is limited to single pass over the data, then SGD directly minimizes the **expected** risk.
730
+
then the formal convergence of SGD can be proved, under appropriate assumptions.
731
+
- If training is limited to a single pass over the data, then SGD directly minimizes the **expected** risk.
728
732
729
733
---
730
734
@@ -745,7 +749,7 @@ where
745
749
746
750
class: middle
747
751
748
-
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield the best generalization performance (in terms of excess error) despite being the worst optimization algorithms for minimizing the empirical risk.
752
+
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield strong generalization performance (in terms of excess error) despite being poor optimization algorithms for minimizing the empirical risk.
749
753
750
754
---
751
755
@@ -770,18 +774,13 @@ These derivatives can be evaluated automatically from the *computational graph*
770
774
771
775
class: middle
772
776
773
-
In Leibniz notations, the **chain rule** states that
777
+
## Backpropagation
778
+
779
+
- In Leibniz notations, the **chain rule** states that
0 commit comments