Skip to content

Commit ca27ebe

Browse files
committed
save
1 parent 417d14c commit ca27ebe

File tree

2 files changed

+128
-124
lines changed

2 files changed

+128
-124
lines changed

lecture2.md

+128-124
Original file line numberDiff line numberDiff line change
@@ -297,12 +297,6 @@ Draw the NN diagram.
297297

298298
---
299299

300-
class: middle, center
301-
302-
(demo)
303-
304-
---
305-
306300
class: middle
307301

308302
## Output layers
@@ -316,7 +310,133 @@ for $i=1, ..., C$.
316310

317311
---
318312

319-
# Training neural networks
313+
class: middle, center
314+
315+
(demo)
316+
317+
---
318+
319+
class: middle
320+
321+
## Expressiveness
322+
323+
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{ReLU}(x + b_i).$$ This model can approximate .bold[any] smooth 1D function, provided enough hidden units.
324+
325+
---
326+
327+
class: middle
328+
329+
.center[![](figures/lec2/ua-0.png)]
330+
331+
---
332+
333+
class: middle
334+
count: false
335+
336+
.center[![](figures/lec2/ua-1.png)]
337+
338+
---
339+
340+
class: middle
341+
count: false
342+
343+
.center[![](figures/lec2/ua-2.png)]
344+
345+
---
346+
347+
class: middle
348+
count: false
349+
350+
.center[![](figures/lec2/ua-3.png)]
351+
352+
---
353+
354+
class: middle
355+
count: false
356+
357+
.center[![](figures/lec2/ua-4.png)]
358+
359+
---
360+
361+
class: middle
362+
count: false
363+
364+
.center[![](figures/lec2/ua-5.png)]
365+
366+
---
367+
368+
class: middle
369+
count: false
370+
371+
.center[![](figures/lec2/ua-6.png)]
372+
373+
---
374+
375+
class: middle
376+
count: false
377+
378+
.center[![](figures/lec2/ua-7.png)]
379+
380+
---
381+
382+
class: middle
383+
count: false
384+
385+
.center[![](figures/lec2/ua-8.png)]
386+
387+
---
388+
389+
class: middle
390+
count: false
391+
392+
.center[![](figures/lec2/ua-9.png)]
393+
394+
---
395+
396+
class: middle
397+
count: false
398+
399+
.center[![](figures/lec2/ua-10.png)]
400+
401+
---
402+
403+
class: middle
404+
count: false
405+
406+
.center[![](figures/lec2/ua-11.png)]
407+
408+
---
409+
410+
class: middle
411+
count: false
412+
413+
.center[![](figures/lec2/ua-12.png)]
414+
415+
---
416+
417+
class: middle
418+
419+
.bold[Universal approximation theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a
420+
bounded, non-constant continuous function. Let $I\_p$ denote the $p$-dimensional hypercube, and
421+
$C(I\_p)$ denote the space of continuous functions on $I\_p$. Given any $f \in C(I\_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v\_i, w\_i, b\_i, i=1, ..., q$ such that
422+
$$F(x) = \sum\_{i \leq q} v\_i \sigma(w\_i^T x + b\_i)$$
423+
satisfies
424+
$$\sup\_{x \in I\_p} |f(x) - F(x)| < \epsilon.$$
425+
426+
- It guarantees that even a single hidden-layer network can represent any classification
427+
problem in which the boundary is locally linear (smooth);
428+
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
429+
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).
430+
431+
---
432+
433+
class: middle
434+
435+
# Training
436+
437+
---
438+
439+
# Loss functions
320440

321441
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$ of $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
322442

@@ -330,9 +450,7 @@ Switch to blackboard.
330450

331451
---
332452

333-
class: middle
334-
335-
## Gradient descent
453+
# Gradient descent
336454

337455
To minimize $\mathcal{L}(\theta)$, **gradient descent** uses local linear information to iteratively move towards a (local) minimum.
338456

@@ -845,120 +963,6 @@ Don't forget the magic trick!
845963

846964
---
847965

848-
# Universal approximation
849-
850-
Let us consider the 1-hidden layer MLP
851-
$$f(x) = \sum w\_i \text{ReLU}(x + b_i).$$
852-
This model can approximate *any* smooth 1D function, provided enough hidden units.
853-
854-
---
855-
856-
class: middle
857-
858-
.center[![](figures/lec2/ua-0.png)]
859-
860-
---
861-
862-
class: middle
863-
count: false
864-
865-
.center[![](figures/lec2/ua-1.png)]
866-
867-
---
868-
869-
class: middle
870-
count: false
871-
872-
.center[![](figures/lec2/ua-2.png)]
873-
874-
---
875-
876-
class: middle
877-
count: false
878-
879-
.center[![](figures/lec2/ua-3.png)]
880-
881-
---
882-
883-
class: middle
884-
count: false
885-
886-
.center[![](figures/lec2/ua-4.png)]
887-
888-
---
889-
890-
class: middle
891-
count: false
892-
893-
.center[![](figures/lec2/ua-5.png)]
894-
895-
---
896-
897-
class: middle
898-
count: false
899-
900-
.center[![](figures/lec2/ua-6.png)]
901-
902-
---
903-
904-
class: middle
905-
count: false
906-
907-
.center[![](figures/lec2/ua-7.png)]
908-
909-
---
910-
911-
class: middle
912-
count: false
913-
914-
.center[![](figures/lec2/ua-8.png)]
915-
916-
---
917-
918-
class: middle
919-
count: false
920-
921-
.center[![](figures/lec2/ua-9.png)]
922-
923-
---
924-
925-
class: middle
926-
count: false
927-
928-
.center[![](figures/lec2/ua-10.png)]
929-
930-
---
931-
932-
class: middle
933-
count: false
934-
935-
.center[![](figures/lec2/ua-11.png)]
936-
937-
---
938-
939-
class: middle
940-
count: false
941-
942-
.center[![](figures/lec2/ua-12.png)]
943-
944-
---
945-
946-
class: middle
947-
948-
.bold[Universal approximation theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a
949-
bounded, non-constant continuous function. Let $I\_p$ denote the $p$-dimensional hypercube, and
950-
$C(I\_p)$ denote the space of continuous functions on $I\_p$. Given any $f \in C(I\_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v\_i, w\_i, b\_i, i=1, ..., q$ such that
951-
$$F(x) = \sum\_{i \leq q} v\_i \sigma(w\_i^T x + b\_i)$$
952-
satisfies
953-
$$\sup\_{x \in I\_p} |f(x) - F(x)| < \epsilon.$$
954-
955-
- It guarantees that even a single hidden-layer network can represent any classification
956-
problem in which the boundary is locally linear (smooth);
957-
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
958-
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).
959-
960-
---
961-
962966
class: middle
963967

964968
.center.circle.width-30[![](figures/lec2/lecun.jpg)]

pdf/lec2.pdf

95.4 KB
Binary file not shown.

0 commit comments

Comments
 (0)