You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: lecture2.md
+128-124
Original file line number
Diff line number
Diff line change
@@ -297,12 +297,6 @@ Draw the NN diagram.
297
297
298
298
---
299
299
300
-
class: middle, center
301
-
302
-
(demo)
303
-
304
-
---
305
-
306
300
class: middle
307
301
308
302
## Output layers
@@ -316,7 +310,133 @@ for $i=1, ..., C$.
316
310
317
311
---
318
312
319
-
# Training neural networks
313
+
class: middle, center
314
+
315
+
(demo)
316
+
317
+
---
318
+
319
+
class: middle
320
+
321
+
## Expressiveness
322
+
323
+
Let us consider the 1-hidden layer MLP $$f(x) = \sum w\_i \text{ReLU}(x + b_i).$$ This model can approximate .bold[any] smooth 1D function, provided enough hidden units.
324
+
325
+
---
326
+
327
+
class: middle
328
+
329
+
.center[]
330
+
331
+
---
332
+
333
+
class: middle
334
+
count: false
335
+
336
+
.center[]
337
+
338
+
---
339
+
340
+
class: middle
341
+
count: false
342
+
343
+
.center[]
344
+
345
+
---
346
+
347
+
class: middle
348
+
count: false
349
+
350
+
.center[]
351
+
352
+
---
353
+
354
+
class: middle
355
+
count: false
356
+
357
+
.center[]
358
+
359
+
---
360
+
361
+
class: middle
362
+
count: false
363
+
364
+
.center[]
365
+
366
+
---
367
+
368
+
class: middle
369
+
count: false
370
+
371
+
.center[]
372
+
373
+
---
374
+
375
+
class: middle
376
+
count: false
377
+
378
+
.center[]
379
+
380
+
---
381
+
382
+
class: middle
383
+
count: false
384
+
385
+
.center[]
386
+
387
+
---
388
+
389
+
class: middle
390
+
count: false
391
+
392
+
.center[]
393
+
394
+
---
395
+
396
+
class: middle
397
+
count: false
398
+
399
+
.center[]
400
+
401
+
---
402
+
403
+
class: middle
404
+
count: false
405
+
406
+
.center[]
407
+
408
+
---
409
+
410
+
class: middle
411
+
count: false
412
+
413
+
.center[]
414
+
415
+
---
416
+
417
+
class: middle
418
+
419
+
.bold[Universal approximation theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a
420
+
bounded, non-constant continuous function. Let $I\_p$ denote the $p$-dimensional hypercube, and
421
+
$C(I\_p)$ denote the space of continuous functions on $I\_p$. Given any $f \in C(I\_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v\_i, w\_i, b\_i, i=1, ..., q$ such that
422
+
$$F(x) = \sum\_{i \leq q} v\_i \sigma(w\_i^T x + b\_i)$$
423
+
satisfies
424
+
$$\sup\_{x \in I\_p} |f(x) - F(x)| < \epsilon.$$
425
+
426
+
- It guarantees that even a single hidden-layer network can represent any classification
427
+
problem in which the boundary is locally linear (smooth);
428
+
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
429
+
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).
430
+
431
+
---
432
+
433
+
class: middle
434
+
435
+
# Training
436
+
437
+
---
438
+
439
+
# Loss functions
320
440
321
441
The parameters (e.g., $\mathbf{W}\_k$ and $\mathbf{b}\_k$ for each layer $k$ of $f(\mathbf{x}; \theta)$ are learned by minimizing a loss function $\mathcal{L}(\theta)$ over a dataset $\mathbf{d} = \\\{ (\mathbf{x}\_j, \mathbf{y}\_j) \\\}$ of input-output pairs.
322
442
@@ -330,9 +450,7 @@ Switch to blackboard.
330
450
331
451
---
332
452
333
-
class: middle
334
-
335
-
## Gradient descent
453
+
# Gradient descent
336
454
337
455
To minimize $\mathcal{L}(\theta)$, **gradient descent** uses local linear information to iteratively move towards a (local) minimum.
338
456
@@ -845,120 +963,6 @@ Don't forget the magic trick!
845
963
846
964
---
847
965
848
-
# Universal approximation
849
-
850
-
Let us consider the 1-hidden layer MLP
851
-
$$f(x) = \sum w\_i \text{ReLU}(x + b_i).$$
852
-
This model can approximate *any* smooth 1D function, provided enough hidden units.
853
-
854
-
---
855
-
856
-
class: middle
857
-
858
-
.center[]
859
-
860
-
---
861
-
862
-
class: middle
863
-
count: false
864
-
865
-
.center[]
866
-
867
-
---
868
-
869
-
class: middle
870
-
count: false
871
-
872
-
.center[]
873
-
874
-
---
875
-
876
-
class: middle
877
-
count: false
878
-
879
-
.center[]
880
-
881
-
---
882
-
883
-
class: middle
884
-
count: false
885
-
886
-
.center[]
887
-
888
-
---
889
-
890
-
class: middle
891
-
count: false
892
-
893
-
.center[]
894
-
895
-
---
896
-
897
-
class: middle
898
-
count: false
899
-
900
-
.center[]
901
-
902
-
---
903
-
904
-
class: middle
905
-
count: false
906
-
907
-
.center[]
908
-
909
-
---
910
-
911
-
class: middle
912
-
count: false
913
-
914
-
.center[]
915
-
916
-
---
917
-
918
-
class: middle
919
-
count: false
920
-
921
-
.center[]
922
-
923
-
---
924
-
925
-
class: middle
926
-
count: false
927
-
928
-
.center[]
929
-
930
-
---
931
-
932
-
class: middle
933
-
count: false
934
-
935
-
.center[]
936
-
937
-
---
938
-
939
-
class: middle
940
-
count: false
941
-
942
-
.center[]
943
-
944
-
---
945
-
946
-
class: middle
947
-
948
-
.bold[Universal approximation theorem.] (Cybenko 1989; Hornik et al, 1991) Let $\sigma(\cdot)$ be a
949
-
bounded, non-constant continuous function. Let $I\_p$ denote the $p$-dimensional hypercube, and
950
-
$C(I\_p)$ denote the space of continuous functions on $I\_p$. Given any $f \in C(I\_p)$ and $\epsilon > 0$, there exists $q > 0$ and $v\_i, w\_i, b\_i, i=1, ..., q$ such that
951
-
$$F(x) = \sum\_{i \leq q} v\_i \sigma(w\_i^T x + b\_i)$$
952
-
satisfies
953
-
$$\sup\_{x \in I\_p} |f(x) - F(x)| < \epsilon.$$
954
-
955
-
- It guarantees that even a single hidden-layer network can represent any classification
956
-
problem in which the boundary is locally linear (smooth);
957
-
- It does not inform about good/bad architectures, nor how they relate to the optimization procedure.
958
-
- The universal approximation theorem generalizes to any non-polynomial (possibly unbounded) activation function, including the ReLU (Leshno, 1993).
0 commit comments