You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -397,18 +402,13 @@ or $O(logk(n))$ in the case of dilated convolutions [18], increasing the length
397
402
between any two positions in the network. Convolutional layers are generally more expensive than
398
403
recurrent layers, by a factor of $k$.
399
404
400
-
As side benefit, self-attention could yield more interpretable models. We inspect attention distributions
401
-
from our models and present and discuss examples in the appendix. Not only do individual attention
402
-
heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic
403
-
and semantic structure of the sentences.
404
-
405
405
---
406
406
407
407
class: middle
408
408
409
409
## A toy example
410
410
411
-
To illustrate the behavior of the attention mechanism, we consider a toy problem with 1D sequences composed of two triangular and two rectangular patterns. The target sequence averages the heights in each pair of shapes.
411
+
To illustrate the behavior of the attention mechanism, we consider a toy problem with 1d sequences composed of two triangular and two rectangular patterns. The target sequence averages the heights in each pair of shapes.
412
412
413
413
.center.width-100[]
414
414
@@ -443,6 +443,8 @@ $$\begin{aligned}
443
443
\end{aligned}$$
444
444
for any permutation $\sigma$ of the key-value pairs.
445
445
446
+
(It is also permutation-equivariant with permutation $\sigma$ of the queries.)
447
+
446
448
---
447
449
448
450
class: middle
@@ -527,10 +529,6 @@ The encoders start by processing the input sequence. The output of the top encod
527
529
528
530
.footnote[Credits: Jay Alammar, [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).]
529
531
530
-
???
531
-
532
-
R: Check UDL
533
-
534
532
---
535
533
536
534
class: middle
@@ -543,10 +541,6 @@ The output of each step is fed to the bottom decoder in the next time step, and
543
541
544
542
.footnote[Credits: Jay Alammar, [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).]
545
543
546
-
???
547
-
548
-
R: Check UDL
549
-
550
544
---
551
545
552
546
class: middle
@@ -674,6 +668,16 @@ Large models also enjoy better sample efficiency than small models.
674
668
675
669
---
676
670
671
+
class: middle
672
+
673
+
## Conversational agents
674
+
675
+
.center.width-70[]
676
+
677
+
All modern conversational agents are based on the same transformer models, scaled up to billions of parameters, trillions of training tokens, and thousands of petaflop/s-days of compute.
678
+
679
+
---
680
+
677
681
class: middle
678
682
count: false
679
683
@@ -709,8 +713,6 @@ class: middle
709
713
710
714
Just like text transformers, vision transformers learn representations of the input image that can be used for various tasks, such as image classification, object detection, and image generation.
711
715
712
-
713
-
714
716
---
715
717
716
718
class: middle
@@ -719,7 +721,15 @@ class: middle
719
721
720
722
.center.width-100[]
721
723
722
-
.center[Segment anything (Kirillov et al., 2024) combines a vision transformer with a prompt encoder to produce masks with a transformer-based decoder.]
724
+
.center[Segment anything (Kirillov et al., 2023) combines a vision transformer with a prompt encoder to produce masks with a transformer-based decoder.]
0 commit comments