class: middle, center, title-slide
Lecture: Communication
Prof. Gilles Louppe
[email protected]
Can you talk to an artificial agent? Can it understand what you say?
- Machine translation
- Speech recognition
- Text-to-speech synthesis
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.kol-1-3[Machine translation:][Hello, my name is HAL.][$\rightarrow$][Bonjour, mon nom est HAL.]
.kol-1-3[Speech recognition:][.width-100[]][$\rightarrow$][Hello, my name is HAL.]
.kol-1-3[Text-to-speech synthesis:][Hello, my name is HAL.][$\rightarrow$][.width-100[
class: middle
class: middle
Automatic translation of text from one natural language (the source) to another (the target), while preserving the intended meaning.
.exercise[How would you engineer a machine translation system?]
Expect the students to come up with a dictionary-based solution.
class: middle
.center[Natural languages are not 1:1 mappings of each other!]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.center[To obtain a correct translation, one must decide
whether "it" refers to the soccer ball or to the window.
Therefore, one must understand physics as well as language.]
.footnote[Image credits: CS188, UC Berkeley.]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Translation systems must model the source and target languages, but systems vary in the type of models they use.
- Some systems analyze the source language text all the way into an interlingua knowledge representation and then generate sentences in the target language from that representation.
- Other systems are based on a transfer model. They keep a database of translation rules and whenever the rule matches, they translate directly. Transfer can occur at the lexical, syntactic or semantic level.
class: middle
To translate an English sentence
- The language model
$P(f|e)$ is learned from a bilingual corpus, i.e. a collection of parallel texts, each an English/French pair. - Most of the English sentences to be translated will be novel, but will be composed of phrases that that have been seen before.
- The corresponding French phrases will be reassembled to form a French sentence that makes sense.
phrase = locution
class: middle
Given an English source sentence
- Break
$e$ into phrases$e_1, ..., e_n$ . - For each phrase
$e_i$ , choose a corresponding French phrase$f_i$ . We use the notation$P(f_i|e_i)$ for the phrasal probability that$f_i$ is a translation of$e_i$ . - Choose a permutation of the phrases
$f_1, ..., f_n$ . For each$f_i$ , we choose a distortion$$d_i = \text{start}(f_i) - \text{end}(f_{i-1}) - 1,$$ which is the number of words that phrase$f_i$ has moved with respect to$f_{i-1}$ ; positive for moving to the right, negative for moving the left.
class: middle
class: middle
We define the probability
Assuming that each phrase translation and each distortion is independent of the others, we have
- The best
$f$ and$e$ cannot be found through enumeration because of the combinatorial explosion. - Instead, local beam search with a heuristic that estimates probability has proven effective at finding a nearly-most-probable translation.
With maybe
100 French phrases for each English phrase in the corpus, there are
class: middle
All that remains is to learn the phrasal and distortion probabilities:
- Find parallel texts.
- Segment into sentences.
- Align sentences.
- Align phrases.
- Extract distortions.
- Improve estimates with expectation-maximization.
Modern machine translation systems are all based on neural networks of various types, often architectured as compositions of
- recurrent networks for sequence-to-sequence learning,
- convolutional networks for modeling spatial dependencies.
- transformer networks.
class: middle
.grid[ .kol-1-2[
- Encoder: bidirectional RNN, producing a set of annotation vectors
$h_i$ . - Decoder: attention-based.
class: middle
class: middle, center, black-slide
My name is HAL.]
Speech recognition can be viewed as an instance of the problem of finding the most likely sequence of state variables
In this case, (hidden) state variables are the words and the observations are sounds.
The input audio waveform from a microphone is converted into a sequence of fixed size acoustic vectors
$\mathbf{y}_{1:T}$ in a process called feature extraction. -
The decoder attempts to find the sequence of words
$\mathbf{w}_{1:L} = w_1, ..., w_L$ which is the most likely given the sequence$\mathbf{y}_{1:T}$ :$$\hat{\mathbf{w}}_{1:L} = \arg \max_{\mathbf{w}_{1:L}} P(\mathbf{w}_{1:L}|\mathbf{y}_{1:T})$$
class: middle
- the likelihood
$p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L})$ is the acoustic model; - the prior
$P(\mathbf{w}_{1:L})$ is the language model.
class: middle
class: middle
- The feature extraction seeks to provide a compact representation
$\mathbf{y}_{1:T}$ of the speech waveform. - This form should minimize the loss of information that discriminates between words.
- One of the most widely used encoding schemes is based on mel-frequency cepstral coefficients (MFCCs).
class: middle
MFCCs calculation.]
.footnote[Image credits: Giampiero Salvi, 2016. DT2118.]
- Pre-emphasis: amplify the high frequencies.
- Windowing: split the signal into short-time frames. - FFT: calculate the frequency spectrum and compute the power spectrum (periodogram).
- Filter banks: apply triangular filter (around 40) on a Mel-scale to the power spectrum to extract frequency bands.
- The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminate at lower frequencies and less discriminative at higher frequencies.
- Decorrelate the bank coefficients through a Discrete Cosine Transform.
class: middle
.center[Feature extraction from the signal in the time domain to MFCCs.]
.footnote[Image credits: Haytham Fayek, 2016.]
class: middle
A spoken word
- This sequence is called its pronunciation
$\mathbf{q}^{w}_{1:K_w} = q_1, ..., q_{K_w}$ . - Pronunciations are related to words through pronunciations models defined for each word.
- e.g. "Artificial intelligence" is pronounced
/ɑːtɪˈfɪʃ(ə)l ɪnˈtɛlɪdʒ(ə)ns/
class: middle
class: middle
class: middle
Each base phone
- the transition probability parameter
$a_{ij}$ corresponds to the probability of making the particular transition from state$s_i$ to$s_j$ ; - the output sensor models are Gaussians
$b_j(\mathbf{y}) = \mathcal{N}(\mathbf{y}; \mu^{(j)}, \Sigma^{(j)})$ and relate state variables$s_j$ to MFCCs$\mathbf{y}$ .
class: middle
The full acoustic model can now be defined as a composition of pronunciation models with individual phone models:
p(\mathbf{y}_{1:T}|\mathbf{w}_{1:L}) &= \sum_{\mathbf{Q}} P(\mathbf{y}_{1:T} | \mathbf{Q}) P(\mathbf{Q} | \mathbf{w}_{1:L})
where the summation is over all valid pronunciation sequences for
class: middle
Given the composite HMM formed by concatenating all the constituent pronunciations
From this formulation, all model parameters can be efficiently estimated from a corpus of training utterances with expectation-maximization.
class: middle
The prior probability of a word sequence
The N-gram probabilities are estimated from training texts by counting N-gram occurrences to form maximum likelihood estimates.
class: middle
The composite model corresponds to a HMM, from which the most-likely state sequence
By construction, states
Modern speech recognition systems are now based on end-to-end deep neural network architectures trained on large corpus of data.
.grid[ .kol-2-3[
- Recurrent neural network with
- one or more convolutional input layers,
- followed by multiple recurrent layers,
- and one fully connected layer before a softmax layer.
- Total of 35M parameters.
- Same architecture for both English and Mandarin.
]] ]
.footnote[Image credits: Amodei et al, 2015. arXiv:1512.02595.]
class: middle, black-slide
<iframe width="640" height="400" src="" frameborder="0" volume="0" allowfullscreen></iframe>Deep Speech 2 ]
class: middle
class: middle
My name is HAL.][
The Tacotron 2 system is a sequence-to-sequence neural network architecture for text-to-speech. It consists of two components:
- a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence;
- a Wavenet vocoder which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames.
class: middle
.footnote[Image credits: Shen et al, 2017. arXiv:1712.05884.]
class: middle
- The Tacotron 2 architecture produces mel spectrograms as outputs, which remain to be synthesized as waveforms.
- This last step can be performed through another autoregressive neural model, such as Wavenet, to transform mel-scale spectrograms into high-fidelity waveforms.
.center[ .width-30[] .width-50[] ]
class: middle
Audio samples at
class: middle, black-slide
<iframe width="640" height="400" src="" frameborder="0" volume="0" allowfullscreen></iframe>Google Assistant: Soon in your smartphone. ]
- Natural language understanding is one of the most important subfields of AI.
- Machine translation, speech recognition and text-to-speech synthesis are instances of sequence-to-sequence problems.
- All problems can be tackled with traditional statistical inference methods but require sophisticated engineering.
- State-of-the-art methods are now based on neural networks.
class: end-slide, center count: false
The end.