Lecture 7: Machine Translation and Sequence to Sequence

Machine Translation

Machine Translation is the task of translating a sentence $x$ from one language to a sentence $y$ in another language.

Simple History:

1990s-2010s: Statistical Machine Translation
After 2014: Neural Machine Translation

Sequence to Sequence Model

The sequence-to-sequence model is an example of a Conditional Language Model

Language Model because the decoder is predicting the next word of the target sentence $y$
Conditional because its predictions are also conditioned on the source sentence $x$

Multi-layer RNNs in practice

High-performing RNNs are usually multi-layer.

For example: In a 2017 paper, Britz et al. find that for Neural Machine Translation, 2 to 4 layers is best for the encoder RNN, and 4 layers is best for the decoder RNN.

Transformer-based networks (e.g., BERT) are usually deeper, like 12 or 24 layers.

Greedy decoding

We saw how to generate (or “decode”) the target sentence by taking argmax on each step of the decoder. This is greedy decoding (take most probable word on each step)

But this way has some problems. Like this:

Exhaustive search decoding

Ideally, we want to find a (length T) translation y that maximizes

We could try computing all possible sequences $y$

This means that on each step t of the decoder, we’re tracking $V^t$ possible partial translations, where $V$ is vocab size
This $O(V^T)$ complexity is far too expensive!

Beam search decoding

Core idea: On each step of decoder, keep track of the $k$ most probable partial translations (which we call hypotheses)

$k$ is the beam size (in practice around 5 to 10)

Beam search is not guaranteed to find optimal solution, but much more efficient than exhaustive search.

In greedy decoding, usually we decode until the model produces an “<END>” token.

In beam search decoding, different hypotheses may produce <END> tokens on different timesteps.

We have our list of completed hypotheses.

Problem with this: longer hypotheses have lower scores.

Machine Translation#

Sequence to Sequence Model#

Multi-layer RNNs in practice#

Greedy decoding#

Exhaustive search decoding#

Beam search decoding#

The bottleneck problem#