Lecture 8: Attention

Sequence-to-sequence with attention

Attention: in equations

We have encoder hidden states $h_1,...,h_N \in \R^h$
On timestep $t$ , we have decoder hidden state $s_t \in \R^h$
We get the attention scores $e^t$ for this step:
$$ e^t=[s^T_th_1,...,s^T_th_N] \in \R^N $$
We take softmax to get the attention distribution $\alpha^t$ for this step (this is a probability distribution and sums to 1)
$$ \alpha^t=softmax(e^t) \in \R^N $$
We use $\alpha^t$ to take a weighted sum of the encoder hidden states to get the attention output $a_i$
$$ a_i=\sum^N_{i=1}\alpha_i^th_i \in \R^h $$
Finally we concatenate the attention output $a_t$ with the decoder hidden state $s_t$ and proceed as in the non-attention seq2seq model
$$ [a_t;s_t] \in \R^{2h} $$

Attention is great

Attention significantly improves NMT performance
Attention provides more “human-like” model of the MT process
Attention solves the bottleneck problem
Attention helps with the vanishing gradient problem
- Provides shortcut to faraway states
Attention provides some interpretability
- By inspecting attention distribution, we can see what the decoder was focusing on

There are several attention variants

Attention variants

Attention is a general Deep Learning technique

More general definition of attention:

Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.

Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).

Sequence-to-sequence with attention#

Attention: in equations#

Attention is great#

There are several attention variants#

Attention variants#

Attention is a general Deep Learning technique#