Sequence-to-sequence with attention

Attention: in equations

  • We have encoder hidden states $h_1,...,h_N \in \R^h$

  • On timestep $t$​ , we have decoder hidden state $s_t \in \R^h$

  • We get the attention scores $e^t$ for this step:

    $$ e^t=[s^T_th_1,...,s^T_th_N] \in \R^N $$
  • We take softmax to get the attention distribution $\alpha^t$​ for this step (this is a probability distribution and sums to 1)

    $$ \alpha^t=softmax(e^t) \in \R^N $$
  • We use $\alpha^t$ to take a weighted sum of the encoder hidden states to get the attention output $a_i$

    $$ a_i=\sum^N_{i=1}\alpha_i^th_i \in \R^h $$
  • Finally we concatenate the attention output $a_t$ with the decoder hidden state $s_t$​ and proceed as in the non-attention seq2seq model

    $$ [a_t;s_t] \in \R^{2h} $$

Attention is great

  • Attention significantly improves NMT performance
  • Attention provides more “human-like” model of the MT process
  • Attention solves the bottleneck problem
  • Attention helps with the vanishing gradient problem
    • Provides shortcut to faraway states
  • Attention provides some interpretability
    • By inspecting attention distribution, we can see what the decoder was focusing on

There are several attention variants

Attention variants

Attention is a general Deep Learning technique

More general definition of attention:

  • Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on.

Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).