Issues with recurrent models

Linear interaction distance

  • RNNs are unrolled “left-to-right”

  • Problem: RNNs take O(sequence length) steps for distant word pairs to interact

What does the O Problem means ?

  • Hard to learn long-distance dependencies (because gradient problems! )
  • Linear order of words is “baked in”; we already know linear order isn’t the right way to think about sentences…

Lack of parallelizability

  • Forward and backward passes have O(sequence length) unparallelizable operations
    • GPUs can perform a bunch of independent computations at once, but future RNN hidden states can’t be computed in full before past RNN hidden states have been computed

Self-Attention

Recall: Attention operates on queries, keys, and values.

  • Each query $q_i$ , key $k_i$ and value $v_i$ has follows: $$ q_i \in \R^d, \ k_i \in \R^d, \ v_i \in \R^d $$

In self-attention, the queries, keys, and values are drawn from the same source.

  • For example, if the output of the previous layer is $x_1, ..., x_T$ (one vec per word), we cloud let $v_i=k_i=q_i=x_i$ (that is, use the same vectors for all of them)

The (dot product) self-attention operation is as follows:

$$ e_{ij}=q_i^Tk_j \\ \alpha=\frac{exp(e_{ij})}{\sum_{j^{'}}exp(e_{ij^{'}})} \\ output_i=\sum_j\alpha_{ij}v_j $$

Fixing the first self-attention problem: sequence order

Since self-attention doesn’t build in order information, we need to encode the order of the sentence in our keys, queries, and values.

Consider representing each sequence index as a vector :

$$ p_i \in \R^d, \ for \ i \in \{1,2,...,T\} \ are \ postion \ vectors $$

Easy to incorporate this info into our self-attention block: just add the $p_i$ to our inputs.

Let $\hat{v},\hat{k},\hat{q}$​ be our old values, keys, and queries.

$$ v_i = \hat{v_i}+p_i \\ q_i = \hat{q_i}+p_i \\ k_i = \hat{k_i}+p_i $$

Position representation vectors through sinusoids

Sinusoidal position representations: concatenate sinusoidal functions of varying periods

Pros:

  • Periodicity indicates that maybe “absolute position” isn’t as important
  • Maybe can extrapolate to longer sequences as periods restart

Cons:

  • Not learnable; also the extrapolation doesn’t really work

Position representation vectors learned from scratch

Learned absolute position representations: Learn a matrix $p \in \R^{d \times T }$ , and let each $p_i$ be a column of that matrix.

Pros:

  • Flexibility: each position gets to be learned to fit the data

Cons:

  • Definitely can’t extrapolate to indices outside $1,...,T$

Fixing the second self-attention problem: Nonlinearities

Easy fix: add a feed-forward network to post-process each output vector

Fixing the third self-attention problem: Mask

To use self-attention in decoders, we need to ensure we can’t peek at the future.

To enable parallelization, we mask out attention to future words by setting attention scores to $-\infty$

overall, necessities for a self-attention building block

The Transformer Encoder

Key-Query-Value Attention

Let’s look at how key-query-value attention is computed, in matrices.

  • Let $X=[x_1;...;x_T] \in \R^{T \times d}$ be the concatenation of input vectors
  • First, note that $XK \in \R^{T \times d}, \ XQ \in \R^{T \times d}, \ XV \in \R^{T \times d}$​
  • The output is defined as $output=softmax(XQ(XK)^T)\times XV$

Multi-headed attention

Residual connections

见我的这篇文章 残差连接 | KurongBlog (705248010.github.io)

Layer normalization

Scaled Dot Product

“Scaled Dot Product” attention is a final variation to aid in Transformer training.

When dimensionality $𝑑$​ becomes large, dot products between vectors tend to become large.

  • Because of this, inputs to the softmax function can be large, making the gradients small

The Transformer Encoder-Decoder

The Transformer Decoder

Cross-attention