Training an RNN Language Model

  • Get a big corpus of text which is a sequence of words $x^{(1)},...,x^{(T)}$
  • Feed into RNN-LM; compute output distribution $\hat y ^{(t)}$ for every timestep $t$​

Backpropagation for RNNs

Problems with Vanishing and Exploding Gradients

Vanishing gradient intuition

Why is vanishing gradient a problem?

  • 来自远处的梯度信号会丢失,因为它比来自近处的梯度信号小得多
  • 因此,模型权重只会根据近期效应而不是长期效应进行更新

If gradient is small, the model can’t learn this dependency. So, the model is unable to predict similar long distance dependencies at test time.

Why is exploding gradient a problem?

  • This can cause bad updates: we take too large a step and reach a weird and bad parameter configuration (with large loss)
  • In the worst case, this will result in Inf or NaN in your network (then you have to restart training from an earlier checkpoint)

Gradient clipping: solution for exploding gradient

  • Gradient clipping 梯度裁剪: if the norm of the gradient is greater than some threshold, scale it down before applying SGD update
  • Intuition: take a step in the same direction, but a smaller step

Long Short-Term Memory RNNs

On step $t$ , there is a hidden state $h^{(t)}$ and a cell state $c^{(t)}$​

  • Both are vectors length $n$
  • The cell stores long-term information
  • The LSTM can read, erase, and write information from the cell

The selection of which information is erased/written/read is controlled by three corresponding gates

  • The gates are also vectors length $n$
  • On each timestep, each element of the gates can be $open (1), closed (0)$, or somewhere in-between
  • The gates are dynamic: their value is computed based on the current context

  • 遗忘门:控制上一个单元状态的保存与遗忘
  • 输入门:控制写入单元格的新单元内容的哪些部分
  • 输出门:控制单元的哪些内容输出到隐藏状态
  • 新单元内容:这是要写入单元的新内容
  • 单元状态:删除(“忘记”)上次单元状态中的一些内容,并写入(“输入”)一些新的单元内容
  • 隐藏状态:从单元中读取(“output”)一些内容

How does LSTM solve vanishing gradients?

  • The LSTM architecture makes it easier for the RNN to preserve information over many timesteps.

  • LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies