CS224N

Lecture 10: Pretrained Model

Word structure and subword models We assume a fixed vocab of tens of thousands of words, built from the training set. All novel words seen at test time are mapped to a single UNK. Finite vocabulary assumptions make even less sense in many languages. Many languages exhibit complex morphology, or word structure. The byte-pair encoding algorithm (BPE) Subword modeling in NLP encompasses a wide range of methods for reasoning about structure below the word level....

Lecture 9: Transformer

Issues with recurrent models Linear interaction distance RNNs are unrolled “left-to-right” Problem: RNNs take O(sequence length) steps for distant word pairs to interact What does the O Problem means ? Hard to learn long-distance dependencies (because gradient problems! ) Linear order of words is “baked in”; we already know linear order isn’t the right way to think about sentences… Lack of parallelizability Forward and backward passes have O(sequence length) unparallelizable operations GPUs can perform a bunch of independent computations at once, but future RNN hidden states can’t be computed in full before past RNN hidden states have been computed Self-Attention Recall: Attention operates on queries, keys, and values....

Lecture 8: Attention

Sequence-to-sequence with attention Attention: in equations We have encoder hidden states $h_1,...,h_N \in \R^h$ On timestep $t$ , we have decoder hidden state $s_t \in \R^h$ We get the attention scores $e^t$ for this step: $$ e^t=[s^T_th_1,...,s^T_th_N] \in \R^N $$ We take softmax to get the attention distribution $\alpha^t$ for this step (this is a probability distribution and sums to 1) $$ \alpha^t=softmax(e^t) \in \R^N $$ We use $\alpha^t$ to take a weighted sum of the encoder hidden states to get the attention output $a_i$ $$ a_i=\sum^N_{i=1}\alpha_i^th_i \in \R^h $$ Finally we concatenate the attention output $a_t$ with the decoder hidden state $s_t$ and proceed as in the non-attention seq2seq model $$ [a_t;s_t] \in \R^{2h} $$ Attention is great Attention significantly improves NMT performance Attention provides more “human-like” model of the MT process Attention solves the bottleneck problem Attention helps with the vanishing gradient problem Provides shortcut to faraway states Attention provides some interpretability By inspecting attention distribution, we can see what the decoder was focusing on There are several attention variants Attention variants Attention is a general Deep Learning technique More general definition of attention:...

Lecture 7: Machine Translation and Sequence to Sequence

Machine Translation Machine Translation is the task of translating a sentence $x$ from one language to a sentence $y$ in another language. Simple History: 1990s-2010s: Statistical Machine Translation After 2014: Neural Machine Translation Sequence to Sequence Model The sequence-to-sequence model is an example of a Conditional Language Model Language Model because the decoder is predicting the next word of the target sentence $y$ Conditional because its predictions are also conditioned on the source sentence $x$ Multi-layer RNNs in practice High-performing RNNs are usually multi-layer....

Lecture 6: Long Short-Term Memory RNNs

Training an RNN Language Model Get a big corpus of text which is a sequence of words $x^{(1)},...,x^{(T)}$ Feed into RNN-LM; compute output distribution $\hat y ^{(t)}$ for every timestep $t$ Backpropagation for RNNs Problems with Vanishing and Exploding Gradients Vanishing gradient intuition Why is vanishing gradient a problem? 来自远处的梯度信号会丢失，因为它比来自近处的梯度信号小得多因此，模型权重只会根据近期效应而不是长期效应进行更新 If gradient is small, the model can’t learn this dependency. So, the model is unable to predict similar long distance dependencies at test time....

Lecture 5: Language Models and Recurrent Neural Network

Basic Tricks on NN L2 Regularization A full loss function includes regularization over all parameters $\theta$ , e.g., L2 regularization: $$ J(\theta)=f(x)+\lambda \sum_k \theta^2_k $$ Regularization produces models that generalize well when we have a “big” model. Dropout Training time: at each instance of evaluation (in online SGD-training), randomly set 50% of the inputs to each neuron to 0 Test time: halve the model weights (now twice as many) This prevents feature co-adaptation Can be thought of as a form of model bagging (i....

Lecture 4: Dependency parsing

Two views of linguistic structure Context-free grammars (CFGs) Phrase structure organizes words into nested constituents. Dependency structure Dependency structure shows which words depend on (modify, attach to, or are arguments of) which other words. modify：修饰词，attach to：连接词 Why do we need sentence structure? Humans communicate complex ideas by composing words together into bigger units to convey complex meanings. Listeners need to work out what modifies [attaches to] what A model needs to understand sentence structure in order to be able to interpret language correctly Linguistic Ambiguities Prepositional phrase attachment ambiguity 介词短语附着歧义 Coordination scope ambiguity 对等范围歧义 Adjectival/Adverbial Modifier Ambiguity 形容词修饰语歧义 Verb Phrase (VP) attachment ambiguity 动词短语依存歧义 Dependency paths identify semantic relations 依赖路径识别语义关系 help extract semantic interpretation....

Lecture 1: Introduction and Word Vectors

Meaning Definition: meaning 语义 the idea that is represented by a word, phrase, etc. the idea that a person wants to express by using words, signs, etc. the idea that is expressed in a work of writing, art, etc. WordNet Common NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym (同义词) sets and hypernyms (上位词) (“is a” relationships). Problems with resources like WordNet Great as a resource but missing nuance...