Lecture 1: Introduction and Word Vectors

Meaning

Definition: meaning 语义

Common NLP solution: Use, e.g., WordNet, a thesaurus containing lists of synonym (同义词) sets and hypernyms (上位词) (“is a” relationships).

Distributional semantics 分布式语义: A word’s meaning is given by the words that frequently appear close-by.

When a word $w$ appears in a text, its context 上下文 is the set of words that appear nearby (within a fixed-size window).
Use the many contexts of $w$ to build up a representation of $w$

word vectors are also called word embeddings 词嵌入 or (neural) word representations.

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors.

Every word in a fixed vocabulary is represented by a vector
Go through each position $t$ in the text, which has a center word $c$ and context (“outside”) words $o$
Use the similarity of the word vectors for $c$ and o to calculate the probability of $o$ given $c$
Keep adjusting the word vectors to maximize this probability

For Word2vec, its objective function $J(\theta)$ is the negative log likelihood:

$$ J(\theta)=-\frac{1}{T}\sum^T_{t=1}\sum_{\substack{-m \le j \le m \\ j\neq0}}logP(w_{t+j} \ | \ w_t; \ \theta) $$

As the function, maximizing predictive accuracy transfers to minimizing objective function.

We use $v_w$ when $w$ is a center word and $u_w$ when $w$ is a context word. We will use two vectors, a center word $c$ and a context word $o$ :

$$ P(o|c) = \frac{exp(u^T_ov_c)}{\sum_{w \in V}exp(u^T_wv_c)} $$

The equation above has some points:

This is an example of the **softmax function: **
$$ softmax(x_i)=\frac{exp(x_i)}{\sum^n_{j=1}exp(x_j)}=p_i $$
The softmax function maps arbitrary values $x_i$ : to a probability distribution $p_i$
- “max” because amplifies probability of largest $x_i$ ;
- “soft” because still assigns some probability to smaller $x_i$

Two model variants:

Skip-grams:
Predict context (“outside”) words (position independent) given center word
Continuous Bag of Words (CBOW):
Predict center word from (bag of) context words