Word Embeddings: Difference between revisions
(Force re-parse after Math source-mode rollout (v1.2.0)) Tags: ci-deploy Reverted |
(Pass 2 force re-parse) Tags: ci-deploy Reverted |
||
| Line 120: | Line 120: | ||
[[Category:Intermediate]] | [[Category:Intermediate]] | ||
<!--v1.2.0 cache-bust--> | <!--v1.2.0 cache-bust--> | ||
<!-- pass 2 --> | |||
Revision as of 07:01, 24 April 2026
| Article | |
|---|---|
| Topic area | NLP |
| Difficulty | Intermediate |
| Prerequisites | Neural Networks |
Word embeddings are dense, low-dimensional vector representations of words in which semantically similar words are mapped to nearby points in the vector space. They are a foundational component of modern natural language processing (NLP), replacing sparse one-hot encodings with representations that capture meaning, analogy, and syntactic relationships.
The distributional hypothesis
Word embeddings are grounded in the distributional hypothesis, famously stated by J. R. Firth (1957): "You shall know a word by the company it keeps." The idea is that words appearing in similar contexts tend to have similar meanings. For example, "dog" and "cat" frequently appear near words like "pet", "fur", and "veterinarian", so they should have similar representations.
Early approaches to exploiting distributional information include co-occurrence matrices, pointwise mutual information (PMI), and latent semantic analysis (LSA). Modern word embedding methods learn dense vectors directly using neural networks.
One-hot vs dense representations
One-hot encoding
In a vocabulary of $ V $ words, a one-hot vector for the $ i $-th word is a $ V $-dimensional vector with a 1 in position $ i $ and 0 elsewhere. This representation has two critical shortcomings:
- Dimensionality — vectors are extremely high-dimensional (typically $ V > 100{,}000 $).
- No similarity — every pair of one-hot vectors is equally distant: $ \mathbf{e}_i^\top \mathbf{e}_j = 0 $ for $ i \neq j $. "Cat" is as far from "dog" as it is from "democracy."
Dense embeddings
A word embedding maps each word to a real-valued vector of $ d $ dimensions (typically $ d = 100 $–$ 300 $):
- $ \mathbf{w}_i \in \mathbb{R}^d, \quad d \ll V $
Similar words have high cosine similarity:
- $ \text{sim}(\mathbf{w}_a, \mathbf{w}_b) = \frac{\mathbf{w}_a \cdot \mathbf{w}_b}{\|\mathbf{w}_a\|\;\|\mathbf{w}_b\|} $
Word2Vec
Word2Vec (Mikolov et al., 2013) introduced two efficient architectures for learning word embeddings from large corpora.
Continuous Bag of Words (CBOW)
CBOW predicts a target word from its surrounding context words. Given a window of context words $ \{w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c}\} $, the model maximises:
- $ P(w_t \mid w_{t-c}, \ldots, w_{t+c}) $
The context word vectors are averaged and passed through a softmax layer. CBOW is faster to train and works well for frequent words.
Skip-gram
Skip-gram reverses the prediction: given a target word, it predicts the surrounding context words. For each pair $ (w_t, w_{t+j}) $ where $ j \in [-c, c] \setminus \{0\} $, the model maximises:
- $ P(w_{t+j} \mid w_t) = \frac{\exp(\mathbf{v}'_{w_{t+j}}{}^\top \mathbf{v}_{w_t})}{\sum_{w=1}^{V}\exp(\mathbf{v}'_w{}^\top \mathbf{v}_{w_t})} $
where $ \mathbf{v}_w $ and $ \mathbf{v}'_w $ are the input and output embedding vectors. Computing the full softmax over the vocabulary is expensive, so two approximations are commonly used:
- Negative sampling — instead of computing the full softmax, the model contrasts the true context word against $ k $ randomly sampled "negative" words.
- Hierarchical softmax — organises the vocabulary in a binary tree, reducing the softmax cost from $ O(V) $ to $ O(\log V) $.
Skip-gram performs well on rare words and captures subtle relationships. The famous analogy "king − man + woman ≈ queen" emerged from Skip-gram embeddings.
GloVe
GloVe (Global Vectors, Pennington et al., 2014) combines the strengths of global matrix factorisation and local context window methods. It constructs a word co-occurrence matrix $ X $ from the corpus, where $ X_{ij} $ counts how often word $ j $ appears in the context of word $ i $, and then optimises:
- $ J = \sum_{i,j=1}^{V} f(X_{ij})\bigl(\mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\bigr)^2 $
where $ f $ is a weighting function that caps the influence of very frequent co-occurrences. GloVe embeddings often match or exceed Word2Vec quality, and the explicit use of global statistics can improve performance on analogy tasks.
fastText
fastText (Bojanowski et al., 2017) extends Word2Vec by representing each word as a bag of character n-grams. For example, the word "where" with $ n = 3 $ is represented by the n-grams {"<wh", "whe", "her", "ere", "re>"} plus the whole word "<where>". The embedding for a word is the sum of its n-gram vectors.
This approach has two key advantages:
- Handling rare and unseen words — even words not in the training vocabulary can receive embeddings by summing their character n-gram vectors.
- Morphological awareness — words sharing substrings (e.g. "teach", "teacher", "teaching") automatically share embedding components.
Evaluation of embeddings
Word embeddings are evaluated through:
| Evaluation type | Examples | What it measures |
|---|---|---|
| Intrinsic: analogy | "king : queen :: man : ?" | Linear structure of the space |
| Intrinsic: similarity | Correlation with human similarity judgements (SimLex-999, WS-353) | Semantic quality |
| Extrinsic: downstream | Named entity recognition, sentiment analysis, parsing | Practical utility |
Intrinsic evaluations are fast but do not always predict downstream performance. Extrinsic evaluation on the target task is ultimately the most reliable measure.
Contextual embeddings
Traditional word embeddings assign a single vector per word regardless of context — the word "bank" has the same embedding whether it refers to a river bank or a financial institution. Contextual embeddings address this limitation by producing different representations depending on the surrounding text.
Notable contextual embedding models include:
- ELMo (Peters et al., 2018) — uses a bidirectional LSTM to generate context-dependent word representations.
- BERT (Devlin et al., 2019) — uses a Transformer encoder trained with masked language modelling.
- GPT series (Radford et al., 2018–) — uses a Transformer decoder trained autoregressively.
These models have largely superseded static embeddings for most NLP tasks, though static embeddings remain useful for efficiency, interpretability, and low-resource settings.
See also
References
- Firth, J. R. (1957). "A synopsis of linguistic theory, 1930–1955". In Studies in Linguistic Analysis.
- Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781.
- Pennington, J., Socher, R. and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation". EMNLP.
- Bojanowski, P. et al. (2017). "Enriching Word Vectors with Subword Information". TACL, 5, 135–146.
- Peters, M. E. et al. (2018). "Deep contextualized word representations". NAACL.
- Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL.