Professor: Freda Shi | Term: Winter 2026
Lecture 1
This course will be helpful if:
- You are curious about the fundamentals and detailed implementations of language models
- You are looking to apply state-of-the-art NLP techniques to your own goals
- You are interested in NLP and computational linguistics research
This course will not cover:
- Basics of Python programming, probability, and algorithms
- System and architectures for efficient language model training and deployment
- GPU programming (except for a few high-level basics)
Following background knowledge is strongly recommended:
- Basic knowledge of calculus, linear algebra, and probability
- Python programming proficiency
- Fundamentals of algorithms
- Understanding of basic data structures
Question
What is Natural Language Processing?
Goal: Understand natural language with computational models.
End system we want to build
- Simple: Text classification, grammatical error correction
- Complex: Translation, question answering, speech recognition
- Unknown: Human-level comprehension
This course will cover the foundation models of language models.
Question
Why is NLP difficult?
Ambiguity: One form can have multiple meanings.
”She saw a cat with a telescope”. Who was with a telescope?
Variability: Multiple forms can share the same meaning.
”The cat is chasing the mouse”. “The mouse is being chased by the cat”.
Cross-lingual awareness, dialects and accents.
Orilla (Spanish) 银行 (Chinese)
Orilla is a shore bank, 银行 is a financial bank. Translation uses English as the middle language.
Underlying meanings: Politeness, humour, irony.
Q: Do you have iced tea? A1: No A2: We have iced coffee
Roadmap: Levels of language
- Morphology: What words/subwords are we dealing with?
- Syntax: What phrases are we dealing with?
- Semantics: What is the literal meaning?
- Pragmatics: What is the underlying meaning?
Roadmap: Modeling Approaches
Classification:
- Sentiment Classifier: Take input sentence, run it through a sentiment classifier, and output a label.
- Local Consistency Checker: Take a pair of sentences, and output a label to the relationship of the sentences.
Language Modeling:
- GPT: Next-word predictor. Take a prefix of a sentence, and predict the next word
- BERT: [Mask] Token Predictor. Collect text, mask some word in the middle of the text, and predict the masked word based on the context.
Brief History of NLP
1960s-1990s: Classical NLP
- Linguistic theories
- Manually-defined rules
1980s-2013: Statistical NLP
- Supervised machine learning with annotated data
- Mostly linear models with manually-defined features
2012-now: Neural NLP
- Neural network as primary architecture
- Learning representations from large corpora
- Less hand-crafting features/linguistic knowledge
2022-now: Large language models
- Pretrain a language model
- Use the pretrained language model for various tasks
Common notations of discrete probability
denotes the probability that random variable takes on the value .
Given two random variables .
Joint probability:
Marginal probability: . and are independent .
Conditional probability: . and are independent .
Expectation:
Finding the optimum of a function
Find the minimum of a convex function .
Gradient descent: is the learning rate (“step size”).
Lecture 2
Question
What is a word?
A single distinct meaningful element of speech or writing, used with others to form a sentence.
Things in dictionaries? (new words are often created).
Things between spaces and punctuation? (not all languages treat spaces the same).
Smallest unit that can be uttered in isolation? (one can utter “unimpressively” and “impress”).
Each of the above captures some but not all aspects of what word is.
Linguistic Morphology
Definition
Morphology is the study of how words are built from smaller meaning-bearing units
Types of morphemes:
- Stem: Core meaning-bearing unit
- Affix: A piece that attaches to a stem, adding some function or meaning (e.g. prefix, suffix). ‘Speedometer’ is an interfix. Infixes and circumfixes exist in other languages.
Definition
Inflection: Adding morphemes to a word to indicate grammatical information.
walk walked
cat cats
Definition
Derivation: Adding morphemes to a word to create a new word with a different meaning.
happy happiness
define predefine
Definition
Compounding: Combining two or more words to create a new word.
key + board keyboard
law + suit lawsuit
In languages like Classical Chinese, Vietnamese, and Thai
Each word form typically consists of one single morphene. There is litlte moprhology other than compounding.
Few examples of inflection and derivation.
Most Chinese words are created by compounding.
Usually, morphological decomposition is simply splitting a word into its morphemes:
walked = walk + ed
greatness = great + ness
But it can be a hierarchical structure:
unbreakable = un + (break + able)
internationalization = (((inter + nation) + al) + iz[e]) + tion
There is ambiguity in hierarchical decomposition. The door is unlockable.
Tasks that address morphology:
Lemmatization: Putting words/tokens in a standard format.
-
Lemma is a canonical/dictionary form of a word.
-
Wordform: Fully inflected or derived form of a word as it appears in text.
| wordform | lemma |
|---|---|
| run | run |
| ran | run |
| running | run |
| keyboards | keyboard |
Stemming: Reducing words to their stems by removing affixes. More conventional engineering-oriented approach used in applications such as retrieval.
”Caillou is an average, imaginative” “Caillou is an averag imagin”.
Lexical Semantics
Lemmatization and stemming tackles the problem of variability. Multiple forms could share the same or similar meanings.
One wordform could refer to multiple meanings.
Definition
Polysemy: A word has multiple related meanings.
She is a star or the star is shining.
Definition
Homonymy: A word has multiple meanings originated from different sources.
I need to go to the bank for cash or I am sitting on the bank of the river.
Question
Which one is the case for a crane?
Crane is a polysemy. The machine was named after the bird.
Definition
Synonyms: Words that have the same meanings according to some criteria.
There are few examples of perfect synonymy.
Synonymy is a relation between senses rather than words.
Definition
Antonyms: Senses that are opposite with respect to at least one dimensionality of meaning.
dark and light (colours)
dark and bright (light)
Sense is a hyponym of sense is is more specific, denoting a subclass of .
Conversly, is a hypernym of .
Dog is a hyponym of animal
Corgi is a hyponym of dog
Sense is a meronym of sense is is a part of .
Conversely, is a holonym of .
Hand is a meronym of body
Finger is a meronym of hand
Definition
Word-Sense Disambiguation (WSD): The task of determining which sense of a word is used in a particular context, given a set of predefined possible senses.
Definition
Word Sense Induction (WSI): Requires clustering word usages into senses without predefined ground truths.
Default solution: encode the context of words with a pretrained model, and train a neural network to predict the sense.
Question
We now have powerful neural language models, which do not distinguish word senses. Is WSD still a meaningful task? Do discrete word senses even exist?
Definition
Tokenization: The process that converts running text into a sequence of tokens.
| Penn Treebank | Moses | |
|---|---|---|
| don’t | do n’t | don ‘t |
| aren’t | are n’t | aren ‘t |
| can’t | ca n’t | can ‘t |
| won’t | wo n’t | won ‘t |
Important to check and ensure consistency when comparing results across tokenizers.
There is no explicit whitespace between words in some languages, and tokenization becomes highly nontrivial in these cases.
姚明 进入 总决赛 (Chinese Treebank) vs 姚 明 进入 总 决赛 (Peking University).
Type: A unique word.
Token: An instance of a type in the text.
Question
How does the type/token ratio change when adding more data?
Ratio drops fast. Common words appear a lot
Zipf’s Law (also covered in CS451): Frequency of a word is roughly inversely proportional to its rank in the word frequency list.
Tokenization in Modern NLP Systems
There are many words, but the words have shared internal structures and meanings.
Modern NLP systems always convert tokens into numerical indices for further processing. Can we do better than assigning each word a unique index?
flowchart LR
tokenizer --> a[NLP model]
a[NLP model] --> detokenizer
Data-driven tokenizers offer an option that learns the tokenization rules from data, tokenizing texsts into subword units using statistics of character sequences in the dataset.
Two most popular methods:
- Byte Pair Encoding (BPE)
- SentencePiece
Byte Pair Encoding
Originally introduced in 1994 for data compression, later adapted and revived for NLP.
Key idea: Merge symbols with a greedy algorithm.
Initialize the vocabulary with the set of characters, and iteratively merge the most frequent pair of symbols to extend the vocabulary.
Training Corpus
catcatsconcatenationcategorization
Initial vocabulary
a c e g i n o r s t z
Count Symbol-Pair Frequencies
<a t>: 6, <c a>: 4, <o n>: 3,
Update vocabulary
a c e g i n o r s t z at
Count Symbol-Pair Frequencies
<c at>: 4, <o n>: 3,
Update vocabulary
a c e g i n o r s t z at cat
We repeat this process until the vocabulary reaches the desired size.
The BPE proposal is not optimal in terms of compression rate under the same vocabulary size.
A better vocabulary
a c e g i n o r s t z cat on
Apply Trained BPE to a New Corpus
In addition to having the tokens, we will also need to know the merge rules. Starting from individual characters, and merge following the rules.
Terminate vocabulary
a c e g i n o r s t z at cat
Merge Rules
a tatc atcat
Word to be tokenized
c a t e g o r y
Result
cat e g o r y
SentencePiece Tokenization
Find the vocabulary for a unigram language model that maximizes the likelihood of the training corpus.
Byte-Level BPE
Question
How do large language models tokenize texts from different languages, with a unified tokenizer and fixed vocabulary size?
Convert to hexadecimal.
Prepend zeroes to fix the length of tokens (to ensure the unique decoding), and do BPE on the bit/multi-bit level.
Lecture 3
Recap: Tokenization in Modern NLP Systems
The natural first step is to convert tokens into numerical indices for further processing.
Question
Can we do better than assigning each word a unique index?
Recap: Byte-Level BPE Tokenization
Consider UTF-8 encoding of “Hello world!”
| Character | UTF-8 Hex |
|---|---|
| H | 48 |
| e | 65 |
| l | 6C |
| l | 6C |
| o | 6F |
| (space) | 20 |
| W | 57 |
| o | 6F |
| r | 72 |
| l | 6C |
| d | 64 |
| ! | 21 |
We will work with the sequence 48 65 6C 6F 20 57 6F 72 6C 64 21.
The base vocabulary: entries from 00 to FF. At each step, evaluate frequency of consecutive vocabulary entry pairs, and add a new entry.
This is not much different from character-based BPE.
Question
What about non-English characters?
Consider UTF-8 encoding of “Hello 世界” (world).
| Character | UTF-8 Hex |
|---|---|
| H | 48 |
| e | 65 |
| l | 6C |
| l | 6C |
| o | 6F |
| (space) | 20 |
| 世 | E4 B8 96 |
| 界 | E7 95 8C |
We will work with the sequence of 48 65 6C 6F 20 E4 B8 96 E7 95 8C. Base vocabulary is still .
Edit Distance: Comparing similarity of two sequences
Question
How to measure the similarity between two strings?
We need to design algorithms to assign a numerical score to measure similarity.
A proposal: The minimum number of single-character edits required to change one string into another .
Three allowed operations:
- Insertion
- Deletion
- Substitution
This is known as Levenshtein Distance
teh the
- Delete
eat position 2, addeat position 3; or - Add
hat position 2, deletehat position 4; or - Substitute
eat position 2 withh, substitutehat position 3 withe.
cat bat
- Substitute
cat position 1 withb
cat cats
- Add
sat position 4
Unified Algorithmic Solution
Question
How about
sittingextension?
Dynamic Programming (CS341):
Let represent the edit distance (minimal number of edits) between the first characters of and the first characters of .
Edge cases:
| 0 | 1 e | 2 x | 3 t | 4 e | 5 n | 6 s | 7 i | 8 o | 9 n | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 1 s | 1 | 1 | 2 | 3 | 4 | 5 | 5 | 6 | 7 | 8 |
| 2 i | 2 | 2 | 2 | 3 | 4 | 5 | 6 | 5 | 6 | 7 |
| 3 t | 3 | 3 | 3 | 2 | 3 | 4 | 5 | 6 | 6 | 7 |
| 4 t | 4 | 4 | 4 | 3 | 3 | 4 | 5 | 6 | 7 | 7 |
| 5 i | 5 | 5 | 5 | 4 | 4 | 4 | 5 | 5 | 6 | 7 |
| 6 n | 6 | 6 | 6 | 5 | 5 | 4 | 5 | 6 | 7 | 6 |
| 7 g | 7 | 7 | 7 | 6 | 6 | 5 | 5 | 6 | 7 | 7 |
The equation implies doing nothing when when calculating .
Question
Why is this correct?
Intuition: In such cases, doing nothing may not be the unique best solution, but it is one of the best.
for sitting extension, which could come from , or .
Extension: Different Operations with Different Costs
The minimum cost of single-character edits required to change one word into the other. Each operation could have a different non-negative cost.
Three allowed operations:
- Insertion with cost
- Deletion with cost
- Substitution with cost .
Word Vectors
Until 2010, in NLP, words meant atomic symbols.
Nowadays, it’s natural to think about word vectors when talking about words in NLP. Each word is represented by a vector.
Key idea: Similar words are nearby in a good vector space.
How models represent words
We map each word to a (very high-dimensional) vector.
One of the key challenges for NLP is variability of language (multiple forms having the same meaning).
Representation Learning for Engineering
Engineering: These representations are often useful for downstream tasks.
Transfer learning:
- Image Segmentation, Visual QA Object classification
- Text Classification, QA Context prediction
How to represent a word: One-hot representation of words
.
could be very large, word vectors are orthogonal.
Question
What is an ideal word representation?
It should probably capture information about usage and meaning:
- Part of speech tags (noun, verb, adj., adv., etc.)
- The intended sense
- Semantic similarities (winner vs champion)
- Semantic relationships (antonyms, hypernyms)
Features
Features could extend infinitely.
Distributional Semantics: How much of this can we capture from context/data alone?
”The meaning of a word is its use in the language.” - Ludwig Wittgenstein.
The use of a word is defined by its contexts.
Distributional Semantics
Consider a new word: tezgüino.
- A bottle of tezgüino is on the table.
- Everybody likes tezgüino.
- Don’t have tezgüino before you drive.
- We make tezgüino out of corn.
Question
What do you think tezgüino is?
Loud, motor oil, tortillas, choices, wine.
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| tezgüino | 1 | 1 | 1 | 1 |
| loud | 0 | 0 | 0 | 0 |
| motor oil | 1 | 0 | 0 | 1 |
| tortillas | 0 | 1 | 0 | 1 |
| choices | 0 | 1 | 0 | 0 |
| wine | 1 | 1 | 1 | 0 |
Question
How can we automate the process of constructing representations of word meaning from its company?
First solution: word-word cooccurrence counts (CS451).
Counting for Word Vectors
’the club may also employ a chef to prepare and cook food items'
'is up to Remy, Linguini, and the chef Colette to cook for many people'
'cooking program the cook and the chef with Simon Bryant, who’
The top word is the word we are computing the vector for, the words on the side are the context words
Once we have word vectors, we can compute word similarities.
Among many ways to define the similarity of two vectors, a simple way is the dot product:
Dot product is large when the vectors have very large (in terms of absolute values) in the same dimensions.
With dot product as the similarity function, we can find the most similar words (nearest neighbours) to each word:
| cat | chef | chicken | civic | cooked | council |
|---|---|---|---|---|---|
| council | council | council | council | council | council |
| cat | cat | cat | cat | cat | cat |
| civic | civic | civic | civic | civic | civic |
| chicken | chicken | chicken | chicken | chicken | chicken |
| chef | chef | chef | chef | chef | chef |
| cooked | cooked | cooked | cooked | cooked | cooked |
Council is always a top neighbour because its vector has very large magnitudes, and dot products care about magnitude.
Cosine similarity:
Now using cosine similarity:
| cat | chef | chicken | civic | cooked | council |
|---|---|---|---|---|---|
| cat | chef | chicken | civic | cooked | council |
| chef | civic | cooked | council | chef | civic |
| cooked | cooked | chef | chef | ciic | chef |
| civic | council | civic | cooked | council | cooked |
| council | cat | council | cat | cat | cat |
| chickedn | chicken | cat | chicken | chicken | chicken |
Issues with Counting-Based Vectors
Raw frequency count is probably a bad representation. Count of common words are very large, but not very useful. “The”, “it”, “they” are not very informative.
There are many ways proposed for improving raw counts.
- Removing “stop words”
- Down-weight less informative words
TF (Term Frequency) - IDF (Inverse Document Frequency)
- Information Retrieval (IR) workhouse
- A common baseline model
- Sparse vectors
- Words are represented by a simple function of nearby words
Consider a matrix of word counts across documents: term-document matrix
Term Frequency:
| As You Like it | Twelfth Night | Julius Caesar | Henry V | |
|---|---|---|---|---|
| battle | 1 | 0 | 7 | 13 |
| good | 114 | 80 | 62 | 89 |
| fool | 36 | 58 | 1 | 4 |
| wit | 20 | 15 | 2 | 3 |
Columns are bag-of-words (document representation). Rows are word vectors.
Lecture 4
Pointwise Mutual Information
Consider two random variables, and .
Question
Do two events and occur together more often than if they were independent?
If they are independent, then .
PMI for word vectors
For words and its context , each probability can be estimated using counts we already computed.
Some have found benefit by truncating PMI at (positive PMI).
Negative PMI: words occur together less than we would expect (they are anti-correlated).
These anti-correlation may need more data to reliably estimate.
But, negative PMIs do seem reasonable.
Word2Vec
Instead of counting, train a classifier (neural network) to predict context.
Training is self-supervised: no annotated data is required, just raw text. Word embeddings are learned via backpropagation.
Definition
CBOW (Continuous bag-of-words): learning representations that predict a word given a bag of context (many-to-one-prediction).
Given context, need to predict the centre word.
Definition
Skipgram: Learning representations that predict the context given a word.
Given centre word, and need to predict surrounding words.
Skipgram
This is a log linear model.
‘it is a far, far better rest that I go to, than I have ever known’
CBOW
Use the context to predict the centre word
Information we get from the ordering of words is lost with bag of words.
Skipgram with negative sampling
This denominator is very memory intensive. Doing the dot product of a row with the entire matrix every single time.
The vocabulary size : k - M
Very expensive:
Treat the target word and neighbouring context word as positive examples. Randomly sample other words outside of context to get negative samples.
Learn to distinguish between positive and negative samples with a binary classifier.
This is essentially the softmax between the negative dot product and .
You shall know anything by the company it keeps Node2Vec, Concept2Vec, World2Vec.
Classification
Simplest user facing NLP application.
flowchart LR
Text --> Model
Model --> Category
Rule-Based Classifier
Sentiment classification of sentence (classes: positive, negative)
If contains words in [good, excellent, extraordinary, ] return positive.
If contains words in [bad, terrible, awful, ] return negative.
This has nice interpretability and can be very accurate. But rules are difficult to define, system can be very complicated, and is hardly generalizable.
Statistical Classifiers: General Formulation
Data: A set of labeled sentences .
- : a sentence (or a piece of text)
- : label, usually represented as an integer
Deliverable: A classifier that takes an arbitrary sentence as input, and predicts the label .
Inference: solve . Modeling: define function. Learning: choose parameter .
First approach to the modeling problem, Naïve Bayes.
Probabilistic model: estimated from the data.
Question
How to estimate to ensure sufficient generalizability?
The Bayes rule:
Unigram assumption:
in : look-up table that stores and .
Each and can be estimated by counting from the corpora.
This is the estimation that maximizes dataset probability.
Issue: an unseen word in certain class will lead to a zero probability (same problem from CS451)
Solution: Smoothing
Inference
Smoothing applies to words seen in the training data. If the word never existed in the training data, we just ignore it.
Lecture 5
Recap
Logistic function: , .
Logistic Regression: Modeling
Suppose we can represent sentence with a vector .
Probabilistic model:
can be written as if has a constant dimension.
Logistic Regression: Learning
Objective: Maximizing the dataset probability, under the assumption that each example is sampled independently.
For better numerical representation, we usually take the negative logarithm of the probability as the loss and minimize it:
Gradient descent:
Logistic Regression: Inference
Compute whether class 0 or 1 has larger probability.
Question
What if there are more than 2 classes?
- 1 vs. 1 for class pairs and do voting; or
- 1 vs. all for classes and do .
Probabilistic interpretation over classes no longer hold for multi-class classification with logistic regression.
Generative vs. Discriminative Model for Classification
- Generative models: is accessible when modeling .
- Discriminative models: is directly modelled.
Question
What are the key differences?
Difference: If you can generate a new data example once you have the model. Naïve Bayes is a generative model, logistic regression is a discriminative model.
Neural-Network Classifier
A neural network is a function. It has inputs and outputs. Neural modeling now is better thought of as dense representation learning.
With a neural network based function , we input and collects a vector ; is defined by selecting the corresponding entry in .
Common Neural Network Notations:
- : a vector
- : entry in the vector
- : a matrix
- entry in the matrix
Perceptron
If the dot product between and is less than , then category for will be .
Predict the label: .
Update weights: .
| What happens | ||
|---|---|---|
| 0 | 0 | Nothing |
| 1 | 1 | Nothing |
| 1 | 0 | |
| 0 | 1 |
Neural Layer: Generalized Perceptron
A neural layer = affine transformation nonlinearity
Output is a vector (results from multiple independent perceptrons). Can have other activation functions for nonlinearity.
Stacking Neural Layers
Multiple neural layers can be stacked together
We use the output of one layer as input to the next. This is a feed forward (and/or fully-connected layers). Also called multi-layer perceptron (MLP).
Nonlinearities (activation function)
can be applied to each entry in a vector in an element-wise manner. Common activation functions: , , and .
Question
Why nonlinearities?
Otherwise stacking neural layers results in a simple affine transformation.
Nonlinearity:
Nonlinearity:
Nonlinearity:
Sentiment Classification with Neural Networks
We empirically don’t pass the final layer into an activation function.
Question
How can we get for a sentence?
Average word embeddings, or more complicated neural network structures.
Neural Network: Training
Maximize the probability (minimizing the negative log probability) of gold standard label.
Also called cross entropy loss.
Backpropagation
Chain rule: suppose , , then
Question
Now we have , how should we update .
What is actually happening is the concatenation as below.
We calculate as they are scalars, and then reform to get the original shape of the matrix.
Visualization of an Average Word Vector Classifier
Common Neural Architectures
Convolutional Neural Networks
Introduced for vision tasks; also used in NLP to extract feature vectors.
We apply kernels (filters) to image patches (local receptive fields). The kernels are learnable.
Take the dot product between the filter and the stretched word embeddings.
Lecture 6
Pooling
Each kernel/filter extracts one type of feature.
A kernel’s output size depends on sentence length. A fixed dimensional vector is desirable for MLP inputs.
Solution: Mean pooling/max pooling converts a vector to a scalar.
Final feature: Concatenating pooling results of all filters.
Word order matters
Example
Kernel size ‘a cat drinks milk’ (a cat), (cat drinks), (drinks milk) ‘a milk drinks cat’ (a milk), (milk drinks), (drinks cat)
An -gram “matches” with a kernel when they have high dot product.
Drawbacks
Cannot capture long-term dependency. Often used for character-level processing: filters look at character -grams.
Recurrent Neural Networks (RNNs)
Idea: Apply the same transformation to tokens in time order.
flowchart LR
a[xt-1]
b[xt]
c[xt+1]
d[ht-1]
e[h_t]
f[ht+1]
a --> d
d --> e
b --> e
e --> f
c --> f
Gradient update for .
Suppose is the representation passed to the classifier.
We can easily calculate .
Question
What about .
Bug
An important issue of simple RNNs.
Absolute value of entries grow and vanish exponentially with respect to sequence length.
This motivates the development of more advanced RNN architecture.
Long Short-Term Memory Networks (LSTMs)
Designed to tackle the gradient vanishing problem.
Idea: Keep entries in and in the range of .
Gated Recurrent Units
Fewer parameters and generally works quite well.
- Update gate:
- Reset gate:
Forget gate is just 1 minus the input gate. Reset gate is applied to the previous hidden contribution.
RNN: Practical Approaches
Gradient clip: Gradient sometimes goes very large even with LSTMs. Empirical solution: After calculating gradients, require the norm to be at most , set by hyperparameters.
At time step , what matters to is mostly where is close to .
Bidirectional modeling typically results in more powerful features.
Recursive Neural Networks
Run constituency parser on sentence, and construct vector recursively. All nodes share the same set of parameters.
flowchart BT
a[h5]
b[h1]
c[h4]
d[h2]
e[h3]
b --> a
c --> a
d --> c
e --> c
Tree LSTMs typically work well. Slight modification of LST cells needed.
Recursive neural networks with left-branching trees are basically equivalent to recurrent neural networks.
Syntactically meaningful parse trees are not necessary for good representations. Balanced trees work well for most tasks.
Attention
Can be thought of as weighted sum; each token receives weight.
From (unweighted) bag of words to (weighted) bag of words. Each word receives a fixed weight, normalize the weights with .
Parameterized attention: Word tokens with the same word type should probably receive different weights in different sentences. Implement attention with an MLP.
Self-Attentive RNNs
The last hidden state of RNN could be a bad feature. Why?
At time step , what matters to is mostly where is close to .
Attention weights over RNN hidden states could be bad indicators on which token is more important.
Lecture 7
Transformer: Attention-based sentence encoding, and optionally, decoding.
Idea: Every token has “attention” to every other token.
For sentence with tokens .
are trainable parameters.
Question
What is for?
Question
Consider : if each entry in both vector is drawn from a distribution with zero mean and unit variance, what would happen if the dimensionality grows?
The variance of the dot product grows.
For independent zero-mean, unit-variance random variables and , recall (STAT230)
For independent zero-mean, unit-variance random variables and .
If we have independent zero-mean, unit variance variables .
Transformer Encoder
The application of is theoretically motivated.
See also Xavier initialization: initialize a dot product parameter vector with values drawn from
Positional Encoding
Columns of for “a cat” permutation of columns of for “cat a”.
The choice of is somewhat arbitrary, but it’s overall theoretically motivated: The positional add relation can be represented by a linear transformation.
Proof idea: Use the addition theorems on trigonometric functions.
Limitation: Only fixed number of positions are available. Another option is learnable positional encoding.
Multi-Head Attention
We can parallelize multiple , , , with different random initialization (and hope they learn different ways to attend tokens).
We stack transformer layers. Earlier layers’ output are added to the vector stream.
Normalization: Preserve variance:
Lecture 8
Recap: Distributional Semantics
Language Models: General Formulation
Language models compute a probability distribution over strings in a language: , is a string with tokens .
Language modeling: Assign probabilities to token sequences.
Modeling: Define a statistical model , where is a string.
Learning: Estimate the parameters from data.
Goal - compute the probability of a sequence of words.
Relatedly, modeling probability of the next word:
Relatedly, modeling the probability of a masked word given its context
A model that computes any of the above is called a language model. A good language model will assign higher probabilities to sentences that are more likely to appear.
Question
How do we model this?
Recap: The chain rule of probability
We haven’t yet made independence assumptions.
This is autoregressive language modeling.
Important detail: Modeling length
A language model assigns probability to token sequences . can be of any length, and the probabilities should sum to 1 across all possible sequences of different lengths.
Usually length is modelled by including a stop symbol </s> at the end of each sequence. Predicting </s> = modeling stopping probabilities. Relatedly, a start symbol <s> should be added to the beginning.
Language model with both start and stop symbols:
We need to ensure:
Consider removing stopping probabilities
Question
What happens if we don’t model stopping probability?
Sums of probabilities for all possible length and length sequences:
With the stop symbol
Proof sketch: Once you reach the </s> token after sampling some certain sequence, certain probability mass is taken away because <s> is in the same distribution as other vocabulary entries.
Longer sequences that have the same prefix share the remaining probability.
Alternatively, we can model the length explicitly (e.g., using a zero-truncated Poisson distribution).
Estimating language model probabilities.
<s>I do not like green eggs and ham</s>
We can use maximum likelihood estimation (from STAT231)
Problem: We will never have enough data.
Markov Assumption
Independence assumption: The next word only depends on the most recent past (most recent words). Reduces the number of estimated parameters in exchange for modeling capacity.
1st order Markov: . .
2nd order Markov: . .
N-Gram Language Models
: Unigram language model
: Bigram language model
Recap:
Modeling, learning, and inference
Estimating n-gram probabilities
Maximum likelihood estimate (MLE):
Training data:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
A few estimated bigram probabilities given our MLE estimator:
Test data:
<s> I like green eggs and ham</s>
Problem: . This is an over-penalization
Smoothing in n-gram LMs, just add 1 to all counts. This is Laplace smoothing.
MLE Estimate
Add-1 estimate:
Greedy search: Choose the most likely word at every step.
To predict the next word given the previous two words .
Problem is we will generate the same sentence many times, and may be missing out on a higher probability continued sentences (multiple words down the line).
Bigram model - sampling
- Generate the first word: .
- Generate the second word: .
- Generate the third word: .
Generating from a language model
Two effective sampling strategies: excluding the possibility for tokens with very-low probabilities. Top- vs top- sampling.
Top-, sort all vocabulary in terms of the next token probability, only take top .
Top-, do the same thing, but don’t fix the number of tokens, fix the portion of cumulative probability mass we are considering. Number of tokens we take is a function of .
A good language model will assign higher probabilities to sentences that are more likely to appear.
Compute probability on held-out data. Standard metric is perplexity.
Probability of held-out sentences:
Let’s work with -probabilities:
Divide by number of words (including stop symbols) in held-out sentences:
From probability perplexity.
Average token -probability of held-out data:
Perplexity:
The lower the perplexity, the better the model
Lecture 9
Neural Language Modeling
Recap: Language Modeling as Classification
This is just a probabilistic classification problem.
Neural Trigram Language Model
Given two previous words, compute probability distribution over possible next words.
Input is concatenation of vectors (embeddings) of previous two words.
Output is a vector containing probabilities of all possible next words.
To get , do matrix multiplication of parameter matrix and input, then transformation.
Neural Trigram Language Model
denotes indices of sentences, denotes indices of tokens.
via backpropagation.
Trigram vs Neural Trigram LMs
Trigram language model: Separate parameters for every combination of , , , so approximately parameters. # of parameters is exponential in -gram size. Most parameters are zero (even with smoothing).
Neural trigram language model: Only has parameters. can be chosen to scale # parameters up or down and are linear in -gram size. Almost no parameters are zero. No explicitly smoothing, though smoothing is done implicitly via distributed representations.
Removing N-Gram Constraints
RNN Language Models
Hidden stat is a function of previous hidden state and current input. Same weights at each state.
Vector for each word is combined with the history vector. Use last layer hidden state as new hidden representation, instead of the average of the previous.
via back propagation.
Transformer Language Models
A token “attends” to all previous tokens.
Use as feature to predict the next token.
Language models encode knowledge about language.
The pre-training-finetuning paradigm: Language modeling, as the pre-training task, helps encode knowledge. The knowledge helps with downstream tasks. Use the hidden state as the feature for the downstream task.
Masked Language Models
Motivation: learning useful representations of text.
Replace token at position with a placeholder. Use as feature to predict token at position .
Probing
Question
What is encoded in a trained language model?
Empirical answer: linguistic probe.
Take a fixed model as the “frozen” filter feature extractor, train a lightweight model to predict labels.
Frozen: the base model never gets updated when training the lightweight model.
Confounding
Question
Does above-chance performance on held out data mean the model encode part-of-speech knowledge?
Not necessarily, the model might just encode word identity, and the probe learns to group them together.
Solution (control tasks): Draw conclusion performance on main task is significantly better than that on the control task.
Syntax: Constituency
Sentence: the cat is cute
Bracketing: ((the cat) is cute))
Task: given any span of words, is it a constituent?
Probing is required as candidate constituents may be of different lengths but MLP could only take a fixed dimensional vector.
Task: Given a constituent, what’s the label?
Constituent labels: Syntactic Substitutability
Pooling is required to accommodate the variable length of constituents.
Lecture 10
Phrase Structures/Constituency Grammar
Constituency grammars focus on the constituent relation.
Informally: Sentences have hierarchical structures.
A sentence is made up of two pieces:
- Subject, typically a noun phrase (NP)
- Predicate, typically a verb phrase (VP)
NPs and VPs are made up of pieces:
- A cat = (a + cat)
- Walked to the park = (walk + (to + (the + school)))
Each parenthesized phrase is a constituent in the constituent parse.
Constituent: A group of words that functions as a single unit.
Linguists try to determine constituents via constituency tests. A constituency test follows some rules to construct a new sentence, focusing on the constituent candidate of interests. If the constructed sentence looks good, we find some evidence about constituency.
Consider the sentence: Drunks could put off the customers.
Constituency Test: Coordination
Coordinate the candidate constituent with something else.
- Drunks could [put off the customers] and sing.
- Drunks could put off [the customers] and the neighbours.
- Drunks [could] and [would] put off the customers.
Constituency Test: Topicalization
Moving the candidate constituent to the front. Modal adverbs can be added to improve naturalness.
- and [the customers], drunks certainly could put off.
- and [customers], drunks could certainly put off the.
Constituency Test: Deletion
Delete the span of interest. Word orders can be changed to improve naturalness.
- Drunks could put off the customers [
in the bar]. - Drunks could put off the customers [
in the] bar.
Constituency Test: Substitution
Substitute the candidate constituent with the appropriate proform (pronoun/proverb etc.). Slight word over adjustment is allowed to improve naturalness.
- Drunks could [do so = put off the customers].
- Drunks could put [them = the customers] off.
- Drunks could put the [them = customers] off.
Constituency Parsing as Bracketing
Question
Brackets: Which spans of words are the constituents in a sentence?
Sentence: The main walked to the park.
Bracketing: ((the man) (walked (to (the park)))
The brackets can be translated into trees.
There are categories associated with constituents.
flowchart TD
a[NP]
b[NP]
c[the]
d[the]
S --> a
a -----> c
a -----> man
PP ----> to
PP --> b
b ---> d
b ---> park
S --> VP
VP -----> walked
VP --> PP
The internal nodes are called nonterminals, the leaves are called terminals. Non terminals connect to pre-terminals, which connect to terminals.
The head of a constituent is the most responsible/important word for the constituent label.
Question
Which word makes ‘the cat’ an NP?
Cat
Question
Which word makes ‘walked to the park’ a VP?
Walked
There are syntactic ambiguities. Consider the sentences ‘time flies like an arrow’ and ‘fruit flies like a banana’.
flowchart TD
a[NP]
b[NP]
c[NN]
d[NN]
S --> a
a --> c
c --> time
S --> VP
VP --> V
VP --> PP
V --> flies
PP --> P
PP --> b
P --> like
b --> DT
DT --> an
b --> d
d --> arrow
flowchart TD
x[NP]
b[NP]
c[NN]
d[NN]
S --> x
x --> ADJ
x --> c
ADJ --> fruit
c --> flies
S --> VP
VP --> VBP
VP --> b
VBP --> like
b --> DT
b --> d
DT --> a
d --> banana
NLP Task: Constituency Parsing
Given a sentence, output its constituency parse. Widely studied task with a rich history. Most studies are based on the Penn Treebank.
Constituency parsing with general formulation.
The score of a tree is defined by the sum of constituent scores
Each span score can be modeled with a neural network.
Training objective: Let the collection of the true spans have the highest accumulated span scores among all parses.
Question
How do we solve this?
Constituency Parsing: Inference (solving ).
Let’s first assume is a binary unlabeled parse tree. Each node is either a terminal node or the parent of two other nodes. There is one root node.
The simplified CKY Algorithm
The maximum sum of subtree scores if [] is a constituent.
The maximum possible sum of subtree scores if the sentence is fuller parsed.
Edge case: .
Context-Free Grammar (CFG) (CS241)
A generative way to describe constituency parsing. A CFG defines some “rewrite rules” to rewrite nonterminals as other nonterminals or terminals.
In previous ‘the man walked to the park’ example, we would have a sequence of rewrites corresponding to a bracketing.
Question
Why context-free?
A rule to rewrite NP does not depend the context of that NP. The left-hand size (LHS) of a rule is only a single non-terminal (without any other context).
Probabilistic Context-Free Grammar (PCFG)
We assign probabilities to rewrite rules.
Probabilities must sum to 1 for each left-hand side nonterminal. Given a sentence and its tree , the probability of generating with rules in grammar is , where denotes a rule.
Given a treebank, what is the MAP estimation of the PCFG?
A PCFG assigns probabilities to sequences of rewrite operations that terminate in terminals, this sequence implies the natural-language yield, and bracketing of sentences.
CKY with PCFG Formalism
Find the max-probability tree for a sentence
the maximum possible log probability that words within range [ ] are the outcome of a nonterminal label .
Edge case: Set the appropriate when word could have label , and otherwise .
Inside algorithm
Find the probability for generating a sequence from a certain non-terminal (counting all possible trees).
The Chomsky Normal Form (CNF)
For any free-form PCFG, there exists an equivalent PCFG in which each node has zero or two children. Trees satisfying the latter conditions are said to be in the Chomsky normal form.
To go from constituency to dependency, we need to propagate lexical heads up the tree, remove non-lexical parts, merge redundant nodes. Result is the dependency parse tree.
Dependency parses directly model the relation between words.
Lecture 11
Grounded Semantics: Meanings demonstrated from other sources of data in addition to the language systems.
Distributional Semantics
A bottle of tezgüino is on the table. Everybody likes tezgüino. Don’t have tezgüino before you drive.
Visually grounded semantics
Picture of tezgüino.
Symbol Grounding Problem
Symbol meaning: How do we make sense of symbols?
Practical implication: Enable the reliably meaningful interaction between language models and humans/physical world.
Whether this is a stochastic parrot or semantic comprehension is under debate.
Grounding can be categorized into:
- A: Referential grounding
- B: Sensorimotor grounding
- C: Relational grounding
- D: Communicative grounding
- E: Epistemic grounding
A, B, C, E: Semantic (static) grounding
D: Communicative (dynamic) grounding
Recap: Text-Only Language Models
Two types of text-only ungrounded language models:
Autoregressive models, good for generation.
Masked language models, good for feature extraction.
Incorporating visual signals leads to two families of vision-language models.
BERT joint visual semantic embeddings
GPT generative vision-language models.
Joint Visual-Semantic Embeddings (and CLIP)
Idea: Encode visual and textual information into a shared space.
Embedding: Vector space
- Text encoder (turn text into vector)
- Image encoder (turn image into vector)
Design a loss function to “align” the two vector spaces. End up with one single vector space that encodes both.
Training data: Images and their descriptions.
Visual encoding: Convert an image to a fixed-dimensional vector representation.
Joint Visual-Semantic Embedding Space Objective
Idea: Matched image-caption pair should be closer than mismatched pairs in the embedding space.
This is triplet-based hinge loss. Anything close should be close in joint embedding space, anything far should be far.
Properties of the Joint Space
Images and text are close in a good joint embedding space if they are semantically related.
Example applications:
- Bidirectional image-caption retrieval
- Image captioning
Text in the training data can be at any level of granularity (words, phrases, sentences, paragraphs, etc.)
Contrastive Language-Image Pretraining (CLIP)
Image-to-text retrieval: Given a pool of text, model the probability of choosing the correct text; and vice versa.
Generative vision-language models
Recap: Generative autoregressive language models
Text-only language models: Predicting the next token conditioned on the history.
Extending to Vision-Language Models (VLMs).
General VLM: Training Objective
Loss function only calculated on textual positions.
Fine-Grained Vision-Language Tasks
- Object retrieval (assuming objects’ bounding boxes are given).
- Cognitive plausibility: Recognizing objects is very easy for humans.
- Multimodal coreference resolution (without assuming bounding boxes)
- Phrase grounding: Mapping phrases to objects in the image.
- Dense captioning (reverse): Write a short description for each detected object.
Limitations of current VLMs
- Lack of physical knowledge, and the neural architecture makes it hard to incorporate knowledge.
- Poor in recognizing spatial relations.
- Lack of cultural diversity representation.
Lecture 12
Definition
Large Language Model: A computational agent capable of conversational interaction and text generation.
Fundamentally: A probabilistic model that predicts the next token in a sequence based on context.
Core Mechanism: Conditional generation. Input is context (prefix), and output is the probability distribution over next tokens.
Pretraining
Dataset of 100B to > 5T tokens. Task is next-token prediction on unlabeled texts, and the output is a base model.
ELIZA (1966)
Joseph Weizenbaum’s rule-based system simulating a Rogerian psychologist. The ELIZA effect: Humans easily form emotional connections with machines. Limitation is that there is no real understanding, purely pattern matching/rules.
Modern LLMs: Unlike rule-based systems, these learn language patterns and world knowledge from vast corpora.
The Knowledge Bottleneck
Human learning
Children learn ~7-10 words/day to reach adult vocabularies 30k-100k words. Most knowledge is acquired as a by-product of reading and contextual processing.
Machine Learning
Distributional Hypothesis: We learn meaning from the company words keep. The NLP revolution relies on this principle: learning syntax, semantics, and facts from data distribution.
Pretraining: The Core Idea
Definition
Pretraining: Learning knowledge about language and the world by iteratively predicting tokens in massive text corpora.
The Result: A pretrained model containing rich representations of syntax, semantics, and facts.
Foundation: Serves as the base for downstream tasks (QA, translation). This phase is self-supervised.
Question
What does a model learn by predicting the blank?
”With roses, dahlias, and peonies, I was surrounded by x” learns ontology (category: flowers)
“The room wasn’t just big it was x” learns semantics/intensity (enormous > big)
“The square root of 4 is x” learns arithmetic/math
”The author of ‘A room of One’s Own’ is x” learns world knowledge (Virginia Woolf)
Three major architectures
- Decoder-only (Generative: GPT-4, Llama)
- Encoder-only (Understanding: BERT)
- Encoder-decoder (Translation: T5, BART)
Decoder Architecture
Mechanism: Auto-regressive generation. Takes tokens as input and generates tokens one by one (left-to-right).
Causal: Masked attention ensures it can only see past tokens, not future ones.
Use Case: Generative tasks (text generation, code completion).
flowchart LR
a[The cat sat on]
b[Decoder-only model]
c[the]
a --> b
b --> c
Example
GPT-3, GPT-4, Llama, Claude, Mistral
We focus on decoders as they are the standard for generative LLMs.
Encoder Architrecture
Mechanism: Outputs a vector representation (encoding) for each input token.
Bidirectional: Can see context from both left and right directions simultaneously.
Training: Typically masked language modeling.
Use case: Classification, sentiment analysis, NER.
flowchart LR
a["The [MASK] sat on the mat"]
b[Encoder-only model]
c[cat]
a --> b
b --> c
Example
BERT, RoBERTa
Encoder-Decoder Architecture
Mechanism: Maps an input sequence of tokens to a (potentially different) output sequence.
Key feature: Decouples input understanding from output generation. Input and output can have different lengths.
Use case: Translation (english french), summarization.
flowchart LR
a[Hello world]
b[Encoder]
c[Decoder]
d[le]
e[Bonjour]
a --> b
b --> c
c --> d
e --> c
The “Black box” view
Input: Sequence of tokens (context)
The cat sat on the
Output: Probability distribution over the vocabulary for the next token
mat: 0.5 bench: 0.3 dog: 0.01
We sample from this distribution to generate the next word.
Self-supervised learning
The idea: We do not need manually labeled data. The text itself is supervision. At every step , predict the next word given the history . Since we possess the source text, we always know the correct next token.
We train the network to minimize Cross-Entropy Loss
It measures the difference between the model’s predicted distribution and the true distribution.
The loss simplifies to the negative log probability of the correct next word.
Teacher Forcing
Inference vs Training
Inference: Errors accumulate as model predicts its own future context.
Training (Teacher Forcing): We do not use the model’s own prediction. We always feed the correct history sequence.
Advantage: Allows massive parallelization (all tokens trained simultaneously).
Training Visualization
- Input: Sequence of tokens
- Forward pass: Model computes distribution for every position simultaneously.
- Update: Calculate loss, batch, and backpropagate to update weights
Conditional Generation
We model NLP tasks as conditional text generation.
Probability of token given all previous tokens is .
Sampling loop: Compute probability distribution, sample token, add to context, repeat.
We go from logits to text by taking the logits (raw scores from the model), converting to probabilities via softmax, and then decoding (selecting the next token).
Greedy decoding
Algorithm: Always select the single token with the highest probability
Pros: Deterministic, optimal for short factual queries.
Cons: Often leads to repetitive, degenerate text. Misses creative paths.
Random Sampling
Algorithm: Sample the next token randomly according to the probability distribution . Effect: A word with 10% probability is chosen 10% of the time.
Pros: High diversity, human-like variance.
Cons: Tail risk, sampling a very low probability word that derails coherence.
Temperature Sampling
Rescale logits by a temperature parameter .
Low (): Confident/Greedy
High (): Diverse/Creative
Top-k and Nucleus Sampling
Top-: Only sample from the top most likely words. Renormalize weights and do random sampling on just the .
Nucleus (top-): Sample from the smallest set of words whose cumulative probability exceeds (e.g. 0.9).
Preview: Task Modeling
Almost any task can be cast as next-token prediction.
Sentiment Analysis
Input: “The sentiment of the sentence ‘I like Jackie Chan’ is:” Model Compares: vs . Decision: Choose the token with higher probability
Question
What determines performance?
Performance (Loss) is determined by three main power-law factors:
- Number of parameters (model size)
- Number of tokens (dataset size)
- Compute budget (FLOPS)
Pretraining Corpora: LLMs are trained on massive datasets (trillions of tokens)
- Web crawls
- Quality data
- Code
Example
The Pile: An 825 GB corpus of diverse text sources.
Filtering: Quality
Raw web text is noisy, we must filter for quality.
- Heuristics: Removing short documents, excessive symbols, or malformed text
- Model-based: Train classifiers to distinguish high quality text from low-quality text
- Deduplication: Remove duplicate documents. Reduces memorization and improves generalization.
Filtering: Safety & Ethics
PII Removal: Scrubbing Personally Identifiable Information.
Toxicity Filtering: Removing hate speech and abusive content.
Error
Bias: Classifiers may be biased against minority dialects. Trade-off: Models trained on sanitized data may be worse at detecting toxicity themselves.
Ethical & Legal Issues
- Copyright
- Consent
- Privacy
- Skew
The Power of Law of Scaling
Kaplan et al. (2020) empirically showed that loss scales as a power law:
Implication: To improve performance, we must scale up and simultaneously.
Diminishing returns: To halve the error, you need exponentially more compute.
Compute-Optimal Scaling (Chinchilla)
The Chinchilla ratio: To train optimally for a fixed compute budget, scale parameters and data equally. (~20 tokens per parameter)
Example: A 70B model 1.4 trillion tokens (current trend is that models are over-trained to make inference cheaper).
Approximate parameter count:
For a Transformer model, the number of non-embedding parameters is roughly:
- Number of layers
- Model dimensionality
GPT-3: 96 layers, 12288 175 billion parameters.
Question
How do we know it works?
Perplexity (Intrinsic)
Definition
The inverse probability of the test set, normalized by the number of words.
Interpretation: The branching factor of the model.
If , the model is as confused as if it were choosing uniformly from 10 words.
Downstream Benchmarks (Extrinsic)
Perplexity does not always predict reasoning ability. We use standard benchmarks:
- MMLU: 15k+ multiple choice questions
- Code: HumanEval.
- Reasoning: GSM8K
Method: Zero-shot or few-shot prompting.
Data contamination
Error
Did the model memorize the answer?
Contamination: When test set questions appear in the training data.
Consequence: High scores but fails to generalize to new problems.
Mitigation: Rigorous decontamination (-gram matching) before training.
Using Pretrained Models: Prompting
- Prompt: The input text provided to the model to elicit a specific response.
- System Prompt: A hidden prefix instruction that defines the model’s personality.
- Prompt Engineering: The empirical science of designing prompts to maximize model performance.
Definition
In-Context Learning (ICL): The ability of a model to improve performance on a specific task given context in the prompt, without updating model parameters.
Mechanism: The model ‘learns’ the pattern or format from the input buffer activations.
Zero-Shot Prompting
Providing the task instruction without any examples
”Translate the following sentence into French: The cat sat on the mat.”
Reliance: Relies entirely on the model’s pretraining data to understand the instruction verb.
Few-Shot Prompting
Providing examples (demonstrations) of the task before the actual query.
Translate English to French: Dog Chien, Cheese Fromage, Cat ?
Effect: Significantly improves performance on complex or novel tasks by demonstrating the expected output formats.
Question
Why do demonstrations work?
- Format constraints: They teach the model the output structure.
- Task location: They help the model locate the specific task manifold in its parameter space.
- Counter-intuitive finding: Correctness of labels matters less than format. Models improve even with incorrect labels.
Lecture 13
Pretraining creates a base model. Excellent at completing text, possessing vast world knowledge and syntactic fluency.
However there’s an alignment gap.
We expect an intelligent assistant that follows instructions. In reality, base models are trained to complete text, not obey commands.
Failure 1: Misinterpretation
Prompt:
“Explain the moon landing to a six year old in a few sentences”
Base Model Output:
“Explain the theory of gravity to a 6 year old”
The model say a pattern of list of questions in its training questions, and generated the next item in the list.
Failure 2: Continuation vs Answer
Prompt:
“Translate to French: The small dog”
Base Model Output:
“The small dog crossed the road”
Model treated the input as the start of a story rather than a translation command.
The goal of post-training is to bridge the gap between next-token prediction and intent following. We want the model to be helpful, honest, and harmless.
Three stages of training
- Pretraining (covered in last lecture)
- Instruction Tuning (Supervised fine-tuning)
- Alignment (RLHF/DPO)
Definition
Instruction Tuning: The process of further training a base model on a datset of (Instruction, Response) pairs.
Goal: Teach the model to recognize the “instruction” format and generate the appropriate “response”.
Meta-learning: The goal is not just to learn about the specific tasks in the training set, but to learn the general skill of following instructions.
flowchart LR
a[Base model weights]
b[Fine-tuning]
c[Instruction]
d[Response]
e[Adapted Model Weights]
a --> b
c --> d
d --> b
b --> e
Supervised fine-tuning (SFT) uses the same cross-entropy loss as pretraining.
We want the model to answer the user, not learn to predict/mimic the user’s questions.
Domain adaptation vs instruction tuning
The goal of domain adaptation is to adapt a model to new jargon. The data is unlabeled documents and we continue pretraining on raw text.
The goal of instruction tuning is to adapt model to a behavioural interface. Data is labeled pairs (Q, A) via supervised training.
Full SFT: Update all parameters of the model. This requires significant compute/memory.
PEFT: Freezes the base model, adds small trainable adapter matrices. Updates only the adapters.
Traditional Fine-Tuning (BERT-style masked LM): Add a traditional head on top. Train for one specific task. Model cannot do other tasks anymore.
Instruction Tuning: No new layers, the model outputs text. Task is specified in natural language in the prompt. Our model remains a general-purpose engine.
Formatting of SFT Data
We wrap data in a conversational format. The model learns that after the <Instruction> tag comes a command, and it should generate text after the <Response> tag.
<Instruction> Summarize the main idea of the text.
[Input text]...
[Input text]...
<Response> The main idea is that instruction tuning adapts base models to follow user commands effectively.<Instruction> Translate the following sentence into French: "The weather is beautiful today."
<Response> Le temps est magnifique aujourd'hui.To train a robust SFT model, we need thousands of diverse instructions.
Source 1: Human Annotation
Hire crowd-workers or experts to write realistic prompts and high-quality answers. This does result in high quality, “real” user distribution, but is expensive and slow.
Aya: A massive multilingual instruction dataset developed by 3,000 fluent speakers across 114 languages.
Source 2: Templating Existing Datasets
NLP researchers have created thousands of labeled datasets over the decades. Convert these rigid datasets into natural language prompts.
How templating works.
Original Dataset (Sentiment)
Input: “The movie was terrible” Label: 0 (Negative)
Template 1
”Review: {Input}. Is this positive or negative? Answer: Negative”
Template 2
”I just read the following review: {Input}. How did the reviewer feel? Answer: They hated it”
SuperNaturalInstructions
12 million examples from 1,600 different NLP tasks. Each task is paired with multiple natural language templates to ensure the model doesn’t overfit to one phrasing.
Source 3: LLM Synthesis (Self-Instruct)
Use a very strong model to generate training data for a smaller model
Loop:
- Give GPT-4 a few seed examples of tasks
- Ask it to generate 100 new, unique tasks.
- Ask it to generate the solutions.
- Filter for quality.
- Train the small model on this synthetic data.
We can explicitly engineer safety into SFT. Use an LLM to generate a “safe” refusal, add this pair to the training set. The model learns to refuse harmful instructions via pattern matching, even before RLHF.
Our goal is to generalize.
If we train on 1,000 tasks, we don’t want the model to be good at just those 1,000 tasks. Learning to follow instructions transfers.
Evaluation Methodology
Hold-one-out: Train on tasks, test on the th task.
Caution
If you train on “SQuAS” (QA) and test on “NaturalQuestions” (QA), it’s not really an unseen task.
Solution is task clustering. Group datasets by type, hold out entire clusters for evaluation.
Instruction-tuned models significantly outperform base models on zero-shot tasks. The more tasks you add during fine-tuning, the better the generalization (but there are diminishing returns).
Question
Can we teach models to reason via SFT?
Chain-of-Thought (CoT): Prompting models to think step-by-step.
SFT can be used to “bake in” this behaviour.
Data: (Question, Rationale + Answer) pairs. Result: Models learn to output reasoning steps automatically before answering.
The quality vs quantity trade-off
Early SFT (Flan): Focused on quantity (millions of examplse)
Recent work (LIMA): Suggests quality is more important
Pretraining takes months on thousands of GPUs. SFT can take days on dozens of GPUs (or hours on 1 GPU for PEFT).
Problem
SFT can cause the model to forget knowledge from pretraining.
Mitigations include mixing in some pretraining data during SFT, use low learning rates, use PEFT to preserve original weights.
Evaluating generation is hard. There are metrics (but most are bad), human eval is good but expensive, LLM-as-a-Judge (using GPT-4 to grade the outputs of smaller models).
SFT datasets can contain test set leakage. If MMLU questions are in your SFT set, your evaluation score is invalid. Need to decontaminate by creating private, held-out evaluation sets.
Challenge: Multilingual SFT
Most SFT data is in English. SFT in English often improves performance in other languages too. The model learns the concept of following instructions, which maps to its internal multilingual representations. The best practice is still to use multilingual data for best results.
Challenge: Reproducibility Issues
Many open weights model release the weights but not the SFT model. The SFT data is often the secret sauce or proprietary.
Challenge: Synthetic Data Risks
Using GPT-4 to train Llama creates a feedback loop. Model Collapse: If we keep training on synthetic data, models might drift away from real human language distribution.
SFT makes model helpful, but not aligned. If the training data contains errors, the model mimics them. Sycophancy - models might agree with the user’s incorrect premise just to be helpful.