Professor: Freda Shi | Term: Winter 2026

Lecture 1

This course will be helpful if:

  • You are curious about the fundamentals and detailed implementations of language models
  • You are looking to apply state-of-the-art NLP techniques to your own goals
  • You are interested in NLP and computational linguistics research

This course will not cover:

  • Basics of Python programming, probability, and algorithms
  • System and architectures for efficient language model training and deployment
  • GPU programming (except for a few high-level basics)

Following background knowledge is strongly recommended:

  • Basic knowledge of calculus, linear algebra, and probability
  • Python programming proficiency
  • Fundamentals of algorithms
  • Understanding of basic data structures

Question

What is Natural Language Processing?

Goal: Understand natural language with computational models.

End system we want to build

  • Simple: Text classification, grammatical error correction
  • Complex: Translation, question answering, speech recognition
  • Unknown: Human-level comprehension

This course will cover the foundation models of language models.

Question

Why is NLP difficult?

Ambiguity: One form can have multiple meanings.

”She saw a cat with a telescope”. Who was with a telescope?

Variability: Multiple forms can share the same meaning.

”The cat is chasing the mouse”. “The mouse is being chased by the cat”.

Cross-lingual awareness, dialects and accents.

Orilla (Spanish) 银行 (Chinese)

Orilla is a shore bank, 银行 is a financial bank. Translation uses English as the middle language.

Underlying meanings: Politeness, humour, irony.

Q: Do you have iced tea? A1: No A2: We have iced coffee

Roadmap: Levels of language

  • Morphology: What words/subwords are we dealing with?
  • Syntax: What phrases are we dealing with?
  • Semantics: What is the literal meaning?
  • Pragmatics: What is the underlying meaning?

Roadmap: Modeling Approaches

Classification:

  • Sentiment Classifier: Take input sentence, run it through a sentiment classifier, and output a label.
  • Local Consistency Checker: Take a pair of sentences, and output a label to the relationship of the sentences.

Language Modeling:

  • GPT: Next-word predictor. Take a prefix of a sentence, and predict the next word
  • BERT: [Mask] Token Predictor. Collect text, mask some word in the middle of the text, and predict the masked word based on the context.

Brief History of NLP

1960s-1990s: Classical NLP

  • Linguistic theories
  • Manually-defined rules

1980s-2013: Statistical NLP

  • Supervised machine learning with annotated data
  • Mostly linear models with manually-defined features

2012-now: Neural NLP

  • Neural network as primary architecture
  • Learning representations from large corpora
  • Less hand-crafting features/linguistic knowledge

2022-now: Large language models

  • Pretrain a language model
  • Use the pretrained language model for various tasks

Common notations of discrete probability

denotes the probability that random variable takes on the value .

Given two random variables .

Joint probability:

Marginal probability: . and are independent .

Conditional probability: . and are independent .

Expectation:

Finding the optimum of a function

Find the minimum of a convex function .

Gradient descent: is the learning rate (“step size”).

Lecture 2

Question

What is a word?

A single distinct meaningful element of speech or writing, used with others to form a sentence.

Things in dictionaries? (new words are often created).

Things between spaces and punctuation? (not all languages treat spaces the same).

Smallest unit that can be uttered in isolation? (one can utter “unimpressively” and “impress”).

Each of the above captures some but not all aspects of what word is.

Linguistic Morphology

Definition

Morphology is the study of how words are built from smaller meaning-bearing units

Types of morphemes:

  • Stem: Core meaning-bearing unit
  • Affix: A piece that attaches to a stem, adding some function or meaning (e.g. prefix, suffix). ‘Speedometer’ is an interfix. Infixes and circumfixes exist in other languages.

Definition

Inflection: Adding morphemes to a word to indicate grammatical information.

walk walked

cat cats

Definition

Derivation: Adding morphemes to a word to create a new word with a different meaning.

happy happiness

define predefine

Definition

Compounding: Combining two or more words to create a new word.

key + board keyboard

law + suit lawsuit

In languages like Classical Chinese, Vietnamese, and Thai

Each word form typically consists of one single morphene. There is litlte moprhology other than compounding.

Few examples of inflection and derivation.

Most Chinese words are created by compounding.

Usually, morphological decomposition is simply splitting a word into its morphemes:

walked = walk + ed

greatness = great + ness

But it can be a hierarchical structure:

unbreakable = un + (break + able)

internationalization = (((inter + nation) + al) + iz[e]) + tion

There is ambiguity in hierarchical decomposition. The door is unlockable.

Tasks that address morphology:

Lemmatization: Putting words/tokens in a standard format.

  • Lemma is a canonical/dictionary form of a word.

  • Wordform: Fully inflected or derived form of a word as it appears in text.

wordformlemma
runrun
ranrun
runningrun
keyboardskeyboard

Stemming: Reducing words to their stems by removing affixes. More conventional engineering-oriented approach used in applications such as retrieval.

”Caillou is an average, imaginative” “Caillou is an averag imagin”.

Lexical Semantics

Lemmatization and stemming tackles the problem of variability. Multiple forms could share the same or similar meanings.

One wordform could refer to multiple meanings.

Definition

Polysemy: A word has multiple related meanings.

She is a star or the star is shining.

Definition

Homonymy: A word has multiple meanings originated from different sources.

I need to go to the bank for cash or I am sitting on the bank of the river.

Question

Which one is the case for a crane?

Crane is a polysemy. The machine was named after the bird.

Definition

Synonyms: Words that have the same meanings according to some criteria.

There are few examples of perfect synonymy.

Synonymy is a relation between senses rather than words.

Definition

Antonyms: Senses that are opposite with respect to at least one dimensionality of meaning.

dark and light (colours)

dark and bright (light)

Sense is a hyponym of sense is is more specific, denoting a subclass of .

Conversly, is a hypernym of .

Dog is a hyponym of animal

Corgi is a hyponym of dog

Sense is a meronym of sense is is a part of .

Conversely, is a holonym of .

Hand is a meronym of body

Finger is a meronym of hand

Definition

Word-Sense Disambiguation (WSD): The task of determining which sense of a word is used in a particular context, given a set of predefined possible senses.

Definition

Word Sense Induction (WSI): Requires clustering word usages into senses without predefined ground truths.

Default solution: encode the context of words with a pretrained model, and train a neural network to predict the sense.

Question

We now have powerful neural language models, which do not distinguish word senses. Is WSD still a meaningful task? Do discrete word senses even exist?

Definition

Tokenization: The process that converts running text into a sequence of tokens.

Penn TreebankMoses
don’tdo n’tdon ‘t
aren’tare n’taren ‘t
can’tca n’tcan ‘t
won’two n’twon ‘t

Important to check and ensure consistency when comparing results across tokenizers.

There is no explicit whitespace between words in some languages, and tokenization becomes highly nontrivial in these cases.

姚明 进入 总决赛 (Chinese Treebank) vs 姚 明 进入 总 决赛 (Peking University).

Type: A unique word.

Token: An instance of a type in the text.

Question

How does the type/token ratio change when adding more data?

Ratio drops fast. Common words appear a lot

Zipf’s Law (also covered in CS451): Frequency of a word is roughly inversely proportional to its rank in the word frequency list.

Tokenization in Modern NLP Systems

There are many words, but the words have shared internal structures and meanings.

Modern NLP systems always convert tokens into numerical indices for further processing. Can we do better than assigning each word a unique index?

flowchart LR
tokenizer --> a[NLP model]
a[NLP model] --> detokenizer

Data-driven tokenizers offer an option that learns the tokenization rules from data, tokenizing texsts into subword units using statistics of character sequences in the dataset.

Two most popular methods:

  • Byte Pair Encoding (BPE)
  • SentencePiece

Byte Pair Encoding

Originally introduced in 1994 for data compression, later adapted and revived for NLP.

Key idea: Merge symbols with a greedy algorithm.

Initialize the vocabulary with the set of characters, and iteratively merge the most frequent pair of symbols to extend the vocabulary.

Training Corpus

  • cat
  • cats
  • concatenation
  • categorization

Initial vocabulary

  • a c e g i n o r s t z

Count Symbol-Pair Frequencies

  • <a t>: 6, <c a>: 4, <o n>: 3,

Update vocabulary

  • a c e g i n o r s t z at

Count Symbol-Pair Frequencies

  • <c at>: 4, <o n>: 3,

Update vocabulary

  • a c e g i n o r s t z at cat

We repeat this process until the vocabulary reaches the desired size.

The BPE proposal is not optimal in terms of compression rate under the same vocabulary size.

A better vocabulary

  • a c e g i n o r s t z cat on

Apply Trained BPE to a New Corpus

In addition to having the tokens, we will also need to know the merge rules. Starting from individual characters, and merge following the rules.

Terminate vocabulary

  • a c e g i n o r s t z at cat

Merge Rules

  • a t at
  • c at cat

Word to be tokenized

  • c a t e g o r y

Result

  • cat e g o r y

SentencePiece Tokenization

Find the vocabulary for a unigram language model that maximizes the likelihood of the training corpus.

Byte-Level BPE

Question

How do large language models tokenize texts from different languages, with a unified tokenizer and fixed vocabulary size?

Convert to hexadecimal.

Prepend zeroes to fix the length of tokens (to ensure the unique decoding), and do BPE on the bit/multi-bit level.

Lecture 3

Recap: Tokenization in Modern NLP Systems

The natural first step is to convert tokens into numerical indices for further processing.

Question

Can we do better than assigning each word a unique index?

Recap: Byte-Level BPE Tokenization

Consider UTF-8 encoding of “Hello world!”

CharacterUTF-8 Hex
H48
e65
l6C
l6C
o6F
(space)20
W57
o6F
r72
l6C
d64
!21

We will work with the sequence 48 65 6C 6F 20 57 6F 72 6C 64 21.

The base vocabulary: entries from 00 to FF. At each step, evaluate frequency of consecutive vocabulary entry pairs, and add a new entry.

This is not much different from character-based BPE.

Question

What about non-English characters?

Consider UTF-8 encoding of “Hello 世界” (world).

CharacterUTF-8 Hex
H48
e65
l6C
l6C
o6F
(space)20
E4 B8 96
E7 95 8C

We will work with the sequence of 48 65 6C 6F 20 E4 B8 96 E7 95 8C. Base vocabulary is still .

Edit Distance: Comparing similarity of two sequences

Question

How to measure the similarity between two strings?

We need to design algorithms to assign a numerical score to measure similarity.

A proposal: The minimum number of single-character edits required to change one string into another .

Three allowed operations:

  • Insertion
  • Deletion
  • Substitution

This is known as Levenshtein Distance

teh the

  • Delete e at position 2, add e at position 3; or
  • Add h at position 2, delete h at position 4; or
  • Substitute e at position 2 with h, substitute h at position 3 with e .

cat bat

  • Substitute c at position 1 with b

cat cats

  • Add s at position 4

Unified Algorithmic Solution

Question

How about sitting extension?

Dynamic Programming (CS341):

Let represent the edit distance (minimal number of edits) between the first characters of and the first characters of .

Edge cases:

0 1 e2 x3 t4 e5 n6 s7 i8 o9 n
0 0123456789
1 s1123455678
2 i2223456567
3 t3332345667
4 t4443345677
5 i5554445567
6 n6665545676
7 g7776655677

The equation implies doing nothing when when calculating .

Question

Why is this correct?

Intuition: In such cases, doing nothing may not be the unique best solution, but it is one of the best.

for sitting extension, which could come from , or .

Extension: Different Operations with Different Costs

The minimum cost of single-character edits required to change one word into the other. Each operation could have a different non-negative cost.

Three allowed operations:

  • Insertion with cost
  • Deletion with cost
  • Substitution with cost .

Word Vectors

Until 2010, in NLP, words meant atomic symbols.

Nowadays, it’s natural to think about word vectors when talking about words in NLP. Each word is represented by a vector.

Key idea: Similar words are nearby in a good vector space.

How models represent words

We map each word to a (very high-dimensional) vector.

One of the key challenges for NLP is variability of language (multiple forms having the same meaning).

Representation Learning for Engineering

Engineering: These representations are often useful for downstream tasks.

Transfer learning:

  • Image Segmentation, Visual QA Object classification
  • Text Classification, QA Context prediction

How to represent a word: One-hot representation of words

.

could be very large, word vectors are orthogonal.

Question

What is an ideal word representation?

It should probably capture information about usage and meaning:

  • Part of speech tags (noun, verb, adj., adv., etc.)
  • The intended sense
  • Semantic similarities (winner vs champion)
  • Semantic relationships (antonyms, hypernyms)

Features

Features could extend infinitely.

Distributional Semantics: How much of this can we capture from context/data alone?

”The meaning of a word is its use in the language.” - Ludwig Wittgenstein.

The use of a word is defined by its contexts.

Distributional Semantics

Consider a new word: tezgüino.

  1. A bottle of tezgüino is on the table.
  2. Everybody likes tezgüino.
  3. Don’t have tezgüino before you drive.
  4. We make tezgüino out of corn.

Question

What do you think tezgüino is?

Loud, motor oil, tortillas, choices, wine.

1234
tezgüino1111
loud0000
motor oil1001
tortillas0101
choices0100
wine1110

Question

How can we automate the process of constructing representations of word meaning from its company?

First solution: word-word cooccurrence counts (CS451).

Counting for Word Vectors

’the club may also employ a chef to prepare and cook food items'

'is up to Remy, Linguini, and the chef Colette to cook for many people'

'cooking program the cook and the chef with Simon Bryant, who’

The top word is the word we are computing the vector for, the words on the side are the context words

Once we have word vectors, we can compute word similarities.

Among many ways to define the similarity of two vectors, a simple way is the dot product:

Dot product is large when the vectors have very large (in terms of absolute values) in the same dimensions.

With dot product as the similarity function, we can find the most similar words (nearest neighbours) to each word:

catchefchickenciviccookedcouncil
councilcouncilcouncilcouncilcouncilcouncil
catcatcatcatcatcat
civiccivicciviccivicciviccivic
chickenchickenchickenchickenchickenchicken
chefchefchefchefchefchef
cookedcookedcookedcookedcookedcooked

Council is always a top neighbour because its vector has very large magnitudes, and dot products care about magnitude.

Cosine similarity:

Now using cosine similarity:

catchefchickenciviccookedcouncil
catchefchickenciviccookedcouncil
chefciviccookedcouncilchefcivic
cookedcookedchefchefciicchef
civiccouncilciviccookedcouncilcooked
councilcatcouncilcatcatcat
chickednchickencatchickenchickenchicken

Issues with Counting-Based Vectors

Raw frequency count is probably a bad representation. Count of common words are very large, but not very useful. “The”, “it”, “they” are not very informative.

There are many ways proposed for improving raw counts.

  • Removing “stop words”
  • Down-weight less informative words

TF (Term Frequency) - IDF (Inverse Document Frequency)

  • Information Retrieval (IR) workhouse
  • A common baseline model
  • Sparse vectors
  • Words are represented by a simple function of nearby words

Consider a matrix of word counts across documents: term-document matrix

Term Frequency:

As You Like itTwelfth NightJulius CaesarHenry V
battle10713
good114806289
fool365814
wit201523

Columns are bag-of-words (document representation). Rows are word vectors.

Lecture 4

Pointwise Mutual Information

Consider two random variables, and .

Question

Do two events and occur together more often than if they were independent?

If they are independent, then .

PMI for word vectors

For words and its context , each probability can be estimated using counts we already computed.

Some have found benefit by truncating PMI at (positive PMI).

Negative PMI: words occur together less than we would expect (they are anti-correlated).

These anti-correlation may need more data to reliably estimate.

But, negative PMIs do seem reasonable.

Word2Vec

Instead of counting, train a classifier (neural network) to predict context.

Training is self-supervised: no annotated data is required, just raw text. Word embeddings are learned via backpropagation.

Definition

CBOW (Continuous bag-of-words): learning representations that predict a word given a bag of context (many-to-one-prediction).

Given context, need to predict the centre word.

Definition

Skipgram: Learning representations that predict the context given a word.

Given centre word, and need to predict surrounding words.

Skipgram

This is a log linear model.

‘it is a far, far better rest that I go to, than I have ever known’

CBOW

Use the context to predict the centre word

Information we get from the ordering of words is lost with bag of words.

Skipgram with negative sampling

This denominator is very memory intensive. Doing the dot product of a row with the entire matrix every single time.

The vocabulary size : k - M

Very expensive:

Treat the target word and neighbouring context word as positive examples. Randomly sample other words outside of context to get negative samples.

Learn to distinguish between positive and negative samples with a binary classifier.

This is essentially the softmax between the negative dot product and .

You shall know anything by the company it keeps Node2Vec, Concept2Vec, World2Vec.

Classification

Simplest user facing NLP application.

flowchart LR
Text --> Model
Model --> Category

Rule-Based Classifier

Sentiment classification of sentence (classes: positive, negative)

If contains words in [good, excellent, extraordinary, ] return positive.

If contains words in [bad, terrible, awful, ] return negative.

This has nice interpretability and can be very accurate. But rules are difficult to define, system can be very complicated, and is hardly generalizable.

Statistical Classifiers: General Formulation

Data: A set of labeled sentences .

  • : a sentence (or a piece of text)
  • : label, usually represented as an integer

Deliverable: A classifier that takes an arbitrary sentence as input, and predicts the label .

Inference: solve . Modeling: define function. Learning: choose parameter .

First approach to the modeling problem, Naïve Bayes.

Probabilistic model: estimated from the data.

Question

How to estimate to ensure sufficient generalizability?

The Bayes rule:

Unigram assumption:

in : look-up table that stores and .

Each and can be estimated by counting from the corpora.

This is the estimation that maximizes dataset probability.

Issue: an unseen word in certain class will lead to a zero probability (same problem from CS451)

Solution: Smoothing

Inference

Smoothing applies to words seen in the training data. If the word never existed in the training data, we just ignore it.

Lecture 5

Recap

Logistic function: , .

Logistic Regression: Modeling

Suppose we can represent sentence with a vector .

Probabilistic model:

can be written as if has a constant dimension.

Logistic Regression: Learning

Objective: Maximizing the dataset probability, under the assumption that each example is sampled independently.

For better numerical representation, we usually take the negative logarithm of the probability as the loss and minimize it:

Gradient descent:

Logistic Regression: Inference

Compute whether class 0 or 1 has larger probability.

Question

What if there are more than 2 classes?

  • 1 vs. 1 for class pairs and do voting; or
  • 1 vs. all for classes and do .

Probabilistic interpretation over classes no longer hold for multi-class classification with logistic regression.

Generative vs. Discriminative Model for Classification

  • Generative models: is accessible when modeling .
  • Discriminative models: is directly modelled.

Question

What are the key differences?

Difference: If you can generate a new data example once you have the model. Naïve Bayes is a generative model, logistic regression is a discriminative model.

Neural-Network Classifier

A neural network is a function. It has inputs and outputs. Neural modeling now is better thought of as dense representation learning.

With a neural network based function , we input and collects a vector ; is defined by selecting the corresponding entry in .

Common Neural Network Notations:

  • : a vector
  • : entry in the vector
  • : a matrix
  • entry in the matrix

Perceptron

If the dot product between and is less than , then category for will be .

Predict the label: .

Update weights: .

What happens
00Nothing
11Nothing
10
01

Neural Layer: Generalized Perceptron

A neural layer = affine transformation nonlinearity

Output is a vector (results from multiple independent perceptrons). Can have other activation functions for nonlinearity.

Stacking Neural Layers

Multiple neural layers can be stacked together

We use the output of one layer as input to the next. This is a feed forward (and/or fully-connected layers). Also called multi-layer perceptron (MLP).

Nonlinearities (activation function)

can be applied to each entry in a vector in an element-wise manner. Common activation functions: , , and .

Question

Why nonlinearities?

Otherwise stacking neural layers results in a simple affine transformation.

Nonlinearity:

Nonlinearity:

Nonlinearity:

Sentiment Classification with Neural Networks

We empirically don’t pass the final layer into an activation function.

Question

How can we get for a sentence?

Average word embeddings, or more complicated neural network structures.

Neural Network: Training

Maximize the probability (minimizing the negative log probability) of gold standard label.

Also called cross entropy loss.

Backpropagation

Chain rule: suppose , , then

Question

Now we have , how should we update .

What is actually happening is the concatenation as below.

We calculate as they are scalars, and then reform to get the original shape of the matrix.

Visualization of an Average Word Vector Classifier

Common Neural Architectures

Convolutional Neural Networks

Introduced for vision tasks; also used in NLP to extract feature vectors.

We apply kernels (filters) to image patches (local receptive fields). The kernels are learnable.

Take the dot product between the filter and the stretched word embeddings.

Lecture 6

Pooling

Each kernel/filter extracts one type of feature.

A kernel’s output size depends on sentence length. A fixed dimensional vector is desirable for MLP inputs.

Solution: Mean pooling/max pooling converts a vector to a scalar.

Final feature: Concatenating pooling results of all filters.

Word order matters

Example

Kernel size ‘a cat drinks milk’ (a cat), (cat drinks), (drinks milk) ‘a milk drinks cat’ (a milk), (milk drinks), (drinks cat)

An -gram “matches” with a kernel when they have high dot product.

Drawbacks

Cannot capture long-term dependency. Often used for character-level processing: filters look at character -grams.

Recurrent Neural Networks (RNNs)

Idea: Apply the same transformation to tokens in time order.

flowchart LR
a[xt-1]
b[xt]
c[xt+1]
d[ht-1]
e[h_t]
f[ht+1]
a --> d
d --> e
b --> e
e --> f
c --> f

Gradient update for .

Suppose is the representation passed to the classifier.

We can easily calculate .

Question

What about .

Bug

An important issue of simple RNNs.

Absolute value of entries grow and vanish exponentially with respect to sequence length.

This motivates the development of more advanced RNN architecture.

Long Short-Term Memory Networks (LSTMs)

Designed to tackle the gradient vanishing problem.

Idea: Keep entries in and in the range of .

Gated Recurrent Units

Fewer parameters and generally works quite well.

  • Update gate:
  • Reset gate:

Forget gate is just 1 minus the input gate. Reset gate is applied to the previous hidden contribution.

RNN: Practical Approaches

Gradient clip: Gradient sometimes goes very large even with LSTMs. Empirical solution: After calculating gradients, require the norm to be at most , set by hyperparameters.

At time step , what matters to is mostly where is close to .

Bidirectional modeling typically results in more powerful features.

Recursive Neural Networks

Run constituency parser on sentence, and construct vector recursively. All nodes share the same set of parameters.

flowchart BT
a[h5]
b[h1]
c[h4]
d[h2]
e[h3]
b --> a
c --> a
d --> c
e --> c

Tree LSTMs typically work well. Slight modification of LST cells needed.

Recursive neural networks with left-branching trees are basically equivalent to recurrent neural networks.

Syntactically meaningful parse trees are not necessary for good representations. Balanced trees work well for most tasks.

Attention

Can be thought of as weighted sum; each token receives weight.

From (unweighted) bag of words to (weighted) bag of words. Each word receives a fixed weight, normalize the weights with .

Parameterized attention: Word tokens with the same word type should probably receive different weights in different sentences. Implement attention with an MLP.

Self-Attentive RNNs

The last hidden state of RNN could be a bad feature. Why?

At time step , what matters to is mostly where is close to .

Attention weights over RNN hidden states could be bad indicators on which token is more important.

Lecture 7

Transformer: Attention-based sentence encoding, and optionally, decoding.

Idea: Every token has “attention” to every other token.

For sentence with tokens .

are trainable parameters.

Question

What is for?

Question

Consider : if each entry in both vector is drawn from a distribution with zero mean and unit variance, what would happen if the dimensionality grows?

The variance of the dot product grows.

For independent zero-mean, unit-variance random variables and , recall (STAT230)

For independent zero-mean, unit-variance random variables and .

If we have independent zero-mean, unit variance variables .

Transformer Encoder

The application of is theoretically motivated.

See also Xavier initialization: initialize a dot product parameter vector with values drawn from

Positional Encoding

Columns of for “a cat” permutation of columns of for “cat a”.

The choice of is somewhat arbitrary, but it’s overall theoretically motivated: The positional add relation can be represented by a linear transformation.

Proof idea: Use the addition theorems on trigonometric functions.

Limitation: Only fixed number of positions are available. Another option is learnable positional encoding.

Multi-Head Attention

We can parallelize multiple , , , with different random initialization (and hope they learn different ways to attend tokens).

We stack transformer layers. Earlier layers’ output are added to the vector stream.

Normalization: Preserve variance:

Lecture 8

Recap: Distributional Semantics

Language Models: General Formulation

Language models compute a probability distribution over strings in a language: , is a string with tokens .

Language modeling: Assign probabilities to token sequences.

Modeling: Define a statistical model , where is a string.

Learning: Estimate the parameters from data.

Goal - compute the probability of a sequence of words.

Relatedly, modeling probability of the next word:

Relatedly, modeling the probability of a masked word given its context

A model that computes any of the above is called a language model. A good language model will assign higher probabilities to sentences that are more likely to appear.

Question

How do we model this?

Recap: The chain rule of probability

We haven’t yet made independence assumptions.

This is autoregressive language modeling.

Important detail: Modeling length

A language model assigns probability to token sequences . can be of any length, and the probabilities should sum to 1 across all possible sequences of different lengths.

Usually length is modelled by including a stop symbol </s> at the end of each sequence. Predicting </s> = modeling stopping probabilities. Relatedly, a start symbol <s> should be added to the beginning.

Language model with both start and stop symbols:

We need to ensure:

Consider removing stopping probabilities

Question

What happens if we don’t model stopping probability?

Sums of probabilities for all possible length and length sequences:

With the stop symbol

Proof sketch: Once you reach the </s> token after sampling some certain sequence, certain probability mass is taken away because <s> is in the same distribution as other vocabulary entries.

Longer sequences that have the same prefix share the remaining probability.

Alternatively, we can model the length explicitly (e.g., using a zero-truncated Poisson distribution).

Estimating language model probabilities.

<s>I do not like green eggs and ham</s>

We can use maximum likelihood estimation (from STAT231)

Problem: We will never have enough data.

Markov Assumption

Independence assumption: The next word only depends on the most recent past (most recent words). Reduces the number of estimated parameters in exchange for modeling capacity.

1st order Markov: . .

2nd order Markov: . .

N-Gram Language Models

: Unigram language model

: Bigram language model

Recap:

Modeling, learning, and inference

Estimating n-gram probabilities

Maximum likelihood estimate (MLE):

Training data:

<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

A few estimated bigram probabilities given our MLE estimator:

Test data: <s> I like green eggs and ham</s>

Problem: . This is an over-penalization

Smoothing in n-gram LMs, just add 1 to all counts. This is Laplace smoothing.

MLE Estimate

Add-1 estimate:

Greedy search: Choose the most likely word at every step.

To predict the next word given the previous two words .

Problem is we will generate the same sentence many times, and may be missing out on a higher probability continued sentences (multiple words down the line).

Bigram model - sampling

  • Generate the first word: .
  • Generate the second word: .
  • Generate the third word: .

Generating from a language model

Two effective sampling strategies: excluding the possibility for tokens with very-low probabilities. Top- vs top- sampling.

Top-, sort all vocabulary in terms of the next token probability, only take top .

Top-, do the same thing, but don’t fix the number of tokens, fix the portion of cumulative probability mass we are considering. Number of tokens we take is a function of .

A good language model will assign higher probabilities to sentences that are more likely to appear.

Compute probability on held-out data. Standard metric is perplexity.

Probability of held-out sentences:

Let’s work with -probabilities:

Divide by number of words (including stop symbols) in held-out sentences:

From probability perplexity.

Average token -probability of held-out data:

Perplexity:

The lower the perplexity, the better the model

Lecture 9

Neural Language Modeling

Recap: Language Modeling as Classification

This is just a probabilistic classification problem.

Neural Trigram Language Model

Given two previous words, compute probability distribution over possible next words.

Input is concatenation of vectors (embeddings) of previous two words.

Output is a vector containing probabilities of all possible next words.

To get , do matrix multiplication of parameter matrix and input, then transformation.

Neural Trigram Language Model

denotes indices of sentences, denotes indices of tokens.

via backpropagation.

Trigram vs Neural Trigram LMs

Trigram language model: Separate parameters for every combination of , , , so approximately parameters. # of parameters is exponential in -gram size. Most parameters are zero (even with smoothing).

Neural trigram language model: Only has parameters. can be chosen to scale # parameters up or down and are linear in -gram size. Almost no parameters are zero. No explicitly smoothing, though smoothing is done implicitly via distributed representations.

Removing N-Gram Constraints

RNN Language Models

Hidden stat is a function of previous hidden state and current input. Same weights at each state.

Vector for each word is combined with the history vector. Use last layer hidden state as new hidden representation, instead of the average of the previous.

via back propagation.

Transformer Language Models

A token “attends” to all previous tokens.

Use as feature to predict the next token.

Language models encode knowledge about language.

The pre-training-finetuning paradigm: Language modeling, as the pre-training task, helps encode knowledge. The knowledge helps with downstream tasks. Use the hidden state as the feature for the downstream task.

Masked Language Models

Motivation: learning useful representations of text.

Replace token at position with a placeholder. Use as feature to predict token at position .

Probing

Question

What is encoded in a trained language model?

Empirical answer: linguistic probe.

Take a fixed model as the “frozen” filter feature extractor, train a lightweight model to predict labels.

Frozen: the base model never gets updated when training the lightweight model.

Confounding

Question

Does above-chance performance on held out data mean the model encode part-of-speech knowledge?

Not necessarily, the model might just encode word identity, and the probe learns to group them together.

Solution (control tasks): Draw conclusion performance on main task is significantly better than that on the control task.

Syntax: Constituency

Sentence: the cat is cute

Bracketing: ((the cat) is cute))

Task: given any span of words, is it a constituent?

Probing is required as candidate constituents may be of different lengths but MLP could only take a fixed dimensional vector.

Task: Given a constituent, what’s the label?

Constituent labels: Syntactic Substitutability

Pooling is required to accommodate the variable length of constituents.

Lecture 10

Phrase Structures/Constituency Grammar

Constituency grammars focus on the constituent relation.

Informally: Sentences have hierarchical structures.

A sentence is made up of two pieces:

  • Subject, typically a noun phrase (NP)
  • Predicate, typically a verb phrase (VP)

NPs and VPs are made up of pieces:

  • A cat = (a + cat)
  • Walked to the park = (walk + (to + (the + school)))

Each parenthesized phrase is a constituent in the constituent parse.

Constituent: A group of words that functions as a single unit.

Linguists try to determine constituents via constituency tests. A constituency test follows some rules to construct a new sentence, focusing on the constituent candidate of interests. If the constructed sentence looks good, we find some evidence about constituency.

Consider the sentence: Drunks could put off the customers.

Constituency Test: Coordination

Coordinate the candidate constituent with something else.

  • Drunks could [put off the customers] and sing.
  • Drunks could put off [the customers] and the neighbours.
  • Drunks [could] and [would] put off the customers.

Constituency Test: Topicalization

Moving the candidate constituent to the front. Modal adverbs can be added to improve naturalness.

  • and [the customers], drunks certainly could put off.
  • and [customers], drunks could certainly put off the.

Constituency Test: Deletion

Delete the span of interest. Word orders can be changed to improve naturalness.

  • Drunks could put off the customers [in the bar].
  • Drunks could put off the customers [in the] bar.

Constituency Test: Substitution

Substitute the candidate constituent with the appropriate proform (pronoun/proverb etc.). Slight word over adjustment is allowed to improve naturalness.

  • Drunks could [do so = put off the customers].
  • Drunks could put [them = the customers] off.
  • Drunks could put the [them = customers] off.

Constituency Parsing as Bracketing

Question

Brackets: Which spans of words are the constituents in a sentence?

Sentence: The main walked to the park.

Bracketing: ((the man) (walked (to (the park)))

The brackets can be translated into trees.

There are categories associated with constituents.

flowchart TD
a[NP]
b[NP]
c[the]
d[the]

S --> a
a -----> c
a -----> man
PP ----> to
PP --> b
b ---> d
b ---> park
S --> VP
VP -----> walked
VP --> PP

The internal nodes are called nonterminals, the leaves are called terminals. Non terminals connect to pre-terminals, which connect to terminals.

The head of a constituent is the most responsible/important word for the constituent label.

Question

Which word makes ‘the cat’ an NP?

Cat

Question

Which word makes ‘walked to the park’ a VP?

Walked

There are syntactic ambiguities. Consider the sentences ‘time flies like an arrow’ and ‘fruit flies like a banana’.

flowchart TD
a[NP]
b[NP]
c[NN]
d[NN]

S --> a
a --> c
c --> time
S --> VP
VP --> V
VP --> PP
V --> flies
PP --> P
PP --> b
P --> like
b --> DT
DT --> an
b --> d
d --> arrow
flowchart TD
x[NP]
b[NP]
c[NN]
d[NN]

S --> x
x --> ADJ
x --> c
ADJ --> fruit
c --> flies


S --> VP
VP --> VBP
VP --> b
VBP --> like

b --> DT
b --> d
DT --> a
d --> banana

NLP Task: Constituency Parsing

Given a sentence, output its constituency parse. Widely studied task with a rich history. Most studies are based on the Penn Treebank.

Constituency parsing with general formulation.

The score of a tree is defined by the sum of constituent scores

Each span score can be modeled with a neural network.

Training objective: Let the collection of the true spans have the highest accumulated span scores among all parses.

Question

How do we solve this?

Constituency Parsing: Inference (solving ).

Let’s first assume is a binary unlabeled parse tree. Each node is either a terminal node or the parent of two other nodes. There is one root node.

The simplified CKY Algorithm

The maximum sum of subtree scores if [] is a constituent.

The maximum possible sum of subtree scores if the sentence is fuller parsed.

Edge case: .

Context-Free Grammar (CFG) (CS241)

A generative way to describe constituency parsing. A CFG defines some “rewrite rules” to rewrite nonterminals as other nonterminals or terminals.

In previous ‘the man walked to the park’ example, we would have a sequence of rewrites corresponding to a bracketing.

Question

Why context-free?

A rule to rewrite NP does not depend the context of that NP. The left-hand size (LHS) of a rule is only a single non-terminal (without any other context).

Probabilistic Context-Free Grammar (PCFG)

We assign probabilities to rewrite rules.

Probabilities must sum to 1 for each left-hand side nonterminal. Given a sentence and its tree , the probability of generating with rules in grammar is , where denotes a rule.

Given a treebank, what is the MAP estimation of the PCFG?

A PCFG assigns probabilities to sequences of rewrite operations that terminate in terminals, this sequence implies the natural-language yield, and bracketing of sentences.

CKY with PCFG Formalism

Find the max-probability tree for a sentence

the maximum possible log probability that words within range [ ] are the outcome of a nonterminal label .

Edge case: Set the appropriate when word could have label , and otherwise .

Inside algorithm

Find the probability for generating a sequence from a certain non-terminal (counting all possible trees).

The Chomsky Normal Form (CNF)

For any free-form PCFG, there exists an equivalent PCFG in which each node has zero or two children. Trees satisfying the latter conditions are said to be in the Chomsky normal form.

To go from constituency to dependency, we need to propagate lexical heads up the tree, remove non-lexical parts, merge redundant nodes. Result is the dependency parse tree.

Dependency parses directly model the relation between words.

Lecture 11

Grounded Semantics: Meanings demonstrated from other sources of data in addition to the language systems.

Distributional Semantics

A bottle of tezgüino is on the table. Everybody likes tezgüino. Don’t have tezgüino before you drive.

Visually grounded semantics

Picture of tezgüino.

Symbol Grounding Problem

Symbol meaning: How do we make sense of symbols?

Practical implication: Enable the reliably meaningful interaction between language models and humans/physical world.

Whether this is a stochastic parrot or semantic comprehension is under debate.

Grounding can be categorized into:

  • A: Referential grounding
  • B: Sensorimotor grounding
  • C: Relational grounding
  • D: Communicative grounding
  • E: Epistemic grounding

A, B, C, E: Semantic (static) grounding

D: Communicative (dynamic) grounding

Recap: Text-Only Language Models

Two types of text-only ungrounded language models:

Autoregressive models, good for generation.

Masked language models, good for feature extraction.

Incorporating visual signals leads to two families of vision-language models.

BERT joint visual semantic embeddings

GPT generative vision-language models.

Joint Visual-Semantic Embeddings (and CLIP)

Idea: Encode visual and textual information into a shared space.

Embedding: Vector space

  • Text encoder (turn text into vector)
  • Image encoder (turn image into vector)

Design a loss function to “align” the two vector spaces. End up with one single vector space that encodes both.

Training data: Images and their descriptions.

Visual encoding: Convert an image to a fixed-dimensional vector representation.

Joint Visual-Semantic Embedding Space Objective

Idea: Matched image-caption pair should be closer than mismatched pairs in the embedding space.

This is triplet-based hinge loss. Anything close should be close in joint embedding space, anything far should be far.

Properties of the Joint Space

Images and text are close in a good joint embedding space if they are semantically related.

Example applications:

  • Bidirectional image-caption retrieval
  • Image captioning

Text in the training data can be at any level of granularity (words, phrases, sentences, paragraphs, etc.)

Contrastive Language-Image Pretraining (CLIP)

Image-to-text retrieval: Given a pool of text, model the probability of choosing the correct text; and vice versa.

Generative vision-language models

Recap: Generative autoregressive language models

Text-only language models: Predicting the next token conditioned on the history.

Extending to Vision-Language Models (VLMs).

General VLM: Training Objective

Loss function only calculated on textual positions.

Fine-Grained Vision-Language Tasks

  • Object retrieval (assuming objects’ bounding boxes are given).
    • Cognitive plausibility: Recognizing objects is very easy for humans.
  • Multimodal coreference resolution (without assuming bounding boxes)
  • Phrase grounding: Mapping phrases to objects in the image.
  • Dense captioning (reverse): Write a short description for each detected object.

Limitations of current VLMs

  • Lack of physical knowledge, and the neural architecture makes it hard to incorporate knowledge.
  • Poor in recognizing spatial relations.
  • Lack of cultural diversity representation.

Lecture 12

Definition

Large Language Model: A computational agent capable of conversational interaction and text generation.

Fundamentally: A probabilistic model that predicts the next token in a sequence based on context.

Core Mechanism: Conditional generation. Input is context (prefix), and output is the probability distribution over next tokens.

Pretraining

Dataset of 100B to > 5T tokens. Task is next-token prediction on unlabeled texts, and the output is a base model.

ELIZA (1966)

Joseph Weizenbaum’s rule-based system simulating a Rogerian psychologist. The ELIZA effect: Humans easily form emotional connections with machines. Limitation is that there is no real understanding, purely pattern matching/rules.

Modern LLMs: Unlike rule-based systems, these learn language patterns and world knowledge from vast corpora.

The Knowledge Bottleneck

Human learning

Children learn ~7-10 words/day to reach adult vocabularies 30k-100k words. Most knowledge is acquired as a by-product of reading and contextual processing.

Machine Learning

Distributional Hypothesis: We learn meaning from the company words keep. The NLP revolution relies on this principle: learning syntax, semantics, and facts from data distribution.

Pretraining: The Core Idea

Definition

Pretraining: Learning knowledge about language and the world by iteratively predicting tokens in massive text corpora.

The Result: A pretrained model containing rich representations of syntax, semantics, and facts.

Foundation: Serves as the base for downstream tasks (QA, translation). This phase is self-supervised.

Question

What does a model learn by predicting the blank?

”With roses, dahlias, and peonies, I was surrounded by x” learns ontology (category: flowers)

“The room wasn’t just big it was x” learns semantics/intensity (enormous > big)

“The square root of 4 is x” learns arithmetic/math

”The author of ‘A room of One’s Own’ is x” learns world knowledge (Virginia Woolf)

Three major architectures

  • Decoder-only (Generative: GPT-4, Llama)
  • Encoder-only (Understanding: BERT)
  • Encoder-decoder (Translation: T5, BART)

Decoder Architecture

Mechanism: Auto-regressive generation. Takes tokens as input and generates tokens one by one (left-to-right).

Causal: Masked attention ensures it can only see past tokens, not future ones.

Use Case: Generative tasks (text generation, code completion).

flowchart LR
a[The cat sat on]
b[Decoder-only model]
c[the]

a --> b
b --> c

Example

GPT-3, GPT-4, Llama, Claude, Mistral

We focus on decoders as they are the standard for generative LLMs.

Encoder Architrecture

Mechanism: Outputs a vector representation (encoding) for each input token.

Bidirectional: Can see context from both left and right directions simultaneously.

Training: Typically masked language modeling.

Use case: Classification, sentiment analysis, NER.

flowchart LR
a["The [MASK] sat on the mat"]
b[Encoder-only model]
c[cat]

a --> b
b --> c

Example

BERT, RoBERTa

Encoder-Decoder Architecture

Mechanism: Maps an input sequence of tokens to a (potentially different) output sequence.

Key feature: Decouples input understanding from output generation. Input and output can have different lengths.

Use case: Translation (english french), summarization.

flowchart LR
a[Hello world]
b[Encoder]
c[Decoder]
d[le]
e[Bonjour]

a --> b
b --> c
c --> d
e --> c

The “Black box” view

Input: Sequence of tokens (context)

The cat sat on the

Output: Probability distribution over the vocabulary for the next token

mat: 0.5 bench: 0.3 dog: 0.01

We sample from this distribution to generate the next word.

Self-supervised learning

The idea: We do not need manually labeled data. The text itself is supervision. At every step , predict the next word given the history . Since we possess the source text, we always know the correct next token.

We train the network to minimize Cross-Entropy Loss

It measures the difference between the model’s predicted distribution and the true distribution.

The loss simplifies to the negative log probability of the correct next word.

Teacher Forcing

Inference vs Training

Inference: Errors accumulate as model predicts its own future context.

Training (Teacher Forcing): We do not use the model’s own prediction. We always feed the correct history sequence.

Advantage: Allows massive parallelization (all tokens trained simultaneously).

Training Visualization

  1. Input: Sequence of tokens
  2. Forward pass: Model computes distribution for every position simultaneously.
  3. Update: Calculate loss, batch, and backpropagate to update weights

Conditional Generation

We model NLP tasks as conditional text generation.

Probability of token given all previous tokens is .

Sampling loop: Compute probability distribution, sample token, add to context, repeat.

We go from logits to text by taking the logits (raw scores from the model), converting to probabilities via softmax, and then decoding (selecting the next token).

Greedy decoding

Algorithm: Always select the single token with the highest probability

Pros: Deterministic, optimal for short factual queries.

Cons: Often leads to repetitive, degenerate text. Misses creative paths.

Random Sampling

Algorithm: Sample the next token randomly according to the probability distribution . Effect: A word with 10% probability is chosen 10% of the time.

Pros: High diversity, human-like variance.

Cons: Tail risk, sampling a very low probability word that derails coherence.

Temperature Sampling

Rescale logits by a temperature parameter .

Low (): Confident/Greedy

High (): Diverse/Creative

Top-k and Nucleus Sampling

Top-: Only sample from the top most likely words. Renormalize weights and do random sampling on just the .

Nucleus (top-): Sample from the smallest set of words whose cumulative probability exceeds (e.g. 0.9).

Preview: Task Modeling

Almost any task can be cast as next-token prediction.

Sentiment Analysis

Input: “The sentiment of the sentence ‘I like Jackie Chan’ is:” Model Compares: vs . Decision: Choose the token with higher probability

Question

What determines performance?

Performance (Loss) is determined by three main power-law factors:

  • Number of parameters (model size)
  • Number of tokens (dataset size)
  • Compute budget (FLOPS)

Pretraining Corpora: LLMs are trained on massive datasets (trillions of tokens)

  • Web crawls
  • Quality data
  • Code

Example

The Pile: An 825 GB corpus of diverse text sources.

Filtering: Quality

Raw web text is noisy, we must filter for quality.

  • Heuristics: Removing short documents, excessive symbols, or malformed text
  • Model-based: Train classifiers to distinguish high quality text from low-quality text
  • Deduplication: Remove duplicate documents. Reduces memorization and improves generalization.

Filtering: Safety & Ethics

PII Removal: Scrubbing Personally Identifiable Information.

Toxicity Filtering: Removing hate speech and abusive content.

Error

Bias: Classifiers may be biased against minority dialects. Trade-off: Models trained on sanitized data may be worse at detecting toxicity themselves.

Ethical & Legal Issues

  • Copyright
  • Consent
  • Privacy
  • Skew

The Power of Law of Scaling

Kaplan et al. (2020) empirically showed that loss scales as a power law:

Implication: To improve performance, we must scale up and simultaneously.

Diminishing returns: To halve the error, you need exponentially more compute.

Compute-Optimal Scaling (Chinchilla)

The Chinchilla ratio: To train optimally for a fixed compute budget, scale parameters and data equally. (~20 tokens per parameter)

Example: A 70B model 1.4 trillion tokens (current trend is that models are over-trained to make inference cheaper).

Approximate parameter count:

For a Transformer model, the number of non-embedding parameters is roughly:

  • Number of layers
  • Model dimensionality

GPT-3: 96 layers, 12288 175 billion parameters.

Question

How do we know it works?

Perplexity (Intrinsic)

Definition

The inverse probability of the test set, normalized by the number of words.

Interpretation: The branching factor of the model.

If , the model is as confused as if it were choosing uniformly from 10 words.

Downstream Benchmarks (Extrinsic)

Perplexity does not always predict reasoning ability. We use standard benchmarks:

  • MMLU: 15k+ multiple choice questions
  • Code: HumanEval.
  • Reasoning: GSM8K

Method: Zero-shot or few-shot prompting.

Data contamination

Error

Did the model memorize the answer?

Contamination: When test set questions appear in the training data.

Consequence: High scores but fails to generalize to new problems.

Mitigation: Rigorous decontamination (-gram matching) before training.

Using Pretrained Models: Prompting

  • Prompt: The input text provided to the model to elicit a specific response.
  • System Prompt: A hidden prefix instruction that defines the model’s personality.
  • Prompt Engineering: The empirical science of designing prompts to maximize model performance.

Definition

In-Context Learning (ICL): The ability of a model to improve performance on a specific task given context in the prompt, without updating model parameters.

Mechanism: The model ‘learns’ the pattern or format from the input buffer activations.

Zero-Shot Prompting

Providing the task instruction without any examples

”Translate the following sentence into French: The cat sat on the mat.”

Reliance: Relies entirely on the model’s pretraining data to understand the instruction verb.

Few-Shot Prompting

Providing examples (demonstrations) of the task before the actual query.

Translate English to French: Dog Chien, Cheese Fromage, Cat ?

Effect: Significantly improves performance on complex or novel tasks by demonstrating the expected output formats.

Question

Why do demonstrations work?

  • Format constraints: They teach the model the output structure.
  • Task location: They help the model locate the specific task manifold in its parameter space.
  • Counter-intuitive finding: Correctness of labels matters less than format. Models improve even with incorrect labels.

Lecture 13

Pretraining creates a base model. Excellent at completing text, possessing vast world knowledge and syntactic fluency.

However there’s an alignment gap.

We expect an intelligent assistant that follows instructions. In reality, base models are trained to complete text, not obey commands.

Failure 1: Misinterpretation

Prompt:

“Explain the moon landing to a six year old in a few sentences”

Base Model Output:

“Explain the theory of gravity to a 6 year old”

The model say a pattern of list of questions in its training questions, and generated the next item in the list.

Failure 2: Continuation vs Answer

Prompt:

“Translate to French: The small dog”

Base Model Output:

“The small dog crossed the road”

Model treated the input as the start of a story rather than a translation command.

The goal of post-training is to bridge the gap between next-token prediction and intent following. We want the model to be helpful, honest, and harmless.

Three stages of training

  1. Pretraining (covered in last lecture)
  2. Instruction Tuning (Supervised fine-tuning)
  3. Alignment (RLHF/DPO)

Definition

Instruction Tuning: The process of further training a base model on a datset of (Instruction, Response) pairs.

Goal: Teach the model to recognize the “instruction” format and generate the appropriate “response”.

Meta-learning: The goal is not just to learn about the specific tasks in the training set, but to learn the general skill of following instructions.

flowchart LR

a[Base model weights]
b[Fine-tuning]
c[Instruction]
d[Response]
e[Adapted Model Weights]

a --> b
c --> d
d --> b
b --> e

Supervised fine-tuning (SFT) uses the same cross-entropy loss as pretraining.

We want the model to answer the user, not learn to predict/mimic the user’s questions.

Domain adaptation vs instruction tuning

The goal of domain adaptation is to adapt a model to new jargon. The data is unlabeled documents and we continue pretraining on raw text.

The goal of instruction tuning is to adapt model to a behavioural interface. Data is labeled pairs (Q, A) via supervised training.

Full SFT: Update all parameters of the model. This requires significant compute/memory.

PEFT: Freezes the base model, adds small trainable adapter matrices. Updates only the adapters.

Traditional Fine-Tuning (BERT-style masked LM): Add a traditional head on top. Train for one specific task. Model cannot do other tasks anymore.

Instruction Tuning: No new layers, the model outputs text. Task is specified in natural language in the prompt. Our model remains a general-purpose engine.

Formatting of SFT Data

We wrap data in a conversational format. The model learns that after the <Instruction> tag comes a command, and it should generate text after the <Response> tag.

<Instruction> Summarize the main idea of the text.
[Input text]...
[Input text]...
<Response> The main idea is that instruction tuning adapts base models to follow user commands effectively.
<Instruction> Translate the following sentence into French: "The weather is beautiful today."
<Response> Le temps est magnifique aujourd'hui.

To train a robust SFT model, we need thousands of diverse instructions.

Source 1: Human Annotation

Hire crowd-workers or experts to write realistic prompts and high-quality answers. This does result in high quality, “real” user distribution, but is expensive and slow.

Aya: A massive multilingual instruction dataset developed by 3,000 fluent speakers across 114 languages.

Source 2: Templating Existing Datasets

NLP researchers have created thousands of labeled datasets over the decades. Convert these rigid datasets into natural language prompts.

How templating works.

Original Dataset (Sentiment)

Input: “The movie was terrible” Label: 0 (Negative)

Template 1

”Review: {Input}. Is this positive or negative? Answer: Negative”

Template 2

”I just read the following review: {Input}. How did the reviewer feel? Answer: They hated it”

SuperNaturalInstructions

12 million examples from 1,600 different NLP tasks. Each task is paired with multiple natural language templates to ensure the model doesn’t overfit to one phrasing.

Source 3: LLM Synthesis (Self-Instruct)

Use a very strong model to generate training data for a smaller model

Loop:

  • Give GPT-4 a few seed examples of tasks
  • Ask it to generate 100 new, unique tasks.
  • Ask it to generate the solutions.
  • Filter for quality.
  • Train the small model on this synthetic data.

We can explicitly engineer safety into SFT. Use an LLM to generate a “safe” refusal, add this pair to the training set. The model learns to refuse harmful instructions via pattern matching, even before RLHF.

Our goal is to generalize.

If we train on 1,000 tasks, we don’t want the model to be good at just those 1,000 tasks. Learning to follow instructions transfers.

Evaluation Methodology

Hold-one-out: Train on tasks, test on the th task.

Caution

If you train on “SQuAS” (QA) and test on “NaturalQuestions” (QA), it’s not really an unseen task.

Solution is task clustering. Group datasets by type, hold out entire clusters for evaluation.

Instruction-tuned models significantly outperform base models on zero-shot tasks. The more tasks you add during fine-tuning, the better the generalization (but there are diminishing returns).

Question

Can we teach models to reason via SFT?

Chain-of-Thought (CoT): Prompting models to think step-by-step.

SFT can be used to “bake in” this behaviour.

Data: (Question, Rationale + Answer) pairs. Result: Models learn to output reasoning steps automatically before answering.

The quality vs quantity trade-off

Early SFT (Flan): Focused on quantity (millions of examplse)

Recent work (LIMA): Suggests quality is more important

Pretraining takes months on thousands of GPUs. SFT can take days on dozens of GPUs (or hours on 1 GPU for PEFT).

Problem

SFT can cause the model to forget knowledge from pretraining.

Mitigations include mixing in some pretraining data during SFT, use low learning rates, use PEFT to preserve original weights.

Evaluating generation is hard. There are metrics (but most are bad), human eval is good but expensive, LLM-as-a-Judge (using GPT-4 to grade the outputs of smaller models).

SFT datasets can contain test set leakage. If MMLU questions are in your SFT set, your evaluation score is invalid. Need to decontaminate by creating private, held-out evaluation sets.

Challenge: Multilingual SFT

Most SFT data is in English. SFT in English often improves performance in other languages too. The model learns the concept of following instructions, which maps to its internal multilingual representations. The best practice is still to use multilingual data for best results.

Challenge: Reproducibility Issues

Many open weights model release the weights but not the SFT model. The SFT data is often the secret sauce or proprietary.

Challenge: Synthetic Data Risks

Using GPT-4 to train Llama creates a feedback loop. Model Collapse: If we keep training on synthetic data, models might drift away from real human language distribution.

SFT makes model helpful, but not aligned. If the training data contains errors, the model mimics them. Sycophancy - models might agree with the user’s incorrect premise just to be helpful.