Professor: Michael Wallace | Term: Fall 2025

Lecture 1

The 4 assignments in this course (which are not graded) give opportunity to analyze a real-world dataset (Stanford Open Policing Project) using R.

Question

Why study stats?

To be a statistician.

Problem solving: This reminds me of a puzzle.
Variety: Theory, application, anything in between.
Utility: Useful to almost everyone.
Coolness: You will be cool at parties.

To be a user of statistics.

Science: Want to know if you’ve done an experiment properly? Will need statistics.
Industry: Does our new product work better? Why are we losing customers?
Spare time: Can I play the stock market?
Being informed on issues: Should guns be regulated?

To improve critical, analytical, and communication skills.

Learn to ask the right questions.
Understand the importance of precision, in words and measurements.

Key features of STAT231:

The language of statistics
Understanding what data can and can’t tell us
Methods of estimation and analysis
Principles over proofs (this is a STAT not a MATH course)

A key feature of an empirical study is that it involves uncertainty.

If we run an experiment more than once, we’ll almost certainly get different results.

We use probability models to try and model this uncertainty.

Definition

A unit is an individual person, place or thing about which we can take some measurement(s).

Definition

A population is a collection of units.

Definition

A process is a collection of units, but those units are ‘produced’ over time.

Populations and processes are collections of units. A key feature of process is that they usually occur over time whereas populations are static.

Population: All current UW undergraduate students.

Process: All UW undergraduate students for the next ten years.

Definition

Variates are characteristics of units which are usually represented by letters such as $x, y, z$ .

Variates come in many flavours, including:

Continuous
Discrete
Categorical
Ordinal
Complex

Continuous variates are those that can be measured - at least in theory - to an infinite degree of accuracy.

Example

Height and weight, the lifetime of an electrical component etc.

Discrete variates, in contrast, are those that can only take a finite or countably infinite number of values. Discrete variates can still be countably infinite.

Example

The number of car accidents on a certain stretch of highway in a year.

The distinction between discrete and continuous can be unclear.

The distinction affects the assumptions we make and the probability models we use to investigate the data.

Categorical variates are those where units fall into a non-numeric category.

Ordinal variates are those where an ordering is implied, but not necessarily through a numeric measure.

Size	Volume (oz)
Small	10
Medium	14
Large	20
Extra Large	24

Ordinal and continuous. Need to be careful when doing this. Mapping ordinal variates to numbers can misrepresent the data.

Complex variates are more unusual, and include open-ended responses to survey questions, or an image. Usually requires processing to ‘convert’ them into one of the other types.

Warning

If one variate we care about is number of seconds from assignment release until it is due (capped at 7 days), the set of seconds seems to be discrete, but this is still continuous (since we can go even more precise).

If we decide a variate is discrete, we usually use a discrete probability distribution to model it.

If we decide a variate is continuous, we usually use a continuous probability distribution to model it.

However this is not always the case. For example, age is a measure of time; it is continuous, but we often use discrete probability distributions to describe the variate.

Lecture 2

Definition

An attribute of a population or process is a function of a variate, which is defined for all units in the population or process.

A sample survey is where information is obtained about a finite population by selecting a ‘representative’ sample of units from the population and determining the variates of interest for each unit in the sample.

Example

Fortnite V-Bucks

Consider two possible studies to investigate how often players buy V-Bucks.

Study 1: A random sample of players is selected on 1 September 2025, and all in-game activity is logged for one week.

Study 2: A random sample of players is selected on 1 September 2025. Half the players are shown the V-Bucks pricing options in a different order.

Study 1 is an observational study - we learn about the population / process without any attempt to change any variates for the sampled units.

Study 2 is an experimental study - the experimenter intervenes and changes or sets the values of one or more variates for the units in the study.

Observational	Survey
Population of interest infinite or conceptual	Finite, tangible/‘real’
Data collected routinely over time	Often only one point of contact with participants
More passive, learning about population’s daily lives and/or habits	Specific questions about participants’ lives and experiences

Sometimes a study may include a sample survey while also being observational. The above are examples, not rules that are always followed.

Example

20 STAT 231 students were asked how many assignments they completed.

How do we show this usefully show this data? We consider both numerical and graphical summaries.

Types of numerical measures:

Measures of location
Measures of variability
Measures of shape

Location

Recall from STAT230: if $X$ is a discrete random variable with range $A$ and probability function $f (x)$ , and $Y$ is a continuous random variable with range $(- \infty, + \infty)$ and probability density function $g (y)$ , then:

E [X] = x \in A \sum x f (x) E [Y] = \int_{- \infty}^{+ \infty} y g (y) d y

Suppose we roll a six-sided die 100 times, what do we expect to see.

Let the data be represented by ${y_{1}, y_{2}, \dots, y_{n}}$ where $y_{i}$ is a real number and our sample size is $n$ .

One numerical measure of the ‘centre’ of the data is the sample mean.

Sample mean: $\overline{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$ . This means that $\sum_{i = 1}^{n} y_{i} = n \overline{y}$

The idea is to give an empirical version of the previous theoretical idea of expectation.

We also know about other theoretical measure of the centre of a distribution; the median.

$P (Y \leq m) = \int_{- \infty}^{m} g (y) d y = 0.5$

For a discrete random variable $C$ we might define the median to be any value which satisfies $P (x \leq m) = P (X \geq m) = 0.5$

To formally define the median, we introduce the ordered sample as $y_{(1)}, y_{(2)}, \dots, y_{(n)}$ where $y_{(1)} \leq y_{(2)} \leq \dots \leq y_{(n)}$ and $y_{(1)} = min (y_{1}, \dots, y_{n})$ and $y_{(n)} = max (y_{1}, \dots, y_{n})$ .

For an odd number of observations: sample median $= \overset{m}{^} = y_{(\frac{n + 1}{2})}$ . Informally, this is just the middle value.

The median is not unique in the case of an even number of observations.

The average of the middle two observations is chosen for convenience: sample median $= \overset{m}{^} = \frac{1}{2} (y_{(\frac{n}{2})} + y_{(\frac{n}{2} + 1)})$ . Informally this is just the average of the middle two values.

The sample mode is the most common value in a set of data. The sample mode is most useful for discrete or categorical data.

For frequency or grouped data, the group or class with the highest frequency is called the sample modal class.

Watch out for uses of the word ‘average’ in the media, sometimes it is used to refer to the mean, sometimes the median.

Example

Mean household income > Median household income due to long right tail.

In contrast to measures of central tendency, we are also interested in the spread / variability of data.

Examples include:

Sample variance
Range
IQR

In STAT 230 we learned about the variance in the context of random variables. Recall: $Va r (X) = E [(X - E [X])^{2}]$ .

If railing a die 100 times, we can calculate sample mean, and it should be close to the ‘theoretical’ expectation $E [X]$ .

What about sample variance? Can we get something empirically close to the theoretical idea of variance.

Our data is denoted as ${y_{1}, \dots, y_{n}}$ .

Definition

The sample variance is defined as
$s^{2} = \frac{1}{n - 1} i = 1 \sum n (y_{i} - \overline{y})^{2} = \frac{1}{n - 1} [i = 1 \sum n (y_{i}^{2}) - n \overline{y}^{2}]$

The sample standard deviation, denoted $s$ , is just the square root of the variance.

Suppose we have a sample of data from a Gaussian distribution. We can expect:

Approximately 68% of the sample should lie in the interval $[\overline{y} - s, \overline{y} + s]$
Approximately 95% of the sample should lie in the interval $[\overline{y} - 2 s, \overline{y} + 2 s]$

Exercise: Show that if $Y \sim G (μ, σ)$ ,

P (μ - σ \leq Y \leq μ + σ) P (μ - 2 σ \leq Y \leq μ + 2 σ) \approx 0.68 \approx 0.95

The range is defined as: range $= max (y_{1}, \dots, y_{n}) - min (y_{1}, \dots, y_{n}) = y_{(n)} - y_{(1)}$ . The range is very susceptible to outliers.

Recall the $p^{th}$ quantile of a continuous distribution is given by $q$ where

p = \int_{- \infty}^{q} f (y) d y

Recall the $0. 5^{th}$ quantile is just the median.

One way to define the $p^{th}$ quantile ( $0 < p < 1)$ of a sample is the value, denoted $q (p)$ is as follows:

Let $m = (n + 1) p$ , where $n$ is the sample size
If $m$ is an integer, and $1 \leq m \leq n$ , then take the $m^{th}$ smallest value $q (p) = y_{(m)}$
If $m$ is not an integer, but $1 < m < n$ , then determine the closest integer $j$ such that $j < m < j + 1$ and take $q (p) = \frac{1}{2} [y_{(j)} + y_{(j + 1)}]$

R uses a different method of calculating quantiles by default.

The 25th, 50th, 75th percentiles are known as quartiles.

It is common for data to be divided into quartiles.

The 25th percentile is the first quartile, and the 75th percentile is the upper quartile.

The interquartile range, or IQR is defined as $I QR = q (0.75) - q (0.25)$ 50% of the observations should lie between the upper and lower quartiles.

Question

Why do we want this?

More robust, less affected than mean by outliers.

We finish with measures of shape. We can quantify different shaped plots.

Sample skewness measures the asymmetry of the data, and is calculated as:

sample skewness = \frac{\frac{1}{n} \sum _{i = 1}^{n} ( y _{i} - y ) ^{3}}{[ \frac{1}{n} \sum _{i = 1}^{n} ( y _{i} - y ) ^{2} ] ^{\frac{3}{2}}}

The denominator is equal to the sample standard deviation cubed except we replace the $n - 1$ in the denominator of $s$ with $n$ .

We can think of sample skewness as an empirical version of the theoretical concept of the third moment.

The numerator can be positive or negative.

We can infer some properties of the shape of a dataset’s distribution by looking at the sign of the skewness.

For a ‘normal distribution’ - we have that $\sum (y_{i} - \overline{y})^{3} \approx 0$ since the $y_{i} - \overline{y} < 0$ ( $y_{i}$ on the left) and the $y_{i} - \overline{y} > 0$ ( $y_{i}$ on the right) cancel each other out.

For a long right tail - we have that $\sum (y_{i} - \overline{y})^{3} > 0$ . With a $\overline{y}$ in the middle, we have that $y_{i} - \overline{y} >> 0$ ( $y_{i}$ on the right). $y_{i} - \overline{y} < 0$ ( $y_{i}$ on the left). This results in positive skewness (skewed to the right).

For a long left tail - we have that $\sum (y_{i} - \overline{y})^{3} < 0$ .

Kurtosis measures whether data are concentrated in a central peak, or in the tails.

Sample kurtosis is calculated as

sample kurtosis = \frac{\frac{1}{n} \sum _{i = 1}^{n} ( y _{i} - y ) ^{4}}{[ \frac{1}{n} \sum _{i = 1}^{n} ( y _{i} - y ) ^{2} ] ^{2}}

This is the empirical version of the theoretical concept of the fourth moment.

Now, both the numerator and denominator are positive.

Data that look Gaussian, have a sample kurtosis close to 3. Data with large tails, have a sample kurtosis larger than 3. Data with shorter tails, have a sample kurtosis less than 3. Data that look uniform have a sample kurtosis close to 1.8.

We are often interested in whether a Gaussian model is appropriate for a particular sample.

Is the sample mean close to the sample median?
Is the sample skewness close to 0?
Is the sample kurtosis close to 3?

We never prove an assumption is true. Instead we see if we can find evidence against an assumption.

Never use definitive statements such as “the assumption is true” or “the assumption is false”.

Statistics are built on assumptions.

Lecture 3

A useful data summary is the five number summary

The minimum
The lower quartile
The median
The upper quartile
The maximum

Even though this is simple, it is a pretty good summary.

Graphical summaries can be useful as well.

Histograms create graphical summaries of our data that show their distribution.

We partition the range of $y$ into $k$ non-overlapping intervals $I_{j} = [a_{j - 1}, a_{j})$ for $j = 1, 2, \dots, k$ . Let $f_{j} =$ the number of values from ${y_{1}, \dots, y_{n}}$ that are in $I_{j}$ . The $f_{j}$ are called observed frequencies. Draw a rectangle above each of the intervals with height such that the rectangle’s area is proportional to the corresponding observed frequency $f_{j}$ .

In a relative frequency histogram, the height of the rectangle is chosen so that the area of the rectangle equals $f_{j} / n$ , that is

height = \frac{f _{j} / n}{a _{j} - a _{j - 1}}

We can compare this with the probability distribution function now. The relative frequency histogram can be seen as an empirical version of the p.d.f.

An empirical c.d.f., in contrast, lets us compare the distribution of a dataset with a c.d.f. of a random variable.

Recall: A c.d.f. for a random variable $Y$ is a function giving $P (Y \leq y)$ .

In general we estimate the probability of values below a given $y$ as $\frac{number of observations \leq y}{n}$

Definition of the empirical c.d.f.:

\hat{F} = \frac{number of values in { y _{1} , \dots , y _{n} } which are \leq y}{n}

defined for all real values $y$ .

The sample mean being lower than the sample mean is common for negatively skewed data, but not guaranteed.

Another graphical way to summarize data is a box-plot.

Box-plots can help demonstrate skewness. It can also be used to compare the values of variates in two or more groups.

So far we’ve only considered univariate datasets: we only had one observation for each unit in our dataset, denoted ${y_{1}, \dots, y_{n}}$ .

We often have bivariate data, of the form ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$ , where $x_{i}$ and $y_{i}$ are real numbers both observed on unit $i$ .

The most obvious way of graphically summarizing these data is simply to plot the points $(x_{i}, y_{i}), i = 1, \dots, n$

This is a scatterplot.

For random variables, we can define the correlation formally.

If $X$ and $Y$ are random variables with expectation $μ_{x}, μ) Y$ and standard deviation $σ_{X}, σ_{Y}$ , then the correlation between them is:

corr (X, Y) = \frac{co v ( X , Y )}{σ _{X} σ _{Y}} = \frac{E [( x - μ _{x} ) ( Y - μ _{Y} )]}{σ _{X} σ _{Y}}

The sample correlation gives us a numerical summary of a bivariate dataset.

For data ${(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}$ , the sample correlation is defined as

r = \frac{S _{x y}}{S _{xx} S _{yy}}

where

$S_{X Y} = \sum_{i = 1}^{n} (x_{i} - \overline{x}) (y_{i} - \overline{y})$
$S_{XX} = \sum_{i = 1}^{n} (x_{i} - \overline{x})^{2}$
$S_{YY} = \sum_{i = 1}^{n} (y_{i} - \overline{y})^{2}$

The sample correlation takes values between $- 1$ and $+ 1$ . It is a measure of the linear relationship between $x$ and $y$ .

If the value of $r$ is close to $1$ , we say that there is a strong positive linear relationship between the two variates. (We can infer what this means for when $r$ is close to $- 1$ ).

When $r$ is close to $0$ , we say there is no linear relationship between the two variates. But this does not mean they are unrelated, the are just uncorrelated.

A strong linear relationship is not necessarily a causal relationship ( $x$ may not cause changes in $y$ )

Correlation does not necessarily imply causation.

Response variates (dependent) vs explanatory (independent) variates. The explanatory variate can partially explain the the distribution. Which one should be the response or explanatory requires investigation.

Proper analysis of data is crucial. Two broad aspects of the analysis and interpretation of data are:

Descriptive statistics
Statistical inference

Descriptive statistics are portrayals of the data, or parts of the data, in numerical and graphical ways to show features of interests. When data is used to draw general conclusions about a population, we call this statistical inference.

When we reason from the specific data to the general population, this is inductive reasoning. Statistical inference is a form of inductive reasoning.

Using general results (axioms) to prove theorems is called deductive reasoning.

Proof by induction is still deductive reasoning.

Lecture 4

Proposing a statistical model for data allows us to use our knowledge of that distribution’s theoretical properties to answer questions about our study.

In a statistical model, a random variable is used to represent a characteristic or variate of a randomly selected unit from the population or process.

A model is usually chosen on background knowledge or assumptions about the population, past experience with data sets from the population, mathematical convenience, current data set against which the model can be assessed.

”All models are wrong, but some are useful” - George Box.

Binomial $(n, θ$ ): model for outcomes in repeated independent trials with two possible outcomes on each trial
Poisson( $θ$ ): model for the random occurrence of events in time or space
Exponential( $θ$ ): model to represent the distribution of the waiting times until the occurrence of an event of interest
Gaussian( $μ, σ$ ): model to represent the distribution of continuous measurements such as the heights or weights of individuals

Sequence of steps when choosing the model

Collect and examine the data
Propose a model
Fit the model
Check the model
Propose the revised model
Draw conclusions using the chosen model and the observed data

We have a family of models which is indexed by the parameter $θ$ . When we don’t know $θ$ , we write the pdf of a random variable $Y$ as $f (y; θ)$ for $y \in A = range (Y)$ to emphasize the dependence of the model on the parameter $θ$ .

Estimation of unknown parameters.

We need a value of $θ$ estimated using the data. We denote this value $\hat{θ}$ . This is ‘estimating’ the value of $θ$ .

One particular way of estimating the model parameters is the maximum likelihood estimation.

Suppose the random variable $Y \sim G (μ, σ)$ models the weight of randomly chosen geese on campus, we are interested in estimating the unknown quantity $E (Y) = μ$ .

If we randomly select $n$ geese, measure their weight, and estimate $μ$ using $\overset{y}{ˉ}$ : the sample mean. We might write $\overset{μ}{^} = \overset{y}{ˉ}$ .

$μ$ is not necessarily equal to the sample mean. Different draws of the sample result in different sample means.

Definition

A point estimate of a parameter $θ$ is the value of a function of the observed data $y$ and other known quantities such as the sample size $n$ .

For Bin( $n, θ$ ) : we estimate $θ$ by $\hat{θ} = \frac{y}{n}$ , the sample proportion.

We use the method of maximum likelihood to estimate an unknown parameter $θ$ in an assumed model for the observed data $y$ .

The likelihood function for $θ$ is defined as

L (θ) = L (θ; y) = P (Y = y; θ) for θ \in Ω

The likelihood function is the probability we observe the data $y$ as a function of $θ$ .

The value of $θ$ that maximizes $L (θ)$ for given data $y$ is called the maximum likelihood estimate of $θ$ , and is denoted by $\hat{θ}$ .

Suppose a Binomial experiment is conducted and $y$ successes are observed. The likelihood function for $θ$ based on the observed data is

L (θ) = P (Y = y; θ) = (y n) θ^{y} (1 - θ)^{n - 1} for 0 < θ < 1

The maximum likelihood estimate of $θ$ is $\hat{θ} = \frac{y}{n}$

Relative likelihood is the function $\frac{L ( θ )}{L ( θ ^ )}$ , for all values of $θ$ that are not the maximum likelihood estimate, this relative likelihood will be less than 1.

More formally,

R (θ) = \frac{L ( θ )}{L ( θ ^ )}, θ \in Ω

$0 \leq R (θ) \leq 1$ for all $θ \in Ω$ , and $R (\hat{θ}) = 1$ .

For binomial data,

\frac{θ ^{y} ( 1 - θ ) ^{n - y}}{θ ^ ^{y} ( 1 - θ ^ ) ^{n - y}}

The log likelihood function is defined as $ℓ (θ) = lo g L (θ), θ \in Ω$ . Taking the logarithm makes algebra easier.

For the binomial likelihood function, we have the log likelihood function to be

ℓ (θ) = lo g [L (θ)] = lo g [θ^{y} (1 - θ)^{n - y}] = y lo g θ + (n - y) lo g (1 - θ)

The graph of $ℓ (θ)$ is quadratic in shape.

To maximize the log likelihood function, we can differentiate each term separately.

Example

Poisson data

For Poisson data $y_{1}, \dots, y_{n}$ , we have

P (Y_{i} = y_{i}; θ) = \frac{θ ^{y_{i}} e ^{- θ}}{y _{i} !}

and we can derive the likelihood as

L (θ) = i = 1 \prod n P (Y_{i} = y_{i}; θ) = i = 1 \prod n \frac{θ ^{y_{i}} e ^{- θ}}{y _{i} !} = [i = 1 \prod n \frac{1}{y _{i} !}] [i = 1 \prod n θ^{y_{i}}] [i = 1 \prod n e^{- θ}] = [i = 1 \prod n \frac{1}{y _{i} !}] θ^{\sum_{i = 1}^{n} y_{i}} e^{- n θ} θ > 0

The term in the front doesn’t depend on $θ$ , so we can ignore it:

L (θ) = θ^{n \overset{y}{ˉ}} e^{- n θ} θ > 0

We differentiate

\frac{d}{d θ} ℓ (θ) = \frac{n y ˉ}{θ} - n

This leads to $\hat{θ} = \overset{y}{ˉ}$ .

We need to modify this approach for continuous distributions.

Let’s suppose $Y = (Y_{1}, Y_{2}, \dots, Y_{n})$ is a random sample from a continuous distribution with probability density function $f (y; θ)$ for $θ \in Ω$ .

We define the likelihood function for $θ$ based on the observed data $y = (y_{1}, y_{2}, \dots, y_{n})$ as

L (θ) = L (θ; y) = i = 1 \prod n f (y_{i}; θ) θ \in Ω

Invariance property of maximum likelihood estimates.

We’ve found that a good way to estimate something is to find the MLE.

One reason the method of maximum likelihood is so popular is the invariance property.

Definition

If $\hat{θ}$ is the maximum likelihood estimate of $θ$ , then $g (\hat{θ})$ is the maximum likelihood estimate of $g (θ)$ .

Another way to check model fit is to compare the observed values with the expected values (both numerically and graphically). We can also use the e.c.d.f. of the data and compare it with the theoretical c.d.f.

We can also use Q-Q plots.

If $y_{(1)}, y_{(2)}, \dots, y_{(n)}$ are the observed data, ordered from smallest to largest, then the plot of the points

(ϕ^{- 1} (\frac{i}{n + 1}), y_{(i)}) i = 1, 2, \dots, n

(where $ϕ^{- 1}$ is the inverse c.d.f. of a $G (0, 1)$ random variable) should be approximately a straight line if the data are well modelled by the normal distribution.

We can see skewness and kurtosis via the Q-Q plots. If the points are S-shaped, this usually indicates symmetry and low skewness. If they are U-shaped, this indicates asymmetry. The tails can tell us about skewness and kurtosis.

Lecture 5

Remember, we never prove an assumption is true. We see if we can find evidence against an assumption. Do not use definitive statements.

Table of Contents

Backlinks

STAT231 Course Notes

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Lecture 5