Professor: Michael Wallace | Term: Fall 2025
Lecture 1
The 4 assignments in this course (which are not graded) give opportunity to analyze a real-world dataset (Stanford Open Policing Project) using R.
Question
Why study stats?
- To be a statistician.
- Problem solving: This reminds me of a puzzle.
- Variety: Theory, application, anything in between.
- Utility: Useful to almost everyone.
- Coolness: You will be cool at parties.
- To be a user of statistics.
- Science: Want to know if you’ve done an experiment properly? Will need statistics.
- Industry: Does our new product work better? Why are we losing customers?
- Spare time: Can I play the stock market?
- Being informed on issues: Should guns be regulated?
- To improve critical, analytical, and communication skills.
- Learn to ask the right questions.
- Understand the importance of precision, in words and measurements.
Key features of STAT231:
- The language of statistics
- Understanding what data can and can’t tell us
- Methods of estimation and analysis
- Principles over proofs (this is a
STATnot aMATHcourse)
A key feature of an empirical study is that it involves uncertainty.
If we run an experiment more than once, we’ll almost certainly get different results.
We use probability models to try and model this uncertainty.
Definition
A unit is an individual person, place or thing about which we can take some measurement(s).
Definition
A population is a collection of units.
Definition
A process is a collection of units, but those units are ‘produced’ over time.
Populations and processes are collections of units. A key feature of process is that they usually occur over time whereas populations are static.
Population: All current UW undergraduate students.
Process: All UW undergraduate students for the next ten years.
Definition
Variates are characteristics of units which are usually represented by letters such as .
Variates come in many flavours, including:
- Continuous
- Discrete
- Categorical
- Ordinal
- Complex
Continuous variates are those that can be measured - at least in theory - to an infinite degree of accuracy.
Example
Height and weight, the lifetime of an electrical component etc.
Discrete variates, in contrast, are those that can only take a finite or countably infinite number of values. Discrete variates can still be countably infinite.
Example
The number of car accidents on a certain stretch of highway in a year.
The distinction between discrete and continuous can be unclear.
The distinction affects the assumptions we make and the probability models we use to investigate the data.
Categorical variates are those where units fall into a non-numeric category.
Ordinal variates are those where an ordering is implied, but not necessarily through a numeric measure.
| Size | Volume (oz) |
|---|---|
| Small | 10 |
| Medium | 14 |
| Large | 20 |
| Extra Large | 24 |
Ordinal and continuous. Need to be careful when doing this. Mapping ordinal variates to numbers can misrepresent the data.
Complex variates are more unusual, and include open-ended responses to survey questions, or an image. Usually requires processing to ‘convert’ them into one of the other types.
Warning
If one variate we care about is number of seconds from assignment release until it is due (capped at 7 days), the set of seconds seems to be discrete, but this is still continuous (since we can go even more precise).
If we decide a variate is discrete, we usually use a discrete probability distribution to model it.
If we decide a variate is continuous, we usually use a continuous probability distribution to model it.
However this is not always the case. For example, age is a measure of time; it is continuous, but we often use discrete probability distributions to describe the variate.
Lecture 2
Definition
An attribute of a population or process is a function of a variate, which is defined for all units in the population or process.
A sample survey is where information is obtained about a finite population by selecting a ‘representative’ sample of units from the population and determining the variates of interest for each unit in the sample.
Example
Fortnite V-Bucks
Consider two possible studies to investigate how often players buy V-Bucks.
Study 1: A random sample of players is selected on 1 September 2025, and all in-game activity is logged for one week.
Study 2: A random sample of players is selected on 1 September 2025. Half the players are shown the V-Bucks pricing options in a different order.
Study 1 is an observational study - we learn about the population / process without any attempt to change any variates for the sampled units.
Study 2 is an experimental study - the experimenter intervenes and changes or sets the values of one or more variates for the units in the study.
| Observational | Survey |
|---|---|
| Population of interest infinite or conceptual | Finite, tangible/‘real’ |
| Data collected routinely over time | Often only one point of contact with participants |
| More passive, learning about population’s daily lives and/or habits | Specific questions about participants’ lives and experiences |
Sometimes a study may include a sample survey while also being observational. The above are examples, not rules that are always followed.
Example
20 STAT 231 students were asked how many assignments they completed.
How do we show this usefully show this data? We consider both numerical and graphical summaries.
Types of numerical measures:
- Measures of location
- Measures of variability
- Measures of shape
Location
Recall from STAT230: if is a discrete random variable with range and probability function , and is a continuous random variable with range and probability density function , then:
Suppose we roll a six-sided die 100 times, what do we expect to see.
Let the data be represented by where is a real number and our sample size is .
One numerical measure of the ‘centre’ of the data is the sample mean.
Sample mean: . This means that
The idea is to give an empirical version of the previous theoretical idea of expectation.
We also know about other theoretical measure of the centre of a distribution; the median.
For a discrete random variable we might define the median to be any value which satisfies
To formally define the median, we introduce the ordered sample as where and and .
For an odd number of observations: sample median . Informally, this is just the middle value.
The median is not unique in the case of an even number of observations.
The average of the middle two observations is chosen for convenience: sample median . Informally this is just the average of the middle two values.
The sample mode is the most common value in a set of data. The sample mode is most useful for discrete or categorical data.
For frequency or grouped data, the group or class with the highest frequency is called the sample modal class.
Watch out for uses of the word ‘average’ in the media, sometimes it is used to refer to the mean, sometimes the median.
Example
Mean household income > Median household income due to long right tail.
In contrast to measures of central tendency, we are also interested in the spread / variability of data.
Examples include:
- Sample variance
- Range
- IQR
In STAT 230 we learned about the variance in the context of random variables. Recall: .
If railing a die 100 times, we can calculate sample mean, and it should be close to the ‘theoretical’ expectation .
What about sample variance? Can we get something empirically close to the theoretical idea of variance.
Our data is denoted as .
Definition
The sample variance is defined as
The sample standard deviation, denoted , is just the square root of the variance.
Suppose we have a sample of data from a Gaussian distribution. We can expect:
- Approximately 68% of the sample should lie in the interval
- Approximately 95% of the sample should lie in the interval
Exercise: Show that if ,
The range is defined as: range . The range is very susceptible to outliers.
Recall the quantile of a continuous distribution is given by where
Recall the quantile is just the median.
One way to define the quantile ( of a sample is the value, denoted is as follows:
- Let , where is the sample size
- If is an integer, and , then take the smallest value
- If is not an integer, but , then determine the closest integer such that and take
R uses a different method of calculating quantiles by default.
The 25th, 50th, 75th percentiles are known as quartiles.
It is common for data to be divided into quartiles.
The 25th percentile is the first quartile, and the 75th percentile is the upper quartile.
The interquartile range, or IQR is defined as 50% of the observations should lie between the upper and lower quartiles.
Question
Why do we want this?
More robust, less affected than mean by outliers.
We finish with measures of shape. We can quantify different shaped plots.
Sample skewness measures the asymmetry of the data, and is calculated as:
The denominator is equal to the sample standard deviation cubed except we replace the in the denominator of with .
We can think of sample skewness as an empirical version of the theoretical concept of the third moment.
The numerator can be positive or negative.
We can infer some properties of the shape of a dataset’s distribution by looking at the sign of the skewness.
For a ‘normal distribution’ - we have that since the ( on the left) and the ( on the right) cancel each other out.
For a long right tail - we have that . With a in the middle, we have that ( on the right). ( on the left). This results in positive skewness (skewed to the right).
For a long left tail - we have that .
Kurtosis measures whether data are concentrated in a central peak, or in the tails.
Sample kurtosis is calculated as
This is the empirical version of the theoretical concept of the fourth moment.
Now, both the numerator and denominator are positive.
Data that look Gaussian, have a sample kurtosis close to 3. Data with large tails, have a sample kurtosis larger than 3. Data with shorter tails, have a sample kurtosis less than 3. Data that look uniform have a sample kurtosis close to 1.8.
We are often interested in whether a Gaussian model is appropriate for a particular sample.
- Is the sample mean close to the sample median?
- Is the sample skewness close to 0?
- Is the sample kurtosis close to 3?
We never prove an assumption is true. Instead we see if we can find evidence against an assumption.
Never use definitive statements such as “the assumption is true” or “the assumption is false”.
Statistics are built on assumptions.
Lecture 3
A useful data summary is the five number summary
- The minimum
- The lower quartile
- The median
- The upper quartile
- The maximum
Even though this is simple, it is a pretty good summary.
Graphical summaries can be useful as well.
Histograms create graphical summaries of our data that show their distribution.
We partition the range of into non-overlapping intervals for . Let the number of values from that are in . The are called observed frequencies. Draw a rectangle above each of the intervals with height such that the rectangle’s area is proportional to the corresponding observed frequency .
In a relative frequency histogram, the height of the rectangle is chosen so that the area of the rectangle equals , that is
We can compare this with the probability distribution function now. The relative frequency histogram can be seen as an empirical version of the p.d.f.
An empirical c.d.f., in contrast, lets us compare the distribution of a dataset with a c.d.f. of a random variable.
Recall: A c.d.f. for a random variable is a function giving .
In general we estimate the probability of values below a given as
Definition of the empirical c.d.f.:
defined for all real values .
The sample mean being lower than the sample mean is common for negatively skewed data, but not guaranteed.
Another graphical way to summarize data is a box-plot.
Box-plots can help demonstrate skewness. It can also be used to compare the values of variates in two or more groups.
So far we’ve only considered univariate datasets: we only had one observation for each unit in our dataset, denoted .
We often have bivariate data, of the form , where and are real numbers both observed on unit .
The most obvious way of graphically summarizing these data is simply to plot the points
This is a scatterplot.
For random variables, we can define the correlation formally.
If and are random variables with expectation and standard deviation , then the correlation between them is:
The sample correlation gives us a numerical summary of a bivariate dataset.
For data , the sample correlation is defined as
where
The sample correlation takes values between and . It is a measure of the linear relationship between and .
If the value of is close to , we say that there is a strong positive linear relationship between the two variates. (We can infer what this means for when is close to ).
When is close to , we say there is no linear relationship between the two variates. But this does not mean they are unrelated, the are just uncorrelated.
A strong linear relationship is not necessarily a causal relationship ( may not cause changes in )
Correlation does not necessarily imply causation.
Response variates (dependent) vs explanatory (independent) variates. The explanatory variate can partially explain the the distribution. Which one should be the response or explanatory requires investigation.
Proper analysis of data is crucial. Two broad aspects of the analysis and interpretation of data are:
- Descriptive statistics
- Statistical inference
Descriptive statistics are portrayals of the data, or parts of the data, in numerical and graphical ways to show features of interests. When data is used to draw general conclusions about a population, we call this statistical inference.
When we reason from the specific data to the general population, this is inductive reasoning. Statistical inference is a form of inductive reasoning.
Using general results (axioms) to prove theorems is called deductive reasoning.
Proof by induction is still deductive reasoning.
Lecture 4
Proposing a statistical model for data allows us to use our knowledge of that distribution’s theoretical properties to answer questions about our study.
In a statistical model, a random variable is used to represent a characteristic or variate of a randomly selected unit from the population or process.
A model is usually chosen on background knowledge or assumptions about the population, past experience with data sets from the population, mathematical convenience, current data set against which the model can be assessed.
”All models are wrong, but some are useful” - George Box.
- Binomial): model for outcomes in repeated independent trials with two possible outcomes on each trial
- Poisson(): model for the random occurrence of events in time or space
- Exponential(): model to represent the distribution of the waiting times until the occurrence of an event of interest
- Gaussian(): model to represent the distribution of continuous measurements such as the heights or weights of individuals
Sequence of steps when choosing the model
- Collect and examine the data
- Propose a model
- Fit the model
- Check the model
- Propose the revised model
- Draw conclusions using the chosen model and the observed data
We have a family of models which is indexed by the parameter . When we don’t know , we write the pdf of a random variable as for to emphasize the dependence of the model on the parameter .
Estimation of unknown parameters.
We need a value of estimated using the data. We denote this value . This is ‘estimating’ the value of .
One particular way of estimating the model parameters is the maximum likelihood estimation.
Suppose the random variable models the weight of randomly chosen geese on campus, we are interested in estimating the unknown quantity .
If we randomly select geese, measure their weight, and estimate using : the sample mean. We might write .
is not necessarily equal to the sample mean. Different draws of the sample result in different sample means.
Definition
A point estimate of a parameter is the value of a function of the observed data and other known quantities such as the sample size .
For Bin() : we estimate by , the sample proportion.
We use the method of maximum likelihood to estimate an unknown parameter in an assumed model for the observed data .
The likelihood function for is defined as
The likelihood function is the probability we observe the data as a function of .
The value of that maximizes for given data is called the maximum likelihood estimate of , and is denoted by .
Suppose a Binomial experiment is conducted and successes are observed. The likelihood function for based on the observed data is
The maximum likelihood estimate of is
Relative likelihood is the function , for all values of that are not the maximum likelihood estimate, this relative likelihood will be less than 1.
More formally,
for all , and .
For binomial data,
The log likelihood function is defined as . Taking the logarithm makes algebra easier.
For the binomial likelihood function, we have the log likelihood function to be
The graph of is quadratic in shape.
To maximize the log likelihood function, we can differentiate each term separately.
Example
Poisson data
For Poisson data , we have
and we can derive the likelihood as
The term in the front doesn’t depend on , so we can ignore it:
We differentiate
This leads to .
We need to modify this approach for continuous distributions.
Let’s suppose is a random sample from a continuous distribution with probability density function for .
We define the likelihood function for based on the observed data as
Invariance property of maximum likelihood estimates.
We’ve found that a good way to estimate something is to find the MLE.
One reason the method of maximum likelihood is so popular is the invariance property.
Definition
If is the maximum likelihood estimate of , then is the maximum likelihood estimate of .
Another way to check model fit is to compare the observed values with the expected values (both numerically and graphically). We can also use the e.c.d.f. of the data and compare it with the theoretical c.d.f.
We can also use Q-Q plots.
If are the observed data, ordered from smallest to largest, then the plot of the points
(where is the inverse c.d.f. of a random variable) should be approximately a straight line if the data are well modelled by the normal distribution.
We can see skewness and kurtosis via the Q-Q plots. If the points are S-shaped, this usually indicates symmetry and low skewness. If they are U-shaped, this indicates asymmetry. The tails can tell us about skewness and kurtosis.
Lecture 5
Remember, we never prove an assumption is true. We see if we can find evidence against an assumption. Do not use definitive statements.