Sampling and Sampling Distributions

Statistics draws conclusions about a population from a sample. To do so reliably, we must understand how samples behave. Sampling distributions describe the probabilistic behaviour of sample statistics across many hypothetical samples.

Population and Sample

A population is the entire collection of items of interest, described by parameters like μ and σ. A sample is a subset actually observed. Summaries of the sample, called statistics, estimate population parameters.

Random Sampling

Simple random sampling selects n items such that every subset of size n has equal chance of being chosen. Stratified, cluster, and systematic sampling are variants used when the population has natural groupings or when simple random sampling is impractical. Good sampling design is the foundation of valid inference.

Sources of Error

Sampling produces sampling error, variability simply due to chance. Non-sampling errors come from bias, measurement error, or non-response. Statistical theory quantifies sampling error but cannot fix systematic biases; careful design and data collection are essential.

Sampling Distribution of the Mean

If X₁, ..., X_n are independent with mean μ and variance σ², the sample mean X̅ has mean μ and variance σ²/n. The standard error of the mean is σ/√n. Doubling sample size halves the standard error's square, i.e. quadrupling n halves σ/√n.

Central Limit Theorem

For large n, the sample mean is approximately normal, regardless of the population distribution, provided finite variance. This central limit theorem is the foundation of most inferential techniques and justifies using the normal distribution for standardized sample means.

Sampling Distribution of the Proportion

The sample proportion p̂ for a binary population has mean p and variance p(1 − p)/n. For large n, it is approximately normal with these parameters, enabling normal approximation inference for proportions and polls.

Sampling Distribution of the Variance

If the population is normal, (n − 1)S²/σ² has a chi-square distribution with n − 1 degrees of freedom. This result underpins confidence intervals and tests for variances and is central to quality control.

Finite Population Correction

When sampling a substantial fraction of a finite population without replacement, the variance of the sample mean is multiplied by a finite population correction factor (N − n)/(N − 1). It reduces the standard error when the sample is large relative to the population.

Bootstrap Methods

Modern computers enable resampling approaches. The bootstrap repeatedly samples with replacement from the observed data to approximate the sampling distribution of almost any statistic. It replaces analytical formulas with simulation and handles complicated estimators easily.

Summary

Sample statistics fluctuate from sample to sample, and that fluctuation is quantified by sampling distributions. The CLT, standard errors, and resampling methods allow us to reason about how confident we can be in estimates drawn from data.

Sampling and Sampling Distributions