Sampling and Sampling Distributions
Statistics draws conclusions about a population from a sample. To do so reliably, we must understand how samples behave. Sampling distributions describe the probabilistic behaviour of sample statistics across many hypothetical samples.
Population and Sample
A population is the entire collection of items of interest, described by parameters like μ and σ. A sample is a subset actually observed. Summaries of the sample, called statistics, estimate population parameters.
Random Sampling
Simple random sampling selects n items such that every subset of size n has equal chance of being chosen. Stratified, cluster, and systematic sampling are variants used when the population has natural groupings or when simple random sampling is impractical. Good sampling design is the foundation of valid inference.
Sources of Error
Sampling produces sampling error, variability simply due to chance. Non-sampling errors come from bias, measurement error, or non-response. Statistical theory quantifies sampling error but cannot fix systematic biases; careful design and data collection are essential.
Sampling Distribution of the Mean
If X1, ..., Xn are independent with mean μ and variance σ2, the sample mean X̅ has mean μ and variance σ2/n. The standard error of the mean is σ/√n. Doubling sample size halves the standard error's square, i.e. quadrupling n halves σ/√n.
Central Limit Theorem
For large n, the sample mean is approximately normal, regardless of the population distribution, provided finite variance. This central limit theorem is the foundation of most inferential techniques and justifies using the normal distribution for standardized sample means.
Sampling Distribution of the Proportion
The sample proportion p̂ for a binary population has mean p and variance p(1 − p)/n. For large n, it is approximately normal with these parameters, enabling normal approximation inference for proportions and polls.
Sampling Distribution of the Variance
If the population is normal, (n − 1)S2/σ2 has a chi-square distribution with n − 1 degrees of freedom. This result underpins confidence intervals and tests for variances and is central to quality control.
Finite Population Correction
When sampling a substantial fraction of a finite population without replacement, the variance of the sample mean is multiplied by a finite population correction factor (N − n)/(N − 1). It reduces the standard error when the sample is large relative to the population.
Bootstrap Methods
Modern computers enable resampling approaches. The bootstrap repeatedly samples with replacement from the observed data to approximate the sampling distribution of almost any statistic. It replaces analytical formulas with simulation and handles complicated estimators easily.
Summary
Sample statistics fluctuate from sample to sample, and that fluctuation is quantified by sampling distributions. The CLT, standard errors, and resampling methods allow us to reason about how confident we can be in estimates drawn from data.