Correlation and Regression
Correlation measures the strength of association between variables, and regression models one variable as a function of others. Both are foundational tools for exploring relationships and making predictions from data.
Scatter Plots
Before any numerical summary, a scatter plot of Y versus X reveals shape, direction, and outliers. Linear trends call for Pearson correlation and linear regression; nonlinear patterns suggest transformations or alternative models.
Pearson Correlation Coefficient
The Pearson correlation coefficient r measures linear association: r = Σ(x − x̅)(y − y̅) / √(Σ(x − x̅)2 Σ(y − y̅)2). It lies in [−1, 1]. Values near +1 or −1 indicate strong linear association; near 0 indicates weak linear association. r says nothing about nonlinear relationships.
Spearman's Rank Correlation
Spearman's ρ applies Pearson correlation to ranks. It detects monotone relationships and is robust to outliers. Useful when data are ordinal or when relationships are nonlinear but monotone.
Simple Linear Regression
Simple linear regression fits y = β0 + β1 x + ε. Ordinary least squares minimizes Σ εi2. The slope β̂1 = r ⋅ (sy/sx); the intercept is β̂0 = y̅ − β̂1 x̅. Fitted values and residuals are straightforward to compute.
Coefficient of Determination
R2 is the fraction of variance in y explained by the model: R2 = 1 − SSres/SStot. For simple linear regression, R2 = r2. Higher values indicate better fit, but high R2 alone does not prove causation or guarantee predictive accuracy.
Inference for Regression
Under standard assumptions (linearity, independent normal errors with constant variance), the slope estimator is normal. t-tests assess whether β1 differs from zero; confidence intervals measure uncertainty about the slope. Prediction intervals quantify uncertainty in future observations.
Multiple Linear Regression
Multiple linear regression models y = β0 + β1 x1 + ... + βk xk + ε. Matrix form β̂ = (XTX)−1XTy gives the OLS estimator. Adjusted R2 penalizes adding irrelevant predictors.
Assumptions and Diagnostics
Residual plots check linearity, constant variance, and normality. The Durbin–Watson test detects autocorrelation in residuals. Multicollinearity (high correlation among predictors) inflates variance of estimates and is detected by variance inflation factors.
Nonlinear and Logistic Regression
When y is binary, logistic regression models the log-odds as linear. When y is count data, Poisson regression applies. More general nonlinear models are fitted by iterative methods like Gauss–Newton or Levenberg–Marquardt.
Correlation Is Not Causation
Strong correlation does not imply that one variable causes the other. Lurking variables, reverse causation, and selection bias can all create spurious correlations. Causal inference requires experimental design or specialized methods like instrumental variables.
Summary
Correlation quantifies linear or monotone association; regression models relationships for explanation or prediction. Together they are the workhorses of statistical analysis, machine learning, and applied data science.