Exploratory Data Analysis
EDA uses statistical and visual methods to understand data before modelling. It reveals patterns, relationships, anomalies, and insights.
Descriptive Statistics
Central tendency: mean, median, mode. Dispersion: range, variance, standard deviation, IQR. Shape: skewness and kurtosis.
Data Visualization
Histograms (distribution), box plots (quartiles, outliers), scatter plots (relationships), bar charts (categories), line charts (trends), heatmaps (correlations). Libraries: matplotlib, seaborn, plotly.
Correlation Analysis
Pearson (linear, -1 to +1), Spearman (monotonic, rank-based). Correlation ≠ causation. Correlation matrices and pair plots for multiple variables.
Distribution Analysis
Normal, exponential, uniform, Poisson. Q-Q plots, Shapiro-Wilk test. Distribution knowledge guides model selection.
Univariate and Multivariate
Univariate (one variable), bivariate (two variables, scatter plots), multivariate (PCA, clustering, dimensionality reduction).
Best Practices
Start with shape and types. Check missing values. Visualise distributions. Examine correlations. Look for outliers. Document findings. EDA is iterative.
Summary
EDA reveals data characteristics and relationships through statistics and visualisation, informing all subsequent analysis decisions.