Data Collection and Preprocessing
Data preprocessing is the most time-consuming phase (60-80% of project time). Raw data must be collected, cleaned, transformed, and prepared for modelling.
Data Sources
Structured (databases, CSV), semi-structured (JSON, XML, logs), unstructured (text, images, audio). Sources: databases, APIs, web scraping, surveys, sensors/IoT, social media, open data (Kaggle, UCI).
Data Cleaning
Missing values: delete, impute (mean/median/mode, KNN). Outliers: detect with IQR/Z-scores, handle by capping/removing/transforming. Duplicates: identify and remove. Inconsistencies: standardise formats, fix typos.
Data Transformation
Normalisation (Min-Max to [0,1]), standardisation (Z-score), log transformation, one-hot encoding (nominal), label encoding (ordinal), binning (continuous to categorical).
Feature Engineering
Create new features: combine features, extract date components, text features (word count, sentiment), domain-specific features. Good features matter more than complex models.
Feature Selection
Filter methods (correlation, chi-square), wrapper methods (forward/backward selection), embedded methods (Lasso, tree-based importance). PCA for dimensionality reduction.
ETL Pipelines
Extract, Transform, Load. Tools: Apache Airflow, Luigi, dbt, AWS Glue, Azure Data Factory. Automate data processing for consistency.
Summary
Data preprocessing — cleaning, transformation, feature engineering, and pipelines — is critical for data quality. Garbage in, garbage out.