Machine Learning for Data Science

ML automates pattern discovery and prediction from data, turning data into actionable predictions at scale.

Regression

Linear regression (continuous values), multiple regression, polynomial regression. Metrics: MSE, RMSE, MAE, R-squared. Regularisation (Ridge, Lasso) prevents overfitting.

Classification

Logistic regression, decision trees, random forests, SVM, Naive Bayes (good for text). Metrics: accuracy, precision, recall, F1-score, ROC-AUC, confusion matrix.

Clustering

K-means (elbow method for k), hierarchical clustering, DBSCAN (density-based, handles noise). Evaluation: silhouette score. Used for customer segmentation, anomaly detection.

Model Selection

Train/test split (80/20), k-fold cross-validation, hyperparameter tuning (grid search, random search). Bias-variance trade-off: simple models underfit, complex overfit.

Ensemble Methods

Bagging (Random Forest), boosting (XGBoost, LightGBM, AdaBoost), stacking. Ensembles typically outperform individual models.

Feature Importance

Tree-based importance, SHAP values, partial dependence plots. Enables model interpretation and feature selection.

Summary

ML applies regression, classification, and clustering with proper evaluation and ensemble methods for reliable, actionable results.

Machine Learning for Data Science