Machine Learning for Data Science
ML automates pattern discovery and prediction from data, turning data into actionable predictions at scale.
Regression
Linear regression (continuous values), multiple regression, polynomial regression. Metrics: MSE, RMSE, MAE, R-squared. Regularisation (Ridge, Lasso) prevents overfitting.
Classification
Logistic regression, decision trees, random forests, SVM, Naive Bayes (good for text). Metrics: accuracy, precision, recall, F1-score, ROC-AUC, confusion matrix.
Clustering
K-means (elbow method for k), hierarchical clustering, DBSCAN (density-based, handles noise). Evaluation: silhouette score. Used for customer segmentation, anomaly detection.
Model Selection
Train/test split (80/20), k-fold cross-validation, hyperparameter tuning (grid search, random search). Bias-variance trade-off: simple models underfit, complex overfit.
Ensemble Methods
Bagging (Random Forest), boosting (XGBoost, LightGBM, AdaBoost), stacking. Ensembles typically outperform individual models.
Feature Importance
Tree-based importance, SHAP values, partial dependence plots. Enables model interpretation and feature selection.
Summary
ML applies regression, classification, and clustering with proper evaluation and ensemble methods for reliable, actionable results.