Ensemble methods
Code and images and splits: Rishekesan3012/ML_WR
Ensemble learning combines multiple machine learning models to improve predictive performance, increase robustness, and reduce overfitting. Common ensemble methods include:
Bagging (Bootstrap Aggregating):
Trains multiple models on different random subsets of the data and averages their predictions to reduce variance.
Boosting:
Builds models sequentially, where each new model focuses on correcting the errors of the previous ones. This can reduce both bias and variance.
Random Forest:
A type of bagging that uses many decision trees and averages their predictions. It also adds randomness in feature selection for each split.
AdaBoost:
A boosting method that adjusts the weights of incorrectly classified instances so that subsequent models focus more on difficult cases.
Voting:
Combines the predictions of multiple different models (e.g., SVM, Random Forest, Logistic Regression) by majority or average vote.
Stacking:
Trains multiple models and then uses another model (meta-learner) to combine their outputs.
Methods Used in ours
1. Random Forest
Random Forest is an ensemble of decision trees, each trained on a random subset of the data and features. The final prediction is made by averaging (regression) or majority vote (classification).
Strengths: Handles high-dimensional data, reduces overfitting, provides feature importance.
2. AdaBoost
AdaBoost builds a sequence of weak learners (usually shallow trees), each focusing more on the errors of the previous ones.
Strengths: Often improves performance on difficult datasets, robust to overfitting if not too many estimators are used.
Data Preparation
Data Source and Cleaning
Dataset: master_delivery_data.csv containing restaurant and delivery information.
Sampling: 10,000 rows randomly sampled for analysis.
Feature Engineering:
- Categorical features (restaurant_name, city, cuisine) encoded using LabelEncoder.
- Missing values in numeric features imputed using SimpleImputer (mean strategy).
- Target Variable: has_delivery (whether a restaurant offers delivery).
- Splitting and Balancing
Train/Test Split:
Data split into 80% training and 20% testing using train_test_split with stratification to preserve class distribution.
Class Imbalance:
Only ~11% of restaurants offer delivery.
SMOTE (Synthetic Minority Over-sampling Technique) was applied to the training set to balance classes.
The most important feature was has_takeaway, followed by longitude, latitude, restaurant_name, and cuisine.
Performance:Random Forest achieved high accuracy and strong recall for both classes, especially after SMOTE balancing. The model is particularly strong at identifying restaurants that do not offer delivery, but also performs well for the minority class, correctly identifying most delivery-offering restaurants.
Adaboost Results:
Performance:
AdaBoost achieved a recall of 81% for the minority class, correctly identifying most restaurants offering delivery. Precision was 69%, and overall accuracy was 93.3%. AdaBoost produced more false positives than Random Forest but slightly higher recall for delivery restaurants.
Comparison
Random Forest had higher overall accuracy and precision for the minority class, while AdaBoost achieved slightly higher recall for delivery restaurants.
Both ensemble methods significantly outperform single classifiers, especially for the minority class (has_delivery = 1.0).
Feature importance analysis revealed that has_takeaway, location, and cuisine are the most influential predictors of delivery availability.
(e) Conclusions
Ensemble methods such as Random Forest and AdaBoost, especially when combined with SMOTE for class balancing, significantly improve model performance on imbalanced datasets.
Both models were able to detect restaurants offering delivery with much higher recall and precision than single classifiers, making them valuable for real-world applications.
The most important features for predicting delivery were has_takeaway, location, and cuisine, suggesting that both restaurant characteristics and geographic factors play a key role.
Practical Implication:
These results can help food delivery platforms or restaurant aggregators better identify and target restaurants likely to offer delivery, improving operational planning and user experience. The ensemble approach is robust, generalizes well, and is recommended for similar predictive tasks in the food service domain.







0 Comments