Regression



In our project, we applied two types of regression models: Linear Regression and Logistic Regression. These models are often used to predict continuous values (Linear Regression) or binary outcomes (Logistic Regression), but both can be used to analyze the relationship between input features and the target variable.

  • Linear Regression was used to predict continuous outcomes, but in the context of our dataset, it was more about testing how a simple regression model would behave when applied to a binary classification task.

  • Logistic Regression is more appropriate for binary classification tasks and was primarily used to predict whether a restaurant offers delivery services (1 for delivery and 0 for no delivery). Logistic Regression predicts probabilities, which are then used to classify observations into one of two categories.



Data Preparation for Regression Models

For both Linear Regression and Logistic Regression, we used the same dataset with features like temperature_F, precipitation, humidity, wind_speed, and cuisine_encoded. The target variable was has_delivery, where 1 indicates that the restaurant offers delivery, and 0 indicates that it does not.

We followed these steps for data preparation:

  1. Handling Missing Values: Any missing data in the target variable (has_delivery) or features was imputed or removed as necessary.

  2. Feature Encoding: The cuisine column was encoded into numeric values using the category data type to make it compatible with the models.

  3. Train-Test Split: The data was split into training and testing sets (75% for training and 25% for testing) using the train_test_split() function from sklearn. This ensured that the models would be trained on one subset of the data and evaluated on another, minimizing the risk of overfitting.\



  4. Train data 

Test data









Logistic Regression Model (Normal)

For Logistic Regression, the model was applied to predict the probability of a restaurant offering delivery (has_delivery).



Logistic Regression performs better than Linear Regression for binary classification tasks, achieving an accuracy of 88.26%. However, the confusion matrix indicates that the model is biased towards predicting the majority class (no delivery). The minority class (delivery) is not predicted at all, showing that the model is struggling with the class imbalance.


Handling Class Imbalance and Tuning Logistic Regression

To improve Logistic Regression's performance, especially in predicting the minority class (restaurants offering delivery), we applied upsampling to balance the classes and tuned the model.




After balancing the dataset using upsampling and tuning the Logistic Regression model, we achieved a much better accuracy of 60.94%. The confusion matrix shows improved predictions for the minority class, indicating that the class imbalance issue has been mitigated to some extent.







Conclusion and Discussion


After evaluating several machine learning algorithms—including Decision Tree, Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, and Logistic Regression—we gained valuable insights into what influences food delivery availability and which modeling approaches are most effective for this task.





Visualization and Interpretation


The confusion matrices and accuracy scores show that Decision Tree and Logistic Regression models performed the best, with higher accuracy and a more balanced distribution of true positives and true negatives. These models were able to capture the relationships between weather, restaurant features, and delivery availability. In contrast, Multinomial Naive Bayes had lower performance, likely because it is designed for count-based features and our data consists mainly of continuous and categorical variables.


What Did We Learn?


Feature Importance: Weather factors such as precipitation, temperature, and wind speed, along with cuisine type, significantly impact whether a restaurant offers delivery.


Model Suitability: Models that handle continuous and categorical features (like Decision Tree and Logistic Regression) are better suited for this dataset than models like Multinomial NB, which expect count data.


Practical Implications: Both environmental and restaurant-specific factors should be considered by businesses and delivery platforms when planning or optimizing delivery services.


Why Logistic Regression Outperformed Multinomial NB:


Logistic Regression is designed for continuous and categorical features, matching our dataset structure. Multinomial NB, intended for count data (like word frequencies), is less effective when applied to weather and restaurant features that are not counts.

Why Decision Tree Works Well:

Decision Trees can model complex, non-linear relationships and interactions between features, which are likely present in our data.


 we systematically compared multiple classification algorithms—including Decision Tree, Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, and Logistic Regression—on the task of predicting food delivery availability using restaurant and weather data. By evaluating each model with accuracy, precision, recall, F1 score, and confusion matrices, we observed that Decision Tree and Logistic Regression consistently outperformed the Naive Bayes variants. Their confusion matrices showed a more balanced classification of both delivery and non-delivery cases, with higher true positive and true negative rates. In contrast, Multinomial Naive Bayes, which is designed for count-based features, underperformed due to the continuous and categorical nature of our dataset, resulting in more misclassifications and lower metric scores across the board.


Overall, the summary table of metrics clearly demonstrates that models capable of handling mixed data types and capturing complex relationships—such as Decision Tree and Logistic Regression—are best suited for this prediction task. The confusion matrix analysis further revealed that these models maintain a good balance between sensitivity (recall) and precision, minimizing both false positives and false negatives. This comprehensive evaluation confirms that careful model selection and thorough metric-based comparison are crucial for achieving reliable predictions in real-world, heterogeneous datasets like ours.