PCA Analysis

Overview of Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components. Each principal component represents a direction of maximum variance in the data, with the first component capturing the most variance, the second component capturing the second most variance, and so on. PCA allows us to reduce data complexity while retaining most of the original information, making it easier to visualize high-dimensional data and identify the most influential variables.


Data Preparation and Implementation

The analysis began with loading the comprehensive food delivery dataset that includes restaurant attributes and weather metrics. The data cleaning process revealed missing values in the 'has_delivery' (4,047 missing) and 'has_takeaway' (5,560 missing) columns, which were addressed through mean imputation. Nine numeric features were selected for PCA: latitude, longitude, has_delivery, has_takeaway, has_opening_hours, temperature_F, precipitation, humidity, and wind_speed. These features were standardized using StandardScaler to ensure each variable contributed equally to the analysis regardless of their original scales.

PCA Results and Visualizations

The PCA implementation with 2 components revealed that 55.39% of the original data variance was preserved. The first principal component alone explained 35.93% of the variance, while the second component explained an additional 19.46%. When extending to 3 components, the total variance explained increased to 66.19%, with the third component contributing an additional 10.80% of the variance. The visualizations clearly showed data points distributed in the new reduced-dimensional space, with distinct patterns emerging that weren't immediately apparent in the original high-dimensional data.








Feature Importance Analysis

The PCA revealed interesting insights about which features contribute most significantly to the data's variance. In the first principal component, geographical factors (latitude: 0.4936, longitude: 0.4791) and temperature (0.4693) had the highest weights, suggesting these features explain the most variance in the dataset. The second principal component was dominated by service-related features, with has_takeaway (0.6057) and has_delivery (0.5911) having the strongest influence. This indicates that after geographical and temperature factors, the availability of delivery and takeaway services represents the next most important source of variation in the data.



Variance Retention and Dimensionality

To retain at least 95% of the original data variance, the analysis determined that only one principal component would be required. This is a surprising result that suggests the dataset has a high degree of correlation among features, allowing for extreme dimensionality reduction while still preserving most of the information. The top eigenvalues were calculated as 3.23 and 1.75 for the first and second components respectively, confirming the first component's dominance in explaining the dataset's variance. This efficient dimensionality reduction facilitates more effective visualization and subsequent analysis while minimizing information loss.