Clustering Analysis

Comparison of Clustering Algorithms

K-means, hierarchical clustering, and DBSCAN represent three fundamentally different approaches to clustering with distinct strengths and limitations. K-means partitions data into a predetermined number of clusters where each observation belongs to the cluster with the nearest mean. It performs efficiently on large datasets but assumes spherical clusters of similar size and is sensitive to outliers. Hierarchical clustering builds a tree of clusters without requiring a pre-specified cluster count, providing a multi-level perspective of relationships but at higher computational cost. DBSCAN finds clusters based on density, automatically determining cluster numbers and effectively identifying noise points. It can discover arbitrarily shaped clusters but requires careful parameter tuning and struggles with varying density clusters.

Data Preparation

The document shows a comprehensive data preparation process for clustering analysis. The researchers began by sampling the dataset to manage computational load, selecting the most relevant features including temperature, precipitation, humidity, wind speed, and delivery availability. They addressed missing values through median imputation, which preserves the overall distribution of the data without introducing extreme values. The features were then standardized using StandardScaler to ensure each feature contributed equally to the distance calculations, regardless of their original scales. This normalization is particularly important when dealing with weather data where temperature readings might be numerically much larger than precipitation values.

K-means Clustering Analysis

The K-means clustering analysis employed the silhouette method to determine optimal cluster counts. By testing values from 2 to 5 clusters and calculating silhouette scores for each, the researchers identified that 2, 4, and 5 clusters produced the most cohesive groupings. The silhouette plots visualized the quality of these clustering solutions, showing how well observations fit within their assigned clusters versus neighboring clusters. The visual presentation included centroids marked clearly on the scatter plots, helping to interpret the center of each cluster. This methodical approach to determining cluster count overcomes one of the main limitations of K-means clustering – the need to specify k in advance.



Hierarchical Clustering Analysis

The hierarchical clustering implementation used Ward's method for linkage, which minimizes variance within clusters. The resulting dendrogram visualization provided insights into the hierarchical relationships between data points, allowing interpretation at different levels of granularity. This multi-level perspective reveals how clusters merge as the distance threshold increases, offering a more nuanced understanding of data relationships than single-level clustering algorithms. The implementation also included a 3-cluster solution for direct comparison with other methods, demonstrating how cutting the dendrogram at different heights produces different clustering results.





DBSCAN Clustering Analysis

The DBSCAN analysis systematically tested different parameter combinations to optimize clustering performance. By experimenting with various epsilon (neighborhood radius) values and minimum sample thresholds, the researchers explored how these parameters affect cluster formation. The analysis demonstrated DBSCAN's ability to identify irregularly shaped clusters and isolate outliers as noise points. The final parameter selection (eps=0.5, min_samples=5) represented a balance between identifying meaningful clusters and excluding noise. This approach highlights DBSCAN's flexibility in handling complex data distributions but also the challenge of parameter selection without labeled data for validation.





Comparative Analysis

The comparative analysis juxtaposed all three clustering methods, revealing different perspectives on the same dataset. K-means produced well-separated, spherical clusters with clear centroids but might oversimplify complex relationships. Hierarchical clustering preserved nested relationships between observations but with less distinct boundaries between major groups. DBSCAN identified core dense regions while marking peripheral points as noise, potentially revealing delivery patterns that occur only under specific weather conditions. This multi-algorithm approach demonstrates how different clustering techniques can reveal complementary insights, with each method highlighting different aspects of the underlying data structure in the weather and food delivery dataset.