Bikeshare User Clustering

Introduction

We were interested in how user patterns might vary between Omaha and Lincoln. Omaha is a larger city (metro population of about one million people), whereas Lincoln is about a quarter of the size and a college town. To compare users, we first filtered out admin and other invalid trips. Most of the available data pertains to time, so we made maximum use of the timestamps. We considered if a person made a trip in the winter (defined as December, January, February, and March) as a measure of their being an all-season cyclist. We also considered if they biked at night. Nightfall is a bit of a tricky variable. Fortunately, the Python suntime package can give us the sunrise and sunset time on any day, adjusting for latitude and daylight saving time. We consider whether a trip is one-way, meaning it returns to the same station at which it started its trip. This variable provides an indication of if the trip was utilitarian or recreational. Finally, we consider the trip count and average duration of trips by user.

Perform Siloutte Analysis To Define K Value for K-Means Clustering

We use the k-means clustering to define user groups. This algorithm clusters observations (in our case users) in order to maximize the cross-cluster variation in the input features. Our features are defined by the variables described above. We use siloutte analysis to find the optimal number of clusters. The silhouette value is a measure of how similar an observation is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Our results suggest the that four clusters are ideal here.

For n_clusters = 2 The average silhouette_score is : 0.4653705024493721
For n_clusters = 3 The average silhouette_score is : 0.48854051475775756
For n_clusters = 4 The average silhouette_score is : 0.5065985416901237
For n_clusters = 5 The average silhouette_score is : 0.4497914115339878
For n_clusters = 6 The average silhouette_score is : 0.4615478858524191

K-Means Clustering for Four Clusters

Describing The Clusters

With four clusters and the above descriptive statistics, we can define the clusters as follows: Cluster 0 (Local infrequency): These are occassional users who do not use the system during the winter. They may use it during the evening. Cluster 1 (Tourist): These are occassional users who do not use the system during the winter. They may use it during the evening. They differ from Cluster 0 in that they are more likely to make one-way trips and make slightly fewer trips. Cluster 2 (Frequent): These are the most frequent users of the system. They make many trips and are likely to make a trip during the winter months. Clsuter 3 (Frequent social): These are frequent users of the system. They are slightly more likely to make one-way and night trips than Cluster 2 users.

Statistics for Cluster 0 (Local infrequency)

trip_ct is_winter is_night one_way duration
count 42836 42836 42836 42836 42836
mean 5.89523 0.0884465 0.25797 0.532147 24.2404
std 15.7375 0.266521 0.403309 0.465253 9.93219
min 1 0 0 0 2
25% 1 0 0 0 16.5
50% 2 0 0 0.555556 25.5
75% 4 0 0.5 1 32.5
max 185 1 1 1 49.2203

Statistics for Cluster 1 (Tourist)

trip_ct is_winter is_night one_way duration
count 35900 35900 35900 35900 35900
mean 2.47889 0.0779591 0.20361 0.76414 55.7638
std 2.97054 0.255467 0.385383 0.400037 12.9453
min 1 0 0 0 40
25% 1 0 0 0.583333 46
50% 2 0 0 1 52.3333
75% 3 0 0 1 62
max 122 1 1 1 96

Statistics for Cluster 2 (Frequent)

trip_ct is_winter is_night one_way duration
count 22 22 22 22 22
mean 1820.91 0.23552 0.159872 0.0988543 13.6594
std 996.699 0.0663609 0.110981 0.132088 8.6342
min 1111 0.0878261 0.0483019 0.00126984 3.53841
25% 1307.75 0.205502 0.0686736 0.00993912 6.66342
50% 1454 0.244916 0.111485 0.0573053 11.3514
75% 1855.25 0.284529 0.257867 0.12855 18.3469
max 5447 0.332565 0.404865 0.561656 33.7786

Statistics for Cluster 3 (Frequent social)

trip_ct is_winter is_night one_way duration
count 347 347 347 347 347
mean 365.585 0.156223 0.19016 0.103283 12.6812
std 182.188 0.126897 0.130693 0.160661 8.8064
min 186 0 0 0 3.68841
25% 230 0.0556363 0.0815076 0.0195956 6.7967
50% 304 0.13215 0.182979 0.0501882 9.26971
75% 442.5 0.229686 0.270637 0.113309 15.2136
max 1029 0.720096 0.669565 1 49.4857

Potential Additional Dimensions to Explore:

  • Use date to get weekday vs. weekend
  • Travel speed (duration seems a bit inaccurate for some trips, so may be hard to have confidence, and we do not have route distance)