Teammates: Jessica Greene, Brian Rawn
Rebalancing in bikeshare has long been a difficult task. There are always draining and spilling in different stations. Elavation, weather, holidays, events and the pandemic are all variables that affect peoples’ preference and commuting modes. The purpose of rebalancing is to determine how many bikes should be at a station at a certain time. One of the most important thing for bikeshare companies to do is to accommodate for the time and spatial pattern and redistribute bikes over different stations at a proper time.
The most manageable and efficient way seems to be using a truck to move bikes around. This mode is especially useful when there is a predicted surge of bike demand in certain areas during certain times, such as parks on sunny holidays and stadiums after a show. It can also used on a daily base to routinely move bikes from spilling stations back to the draining ones, assuming that the demand will also occur at certain time as usual. For stations that spilling and draining problems are not so severe or just occasionally happen, it would be good to offer people bonus or coupons for a suggested route or returning spots. Not by changing the customer’s destination, the bikeshare company could suggest a nearby returning station within walkable distance to the customers original destination, avoiding clustering of bike return in some hot spots.
Commuting patterns, holidays, weathers and events are some of the relevantly predictable factors in the supply and demand analysis. There will basically be different time scales for rebalancing in different scenarios. For daily commuting patterns, e.g. rush hours in the morning and afternoon, redistribution may happen by a day or half of a day. For holidays, weekends or special events, the stations with a predicted high demand should be prepared several hours before the spike come.
In this report, we are using historical bikeshare data from Boston (April 2018) to build a prediction model. The main features we took into account are timelags, holiday lags, weather, and spatial features including the location of the station, university areas, and possible hotspots such as parks, waters and parkways.
Load the libraries
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gganimate)
library(gifski)
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette6 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c",'#053260')
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
# Load Quantile break functions
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
qBr2 <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(round(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T)))
} else if (rnd == FALSE | rnd == F) {
as.character(round(formatC(quantile(round(df[[variable]]), 0)),
c(.01,.2,.4,.6,.8), na.rm=T))
}
}
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
We are using data provided by an outside link from Boston Opendata. We use April 2018 as the sample month to build our predictions.
The data includes:
Trip data: trip duration, start time and date, stop time and date, from station, to station
Customer data: member/casual, birth year, gender
In this model, we didn’t include the customer data.
Link on Open Data - the dataset is managed by an outside resource
https://data.boston.gov/dataset/blue-bikes-system-data
Bluebike System Data
https://www.bluebikes.com/system-data
bostonbikes <-
read.csv('D:/Rdata/Homework5_Bikeshare/data/HubwayTrips.csv', header=TRUE)
dat <- bostonbikes
Use date parsing to bin the data by 15 and 60 minute intervals
dat2 <- dat %>%
mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
Census is used to get tract-scale socio-economic information for future analysis and generalizability tests. Features includes are: total population, median income, median age, white population, travel time, number of commuters, means of transportation
bostonCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "MA",
geometry = TRUE,
county=c("Suffolk"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
bostonTracts <-
bostonCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
#BR note: changed "left = TRUE" to "left = FALSE" to exclude data that was outside of boston tracts. Can someone confirm that the second "TRUE" should not also be made false?
dat_census <- st_join(dat2 %>%
filter(is.na(from_longitude) == FALSE &
is.na(from_latitude) == FALSE &
is.na(to_latitude) == FALSE &
is.na(to_longitude) == FALSE) %>%
st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326),
bostonTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = FALSE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(from_longitude = unlist(map(geometry, 1)),
from_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("to_longitude", "to_latitude"), crs = 4326) %>%
st_join(., bostonTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(to_longitude = unlist(map(geometry, 1)),
to_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
Weather data is downloaded from April 1 2018 to April 30 2018 from the weather station at Boston Logan International Airport. It is then summarized to an hourly interval through the month.
Weather features includes: temperature, precipitation and wind speed.
weather.Panel <-
riem_measures(station = "KBOS", date_start = "2018-04-01", date_end = "2018-04-30") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Boston KBOS - May, 2018")
Create a time-space panel
length(unique(dat_census$interval60)) * length(unique(dat_census$from_station_id))
## [1] 90610
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
from_station_id = unique(dat_census$from_station_id)) %>%
left_join(., dat_census %>%
select(from_station_id, from_station_name, Origin.Tract,
from_longitude, from_latitude, to_longitude,to_latitude )%>%
distinct() %>%
group_by(from_station_id) %>%
slice(1))%>%
na.omit()
nrow(study.panel)
## [1] 90610
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, from_station_id, from_station_name, Origin.Tract,
from_longitude, from_latitude,to_longitude,to_latitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(from_station_id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, bostonCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
Time lag variables are created and there correlation are tested in this part.
Time lag: lag 1 hour, lag 2 hours, lag 3 hours, lag 4 hours, lag 12 hours, lag 1 day
Holiday lags: the holiday in April in 2018 was Easter on April 1st, which was the start of the month. So holiday lags are: plus one day, plus two days and plus three days.
From the correlation test, we found that lag 1 hour is most correlated, lag 2 hour the second and lag 1 day the third. These three time lag features will later be included into the model as representatives.
ride.panel <-
ride.panel %>%
arrange(from_station_id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 91,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays"),
holidayLag = replace_na(holidayLag, 0))
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 x 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 0.54
## 2 lag2Hours 0.35
## 3 lag3Hours 0.18
## 4 lag4Hours 0.11
## 5 lag12Hours -0.17
## 6 lag1day 0.28
#add row index
ride.panel <- ride.panel %>% mutate(id = row_number())
A time feature and three spatial features are added to the model. Since the pattern is really different between weekdays and weekends, we think this dummy variable will be helpful to improve the model. As rideshare might be related to certain locations, we measured the proximity to open spaces and amenities to see if there will be spatial autocorrelation.
ride.panel <- ride.panel %>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
ride.panel.geo.from <-
st_as_sf(ride.panel,coords = c('from_longitude','from_latitude'),crs=4326) %>%
st_transform(st_crs(bostonTracts))
ride.panel.geo.to <-
st_as_sf(ride.panel,coords = c('to_longitude','to_latitude'),crs=4326) %>%
st_transform(st_crs(bostonTracts))
The proximity of the to-stations to water and the Emerald Necklace (only the linear parkway systems are included here). We assume that if the to-stations are within 500 meters from the water or parkways, the riders’ destination are more likely to be these leisure locations.
The feature is divided to two categories: near water/parkways or not near water/parkways.
WaterParkways <-
st_read("https://opendata.arcgis.com/datasets/2868d370c55d4d458d4ae2224ef8cddd_7.geojson") %>%
st_transform(st_crs(bostonTracts)) %>%
filter(TypeLong =="Parkways, Reservations & Beaches" )
WP.point <- st_cast(WaterParkways,"POINT")
ride.panel.geo.to <-
ride.panel.geo.to %>%
mutate(dist.WP = nn_function(st_coordinates(ride.panel.geo.to),
st_coordinates(WP.point), 1))
ride.panel.geo.to <-
ride.panel.geo.to %>%
mutate(WP.cat = case_when(
dist.WP >= 0 & dist.WP < 0.005 ~ "near WP",
dist.WP >= 0.005 ~ "not near WP"))
This feature aims to account for the weekends leisure rides. We filtered the parks that are larger than 15 acres. And like the assumptions made in water and parkways, we assume the to-station within 300 meters buffer of the parks suggest them are the riders’ destination.
openspace <-
st_read("https://opendata.arcgis.com/datasets/2868d370c55d4d458d4ae2224ef8cddd_7.geojson") %>%
st_transform(st_crs(bostonTracts)) %>%
filter(TypeLong =="Parks, Playgrounds & Athletic Fields") %>%
filter(ACRES >= 15)
openspaceBuffer <-
st_buffer(openspace, 0.003)
nearpark <- st_intersection(ride.panel.geo.to,openspaceBuffer) %>%
dplyr::select(id) %>%
st_drop_geometry() %>%
mutate(nearpark = 'yes')
ride.panel.geo.to <- left_join(ride.panel.geo.to,nearpark, by='id')
ride.panel.geo.to$nearpark <- replace_na(ride.panel.geo.to$nearpark, 'no')
There might be more bike rides in the university areas. So we choose two radius of 500m and 1000m to cover the university areas.
colleges <-
st_read("https://opendata.arcgis.com/datasets/cbf14bb032ef4bd38e20429f71acb61a_2.geojson") %>%
st_transform(st_crs(bostonTracts))
ride.panel.geo.from <-
ride.panel.geo.from %>%
mutate(dist.college = nn_function(st_c(ride.panel.geo.from),
st_c(st_coid(colleges)),1))
ride.panel.geo.from <-
ride.panel.geo.from %>%
mutate(colleges.cat = case_when(
dist.college >= 0 & dist.college < 0.005 ~ "0-500m",
dist.college >= 0.005 & dist.college < 0.01 ~ "500-1000",
dist.college >= 0.01 & dist.college < 50 ~ "outside"))
Join new features to ride.panel
ride.panel.geo.from.dat <- ride.panel.geo.from %>%
st_drop_geometry() %>%
dplyr::select(id,dist.college, colleges.cat)
ride.panel.geo.to.dat <- as.data.frame(ride.panel.geo.to) %>%
dplyr::select(id,nearpark,dist.WP,WP.cat)
ride.panel <- left_join(ride.panel,ride.panel.geo.from.dat, by='id')
ride.panel <- left_join(ride.panel,ride.panel.geo.to.dat, by='id')
Split our data into a training and a test set. We trained it with the first three weeks and tested on the last two weeks. We chose the first three weeks as the training set because it include the April 1st holiday. But as we mentioned above, there was an outstanding spike on April 24th, which was in the test set, so the model was likely to fail in predicting that.
#BR Note:
#Previously was week >= 20 and week < 20
ride.Train <- filter(ride.panel, week < 16)
ride.Test <- filter(ride.panel, week >= 16)
We create six linear models using the lm
function
reg 1: only time
reg 2: only space
reg 3: time and space
reg 4: time, space and time lag
reg 5: time, space, time lag and holiday lag
reg 6: time, space, time lag, holiday lag new spatial features
#Getting an error here
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ from_station_name + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ from_station_name + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ from_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ from_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag1day + holidayLag + holiday,
data=ride.Train)
reg6 <-
lm(Trip_Count ~ from_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag1day + holidayLag + holiday+ weekend +
colleges.cat + nearpark + WP.cat ,
data=ride.Train)
Create a nested data frame of test data by week
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
Create a function called model_pred
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
Run Predictions
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
FAll_Ameneties = map(.x = data, fit = reg6, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions
We run a 50 folds cross validation to test the generalizability.
The R-square is 16%, suggesting a poor fit. The reason might be because our data size of just a month is quite small for bikeshare prediction. And the bikeshare volumes are also small for most of the stations. In following sections we’ll discuss what the problems might be.
# use caret package cross-validation method
fitControl <- trainControl(method = "cv",
number = 50,
# savePredictions differs from book
savePredictions = TRUE)
set.seed(856)
# for k-folds CV
#Run Regression using K fold CV
reg.cv <-
train(Trip_Count ~ ., data = ride.panel %>%
dplyr::select(Trip_Count,from_station_name ,interval60 ,dotw , Temperature , Precipitation ,
lagHour , lag2Hours ,lag1day , holidayLag , holiday, weekend ,
colleges.cat , nearpark , WP.cat
),
method = "lm",
trControl = fitControl,
na.action = na.pass)
reg.cv
## Linear Regression
##
## 93398 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 84058, 84059, 84058, 84058, 84059, 84058, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1950763 0.1649965 0.04153874
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
By including time, space, timelag, holidaylag into the model, the MAE becomes lower and the regression is improved. But by adding spatial features and new time feature of weekday/weekend in reg 6 (F_All_Amenities), the model was just marginally improved.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette6) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
## Warning: Removed 6 rows containing missing values (geom_bar).
The line plots show that the model is significantly underpredicting the spikes. The prediction performs better when predict for average bikeshare trips.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, from_station_id)) %>%
dplyr::select(interval60, from_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -from_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Boston; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme
For most of the stations, the error is low. But there is a cluster of higher errors in downtown Boston and an outlier. As we are underestimating the spikes and the downtown is where bikeshare volume is high, it is possible that we are under predicting for the stations with higher bikeshare trips.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, from_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
select(interval60, from_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
group_by(from_station_id, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = bostonCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", alpha = 1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapTheme
Comparing to the perfect fit, our model is underpredicting the bikeshare. The slope differences are bigger in rush hours PM, suggesting the errors are larger for this time period.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, from_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw)) %>%
select(interval60, from_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme
The outlier is still obvious in all of the maps, but the error is smaller overnight, in which time period bikeshare trip volume is also smaller. The clustering of errors in the downtown is more significant during weekdays, while errors on the weekends are more evenly distributed.
Hence, we are underpredicting for higher rideshare volume for downtown stations, especially during the rush hours of workdays. This is the features that we should consider to improve our model.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, from_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, from_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(from_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = bostonCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size = 1, alpha = 1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
The line for Percent_taken_public_transport is quite flat. This is good sign that our model is predicting equally among populations with different accessibility to public transportation. However, errors grow when median income and percentage of white increase. Reflecting back to our underprediction for higher rideshares, it is possible higher income and white share neighborhoods are where there are more rideshare occurring.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
from_station_id = map(data, pull, from_station_id),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, from_station_id, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(from_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-from_station_id, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
In sum, we will not recommend to put this model into production without further modification. The main purpose for prediction is to forecast where there will be higher rideshares to prepare for rebalancing. But our model significantly underpredict for the spikes, which makes it inaccurate for predicting the trend and the actual demand.
Next step for us to improve the model will be:
Take longer period of time into consideration
We are training our model for a very short period of time. Though the weekly time pattern is repetitive, we don’t have enough volatility in the training set. To increase the generalizablity, the model should include more historical data, expanding the data size to half a year or even one year.
Include more spatial features
The model didn’t distinguish the downtown from other areas of Boston. The clustering of errors suggest a spatial autocorrelation in this area. Also, other spatial features like elevation (near hills), proximity to metro stations (categorical features to avoid colinearity with the station name), proximity to other amenities might help to improve the model.
Consider about the customer features
The gender, age and membership of the customers should have influence on their behavior and preference. For example, younger cohorts are more likely to be bikesharing during the rush hours on weekdays.
Also, it should be noticed that because of the pandemic, the historical data is becoming more unreliable. The previous model and feature engineering might not explain the time space patterns in the post-pandemic world.