Teammates: Jessica Greene, Brian Rawn

1 Introduction

Rebalancing in bikeshare has long been a difficult task. There are always draining and spilling in different stations. Elavation, weather, holidays, events and the pandemic are all variables that affect peoples’ preference and commuting modes. The purpose of rebalancing is to determine how many bikes should be at a station at a certain time. One of the most important thing for bikeshare companies to do is to accommodate for the time and spatial pattern and redistribute bikes over different stations at a proper time.

The most manageable and efficient way seems to be using a truck to move bikes around. This mode is especially useful when there is a predicted surge of bike demand in certain areas during certain times, such as parks on sunny holidays and stadiums after a show. It can also used on a daily base to routinely move bikes from spilling stations back to the draining ones, assuming that the demand will also occur at certain time as usual. For stations that spilling and draining problems are not so severe or just occasionally happen, it would be good to offer people bonus or coupons for a suggested route or returning spots. Not by changing the customer’s destination, the bikeshare company could suggest a nearby returning station within walkable distance to the customers original destination, avoiding clustering of bike return in some hot spots.

Commuting patterns, holidays, weathers and events are some of the relevantly predictable factors in the supply and demand analysis. There will basically be different time scales for rebalancing in different scenarios. For daily commuting patterns, e.g. rush hours in the morning and afternoon, redistribution may happen by a day or half of a day. For holidays, weekends or special events, the stations with a predicted high demand should be prepared several hours before the spike come.

In this report, we are using historical bikeshare data from Boston (April 2018) to build a prediction model. The main features we took into account are timelags, holiday lags, weather, and spatial features including the location of the station, university areas, and possible hotspots such as parks, waters and parkways.

2 Set-up and Import Data

2.1 Setup

Load the libraries

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gganimate)
library(gifski)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette6 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c",'#053260')
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")


# Load Quantile break functions

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

qBr2 <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(round(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T)))
  } else if (rnd == FALSE | rnd == F) {
    as.character(round(formatC(quantile(round(df[[variable]]), 0)),
                 c(.01,.2,.4,.6,.8), na.rm=T))
  }
}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

2.2 Import Data

We are using data provided by an outside link from Boston Opendata. We use April 2018 as the sample month to build our predictions.

The data includes:

Trip data: trip duration, start time and date, stop time and date, from station, to station
Customer data: member/casual, birth year, gender

In this model, we didn’t include the customer data.

Link on Open Data - the dataset is managed by an outside resource
https://data.boston.gov/dataset/blue-bikes-system-data

Bluebike System Data
https://www.bluebikes.com/system-data

bostonbikes <-
  read.csv('D:/Rdata/Homework5_Bikeshare/data/HubwayTrips.csv', header=TRUE)

dat <- bostonbikes

Use date parsing to bin the data by 15 and 60 minute intervals

dat2 <- dat %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

2.3 Import Census Information

Census is used to get tract-scale socio-economic information for future analysis and generalizability tests. Features includes are: total population, median income, median age, white population, travel time, number of commuters, means of transportation

bostonCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2017, 
          state = "MA", 
          geometry = TRUE, 
          county=c("Suffolk"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

bostonTracts <- 
  bostonCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

#BR note: changed "left = TRUE" to "left = FALSE" to exclude data that was outside of boston tracts. Can someone confirm that the second "TRUE" should not also be made false?
dat_census <- st_join(dat2 %>% 
          filter(is.na(from_longitude) == FALSE &
                   is.na(from_latitude) == FALSE &
                   is.na(to_latitude) == FALSE &
                   is.na(to_longitude) == FALSE) %>%
          st_as_sf(., coords = c("from_longitude", "from_latitude"), crs = 4326),
        bostonTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = FALSE) %>%
    rename(Origin.Tract = GEOID) %>%
  mutate(from_longitude = unlist(map(geometry, 1)),
         from_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("to_longitude", "to_latitude"), crs = 4326) %>%
  st_join(., bostonTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(to_longitude = unlist(map(geometry, 1)),
         to_latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.4 Import Weather Data

Weather data is downloaded from April 1 2018 to April 30 2018 from the weather station at Boston Logan International Airport. It is then summarized to an hourly interval through the month.

Weather features includes: temperature, precipitation and wind speed.

weather.Panel <- 
  riem_measures(station = "KBOS", date_start = "2018-04-01", date_end = "2018-04-30") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Boston KBOS - May, 2018")

3 Data Wrangling and Exploration

3.1 Describe and Explore the Data

3.1.1 Bikeshare frequency

Generally, bikeshare pattern remain constant among weekdays. Weekends see weekly dip of bike share trips. Easter on April 1st influenced the bikeshare volumes on and after for about two days. There is an outstanding spike on April 24th, though we are not sure what caused the surge, it is something noticeable.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike Share Trips per hr. Boston, April, 2018",
       x="Date", 
       y="Number of trips")+
  plotTheme

Trip volume by station and time

In general, there are more stations with zero bikerides throughout the day. But rush hours AM and PM suggests there are more bikerides in some of the stations comparing to midday and overnight. AM rush hour seems to have more extreme high demand stations than PM rush.

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, from_station_name, time_of_day) %>%
         tally()%>%
  group_by(from_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Boston, April, 2018",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

ggplot(dat_census %>%
         group_by(interval60, from_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike Share Trips per hr by Station. Boston, April, 2018",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme

Bikeshare Trips in Boston, by day and by week

The line plots of bikeshare trips by days and by week shows two spikes on weekdays for the AM rush and PM rush. Mondays seem to have the highest bikeshare volume. The time pattern is significantly different between weekdays and weekend. On the weekends, there are no rush hour spikes but a slump hump from late morning to afternoon, probably because of the weekends leisure rides.

ggplot(dat_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Boston, by day of the week, April, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Boston - weekend vs weekday, April, 2018",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

Bikeshare Trips by hour, station and weekday/weekend

On weekdays, high volume bikeshare rides clustered in mid-west Boston, which are Back Bay and South End, the downtown of Boston. The cluster is most significant during rush hours AM and PM. On weekends, the bikeshare volume were more evenly distributed in the city, but in the mid day, there was a rise of bikeshares also in downtown Boston.

#Data appears to not be aligned with tracts - this is because the data is inclusive of tracts in Cambridge, which is not included in our tract data.

ggplot()+
  geom_sf(data = bostonTracts %>%
          st_transform(crs=4326))+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(from_station_id, from_latitude, from_longitude, weekend, time_of_day) %>%
              tally(),
            aes(x=from_longitude, y = from_latitude, color = n), 
            fill = "transparent", alpha = 1, size = 0.6)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike Share Trips per hr by Station. Boston, April, 2018")+
  mapTheme

Animated Map

The time and space pattern is more clearly visualized in the animated map. As we found before, the volume of bikeshare trips increased significantly in tracts in Back Bay during rush hours, especially the PM rush.

week14 <-
  filter(dat_census , week == 14 & dotw == "Mon")

week14.panel <-
  expand.grid(
    interval15 = unique(week14$interval15),
    Origin.Tract = unique(dat_census$Origin.Tract))

ride.animation.data <-
  mutate(week14, Trip_Counter = 1) %>%
    right_join(week14.panel) %>% 
    group_by(interval15, Origin.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(bostonTracts, by=c("Origin.Tract" = "GEOID")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))

rideshare_animation <-
  ggplot() +
    geom_sf(data = ride.animation.data, aes(fill = Trips)) +
    scale_fill_manual(values = palette5) +
    labs(title = "Rideshare pickups for a Monday in April 2018",
         subtitle = "15 minute intervals: {current_frame}") +
    transition_manual(interval15) +
    mapTheme

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

3.2. Create Time-Space Panel

Create a time-space panel

length(unique(dat_census$interval60)) * length(unique(dat_census$from_station_id))

## [1] 90610

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              from_station_id = unique(dat_census$from_station_id)) %>%
  left_join(., dat_census %>%
              select(from_station_id, from_station_name, Origin.Tract, 
                     from_longitude, from_latitude, to_longitude,to_latitude )%>%
              distinct() %>%
              group_by(from_station_id) %>%
              slice(1))%>%
            na.omit()

nrow(study.panel)

## [1] 90610

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, from_station_id, from_station_name, Origin.Tract, 
           from_longitude, from_latitude,to_longitude,to_latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(from_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, bostonCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3.3. Create time lags

Time lag variables are created and there correlation are tested in this part.

Time lag: lag 1 hour, lag 2 hours, lag 3 hours, lag 4 hours, lag 12 hours, lag 1 day
Holiday lags: the holiday in April in 2018 was Easter on April 1st, which was the start of the month. So holiday lags are: plus one day, plus two days and plus three days.

From the correlation test, we found that lag 1 hour is most correlated, lag 2 hour the second and lag 1 day the third. These three time lag features will later be included into the model as representatives.

ride.panel <- 
  ride.panel %>% 
  arrange(from_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 91,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays"),
         holidayLag = replace_na(holidayLag, 0))

as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))

## # A tibble: 6 x 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.54
## 2 lag2Hours         0.35
## 3 lag3Hours         0.18
## 4 lag4Hours         0.11
## 5 lag12Hours       -0.17
## 6 lag1day           0.28

#add row index
ride.panel <- ride.panel %>% mutate(id = row_number())

3.4 Other Features

A time feature and three spatial features are added to the model. Since the pattern is really different between weekdays and weekends, we think this dummy variable will be helpful to improve the model. As rideshare might be related to certain locations, we measured the proximity to open spaces and amenities to see if there will be spatial autocorrelation.

3.4.1 Time Features

ride.panel <- ride.panel %>% 
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))

3.4.2 Spatial Features

ride.panel.geo.from <- 
  st_as_sf(ride.panel,coords = c('from_longitude','from_latitude'),crs=4326) %>% 
  st_transform(st_crs(bostonTracts))

ride.panel.geo.to <- 
  st_as_sf(ride.panel,coords = c('to_longitude','to_latitude'),crs=4326) %>% 
  st_transform(st_crs(bostonTracts))

water and parkways

The proximity of the to-stations to water and the Emerald Necklace (only the linear parkway systems are included here). We assume that if the to-stations are within 500 meters from the water or parkways, the riders’ destination are more likely to be these leisure locations.

The feature is divided to two categories: near water/parkways or not near water/parkways.

WaterParkways <- 
  st_read("https://opendata.arcgis.com/datasets/2868d370c55d4d458d4ae2224ef8cddd_7.geojson") %>%
  st_transform(st_crs(bostonTracts)) %>% 
  filter(TypeLong =="Parkways, Reservations & Beaches" )

WP.point <- st_cast(WaterParkways,"POINT") 

ride.panel.geo.to <- 
  ride.panel.geo.to %>%
  mutate(dist.WP = nn_function(st_coordinates(ride.panel.geo.to), 
                                  st_coordinates(WP.point), 1))

ride.panel.geo.to <- 
  ride.panel.geo.to %>%
  mutate(WP.cat = case_when(
                  dist.WP >= 0 & dist.WP < 0.005  ~ "near WP",
                  dist.WP >= 0.005  ~ "not near WP"))

open space

This feature aims to account for the weekends leisure rides. We filtered the parks that are larger than 15 acres. And like the assumptions made in water and parkways, we assume the to-station within 300 meters buffer of the parks suggest them are the riders’ destination.

openspace <- 
  st_read("https://opendata.arcgis.com/datasets/2868d370c55d4d458d4ae2224ef8cddd_7.geojson") %>%
  st_transform(st_crs(bostonTracts)) %>% 
  filter(TypeLong =="Parks, Playgrounds & Athletic Fields") %>% 
  filter(ACRES >= 15)

openspaceBuffer <-
  st_buffer(openspace, 0.003) 

nearpark <- st_intersection(ride.panel.geo.to,openspaceBuffer) %>% 
  dplyr::select(id) %>% 
  st_drop_geometry() %>% 
  mutate(nearpark = 'yes')

ride.panel.geo.to <- left_join(ride.panel.geo.to,nearpark, by='id') 
ride.panel.geo.to$nearpark <- replace_na(ride.panel.geo.to$nearpark, 'no')

colleges and universities

There might be more bike rides in the university areas. So we choose two radius of 500m and 1000m to cover the university areas.

colleges <- 
  st_read("https://opendata.arcgis.com/datasets/cbf14bb032ef4bd38e20429f71acb61a_2.geojson") %>%
  st_transform(st_crs(bostonTracts)) 

ride.panel.geo.from <-
  ride.panel.geo.from %>% 
  mutate(dist.college =  nn_function(st_c(ride.panel.geo.from), 
                      st_c(st_coid(colleges)),1))

ride.panel.geo.from <-
  ride.panel.geo.from %>%
  mutate(colleges.cat = case_when(
                  dist.college >= 0 & dist.college < 0.005  ~ "0-500m",
                  dist.college >= 0.005 & dist.college < 0.01  ~ "500-1000",
                  dist.college >= 0.01 & dist.college < 50 ~ "outside"))

Join new features to ride.panel

ride.panel.geo.from.dat <- ride.panel.geo.from %>% 
  st_drop_geometry() %>% 
  dplyr::select(id,dist.college, colleges.cat)

ride.panel.geo.to.dat <- as.data.frame(ride.panel.geo.to) %>% 
  dplyr::select(id,nearpark,dist.WP,WP.cat)

ride.panel <- left_join(ride.panel,ride.panel.geo.from.dat, by='id')
ride.panel <- left_join(ride.panel,ride.panel.geo.to.dat, by='id')

4 Regression

4.1 Run Models

Split our data into a training and a test set. We trained it with the first three weeks and tested on the last two weeks. We chose the first three weeks as the training set because it include the April 1st holiday. But as we mentioned above, there was an outstanding spike on April 24th, which was in the test set, so the model was likely to fail in predicting that.

#BR Note:
#Previously was week >= 20 and week < 20
ride.Train <- filter(ride.panel, week < 16)
ride.Test <- filter(ride.panel, week >= 16)

We create six linear models using the lm function

reg 1: only time
reg 2: only space
reg 3: time and space
reg 4: time, space and time lag
reg 5: time, space, time lag and holiday lag
reg 6: time, space, time lag, holiday lag new spatial features

#Getting an error here

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  from_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  from_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  from_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  from_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <- 
  lm(Trip_Count ~  from_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag1day + holidayLag + holiday+ weekend +
                    colleges.cat + nearpark + WP.cat , 
     data=ride.Train)

4.2. Predict for test data

Create a nested data frame of test data by week

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

Create a function called model_pred

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

Run Predictions

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FAll_Ameneties = map(.x = data, fit = reg6, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

4.3 K-fold Cross Validation

We run a 50 folds cross validation to test the generalizability.

The R-square is 16%, suggesting a poor fit. The reason might be because our data size of just a month is quite small for bikeshare prediction. And the bikeshare volumes are also small for most of the stations. In following sections we’ll discuss what the problems might be.

# use caret package cross-validation method
fitControl <- trainControl(method = "cv", 
                           number = 50,
                           # savePredictions differs from book
                           savePredictions = TRUE)

set.seed(856)

# for k-folds CV

#Run Regression using K fold CV


reg.cv <- 
  train(Trip_Count ~ ., data = ride.panel %>% 
    dplyr::select(Trip_Count,from_station_name ,interval60 ,dotw , Temperature , Precipitation ,
                   lagHour , lag2Hours ,lag1day , holidayLag , holiday, weekend ,
                    colleges.cat , nearpark , WP.cat
                  ), 
     method = "lm", 
     trControl = fitControl, 
     na.action = na.pass)

reg.cv

## Linear Regression 
## 
## 93398 samples
##    14 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 84058, 84059, 84058, 84058, 84059, 84058, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1950763  0.1649965  0.04153874
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

5 Test Generalizability

5.1. Examine Error Metrics for Accuracy

By including time, space, timelag, holidaylag into the model, the MAE becomes lower and the regression is improved. But by adding spatial features and new time feature of weekday/weekend in reg 6 (F_All_Amenities), the model was just marginally improved.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette6) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

## Warning: Removed 6 rows containing missing values (geom_bar).

Predicted/observed bikeshare time series

The line plots show that the model is significantly underpredicting the spikes. The prediction performs better when predict for average bikeshare trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_station_id = map(data, pull, from_station_id)) %>%
    dplyr::select(interval60, from_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -from_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Boston; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

Plot Mean Absolute Errors by station

For most of the stations, the error is low. But there is a cluster of higher errors in downtown Boston and an outlier. As we are underestimating the spikes and the downtown is where bikeshare volume is high, it is possible that we are under predicting for the stations with higher bikeshare trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_station_id = map(data, pull, from_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude)) %>%
    select(interval60, from_station_id, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(from_station_id, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = bostonCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", alpha = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme

5.2. Space-Time Error Evaluation

Comparing to the perfect fit, our model is underpredicting the bikeshare. The slope differences are bigger in rush hours PM, suggesting the errors are larger for this time period.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_station_id = map(data, pull, from_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, from_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

Plot errors on a map by weekend/weekday and time of day

The outlier is still obvious in all of the maps, but the error is smaller overnight, in which time period bikeshare trip volume is also smaller. The clustering of errors in the downtown is more significant during weekdays, while errors on the weekends are more evenly distributed.

Hence, we are underpredicting for higher rideshare volume for downtown stations, especially during the rush hours of workdays. This is the features that we should consider to improve our model.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_station_id = map(data, pull, from_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, from_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(from_station_id, weekend, time_of_day, from_longitude, from_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = bostonCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = from_longitude, y = from_latitude, color = MAE), 
             fill = "transparent", size = 1, alpha = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
  xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

Plot Errors as a function of socio-economic variables

The line for Percent_taken_public_transport is quite flat. This is good sign that our model is predicting equally among populations with different accessibility to public transportation. However, errors grow when median income and percentage of white increase. Reflecting back to our underprediction for higher rideshares, it is possible higher income and white share neighborhoods are where there are more rideshare occurring.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_station_id = map(data, pull, from_station_id), 
           from_latitude = map(data, pull, from_latitude), 
           from_longitude = map(data, pull, from_longitude),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, from_station_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(from_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-from_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

6 Conclusion

In sum, we will not recommend to put this model into production without further modification. The main purpose for prediction is to forecast where there will be higher rideshares to prepare for rebalancing. But our model significantly underpredict for the spikes, which makes it inaccurate for predicting the trend and the actual demand.

Next step for us to improve the model will be:

Take longer period of time into consideration
We are training our model for a very short period of time. Though the weekly time pattern is repetitive, we don’t have enough volatility in the training set. To increase the generalizablity, the model should include more historical data, expanding the data size to half a year or even one year.
Include more spatial features
The model didn’t distinguish the downtown from other areas of Boston. The clustering of errors suggest a spatial autocorrelation in this area. Also, other spatial features like elevation (near hills), proximity to metro stations (categorical features to avoid colinearity with the station name), proximity to other amenities might help to improve the model.
Consider about the customer features
The gender, age and membership of the customers should have influence on their behavior and preference. For example, younger cohorts are more likely to be bikesharing during the rush hours on weekdays.

Also, it should be noticed that because of the pandemic, the historical data is becoming more unreliable. The previous model and feature engineering might not explain the time space patterns in the post-pandemic world.

Bikeshare Prediction, Boston

Yihan Zhang

10/16/2020