Teammate: Jessica Greene

1 Introduction

1 概述

From the crime data published by Chicago Open Data, there were 9157 motor vehicle theft happened in 2017. Since motor vehicle theft has a correlation with unemployment rate and unguarded parking space, there is a reported surge of motor vehicle theft during the pandemic. Therefore, in this report, we tried to build a motor theft prediction model based on 2017 crime data and environmental risk factors.

Looking as the non-white to white arrest rate ratio in 2001, the top 5 crimes are gambling (6.451), robbery (5.733) , murder (4.933), stolen property (3.282) and motor vehicle theft (3.261)1. For these crimes, there are significantly more non-whites arrested than whites. Among property crimes, which includes burglary, larceny-theft, motor vehicle theft and arson, motor vehicle theft has the highest non-white to white arrest rate ratio. From these data, we assume there might be a racial bias in motor vehicle theft policing.

Through our research, we found that motor vehicle thefts are more likely to happen in obsolete places, unguarded parking lots, car dealership and car repairs, where people tend to leave their keys inside.

In this model, we try to focus on the environments that is “unobserved”, or in Jane Jacobs’s words, places without “street eyes”. By focusing on physical environment itself and excluding racial and income context of the areas, we try to depict a “crime scene” rather than a crime socio-economic context.

2 Set-up

2.1 Setup

load('D:/Rdata/508_Homework3/workspace/esther_workspace.RData')

2.1 Read in Data from Chicago - Motor Vehicle Thefts

Loading boundary data and crime data for Chicago.

  • Boundary: Chicago County, Police Districts, Police Beats

  • Crime Data: Motor Vehicle Theft (2017)

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

mvtheft <-
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>% 
    filter(Primary.Type == "MOTOR VEHICLE THEFT" & Description == "AUTOMOBILE") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

2.2 Visualizing Motor Vehicle Theft

There is obvious clustering of motor vehicle thefts in some parts of the city. Referring to community boundaries, the top 5 communities that had most auto thefts in 2017 were Austin, West Town, Humboldt Park, North Lawndale and Near West Side2.

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary) +
  geom_sf(data = mvtheft, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Motor Vehicle Theft, Chicago - 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(mvtheft)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Motor Vehicle Theft") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

1.3 Creating a fishnet grid

A fishnet is created in replace of the fixed administrative boundaries, which may not account for continuous condition inside. The cell size of the fishnet is 500 meters.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

2.4 Aggregate points to the fishnet

Motor Vehicle Theft points data is aggregated into the fishnet.

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(mvtheft) %>% 
  mutate(countmvtheft = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countmvtheft = replace_na(countmvtheft, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24),                                #why 24?
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countmvtheft), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of MVTheft for the fishnet") +
  mapTheme()

3 Modeling Spatial Features

3.1 Data Gathering

To depict the crime scene, we used 311 reports from Chicago Open Data and points of interests data from OpenStreetMap.

Below is the list of features included:

  • Risk factors from 311 reports
    Abandon cars, abandon buildings, graffiti, street lights out, sanitation, liquor retail, garbage carts

  • Points of interests from OpenStreetMap
    Parking meters (for street parking), parking lots, supermarkets (for possible large unguarded parking space), car dealership, car repair

## only pulling a single variable for our model to keep it simple
## using Socrata again
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

parking_meters <- 
  st_read('D:/Rdata/508_Homework3/data and references/chicagoparkingmeters.geojson') %>% 
  dplyr::select(geometry) %>% 
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "parking_meters")

garbage_cart <- read.socrata("https://data.cityofchicago.org/resource/cry7-g5xt.json") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "garbage_carts")

traffic <-
  read.socrata("https://data.cityofchicago.org/resource/pfsx-4n4m.json") %>% 
  dplyr::select(Y = latitude, X = longitude,total_passing_vehicle_volume) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "traffic_count")

library(osmdata)

# parking lots (single points)
parking <- getbb('Chicago') %>% 
  opq() %>% 
  add_osm_feature('amenity','parking') %>% 
  osmdata_sf()
parking <- parking$osm_points %>% 
  dplyr::select(geometry) %>% 
  st_transform(st_crs(fishnet)) %>%
  st_intersection(.,chicagoBoundary) %>% 
  dplyr::select(geometry) %>% 
  mutate(Legend = "parking")

# supermarkets (points)
supermarkets <- getbb('Chicago') %>% 
  opq() %>% 
  add_osm_feature('building','supermarket') %>% 
  osmdata_sf() 
supermarkets <- supermarkets$osm_points %>% 
  dplyr::select(geometry)%>% 
  st_transform(st_crs(fishnet)) %>%
  st_intersection(.,chicagoBoundary) %>% 
  dplyr::select(geometry) %>% 
  mutate(Legend = "supermarkets")

#car dealership
car_shop <- getbb('Chicago') %>% 
  opq() %>% 
  add_osm_feature('shop','car') %>% 
  osmdata_sf() 
car_shop <- car_shop$osm_points %>% 
  dplyr::select(geometry)%>% 
  st_transform(st_crs(fishnet)) %>%
  st_intersection(.,chicagoBoundary) %>% 
  dplyr::select(geometry) %>% 
  mutate(Legend = "car_shops")

#car Repair
car_repair <- getbb('Chicago') %>% 
  opq() %>% 
  add_osm_feature('shop','car_repair') %>% 
  osmdata_sf() 
car_repair <- car_repair$osm_points %>% 
  dplyr::select(geometry)%>% 
  st_transform(st_crs(fishnet)) %>%
  st_intersection(.,chicagoBoundary) %>% 
  dplyr::select(geometry) %>% 
  mutate(Legend = "car_repair")


## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

3.2 Aggregating Features to Fishnet

Features are aggregated into the fishnet in two ways. One is by counting points in cells, the other is calculating the cells’ distance to the nearest feature point.

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        parking_meters, garbage_cart,
        liquorRetail, graffiti, sanitation, parking, 
        supermarkets, car_shop, car_repair ) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

 traffic_net <- st_join(traffic,fishnet,join=st_within) %>% 
   st_drop_geometry() %>% 
   group_by(uniqueID,Legend) %>% 
   summarise(count_traffic=sum(as.numeric(total_passing_vehicle_volume))) %>% 
   dplyr::select(uniqueID,count_traffic)
 
vars_net <- left_join(vars_net,traffic_net,by='uniqueID') %>% 
   mutate(count_traffic=replace_na(count_traffic,0)) 

3.2.1 Map Risk Factors by Counts

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors by Fishnet"))

3.2.2 Map Risk Factors by Nearest Neighborhood

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

st_c <- st_coordinates
st_coid <- st_centroid

library(stringr)


#omit NA values from New features and take ST_coordinates
parking_meters.c <- data.frame(st_coordinates(parking_meters)) %>%
                                 na.omit()
garbage_cart.c <- data.frame(st_coordinates(garbage_cart)) %>%
                                 na.omit() 

parking.c <- data.frame(st_coordinates(parking)) %>%
                                 na.omit()
supermarkets.c <- data.frame(st_coordinates(supermarkets)) %>%
                                 na.omit()
car_shop.c <- data.frame(st_coordinates(car_shop)) %>%
                                 na.omit()
car_repair.c <- data.frame(st_coordinates(car_repair)) %>%
                                 na.omit()
vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), st_c(sanitation),3),
      Parking_meters.nn =
        nn_function(st_coordinates(st_centroid(vars_net)), parking_meters.c,3),
      Garbage_cart.nn =
        nn_function(st_c(st_coid(vars_net)), garbage_cart.c,3),
      Parking.nn =
        nn_function(st_c(st_coid(vars_net)), parking.c,3),
      Supermarkets.nn =
        nn_function(st_c(st_coid(vars_net)), supermarkets.c,3),
      Car_shop.nn =
        nn_function(st_c(st_coid(vars_net)), car_shop.c,3),
      Car_repair.nn =
        nn_function(st_c(st_coid(vars_net)), car_repair.c,3))

#Plot Nearest Neighbor

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))

3.3 Distance to CBD

Loop is the central business district of Chicago, distance from each cell to loop is calculated here as the feature of its distance to city center.

loopPoint <-
  filter(neighborhoods, name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric() 

3.4 Create the Final Net

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

3.5 Spatial Process of Motor Vehicle Thefts

Local Moran’s I is used in this part to analyze local scale spatial clustering of crime. P-value is used here to distinguish motor vehicle theft hot spots in the city. After comparing different P-value thresholds, we found P-value threshold at 0.0001 will perform best when accounting for spatial process in our model.

#Visualize local spatial process of auto Theft

final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
#Visualize local spatial process of auto Theft

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countmvtheft, final_net.weights)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(mvtheft_Count = countmvtheft, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z > 0)`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.0001, 1, 0)) %>%
      gather(Variable, Value, -geometry)
  
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, MV Theft"))

Distance from cells to their nearest motor vehicle theft hot spots is added in to the model as spatial feature.

final_net <-
  final_net %>% 
  mutate(mvtheft.isSig = 
           ifelse(localmoran(final_net$countmvtheft, 
                             final_net.weights)[,5] <= 0.0001, 1, 0)) %>%
  mutate(mvtheft.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, mvtheft.isSig == 1))), 1))

#plot distance to significant MV theft hotspots

ggplot() +
  geom_sf(data = final_net, aes(fill = mvtheft.isSig.dist), color = NA) +
  scale_fill_viridis() +
  labs(title = "Distance to highly significant local MV theft hotspots") +
  mapTheme()

3.6 Correlation of Different Features

Counts and nearest neighbor (nn) correlations are displayed in the following plots.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -loopDistance, -name, -District,-count_traffic) %>%
    gather(Variable, Value, -countmvtheft)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countmvtheft, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countmvtheft)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "MV THEFT count as a function of risk factors") +
  plotTheme()

3.7 Distribution of the Dependent Variable (mvTheft)

The distribution of dependent variable fits the features of a poisson distribution. Most of the cells in Chicago fishnet has 0 or very few motor vehicle thefts.

crime_distribution <- crime_net %>% st_drop_geometry() %>% 
  dplyr::select(countmvtheft)

ggplot(data.frame(crime_distribution), aes(x=countmvtheft)) +
  geom_bar()+
  labs(title = "MVTheft Distribution",
       subtitle = 'A histogram of Motor Vehicle Theft Count') +
  plotTheme()

4 Regression

4.1 Building Regressions

Features included in the final regression are:

  • Risk factors: distance to abandon building, distance to abandon cars, distance to graffiti, distance to liquor retail, distance to street light out spots, distance to garbage carts

  • Parking related factors: distance to parking meters, distance to parking lots, distance to supermarkets, distance to car dealership, distance to car repair

  • Distance to CBD: distance to Loop

  • Spatial Process: Distance to nearest mvTheft hot spot

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "Parking_meters.nn", "Garbage_cart.nn", 
              "Parking.nn", "Supermarkets.nn", "Car_shop.nn", "Car_repair.nn", 
              "loopDistance")

#book says we can add neighborhood name and police district to this
reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "Parking_meters.nn", "Garbage_cart.nn", 
              "Parking.nn", "Supermarkets.nn", "Car_shop.nn", "Car_repair.nn", 
                 "loopDistance", "mvtheft.isSig", "mvtheft.isSig.dist")

4.2 Cross Validation

k-folds cross validation and LOGO cross validation are used here to test the accuracy of the final regression model.

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countmvtheft ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countmvtheft",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countmvtheft, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countmvtheft",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countmvtheft, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countmvtheft",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countmvtheft, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countmvtheft",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countmvtheft, Prediction, geometry)

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countmvtheft,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countmvtheft,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countmvtheft,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countmvtheft,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 

4.3 Accuracy and Generalizability

4.3.1 Accuracy

The result in the following histogram and table suggests that including spatial process has improved our model. By comparing to our previous baseline regression model (not presented here), which only included six risk features from the 311 reports, the Mean-MAE dropped from 0.50 to 0.47, SD-MAE dropped from 0.42 to 0.36 in k-folds cross validation. So we think including parking related factors has successfully improved the model.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countmvtheft, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") 

Create table of mean and standard deviation in errors.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "MAE by regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF") 
MAE by regression
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.57 0.44
Random k-fold CV: Spatial Process 0.47 0.36
Spatial LOGO-CV: Just Risk Factors 1.35 1.32
Spatial LOGO-CV: Spatial Process 1.06 1.04

Map Visualizing where the higher errors occur.

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Mvtheft errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

It’s possible accounting for local spatial process will remove all spatial variation in count mvtheft. P-value is higher in the regression model that included spatial process.

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]]) %>%  
    kable(caption = "Spatial Process") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF")
Spatial Process
Regression Morans_I p_value
Spatial LOGO-CV: Just Risk Factors 0.2087628 0.002
Spatial LOGO-CV: Spatial Process 0.1291231 0.019
st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
    mutate(mvtheft_Decile = ntile(countmvtheft, 10)) %>%
  group_by(Regression, mvtheft_Decile) %>%
    summarize(meanObserved = mean(countmvtheft, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -mvtheft_Decile) %>%          
    ggplot(aes(mvtheft_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = mvtheft_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Predicted and observed MV theft by observed burglary decile")

4.3.2 Generalizability by Race Context

Generally, the model underestimates mvThefts in majority non-white neighborhoods and overestimates it in majority white neighborhoods. This could to some extent mitigate the over policing problems in non-white neighborhoods.

tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]
reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F) 
Mean Error by neighborhood racial context
Regression Majority_Non_White Majority_White
Spatial LOGO-CV: Just Risk Factors -0.3227800 0.3188473
Spatial LOGO-CV: Spatial Process -0.2699454 0.2693431

5 Comparison with Traditional Crime Hotspots

The traditional crime hotspots is generated by kernel density. To compare our model with the traditional model, we trained the models on 2017 crime data and tested them on 2018 crime data to see which one is more generalizable.

5.1 Kernel Density Map

Create Kernel density map

mvtheft_ppp <- as.ppp(st_coordinates(mvtheft), W = st_bbox(final_net))
mvtheft_KD <- spatstat::density.ppp(mvtheft_ppp, 1000)

as.data.frame(mvtheft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(mvtheft, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 MV Theft") +
     mapTheme()

Load 2018 motor vehicle theft data.

#download mvtheft data 2018

mvtheft18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "MOTOR VEHICLE THEFT" & 
         Description == "AUTOMOBILE") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

mvtheft_KDE_sf <- as.data.frame(mvtheft_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(mvtheft18) %>% mutate(mvtheftCount = 1), ., sum) %>%
    mutate(mvtheftCount = replace_na(mvtheftCount, 0))) %>%
  dplyr::select(label, Risk_Category, mvtheftCount)

Repeat process for risk predictions

mvtheft_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(mvtheft18) %>% mutate(mvtheftCount = 1), ., sum) %>%
      mutate(mvtheftCount = replace_na(mvtheftCount, 0))) %>%
  dplyr::select(label,Risk_Category, mvtheftCount)

Generate map of risk categories with mvtheft18 overlaid.

rbind(mvtheft_KDE_sf, mvtheft_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(mvtheft18, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 MVTHEFT risk predictions; 2018 MV Thefts") +
    mapTheme()

Calculate rate of 2018 Theft points by risk category and model type.

rbind(mvtheft_KDE_sf, mvtheft_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countmvtheft = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countmvtheft / sum(countmvtheft)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2018 MV Theft") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

This plot suggests that our model predicted better in highest risk category (90-100%) and in mid to low risk category (30-40%).

Conclusion

Overall, this model successfully identified the features that are related to motor vehicle theft and made powerful predictions in 2018. I will recommend this algorithm to be put into production for the following reasons:

  • The accuracy of the model is good
    The mean absolute error from the cross validation is 0.47 and the standard MAE is 0.36, both significantly lower than the baseline regression. The features included in the model successfully accounted for the risk factors, spatial process and parking related factors in the prediction.
  • The model might mitigate the racial bias suffered by motor vehicle thefts
    As mentioned in introduction, motor vehicle theft might be suffering from greater racial bias than other property crimes. In our model, we potentially mitigated the bias by preventing over policing in majority non-white neighborhood.

  • This model is more generalizable in motor vehicle theft hot spots
    This model performs well in motor vehicle theft prediction especially in highest risk category.

However, selection biases are still present in this model. Risk factors that identify the street conditions (graffiti, garbage carts, sanitation…) might be reported spatially unevenly in different areas among the city. Also, the status of parking space features used in the model is not taken into consideration. We don’t have the data to differentiate these parking spaces as busy/obsolete, guarded/unguarded.

To optimize predictions among all risk categories, I will suggest the police department to combine and cross validate the prediction produced by this model and the traditional model together for future predictions.

References

  1. Piquero, Alex R., and Robert W. Brame. “Assessing the Race–Crime and Ethnicity–Crime Relationship in a Sample of Serious Adolescent Delinquents.” Crime & Delinquency 54, no. 3 (July 2008): 390–422. https://doi.org/10.1177/0011128707307219. Table 1. Data originally from the “Age-Specific Arrest Rates and Race-Specific Arrest Rates for Selected Offenses, 1991-2001” (Retrieved September 10, 2005, from http://www.fbi.giv/ucr/addpubs.htm)..)

  2. https://www.illinoisvehicle.com/about-us/blog/2016-chicago-auto-thefts/