Introducing ‘GridPred’ – A Blog for Data Stuff

Spatial Crime Prediction Models

There are no shortage of crime prediction models available today. In fact, prediction models are fairly ubiquitous nowadays in almost every field. Crime prediction specifically has an unusual place in the public consciousness, primarily due to fanciful allusions to books like The Minority Report. However, at its core, crime prediction does not differ much from other types of prediction models. Crime, like many other human processes follows consistent, repeatable patterns (McDowall, Loftin, and Pate 2012). People congregate in similar places during specific times of the year. Certain places contribute to crime by providing opportunities (like parked vehicles to break into), or excuses (like bar districts where intoxicated people may interact). Prediction models can take advantage of these patterns and provide some insight into the drivers and causes.

However, making prediction models useful is an entirely different issue altogether. One of the biggest complaints I had when I was in the academic world, was that generating highly complex models was often easy. However, using them in a real way, with real on-the-ground officers was much more difficult. The translation from model to action is likely the most difficult step.

My goal for this project was to put together a fairly simple, minimal, and effective workflow for setting up disparate sets of data into something that can be used in a machine learning model (a sort of plug-and-play model). I wanted to avoid overt complexity, because there is ample research that prior crime counts are often the best predictor of future crime events(Gorr, Olligschlaeger, and Thompson 2003). The actual described workflow below is, in fact, inspired by earlier work that was demonstrated in Dallas(Wheeler and Steenbeek 2021). In addition, I wanted the results to be simple, clean, and allow for quick assessment without too much additional fuss. In addition, the inputs should be general enough that if additional work was needed, using them in another format would be easy.

Working with `GridPred`

With all this in mind, I tried to define a fairly limited Python workflow for my project¹.The absolute minimal set of functions GridPred does is: (1) create a spatial grid over a set of input points (like crime), (2) split them into a training and evaluation set, and (3) allows users to pass them into a machine learning model. For a lot of use cases, this might get someone 80 - 90% to a completed analysis.

Let’s start with a quick walkthrough: Below, we import the key modules and define define the inputs. The primary functions are all contained under the core GridPred class under gridpred.prediction. The library optionally contains a number of helper functions to aid with model fitting, visualization, and metrics calculations.

For importing data, at a minimum, you need to provide either a pandas dataframe or csv file containing a longitudinal set of crime data with a date field and longitude and latitude fields with at least 3 distinct periods:

Code

# import all relevant functions
import pandas as pd
import numpy as np
from gridpred.evaluate.metrics import evaluate, pai, pei, rri
from gridpred.model.random_forest import RandomForestGridPred
from gridpred.plotting import visualize_predictions
from gridpred.prediction import GridPred

# inputs
crime_data = "input/hartford_robberies.csv"
predictor_features = "input/hartford_pois.csv"
region_shapefile = "input/hartford.shp"

If possible, you should also provide additional predictor features in the same format as the input crime data. If you have a shapefile that marks the boundary of your study area, you can pass that as well to make sure the metrics calculated correspond strictly to the region’s boundaries.

Code

# define variable names
time_var = "year"
features_var = "types"

# spatial projections
# includes the coordinate reference system of the input crime data
# as well as a desired projection for all spatial objects
input_crime_crs = 3508
projected_crs = 3508

# size of the grid to use (in units based on projection)
grid_size = 400


# This initalizes the GridPred class with the specified data
gridpred = GridPred(
    input_crime_data=crime_data,
    input_features_data=predictor_features,
    input_study_region=region_shapefile,
    crime_time_variable=time_var,
    features_names_variable=features_var,
    input_crs=input_crime_crs,
)

The prepare_data function takes all of the inputs and creates a tabular dataset X that can be passed as a list of predictor features in a machine learning model. Using this demo data, printing the generated regression input shows that it created a feature set with the counts of 2017 robberies, as well as the nearest distance to a variety of potentially criminogenic features (gas stations, bars, night clubs, etc…). y is the hold-out evaluation set.

# This generates the input to the regression model
gridpred.prepare_data(
    grid_cell_size=grid_size,
    do_projection=True,
    projected_crs=projected_crs
)

# Look at top 5 values in the predictor matrix
# is stored as a class object `X`
gridpred.X.head(5).round(2)

	liquor_store	bar	gas_station	restaurant	pharmacy	night_club	x	y
231	641.01	648.26	2878.30	4489.75	5964.84	9438.41	1010113.10	824653.91
308	254.13	255.45	2915.07	4217.04	5942.46	9098.14	1010511.83	824679.61
154	970.47	979.03	2892.32	4732.11	6006.24	9726.21	1009781.55	824629.96
232	751.07	744.97	2534.31	4231.56	5618.71	9241.02	1010119.00	825000.00
309	467.13	449.17	2605.54	3962.67	5622.69	8907.88	1010519.00	825000.00

By default, when creating the predictor dataset, GridPred sets aside \(t-2\) as the outcome for training the model, and \(t-1\) for evaluation. All remaining time periods are used as count features in the model. Here the most recent time value is 2019 robberies.

Predictor Models

For the actual modeling portion, users can specify any model that accepts a tabular input format. For simplicity we can just use a RandomForestGridPred which is primarily a wrapper around a scikit-learn Random Forest model with some convenience functions added on top. But you could just as easily import any other model and run them separately. Here, I just define a very simple model with a low number of iterations and poisson loss for the criterion. For this scenario, the model fits very quickly.

# very basic demo model workflow
# can replace with xgboost or whatever model
X = gridpred.X
y = gridpred.y

rf = RandomForestGridPred(
    n_estimators=500, criterion="poisson", random_state=42
)
rf.fit(X, y)
y_pred = rf.predict(X)

And because the base model derives from scikit-learn, we have access to the same underlying fields like feature importance. For now, we can just print them out manually (however, there will be some more automated reporting available down the road).

# print feature importances
# TODO: in future, can be logged and plotted
importances = pd.Series(rf.get_feature_importances(), index=X.columns)
print(importances.sort_values(ascending=False))

events_2017     0.270618
liquor_store    0.240522
pharmacy        0.089078
restaurant      0.084547
gas_station     0.081187
night_club      0.072135
bar             0.071431
y               0.053238
x               0.037243
dtype: float64

We can also rely on functions under gridpred.plotting to visually assess the predictions on a gridded map. By default this returns the results on a OpenStreetMap page. The predictions are on the original scale of the input, which in this case is a count variable. For flagging hotspots, we can select the highest percentile grid cells based on the predicted values.

# plotting
region_grid = gridpred.region_grid
visualize_predictions(region_grid, y_pred)

Evaluation Metrics

GridPred has a number of standard crime-prediction metrics like the Predictive Accuracy Index (PAI), the Predective Effecicency Index (PEI), and the Rate Recapture Index (RRI) already pre-defined in the library. These can be passed in a dict of metrics or a simple list of the functions on their own.

# Pass a dict of pre-defined library metrics
METRICS = {'PAI': pai, 'PEI': pei, 'RRI': rri}

print(
    evaluate(
        y_true=gridpred.eval,
        y_pred=y_pred,
        metrics=METRICS,
        region_grid=region_grid,
        round_digits=2
    )
)

{'PAI': 14.21, 'PEI': 0.49, 'RRI': 14.96}

Furthermore, you can pass any valid function to evaluate. Any other arbitrary set of functions are valid as long as it takes the values y_true and y_pred as an argument. An example below shows a custom function defined for computing the hit rate (proportion of total crimes in the top x% of predicted hot spots).

# function for hit rate, add to dict and evaluate alongside others
def hit_rate(y_true, y_pred, top_fraction=0.01, **kwargs):

    # ensure arrays
    y_true = y_true.values
    k = int(np.ceil(len(y_pred) * top_fraction))

    # top predicted cells
    idx = np.argsort(y_pred)[-k:]

    return y_true[idx].sum() / y_true.sum() if y_true.sum() > 0 else 0.0

print(
    evaluate(
        y_true=gridpred.eval,
        y_pred=y_pred,
        metrics=[pai, hit_rate],
        region_grid=region_grid,
        round_digits=2,
    )
)

{'pai': 14.21, 'hit_rate': 0.15}

What’s next?

In a few follow-up blog posts, I’ll be talking a bit more about features as I continue to add them. Specifically, I’m looking to add additional plotting metrics for feature importance (e.g. SHAP), as well as some diagnostics that help plot out the evaluation metrics like PEI and PAI across different hot spot thresholds.

References

Gorr, Wilpen, Andreas Olligschlaeger, and Yvonne Thompson. 2003. “Short-Term Forecasting of Crime.” International Journal of Forecasting 19 (4): 579–94.

McDowall, David, Colin Loftin, and Matthew Pate. 2012. “Seasonal Cycles in Crime, and Their Variability.” Journal of Quantitative Criminology 28 (3): 389–410.

Wheeler, Andrew P, and Wouter Steenbeek. 2021. “Mapping the Risk Terrain for Crime Using Machine Learning.” Journal of Quantitative Criminology 37 (2): 445–80.

Footnotes

I had a previous project named quickGrid that was written in R. But given the number of depricated packages it became too much work to update.↩︎