Visualizing predictor effects using accumulated local effect (ALE) plots
Identifying ‘hot spots’ of car crashes in NYC
Python
Data Science Applications
Data Visualization
Author
Gio Circo, Ph.D.
Published
February 20, 2026
Improving Model Explainability with ALE
One of the major drawbacks with many machine learning models is that explainability is often challenging. Unlike methods that rely on more conventional statistical inference, machine learning models are often opaque on why a certain prediction was made. Consider the example of a random forest. A specific prediction derives from the aggregation over many shallow decision trees, so understanding how a prediction varies on average is much less straightforward than a linear model.
ALE, or accumulated local effects, are a method to help decompose the effect of a predictor feature on the subsequent prediction. In a more conventional setup you can consider these very similar to a partial effects plot. The end goal here is to isolate how the effect of the predictor changes, on average, over different input values (e.g. does it monotonically increase? Does it decay?).
To show this, I put together another example by using gridpred to build out a highly simplistic prediction model for car accidents. If you want a more step-by-step process of how GridPred works, can you go to the first blog post on it here.
Pulling crash data from the NYC data portal
For this example I rely on car accident report data from the New York City Open Data portal. For predictor features, I also extract the location of pedestrian crossings and bus stops. The assumption here is these locations are more prone to accidents with cars suddenly stopping or moving around busses.
The code below does a lot of work of pulling crash data by interacting with the NYC data via Socrata API requests. I use a bunch of helper functions to query each of the individual datasets, which I then combine into a format that is usable by GridPred. At the end of the process we have a gridded map of NYC containing a count of vehicle accidents over a roughly 8 week lookback period, as well as the distance to the nearest set of predictor features.
Code
import osfrom datetime import datetime, timedeltaimport geopandas as gpdimport matplotlib.pyplot as pltimport pandas as pdimport seaborn as snsimport numpy as npimport skexplainfrom dotenv import load_dotenvfrom shapely.geometry import shapefrom sodapy import Socratafrom gridpred.evaluate.metrics import evaluate, pai, pei, rrifrom gridpred.model.random_forest import RandomForestGridPredfrom gridpred.plotting import visualize_predictionsfrom gridpred.prediction import GridPred# Set up clientload_dotenv()APP_TOKEN = os.getenv("APP_TOKEN")client = Socrata("data.cityofnewyork.us", APP_TOKEN)# Set plotting themesns.set_theme(context='notebook', style="ticks", palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None)# local funcs for datadef get_date_lookback(lookback_days=14):"Retrieve a date from a number of lookback days" current_date = datetime.today() lookback_date = current_date - timedelta(days=lookback_days)return lookback_date.strftime("%Y-%m-%d")def get_features(borough):"Retrieve all predictor features"# bus stop point locations bus_stops = pd.DataFrame(client.get("t4f2-8md7", select="'BusStop' AS type,Longitude, Latitude", where=f"boro_name = '{borough.capitalize()}'", limit=1000, ))# pedesterian crossings point locations ped_xing = pd.DataFrame(client.get("xc4v-ntf4", select="'PedXing' AS type, long AS Longitude, lat AS Latitude", limit=7000, ))return pd.concat([bus_stops, ped_xing])def get_study_region_spatial(borough, espg="4326"):"Retrieve a boro boundaries, parse as a geopandas object" data = client.get("gthc-hcne", select="*", where=f"boroname = '{borough.capitalize()}'")# parse api response to geopandas rows = []for r in data: rows.append({**{k: v for k, v in r.items() if k !="the_geom"},"geometry": shape(r["the_geom"]) })return(gpd.GeoDataFrame(rows, crs=f"EPSG:{espg}"))# -- Workflow stuff --- ## define a borough and a lookback periodBOROUGH ="BROOKLYN"LOOKBACK_DAYS =28*4# retrieve point locations of crashes within date rangedate = get_date_lookback(LOOKBACK_DAYS)results = client.get("h9gi-nx95", select="crash_date, borough, latitude, longitude", where=f"crash_date >= '{date}' AND borough = '{BOROUGH}' AND latitude != 0 AND longitude != 0", limit=10000)results_df = pd.DataFrame.from_records(results)# point featuresfeat_df = get_features(BOROUGH)# region spatialboro_region = get_study_region_spatial(BOROUGH)# create a dataframe where we define time periods# model uses t-1 for training# t=0 for validation# Parse datetimeresults_df["crash_date"] = pd.to_datetime(results_df["crash_date"])# Get the oldest datemin_date = results_df["crash_date"].min()# Compute 2-week periods forward in timeresults_df["timevar"] = ( (results_df["crash_date"] - min_date).dt.days //14) +1
After setting up the data we just define the input variables, feature names, and create the GridPred object. Below, the workflow runs a test model using the first 6 weeks of data as predictors, week 7 as the outcome, and week 8 as the validation set. We get the performance metrics and then run the model again on the FULL set (weeks 1-7 as predictors, week 8 as outcome) and get predictions for the next week.
Code
# define variable namestime_var ="timevar"features_var ="type"# spatial projections# includes the coordinate reference system of the input crime data# as well as a desired projection for all spatial objectsinput_crime_crs =4326projected_crs =3857# size of the grid to use (in units based on projection)grid_size =300# This initalizes the GridPred class with the specified datagridpred = GridPred( input_crime_data=results_df, input_features_data=feat_df, features_names_variable=features_var, crime_time_variable=time_var, input_study_region=boro_region, input_crs=input_crime_crs,)# This generates the input to the regression modelgridpred.prepare_data( grid_cell_size=grid_size, do_projection=True, projected_crs=3857)# very basic demo model workflow# can replace with xgboost or whatever modelX = gridpred.Xy = gridpred.yrf = RandomForestGridPred( n_estimators=1000, criterion="poisson", random_state=42)rf.fit(X, y)y_pred = rf.predict(X)# check eval metrics on test then rerunregion_grid = gridpred.region_gridMETRICS = {'PAI': pai, 'PEI': pei, 'RRI': rri}print( evaluate( y_true=gridpred.eval, y_pred=y_pred, metrics=METRICS, region_grid=region_grid, round_digits=2 ))# now next-weeks predictionXfinal = gridpred.X_finalyfinal = gridpred.y_finalrf_final = RandomForestGridPred( n_estimators=1000, criterion="poisson", random_state=42)rf_final.fit(Xfinal, yfinal)y_pred_final = rf_final.predict(Xfinal)
{'PAI': 3.84, 'PEI': 0.25, 'RRI': 4.3}
For reference, the final input dataset looks like this below, where the predictors are counts of accidents over the prior 7 weeks, as well as the distance, in feet, to the nearest pedestrian crossing PedXing and bus stop BusStop.
events_1
events_2
events_3
events_4
events_5
events_6
BusStop
PedXing
x
y
1992
0
0
0
0
0
0
541.92
532.64
-8234909.75
4949342.61
2159
0
0
0
0
0
0
459.34
391.51
-8234365.42
4949584.69
2076
0
0
0
0
0
0
304.09
261.92
-8234666.15
4949560.02
1993
0
0
0
0
0
0
392.06
373.60
-8234959.60
4949533.84
1994
0
0
0
0
1
0
226.19
125.60
-8234956.00
4949803.00
2077
0
0
0
0
0
0
96.02
28.15
-8234656.00
4949803.00
2160
0
0
0
0
0
0
330.80
321.57
-8234356.00
4949803.00
1995
0
0
0
0
1
0
330.44
218.86
-8234956.00
4950103.00
1494
0
0
0
0
0
0
318.01
349.33
-8236761.77
4949272.93
1577
0
0
0
0
0
0
388.02
395.93
-8236466.29
4949296.27
The map below shows the predicted hot spots of car crashes. Primarily, these are near the main highways around the Brooklyn and Williamsburg bridges.
Predicted intensity of auto accidents for week of 2026-02-23. Gridded squares are 300x300 foot zones.
Accumulated Local Effects
To start, I think reading through Christoph Molnar’s free book Interpretable Machine Learning is an essential starting point (Molnar 2020). To reiterate (without getting too much into the weeds) our goal is to identify differences in predictions conditional on a small “window” around the feature set. This second part is important, because ALE plots are constrained to observed, local differences in predictions. This avoids getting predictions from combinations of features that would never exist (one of the examples Molnar gives is an improbable house prediction for a 30 square meter house containing 10 rooms). Because we accumulate predictions over small regions of the observed feature space, we avoid a lot of the issues described above.
For a quick example, we can look at the the most important features in our Random Forest model. We see that distance to the nearest PedXing and BusStop are the top two features. Now, one question might be “how close to the nearest pedestrian crossing or bus stop is car accident risk increased?” This is actually a plausible question, as it might affect where signs are placed or inform environmental design elements.
To do this, we just use some underlying code from skexplain. Below I generate estimates of the ALE for all features in the dataset. To generate the ALE we iterate over bootstrap samples of the original dataset for computational efficiency. The result is a multivariate array with ALE estimates for each feature, over each bootstrap sample. I then use some helper functions to pull out the estimates for plotting. skexplain has some nice built-in plotting functions, but I prefer to generate the plots myself.
Code
def unpack_feature(ds, feat_name):""" Extracts the bin values and ALE values for a specific feature. """# get the bin values and ale calculations for selected feature values_bin = ds[f"{feat_name}__bin_values"].values values_ale = ds[f"{feat_name}__name__ale"].valuesreturn values_bin, values_aledef unpack_feature_2d(ds, feat_x, feat_y):""" Extracts bin values and ALE values for a 2D feature interaction. """# bin values x_bin = ds[f"{feat_x}__bin_values"].values y_bin = ds[f"{feat_y}__bin_values"].values# handle possible ordering of feature names key1 =f"{feat_x}__{feat_y}__name__ale" key2 =f"{feat_y}__{feat_x}__name__ale"if key1 in ds: ale_vals = ds[key1].valueselif key2 in ds: ale_vals = ds[key2].valueselse:raiseKeyError(f"No 2D ALE found for {feat_x}, {feat_y}")return x_bin, y_bin, ale_valsdef plot_feature_ale_1d(ds, feature, xlim=None, ylim=None):# unpack feature data values_bin, values_ale = unpack_feature(ds, feature)# aggregate across bootstrap iterations val_min = values_ale.min(axis=0) val_avg = values_ale.mean(axis=0) val_max = values_ale.max(axis=0)# plot plt.figure(figsize=(5, 4)) plt.fill_between( x=values_bin, y1=val_min, y2=val_max, alpha=0.2, color='#0077BB' ) sns.lineplot( x=values_bin, y=val_avg, color='#0077BB', label='Mean ALE' ) plt.axhline(y=0, color='#CC3311', linestyle='--', linewidth=1) plt.title(f"1D ALE Plot: {feature}") plt.xlabel(f"{feature} Values") plt.ylabel("ALE Effect")# axis constraintsif xlim isnotNone: plt.xlim(xlim)if ylim isnotNone: plt.ylim(ylim) sns.despine() plt.tight_layout() plt.show()def plot_feature_ale_2d( ds, feat_x, feat_y, xlim=None, ylim=None, clim=None# (vmin, vmax) for color scaling):# unpack x_bin, y_bin, ale_vals = unpack_feature_2d(ds, feat_x, feat_y)# aggregate across bootstraps ale_mean = ale_vals.mean(axis=0)# meshgrid X, Y = np.meshgrid(x_bin, y_bin, indexing="ij")# symmetric scaling by defaultif clim isNone: vmax = np.abs(ale_mean).max() vmin =-vmaxelse: vmin, vmax = clim plt.figure(figsize=(6, 5)) contour = plt.contourf( X, Y, ale_mean, levels=20, cmap="RdBu_r", vmin=vmin, vmax=vmax ) plt.colorbar(contour, label="ALE Effect") plt.title(f"2D ALE Plot: {feat_x} × {feat_y}") plt.xlabel(feat_x) plt.ylabel(feat_y)# axis constraintsif xlim isnotNone: plt.xlim(xlim)if ylim isnotNone: plt.ylim(ylim) sns.despine() plt.tight_layout() plt.show()
Below we get the 1D ALE plots for two of the features: distance to bus stops and pedestrian crossings. We limit the predictions to within the first 1000 feet, since that is well within the area of interest. The ALE effect is the centered effect, so the y-axis is based on increases or decreases from the mean prediction.
The results we see here logically make sense. For both pedestrian crossings and bus stops, the closer we are the more likely an auto accident is. This effect decreases pretty steadily as distance increases. For bus stops at around 1000 feet there is no real effect, while increased distance from pedestrian crossings have lower number of accidents. The latter effect is probably due to the fact that neighborhoods with fewer crossings are structurally different than those that do.
We can also examine the interaction between both features using a contour plot. Below, we see an interesting interaction where areas very close to a bus stop AND pedestrian crossing have lower risk together. However, the risk of accidents is increased in distances exclusively near bus stops OR pedestrian crossings. Again, some of these results are probably structural elements related to how the road network is set up in different neighborhoods.