ID Troop Violation_Type Date Time Duration Interval
1 1000001884 CSP Troop E STC Violation 2018-01-07 16:31:31 <NA> NA
2 1000001884 CSP Troop E Speed Related 2018-01-07 16:40:15 524 secs 8.733333
Fake Tickets Galore
In 2022 news broke that Connecticut State Police officers had been likely submitting tens of thousands of fake traffic tickets. What started from a report by CT Insider about 4 officers expanded to an audit that implicated dozens more officers in the scandal.
Before I start, this is more of an academic exercise. I actually think the approach that the folks who did the audit is probably the best way and simplest way. In their case they found fake tickets by counting up the number of stops which didn’t have a matching citation in the system. For this example we’re going to rely only on the raw data which is publically available.
Working on the outside
My hypothesis for the fake ticketing was that officers wanted to quickly pad their activity logs with fake stops. Because creating a stop log requires filling out a standardized form, I assumed that these officers would do it as quickly and as lazily as possible. I can imagine this looking like an officer at the end of their shift padding out activity with a bunch of zero-effort fake stops. Relying on this hunch, I figured I could detect unusual officer behavior by finding those whose distribution of stop times looked much different than the average officer. Humans are actually quite bad at generating “random” numbers. In order to pad out a shift, there would probably be a lot of stops with very short times.
The Data
The data for this comes from Connecticut’s Racial Profiling Prohibition Project data portal. These contain data on all traffic stops conducted by local and state police, which have information on the time, date, reason for the stop, and some basic demographic information on the driver. For this analysis I use the data from 2018, and focus on unique stops involving only officers from the Connecticut State Police.
Unfortunately for me, the publicly available data doesn’t directly disclose the length of the stop. There is a variable InterventionDurationCode
which only reports the length of the stop between 3 bins: (0-15, 16-30, 30+)
. For my purposes this is far too coarse of a measure to distinguish abnormal stop lengths. However, as an analog, I figured I could instead compute the time interval between stops. That is, the interval in time between when one stop begins and the next stop begins. This gives us an indirect way of measuring how long a stop took before the next stop was initiated:
\[interval= time_{stop2} - time_{stop_1}\]
So if we had two stops, with one at 16:31:31
and one at 16:40:15
, the interval would be about 8.7 minutes. As an example, this would look like:
Results
To start, it makes sense to first evaluate what “average” stop length intervals look like. One way of doing this is just a bunch of density plots for each officer. Here we’re looking at the top 100 officers by number of stops for 2018. The dark line here is the average for all 100 officers. We can see that the majority of stop intervals are between 10 and 20 minutes, with a long tail reaching out to our maximum of 1 hour. This makes sense, because most traffic stops are pretty perfunctory: warning drivers that a tail light is out, writing a ticket for speeding, etc…). However, a small number of traffic stops are more complex and might involve searches, DWI investigations, or require another officer to attend. Regardless, this plot below gives us some idea of the “average” stop time, as well as the individual behaviors of different officers.
But I think an easier way to visualize this is to transform the distribution above to the empirical cumulative density function (ECDF) for each officer. The ECDF is actually a very useful tool for this kind of question because it makes no assumptions about the distribution of the data. Simply put, and ECDF reports the proportion of observations at or below each given interval between \([0,1]\). All we do is report the cumulative value \(1/n\) for each of the \(n\) data points in a distribution. This is closely related to the quantile. Such that the ECDF at 0.5 is equivalent to the median.
The plot below shows the distribution of stop time intervals for all intervals between 0 and 60 minutes. The dark line is the average for all 100 officers, and the light colored lines are the individual officers’ ECDFs. Below we see that the median interval between stops is about 18 minutes. As we expect officer stop intervals are mostly randomly distributed above and below this. What we are actually interested in are officers whose ECDFs are unusually shorter compared to everyone else.
Code
# plot all ecdfs
ecdf_plot(id = unique(daily_stop_interval$officer_id),
lwd = 3,
main = "Empirical CDF, Time Interval Between Stops",
xlab = "Interval Time (minutes)",
ylab = "Percentiles",
col = "#004488")
for(id in times$officer_id){
ecdf_plot(id, lines=TRUE,col = alpha(rgb(0, .267, .533), .1))
}
Intervals
To get intervals we do the following:
# compute bands on all ecdfs, for 100 points
<- min(daily_stop_interval$time_diff_min)
time_min <- max(daily_stop_interval$time_diff_min)
time_max <- 125
N_points
<- seq(time_min, time_max, length.out = N_points)
x_vals
# Compute all ECDF values at each x for all officer IDs
<- sapply(times$officer_id, function(id) {
ecdf_values <- ecdf(daily_stop_interval$time_diff_min[daily_stop_interval$officer_id == id])
ecdf_func ecdf_func(x_vals)
})
# Compute 5th and 95th percentiles for the confidence bands
<- apply(ecdf_values, 1, function(row) quantile(row, 0.05))
lower_band <- apply(ecdf_values, 1, function(row) quantile(row, 0.95)) upper_band
Officer “88185785”
This officer had the shortest average interval between stops at 14.2 minutes. In fact, more than half of their stops were at intervals of 10 minutes or less. This feels incredibly fast if you consider how long it takes to pull someone over. It seems implausible that this person was conducting almost non-stop traffic stops, taking driver information, issuing a warning (I assume?), and then getting another stop almost immediately afterwards.
Code
= '88185785'
officer_id
# Add confidence bands (shaded region)
plot.new()
plot(c(0, 60), c(0,1), col = "white",
main = "Empirical CDF, Time Interval Between Stops",
xlab = "Interval Time (minutes)",
ylab = "Percentiles")
polygon(c(x_vals, rev(x_vals)), c(lower_band, rev(upper_band)),
col = rgb(0, .267, .533, 0.3), border = NA)
lines(x_vals, lower_band, col = "#004488", lwd = 2, lty = 3)
lines(x_vals, upper_band, col = "#004488", lwd = 2, lty = 3)
ecdf_plot(officer_id, col = '#BB5566', lwd=2, lines = T)