I Made An Election Model – A Blog for Data Stuff

Revisiting the 2024 Election

The 2024 election is over. Donald Trump beat Kamala Harris by about 2 million popular votes and 86 electoral votes. All the posturing, forecasting, and betting is over. So why am I posting about the election now?

Well, at the outset of the election I became interesting in running my own forecasting model. Nowadays there are plenty of folks running their own forecasting models - from individual bloggers to big news agencies like the New York Times. Poll aggregation is one of those things that is very “Bayesian” and I learned a lot about it reading through various books. To boot, I’ve always been a big fan of Andrew Gelman, as well as some other folks who do election and public policy forecasting - like Nate Silver and Elliot Morris. So I thought, “what the hell” and threw my hat into the ring. I held off on posting this because I figured there was already enough noise out there about the election, and maybe the world didn’t need one more person adding to the cacophony.

My Model

To keep it simple, I focused on the simplest kind of model - a poll aggregation model estimating national vote for the two-party candidates. This approach avoids a lot of additional complexity of estimating electoral college votes by not having to model state-level vote share, correlations between states, and many other things. This was truly just meant for a bit of academic fun. As a source of data, I used the very nicely curated tables supplied by 538 here, which was continuously updated during the election cycle. Helpfully, this data contains some other useful variables such as information about pollsters, and 538’s own ranking of pollster’s reliability.

For my model I applied filtering to get a subset of polls:

Only polls that included a Trump-Harris match up
Polls that were conducted just before Biden dropped out (6/1/2024)
Only polls including likely voters, registered voters, or adults
Removed overlapping polls
Polls with larger than 5,000 respondents

Dropping polls with more than 5,000 respondents is largely an approach to remove pollsters who “dredge” for lots of responses that are often typically of low quality. In total, this leaves us with about 255 unique polls in the 165 days between 6/1/2024 and 11/4/2024.

The poll aggregation approach

For the modelling approach I drew some inspiration from Andrew Gelman and others. Most of the current poll aggregation models apply some form of hierarchical linear modelling (HLM), combined with time trends, and often a “fundamentals” component. The general idea here is to partially pool information from many different pollsters, who are all utilizing slightly different samples of the population, with slightly different approaches.

For any individual poll we want to weight its contribution to the daily estimate based on known variables that affect responses (e.g: is this a partisan poll? is the poll registered or likely voters?), as well as “random” factors that sum to a distribution around a mean value (the distribution of responses by day, and by pollster). In addition, I wanted a model that also updated based on smooth time trends. While there are some nifty approaches using things like random-walk priors, I opted for a very simple cubic spine regression.

Fitting the models

To do this we fit the HLM component by applying fixed effects for partisan identification, the survey method (online, phone, or other), and the self-identified voting population (registered voters, likely voters, or adults). We apply varying intercepts for a day indicator, as well as for a pollster indicator. These varying intercepts apply partial pooling for each polling day as well as for pollster effects. In short, what this does is help pull polling estimates toward a group-level mean, and helps avoid any individual poll pulling the estimates too far in one direction - which is often referred to as “shrinkage”. I’ve actually written about this before in a blog post about estimating rare injury rates. We also fit a very simple cubic spline regression with default brms parameters. The idea here is to get smoothed estimates over time to account for a baseline trend of public opinion.

We do all the model fitting using brms using a binomial regression model in the form n_votes |trials(pop) where n_votes is the predicted number of votes for a candidate given a survey sample size pop. Finally, we also weight the polls based on their numeric grade assigned by 538. An example brms model looks something like the following below:

Code

# fitting a poll aggregation model for harris
fit2.1 <-
  brm(
    n_votes |
      trials(pop) + weights(W_dem) ~ 1 + partisan + method + vtype +
      (1 | index_t) +
      (1 | index_p),
    family = "binomial",
    data = all_polls_df[dem, ],
    data2 = list(W_dem = W_dem),
    prior = bprior,
    chains = 4,
    cores = 4,
    file = "data\\election_model\\fit2.1"
  )

Post-hoc weighting

I likely deviated from accepted statistical dogma here by weighting the smoothing spline model and HLM using very ad-hoc approach. Early on I decided that I wanted my estimates to be mostly evenly controlled by the HLM and the smoothing spline, but I felt that the smoothing time trend would help eliminate a lot of the spikes on a day-to-day basic. I opted for a 60/40 weighting scheme that gave slightly more weight to the smoothing model. As you can see below, the combined “post-hoc” weighted predictions.

For clarity, we can add all the individual poll results showing the estimated support for Harris and Trump, with the size weighted by the number of respondents (larger polls get larger circles). Shown below, on the last day before the election we estimate a popular vote of 49.1% for Trump and 48.9% for Harris. Including the uncertainty basically makes estimating the winner essentially a coinflip.

Predicted popular vote for Harris and Trump, as of 11/4/2024

Results

And below we have a table showing the estimated results for the last day of polling on 11/14/2025.

Election day estimates, national popular vote
party	end	ymin	median	ymax
DEM	2024-11-04	47.34	48.92	50.59
REP	2024-11-04	47.49	49.11	50.76

Looking at my estimates, compared to the final count (as of 1/9/2025) the point estimates my model came up with are quite close to the observed results (Trump at 49.9% vs 49.1%, Harris at 48.9% vs 48.4%). Getting within a percentage point of the true value is pretty good, I think for a somewhat half-baked model! That being said, for the purposes of predicting who would ultimately win the day of, the margin of error on the predictions give us essentially no additional confidence beyond a 50/50 chance. This is pretty consistent with a lot of other pollsters who had fancier models. In the end, it was a very close election that was decided by a relatively small number of voters in key areas.

Full Script