An Outsider’s Perspective On Media Mix Modelling

Media Mix Modelling

I’m trying something a bit new this time. Typically how I learn is that I see something interesting (either in a blog post, an academic article, or through something a co-worker is working on). I’ll then try and work through the problem via code on my own to see how I can make it work. It’s not always perfect, but it gets me started. Today I’m going to go out of my comfort zone and try my hand at Media Mix Modelling (MMM).

In general, the stated goal of MMM is to determine the optimal distribution of advertising money, given \(n\) different venues. For instance, this could determine how much to spend on internet, TV, or radio advertising given the costs of running ads and the expected return for each venue. Typically this is done using a regression to try and parse out the effect of each type of advertising net of many other factors (e.g. seasonal and trend effects, costs of the product, demand, etc…).

Getting some data

Getting reliable open-source data for MMM is actually a bit more difficult than you think. There are a number of very trivial simulated datasets scattered about on places like Kaggle, but these aren’t terribly useful. I was able to find a strange mostly undocumented set of data from a git repo here. Per the author, the data purports:

“…data contain information on demand, sales, supply, POS data, advertisiment expenditure and different impressions recorded across multiple channels for calculating the advertising campaign effectiveness such as Mobile SMS, Newspaper Ads, Radio, TV, Poster, Internet etc.in the form of GRP (Gross Rating Point) in Shenzhen city”

Good enough for me.

The Adstock Function

The paper Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects is a pretty clear and concise introduction to media mix modelling. This paper is a pretty good introduction into many of the issues related to MMM. One of the biggest issues is the need to transform the ad spend variables to better represent how they behave in real life. For example, we don’t necessarily expect an ad campaign to have an instantaneous lift, nor do we expect its effect to end immediately either. Hence, the need to model appropriate carryover effects.

To do this I’ll borrow a function Kylie Fu to calculate the weights for the delayed adstock function. The goal here is to define a function that can transform the ad spend variables to reflect our belief that they have delays of decay, delays of peak effect, and an upper maximum carryover effect. Below the function takes a vector of ad spend data and creates the transformation given the parameters lambda, theta, and L.

DelayedSimpleAdstock <- function(advertising, lambda, theta, L){
  N <- length(advertising)
  weights <- matrix(0, N, N)
  for (i in 1:N){
    for (j in 1:N){
      k = i - j
      if (k < L && k >= 0){
        weights[i, j] = lambda ** ((k - theta) ** 2)
      }
    }
  }
  
  adstock <- as.numeric(weights %*% matrix(advertising))
  
  return(adstock)  
}

Setting up an adstock transformation

Now we can choose the parameters for the adstock transformation. Ideally, we want a transformation that captures the decay of the advertising program (lambda), its delayed peak onset (theta), and its maximum effect duration (L). With a bit of simulation we can see what each parameter does across a value of different lags. The goal here is to have a function that matches what we believe the actual effect of ad spending looks like for different media regions (e.g. TV, radio, internet). Below, we can see that increasing lambda increases the decay of the effect up, while varying theta sets the peak onset of the ad campaign to later lags. The value of L simply sets the maximum effect to a specific lag.

# set up grid of params to iterate over

x <- c(1, rep(0, 15))
lambda = seq(0,1, by = .1)
theta = seq(0,10, by = 1)
L = seq(1,12, by = 1)

Based on a visual assesment I just chose an adstock function with a lambda of .8 (suggesting moderate decay of the initial ad effect), a theta of 2 (implying a peak onset of 2 weeks), and an L of 13 which is a rule-of-thumb that makes the maximum effect quite large.

Adstock function (Lambda = .8, theta = 2, L = 13)

The code below applies our adstock function to each of the spend variables. For simplicity here I am making the assumption that all of the modes have similar adstock functions, but this can (and should) vary per modality based on expert prior information. We convert the daily data (which is quite noisy) to a more commonly utilized weekly format. We then limit the focus of our analysis to a 4-year time span.

Code

# setup weekly data
# setup weekly data
mmm_weekly_data <-
  mmm_raw %>%
  mutate(date = as.Date(DATE, "%m/%d/%Y"),
         year = year(date),
         month = month(date),
         week = week(date)) %>%
  select(
    date,
    year,
    month,
    week,
    sales = `SALES ($)`,
    spend_sms = `Advertising Expenses (SMS)`,
    spend_news = `Advertising Expenses(Newspaper ads)`,
    spend_radio = `Advertising Expenses(Radio)`,
    spend_tv = `Advertising Expenses(TV)`,
    spend_net = `Advertising Expenses(Internet)`,
    demand = DEMAND,
    supply = `POS/ Supply Data`,
    price = `Unit Price ($)`
  ) %>%
  filter(year %in% 2014:2017)


weekly_spend <-
  mmm_weekly_data %>%
  group_by(year, month, week) %>%
  summarise(across(sales:spend_net, sum), across(demand:price, mean), .groups = 'drop') %>%
  mutate(index = 1:nrow(.))


# Apply transformation to advertising variables, scale dollar values to per $1,000
X <-
  weekly_spend %>%
  mutate(across(spend_sms:spend_net,~ DelayedSimpleAdstock(.,lambda = .8,theta = 2,L = 13)),
         across(spend_sms:price, function(x) x/1e3),
         trend = 1:nrow(.)/nrow(.))

Setting up the model

Before we fit the model we can plot out the primary variables of interest, along with our dependent variable sales. Looking below we can see a few potential issues. One which should jump out immediately is that there is a very high correlation between several of our ad spend categories. For example, the correlation between TV spending and news spending is almost 1. In the case of MMM this is a common problem, which makes unique identification of the effect of ad spends much more difficult. More troubling, by just eyeballing these plots there doesn’t seem to be a terribly strong relationship between any of the advertising venues and sales.

plot(X[, c('sales',
           'spend_sms',
           'spend_news',
           'spend_radio',
           'spend_tv',
           'spend_net')], col = '#004488')

Pairwise relationships between sales ~ ad venues

Nor do the pairwise correlations seem to be very high either (in fact, they are very nearly zero). Regardless, we’ll continue by fitting a simple set of models.

Code

round(cor(X[, c('sales',
           'spend_sms',
           'spend_news',
           'spend_radio',
           'spend_tv',
           'spend_net')]),2)

            sales spend_sms spend_news spend_radio spend_tv spend_net
sales        1.00     -0.01      -0.04       -0.03    -0.04     -0.07
spend_sms   -0.01      1.00       0.89        0.91     0.90     -0.06
spend_news  -0.04      0.89       1.00        0.86     1.00      0.30
spend_radio -0.03      0.91       0.86        1.00     0.87     -0.09
spend_tv    -0.04      0.90       1.00        0.87     1.00      0.28
spend_net   -0.07     -0.06       0.30       -0.09     0.28      1.00

Fitting A Model

One of the biggest challenges with MMM is that many of the model coefficients including the advertising venues, are likely to be very highly correlated. For example, the advertising spend on TV ads is almost perfectly correlated with the spend on news ads. We can set some moderately strong priors on the ad spend coefficients to ensure that the estimates don’t explode due to multicollinearity. Here I’m just placing a normal(0, .5) prior which is still pretty wide for a coefficient on the logrithmic scale.

Code

# set up priors
bprior <- c(prior(normal(0,.5), class = "b", coef = "spend_sms"),
            prior(normal(0,.5), class = "b", coef = "spend_news"),
            prior(normal(0,.5), class = "b", coef = "spend_radio"),
            prior(normal(0,.5), class = "b", coef = "spend_tv"),
            prior(normal(0,.5), class = "b", coef = "spend_net"))

# just intercept
brm_fit_0 <- brm(log(sales) ~ 1, data = X,
                 chains = 4,
                 cores = 4,
                 family = gaussian(),
                 file = "C:/Users/gioc4/Documents/blog/data/brms_models/mmm_fit0.Rdata")

# no random effects, linear trend effect
brm_fit_1 <- brm(log(sales) ~                 
                   demand +
                   supply +
                   price +
                   as.factor(month) +
                   as.factor(week) +
                   trend +
                   spend_sms +
                   spend_news +
                   spend_radio +
                   spend_tv +
                   spend_net,
                 data = X,
                 chains = 4,
                 cores = 4,
                 prior = bprior,
                 family = gaussian(),
                 file = "C:/Users/gioc4/Documents/blog/data/brms_models/mmm_fit1.Rdata")

# random effects for month+week
# smoothing spline for trend
brm_fit_2 <- brm(log(sales) ~                 
                   demand +
                   supply +
                   price +
                   s(trend) +
                   spend_sms +
                   spend_news +
                   spend_radio +
                   spend_tv +
                   spend_net +
                   (1|month) +
                   (1|week),
                 data = X,
                 chains = 4,
                 cores = 4,
                 prior = bprior,
                 family = gaussian(),
                 control = list(adapt_delta = .9),
                 file = "C:/Users/gioc4/Documents/blog/data/brms_models/mmm_fit2.Rdata")

Model Evaluation

Now we can evaluate the models. Here, I evaluate a model with random effects for month and week, and a smoothing spline for the trend component against a fixed effects model with a linear trend, and a “null” model with just an intercept. The model with a smooth trend spline appears to beat out the linear trend, which makes sense given the non-linear bump in sales observed in the raw data.

Code

# evaluate models using leave-out-out criterion
# looks like smoothing spline is slightly better
loo_eval <- loo(brm_fit_0, brm_fit_1, brm_fit_2)

Warning: Found 12 observations with a pareto_k > 0.7 in model 'brm_fit_1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

Warning: Found 15 observations with a pareto_k > 0.7 in model 'brm_fit_2'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

Code

loo_eval$diffs

          elpd_diff se_diff
brm_fit_2    0.0       0.0 
brm_fit_1  -16.2       8.3 
brm_fit_0 -291.7      18.9

We can (and should) also check out the predictions from the model using a posterior predictive check. In Bayesian terms what this means is we take a sample of draws from our fitted model and compare them against the observed data. If our model is capturing the process well, the predicted values should generally follow the observed process. Below we see that our model does a fairly decent job.

Code

pp_check(brm_fit_2)

Using 10 posterior draws for ppc type 'dens_overlay' by default.

Posterior predictive check, smoothing spline model

Finally, we can look at some of the model coefficients

summary(brm_fit_2)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log(sales) ~ demand + supply + price + s(trend) + spend_sms + spend_news + spend_radio + spend_tv + spend_net + (1 | month) + (1 | week) 
   Data: X (Number of observations: 197) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smoothing Spline Hyperparameters:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(strend_1)     1.53      0.73     0.54     3.35 1.00     1739     2509

Multilevel Hyperparameters:
~month (Number of levels: 12) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.86      0.22     0.55     1.38 1.00     1413     2226

~week (Number of levels: 53) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.79      0.09     0.64     1.00 1.00      721     1142

Regression Coefficients:
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept      13.88      0.48    12.94    14.83 1.00     2447     3039
demand          0.01      0.03    -0.06     0.07 1.00     4078     3029
supply          0.19      0.04     0.12     0.26 1.00     4860     3017
price           2.88      0.99     0.83     4.86 1.00     4064     2908
spend_sms       0.12      0.17    -0.22     0.45 1.00     5927     3202
spend_news     -0.01      0.51    -1.01     0.99 1.00     6425     2879
spend_radio     0.18      0.14    -0.10     0.46 1.00     4534     2610
spend_tv       -0.02      0.02    -0.05     0.01 1.00     5170     2835
spend_net       0.00      0.00    -0.00     0.01 1.00     4820     3243
strend_1       -2.42      1.55    -5.84     0.34 1.00     2711     2636

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.15      0.01     0.13     0.17 1.00     1381     2410

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Model Predictions

And here’s the estimated lift for Radio by week. While there is definitely some lift, it is pretty tiny here

Code

# get predictions 
X_no_radio <- X %>% mutate(spend_radio= 0)

pred1 <- predict(brm_fit_2)
pred1_no_radio <- predict(brm_fit_2, newdata = X_no_radio)

pred_dataframe <-
  cbind.data.frame(week = 1:197,
                   obs = pred1[, 1],
                   pred = pred1_no_radio[, 1]) %>%
  pivot_longer(-week) %>%
  group_by(week) %>%
  mutate(diff = value - lead(value, 1))

# predicted (all) vs predicted (no radio)
ggplot(pred_dataframe) +
  geom_line(aes(x = week, y = value, color = name)) +
  theme_bw() +
  scale_color_manual(values = c("#0077BB", "#EE7733")) +
  labs(y = "(log) Sales", x = "Week") +
  theme(legend.position = 'none')
# predicted lift from radio
pred_dataframe %>%
  na.omit() %>%
  ggplot() +
  geom_line(aes(x = week, y = diff)) +
  theme_bw() +
  labs(y = "(log) Sales", x = "Week")

If we average the estimated mean lift across the entire time frame we get an additional value of about 2%.

mean((pred1[,1] - pred1_no_radio[,1])/pred1[,1])

[1] 0.02156586

At this point there is a lot of additional work that can be done. Most applied uses of MMM apply some optimization algorithms to determine the best ad spend mix given a fixed budget. The data I have here isn’t really good enough to delve any deeper into - but its important to note that fitting the model is really only the beginning.

In Closing: My Take

My biggest problem with Mixed Media Modelling is that it seems like it is easy to implement badly, but much harder to do well. Not only do you have to appropriately model an adstock function for your advertising venues, you also have very high correlation between your variables. This, in turn, makes the choice of model specification even more important because the coefficients with be highly sensitive. Personally, a true experiment or even quasi-experiment would be preferable to this - although I’m all too aware that this is often impossible. Like everything there is no magic bullet, and choosing one approach over another will always introduce trade offs.