Non-Flat Implications of Flat Priors

Motivation

Many researchers, when they’re introduced to Bayesian methods, are nervous about the possibility that their prior distributions will corrupt their posterior inferences. Since they know that the posterior distribution is a precision-weighted average of the prior and the data (or “likelihood”), it initially makes sense to err toward a flatter, more diffuse prior density for model parameters. These diffuse densities let the model put relatively more weight on data, which feels safer.

The purpose of this post is to highlight a few areas where this “default tendency” to use flat priors runs into unexpected consequences. We show how functions of model parameters have implied priors: density functions of their own that inherit the prior uncertainty about the parameters that compose the function. These implied priors can have strange shapes that you wouldn’t anticipate based on the raw parameters.1 We then show two cases where these strange shapes appear. The first case comes from my own published work on voter ID in Wisconsin. The second case is a hypothetical experiment where we “accidentally” create a non-flat prior for the treatment effect in a randomized experiment where we weren’t expecting it.

Together, these exercises give concrete examples for the way flat priors and “uninformative” don’t necessarily mean the same thing.

Implied priors

We’ll begin with a notion of the implied prior. With some random variable \(\theta\), we can construct a function \(f\left(\theta\right)\), which is necessarily a random variable as well. If \(\theta\) has a probability distribution \(p\left(\theta\right)\), \(f\left(\theta\right)\) will have some probability distribution \(p\left(f\left(\theta\right)\right)\).

A simple example. Imagine some standard normal variable \(\nu\) that is distributed \(\mathrm{Normal}\left(0, 1\right)\) prior. If we have some function \(f(\nu) = \mu + \sigma\nu\), then \(f(\nu)\) will have a probability distribution. In this case, it is straightforward to see that this prior would be \(\mathrm{Normal}(\mu, \sigma)\), but in more complicated examples it won’t be so easy to glean the implied prior directly. We can create this example using code, setting \(\mu = 4\) and \(\sigma = 2\), to reassure you that I’m telling the truth.

library("tidyverse")
library("hrbrthemes")
library("latex2exp")
library("viridisLite")

theme_ipsum(base_family = "Fira Sans") %+replace%
  theme(
    panel.grid.minor = element_blank(),
    axis.title.x.bottom = element_text(
      margin = margin(t = 0.35, unit = "cm"),
      size = rel(1.5)
    )
  ) %>%
  theme_set()

accent <- viridis(n = 1, begin = 0.5, end = 0.5)
# set mu and sigma values
mu <- 4
sigma <- 2

# simulate nu and f(nu)
normal_example <- 
  tibble(
    nu = rnorm(100000),
    f_nu = mu + (nu * sigma)
  ) %>%
  print(n = 4) 
## # A tibble: 100,000 x 2
##       nu  f_nu
##    <dbl> <dbl>
## 1 -1.02  1.95 
## 2  0.908 5.82 
## 3 -1.72  0.556
## 4  0.184 4.37 
## # … with 99,996 more rows

# implied prior is Normal(mu, sigma)
ggplot(normal_example) +
  aes(x = f_nu) +
  geom_histogram(binwidth = .1, fill = accent) +
  geom_vline(
    xintercept = c(mu - sigma, mu + sigma),
    linetype = "dashed",
    size = 0.25
  ) +
  geom_vline(xintercept = mu) +
  scale_x_continuous(
    breaks = seq(mu - 5*sigma, mu + 5*sigma, sigma)
  ) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Implied Prior for Transformed Normal",
    subtitle = "Histogram of prior samples",
    x = TeX("$f(\\nu) = \\mu + \\sigma\\nu, \\; \\nu \\sim N(0, 1)$"),
    y = NULL
  )

Bayesians will recognize this as a “non-centered parameterization” of a Normal distribution, or a normal distribution that sneaks the mean and standard deviation values out of the random variable. Bayesian modelers invoke this trick all the time in hierarchical models, since sampling \(\nu\), \(\mu\), and \(\sigma\) separately is easier for a computer to do than sampling a distribution for \(f(\nu)\) that itself contains \(\mu\) and \(\sigma\). Parameterizations that de-correlate these parameters are generally easier to sample and, conveniently, more manageable to set priors for.

Flat priors meet nonlinear transformations

Suppose we have a parameter \(\pi\) has a flat prior in the \([0, 1]\) interval, and we calculate some function \(g(\pi)\). Will \(g(\pi)\) have a flat distribution? It depends.

I encountered an example of this in my work with Ken Mayer on voter ID in Wisconsin in Wisconsin,2 although I didn’t put this lesson about prior flatness in the paper. We wanted to estimate the number of eligible registered voters in two Wisconsin counties for whom Wisconsin’s voter ID requirement hindered their voting in 2016. This estimate contains three ingredients.

  • \(N\): the number of individual records in the registered voter file for the target population. This is an observed constant.
  • \(\epsilon\): the proportion of the population in the voter file that was eligible to vote in 2016. This is relevant because voter files contain individuals who moved, died, or who were otherwise ineligible, that have to be cleaned out of voter files periodically. The population size must be penalized by \(\epsilon\) to remove these ineligible records from our estimate. This is a random variable, estimated by coding a finite sample of the voter file.
  • \(\pi\): the proportion of eligible registrants who experienced ID-related difficulty voting. This is a random variable, estimated using a survey sample of registrants in the voter file.

The quantity we want to estimate is \(N\times \epsilon \times \pi\), an eligibility-penalized population estimate for the number of voters affected by the voter ID requirement. Suppose that we know \(N = 229,625\) from the voter file, but \(\pi\) and \(\epsilon\) are proportions that must be estimated. We give each proportion a flat \(\mathrm{Beta}(1, 1)\) prior on the \([0, 1]\) interval. What is our implied prior for the population estimate?

tibble(
  pi = rbeta(10000, 1, 1),
  epsilon = rbeta(10000, 1, 1),
  N = 229625,
  pop_estimate = N * epsilon * pi
) %>%
  ggplot() +
  aes(x = pop_estimate) +
  geom_histogram(fill = accent, bins = 100, boundary = 0) +
  labs(
    title = "Implied Prior for Population Estimate",
    subtitle = TeX("Histogram of prior simulations"),
    x = TeX("Population estimate $= N \\epsilon \\pi$"),
    y = NULL
  ) +
  scale_x_continuous(labels = scales::comma)

Seriously, what? If I had plopped this graphic into my paper and said that it was my prior for this population quantity, I would have been in trouble. Look at how swoopy that prior looks! How can that be an uninformative prior? Well, we know that the random components have vague priors, so this is the natural result of sending these parameters through a nonlinear function. If you don’t like it, you should think about how this quantity is parameterized and whether some other priors make more sense.

Bayesians may recognize this feature of prior distributions as well, where nonlinear functions of parameters have a density that does not simply reflect a shifting or scaling of the original density. This happens because nonlinear transformations of parameters can “squish” or “stretch” probability mass into different areas/volumes than they were previously, thereby changing probability density. If we wanted to write out the new density, we would need to start with the old density and use (spooky voice) the Jacobian.

Honey, I shrunk my treatment effect

When we think about causal inference, we are thinking about methods that want to be light on their assumptions. If we imagine a Bayesian interpretation of an experiment (even if we don’t specify the Bayesian model per se), it makes sense that we want vague priors on important quantities in order to “let the data speak” instead of deriving results from the prior. This turns out to be less straightforward than you would think.

Imagine an experiment where we treat individuals with an advertisement or we don’t, \(z_{i} \in \{0, 1\}\), and then we measure whether they intend to vote for the Democratic candidate or not, \(y_{i} \in \{0, 1\}\). My treatment effect \(\tau\) is the comparison between the Democratic vote proportion in the control group, \(\mu_{z = 0}\), and the Democratic vote proportion in the treatment group, \(\mu_{z = 1}\). \[\begin{align} \tau = \mu_{1} - \mu_{0} \end{align}\]

If we estimate this effect with a linear model, we have a choice about how to parameterize the regression function. We could use a constant and a treatment effect with error term \(u_{i}\), \[\begin{align} y_{i} = \mu_{0} + \tau z_{i} + u_{i} \end{align}\] or we have two intercepts for each condition. \[\begin{align} y_{i} = \mu_{z[i]} + u_{i} \end{align}\] Bayesians think it makes more sense to use the second parameterization. Why? If we want to set the same prior on both groups, it’s easier to do that when we can directly set priors on each mean instead of a prior on one mean and then a prior on the difference in means. So let’s take that approach, giving each vote share a flat prior that says any vote share for both groups is a priori equally likely. I will write these as flat Beta densities, but you could imagine them as standard Uniform densities as well. \[\begin{align} \mu_{1}, \mu_{0} \sim \mathrm{Beta}(1, 1) \end{align}\] What is the prior for the treatment effect (the difference-in-means)? Not flat! We will again simulate to see the effect of combining parameters into a single function.

# simulate means and calculate difference
tibble(
  mu_0 = rbeta(100000, 1, 1), 
  mu_1 = rbeta(100000, 1, 1),
  trt = mu_1 - mu_0 
) %>%
  ggplot() +
  aes(x = trt) +
  geom_histogram(fill = accent, binwidth = .05, boundary = 0) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    title = "Implied Prior for Treatment Effect",
    subtitle = "Histogram of prior samples",
    x = "Difference in means",
    y = NULL
  )

What happened to my vague prior beliefs? Why do I have this non-flat prior for something I thought I wanted to have vague information for?3

It still is a vague prior, but we’re wrong to expect it to be flat. Why? Well, averaging over my prior uncertainty in both groups, my expected difference in means ought to be mean zero (natch). But more than that, the reason why we get a mode at zero is that there are many more ways to produce differences near zero with my raw means than ways to produce differences far from zero. The only way to get big differences (near \(-1\) or \(1\)) is for both means to be simultaneously far apart, which isn’t as likely to happen randomly with a flat prior on each group mean. When we think about the treatment effect prior in this way, we can understand why this actually feels less informed than a direct flat prior for the treatment effect. Putting a flat prior on the treatment effect is saying that we think big differences are just as likely as small differences. This is like a prior that says my group means should be negatively correlated, effectively upweighting bigger differences from what we’d otherwise expect. Weird! I’d rather set reasonable priors for my means and let my treatment prior do what it do.

Flat priors often have non-flat implications

These implications feel strange at first, but they are all around us whether or not we notice them. The flatness of a prior (or any shape, flat or not) is a relative feature of a model parameterization or a quantity of interest, not an absolute one. Inasmuch as we believe priors are at work even when we don’t want to think about them—i.e. we accept Bayesian models as generalizations of likelihood models—we should respect how transforming a likelihood affects which parameters are exposed to the researcher, and which spaces those parameters are defined in. We should know that flat doesn’t imply uninformative, and that non-flat doesn’t imply informative. What we’re seeing here is that flatness begets non-flatness in tons of circumstances, and that’s totally ordinary. And more examples of how prior predictive checks show us what our model thinks about key quantities of interest.


  1. These strange shapes tend to be extremely useful in practice. For example, it is straightforward to create Bayesian versions of “L1” and “L2” by combining parameters with particular densities. Topic for a future post maybe.

  2. Or see the published version.

  3. You’ve probably seen this phenomenon previously in our stats education. If you keep adding and subtracting more uniform variables, we would approach a Normal distribution.

comments powered by Disqus