Zero/one-inflated Beta regression

Key points
  • The Beta distribution represents a proportion between 0 and 1, but does not include 0 or 1. Zero/one-inflated Beta regression models separate processes for the 0s and 1s

Code
knitr::opts_chunk$set(fig.align = "center", fig.retina = 3)

library(tidyverse)
library(marginaleffects)
library(brms)
library(tidybayes)
library(parameters)
library(tinytable)
library(patchwork)
library(extraDistr)
library(ggridges)
library(scales)
library(betareg)

# Data via the WHO via Kaggle
# https://www.kaggle.com/datasets/lsind18/who-immunization-coverage
tetanus <- readRDS("data/tetanus_pab.rds")
tetanus_2020 <- tetanus |> 
  filter(year == 2020) |> 
  # Cheat a little
  mutate(prop_pab = ifelse(prop_pab >= 0.97, 1, prop_pab)) |> 
  # Reverse this
  mutate(prop_unvacc = 1 - prop_pab)

theme_set(theme_minimal())

options(
  digits = 3, width = 120,
  tinytable_tt_digits = 2
)

Distribution intuition

The Beta distribution is great for proportions, but it doesn’t work with values that are exactly 0 or 1.

To get around this, we can use a special zero-inflated Beta regression. We’ll still model the \(\mu\) (mu) and \(\phi\) (phi) (or mean and precision) of the beta distribution, but we’ll also model one new special parameter \(\pi\) (pi; not 3.14, though). With zero-inflated regression, we’re actually modelling a mixture of data-generating processes:

  1. A logistic regression model that predicts if an outcome is 0 or not, defined by \(\pi\)
  2. A Beta regression model that predicts if an outcome is between 0 and 1 if it’s not zero, defined by \(\mu\) and \(\phi\)

In general, the model looks like this:

\[ \begin{aligned} Y_i &\sim \operatorname{Zero-inflated\ Beta}(\pi_i, \mu_i, \phi_i) \\ \operatorname{logit}(\pi_i) &= \gamma_0 + \gamma \mathbf{X}_i & \text{on/off process} \\ \operatorname{logit}(\mu_i) &= \beta_0 + \beta \mathbf{X}_i & \gt 0 \text{ process} \\ \log(\phi_i) &= \eta_0 + \eta \mathbf{X}_i & \text{dispersion} \end{aligned} \]

Example: Modeling the proprotion of tetanus non-vaccinations

To illustrate this, we’ll look at 2020 PAB tetanus vaccination rates, since far more countries have reached or are approaching universal coverage. We’ll also cheat a little, since only 3 countries are actually at 100%—for the sake of this example, if a country is at 97% or more, we’ll consider that 100%.

Also, we’ll look at the reverse of the proportion, or the proportion of unvaccinated, since there is no one-inflated regression built into {brms}. Their suggestion is to either flip the value (like we’re doing here) or use zero-one-inflated regression and tell the zero-focused process there to be constant. Flipping the value is easier, so we’ll do that.

Let’s first see how many 0s we’re working with:

Code
tetanus_2020 |> 
  count(prop_unvacc == 0) |> 
  mutate(prop = n / sum(n))
# A tibble: 2 × 3
  `prop_unvacc == 0`     n  prop
  <lgl>              <int> <dbl>
1 FALSE                 84 0.792
2 TRUE                  22 0.208

20% of the countries are fully vaccinated here (or have 0% unvaccinated kids). The 80% of countries with some unvaccinated children still follows a Beta distribution:

Code
tetanus_2020 |> 
  mutate(is_zero = prop_unvacc == 0) |> 
  mutate(prop_unvacc = ifelse(is_zero, -0.01, prop_unvacc)) |> 
  ggplot(aes(x = prop_unvacc, fill = is_zero)) +
  geom_histogram(binwidth = 0.04, boundary = 0, color = "white") +
  geom_vline(xintercept = 0) +
  scale_x_continuous(labels = label_percent()) +
  scale_fill_viridis_d(
    option = "plasma", end = 0.9,
    guide = guide_legend(reverse = TRUE)
  ) +
  labs(
    x = "Proportion of unvaccinated children", 
    y = "Count", fill = "Is zero?"
  ) +
  theme(legend.position = "bottom")

The only difference between regular Beta regression and zero-inflated Beta regression is that we have to specify one more parameter: zi. This corresponds to the \(\pi\) parameter and determines the zero/not-zero process.

To help with the intuition, we’ll first run a model where we don’t actually define a model for zi—it’ll just return the intercept for the \(\pi\) parameter.

Code
priors <- c(
  set_prior("student_t(3, 0, 2.5)", class = "Intercept"),
  set_prior("normal(0, 1)", class = "b"),
  set_prior("logistic(0, 1", class = "Intercept", dpar = "zi")
)
Code
# Here we log gdp_per_cap because Stan really doesn't like big numbers
model_beta_zi_int_only <- brm(
  bf(
    prop_unvacc ~ log(gdp_per_cap) + region,
    phi ~ 1,
    zi ~ 1
  ),
  data = tetanus_2020,
  family = zero_inflated_beta(),
  prior = priors,
  chains = 4, iter = 2000, seed = 1234,
  file = "models/model_beta_zi_int_only"
)
Code
model_parameters(model_beta_zi_int_only, verbose = FALSE) |> tt()
tinytable_kq2ypdvwd1hp2i6locyv
Parameter Component Median CI CI_low CI_high pd Rhat ESS
b_Intercept conditional -0.8 0.95 -2.29 0.694 0.85 1 3140
b_phi_Intercept conditional 2.99 0.95 2.65 3.286 1 1 4220
b_loggdp_per_cap conditional -0.16 0.95 -0.34 0.021 0.96 1 3252
b_regionEurope&CentralAsia conditional -0.44 0.95 -1.51 0.446 0.83 1 3071
b_regionLatinAmerica&Caribbean conditional -0.13 0.95 -0.62 0.357 0.7 1 3356
b_regionMiddleEast&NorthAfrica conditional 0.23 0.95 -0.34 0.741 0.79 1 3168
b_regionSouthAsia conditional 0.3 0.95 -0.32 0.84 0.83 1 3280
b_regionSubMSaharanAfrica conditional 0.24 0.95 -0.15 0.662 0.89 1 2752
b_zi_Intercept zero_inflated -1.39 0.95 -1.92 -0.921 1 1 3948

We now have a new parameter here, b_zi_Intercept, or the \(\pi\) in the model. It’s on the logit scale, so we can back-transform it with plogis():

Code
plogis(-1.39)
[1] 0.199

That 20%ish represents the number of rows in the dataset with 0 unvaccinated children, which is what we found earlier!

For now, we’ve only done an intercept-only model, but we can model the exact 0/not-0 process. There are probably regional differences (and probably lots of other things) in whether a country is fully vaccinated, so we can include that in the model:

Code
model_beta_zi <- brm(
  bf(
    prop_unvacc ~ log(gdp_per_cap) + region,
    phi ~ 1,
    zi ~ region
  ),
  data = tetanus_2020,
  family = zero_inflated_beta(),
  prior = priors,
  chains = 4, iter = 2000, seed = 1234,
  file = "models/model_beta_zi"
)

We can look at the individual parts of this model. Here are the posterior predictions of 0s across each region:

Code
# With {marginaleffects}
model_beta_zi |> 
  plot_predictions(condition = c("region", "region"), dpar = "zi") +
  scale_y_continuous(labels = label_percent()) +
  labs(x = NULL, y = "Proportion unvaccinated") +
  scale_color_viridis_d(option = "plasma", end = 0.9, guide = "none") +
  coord_flip() 

Code
# With {tidybayes}
model_beta_zi |> 
  linpred_draws(
    newdata = data.frame(gdp_per_cap = 10000, region = unique(tetanus_2020$region)),
    dpar = "zi", transform = TRUE
  ) |> 
  ggplot(aes(y = region, x = zi)) +
  stat_halfeye(aes(fill = region), normalize = "xy") +
  scale_fill_viridis_d(option = "plasma", end = 0.9, guide = "none") +
  scale_x_continuous(labels = label_percent()) +
  labs(y = NULL, x = "Proportion unvaccinated")

All the regular {marginaleffects} and {tidybayes} functions work too, either on individual parts of the model (with the dpar argument), or on both combined (without any extra arguments). Like this—we can visualize the posterior distributions of the specific marginal effects while also incorporating the 0 process!:

Code
model_beta_zi |> 
  slopes(
    newdata = datagrid(gdp_per_cap = c(1000, 10000, 25000), region = unique),
    variables = "gdp_per_cap"
  ) |> 
  posterior_draws() |> 
  mutate(draw = draw * 1000) |> 
  ggplot(aes(x = draw, y = factor(gdp_per_cap), fill = region)) +
  stat_halfeye(normalize = "xy") +
  scale_x_continuous(labels = label_number(scale = 100, suffix = " pp.")) +
  scale_fill_viridis_d(option = "plasma", end = 0.9, guide = "none") +
  facet_wrap(vars(region), ncol = 1) +
  labs(
    x = "Percentage point change in proportion of PAB non-vaccination", 
    y = "GDP per capita"
  )