Chapter 16 greta: Bayesian Updating And Probabilistic Statements About Posteriors

In the Bayesian Inference chapter, we calculated (by hand) a posterior distribution using a single data point on whether the one tested Chili’s store increased sales. In this original Chili’s story, we limited ourselves to just two possible models of success:

  1. The Pessimist Model: \(\theta = 20\)%
  2. The Optimist Model: \(\theta = 70\)%

We subsequently represented the Chili’s narrative using the generative DAG of Figure 16.1, but we never did get a posterior distribution for this model. In this chapter, we use the computer to remedy this issue.

Figure 16.1: Modelling sales increase outcomes from renovating three Chili stores and accomodating all possible values for the probability of a sales increase.

16.1 Bayesian Updating With greta

Typically, you would go to https://greta-stats.org/ for more information about installing greta along with other necessary software like TensorFlow and Diagrammer. Currently though, as of January 2020, there are some issues with that install process. Refer to the install greta chapter at http://causact.com/ for the most up-to-date version of this chapter.

To manually use Bayes rule to calculate the posterior distribution for the infinite possible values of \(\theta\) is impossible. That would simply take too long. Instead, we will rely on the power of the R eco-system to deliver us a representative sample from the joint posterior distribution. In fact, the software we will use allows us tremendous flexibility; seemingly accomodating any generative DAG consisting of the most commonly used probability distributions. In the next chapter, we will explore commonly used probability distributions and their roles as priors in simple models. In other words, we as analysts supply a generative DAG (our model and prior) and some data. Then, the computer spits out the posterior - using Bayes rule behind the scenes to properly update our prior assumptions to place more plausibility on model parameters that are consistent with the observed data.

Figure 16.2: The greta logo. greta calculates joint posterior distributions.

The greta logo.  greta calculates joint posterior distributions.

To get the computer to do our work, The greta package was developed by Nick Golding who maintains the https://greta-stats.org/ website where you can find further documentation. we need to translate our data and generative DAG into a computational model (i.e. a computer-world model that calculates posterior distributions). The first way will learn to do this is using the intuitive and fantastic greta package within R. Here are the six steps needed to have greta yield a posterior distribution:

For Step 2, the following probability distribution names are recognized by greta: uniform, normal, lognormal, bernoulli, binomial, beta_binomial, negative_binomial, hypergeometric, poisson, gamma, inverse_gamma, weibull,exponential, pareto, student, laplace, beta, cauchy, chi_squared, logistic, f, multivariate_normal, multinomial, categorical, dirichlet, dirichlet_multinomial, wishart, lkj_correlation. Many of these will be introduced in subsequent chapters and all others can be found on Wikipedia. Do not let the fancy names scare you, they are just ways of compactly representing uncertainty.

Step 1 - Tell greta about your observed random variables (i.e. data) using as_data()

library(greta)
## assume first two stores are a success and 
## the third store is not (i.e. make up some data)
storeData = c(1,1,0)

## tell greta about your observed data X 
## recall that X represents sales increase data)
x = as_data(storeData)
## you can safely ignore the plate for observed data
## R automatically knows how many observed realizations
## are in the data because of the storeData vector length

Step 2 - Tell greta about your prior for unobserved random variables

## tell greta about what values of theta seem
## plausible prior to observing the data
theta = uniform(0,1)
## note: greta requires you use the assigment operator (=)
## the ~ of the DAG is inferred because uniform is the name
## of a probability distribution

Step 3 - Tell greta how observed RV’s from step 1 are generated using unobserved RV’s from step 2

## use the distribution() function to say how the distribution
## of an observed random variable might depend on an unobserved 
## random variable.  This represents the likelihood in Bayesian inference.
distribution(x) = bernoulli(theta)

Step 4 - Define an R object that stores a complete greta model

# use the model() function to tell greta which random variables
# you want to analyze in the posterior distribution
fit = model(theta) 
# by convention, the object name fit 
# is often used to store model definitions

Step 5 - Verify the greta model looks okay

## use the plot() function to see how greta organized your model
plot(fit)

Figure 16.3: Verify model.

where the following legend is used to understand greta’s graphical model.

Figure 16.4: Legend for understanding a greta model.

Legend for understanding a greta model.

Take a moment to notice that greta’s graphical model (Figure 16.3) is just a restating of the generative DAG in Figure 16.1. While I personally love greta’s graphical model, the generative DAG that we have learned so far (e.g. Figure 16.1) is better for helping business stakeholders. The greta graphical model is simply too confusing for those not trained to read them; for us though, it is a pretty awesome check that our computational model matches our generative DAG!

Step 6 - Create a representative sample of the joint posterior distribution

## this function might take some time
## it runs for about 20 seconds on my machine
## it stores the posterior distribution
## in an object called draws
draws = mcmc(fit)

This completes the updating from prior to posterior. Assuming all ran without error, we can now query our representative sample of the posterior distribution. This sample is often referred to simply as the posterior distribution.

16.2 Investigating the Posterior Distribution

When investigating the posterior distribution, please note that your representative sample may differ slightly from the sample shown here. Due to some inherent randomness in the way the posterior distribution is generated slight differences in our results are expected.

The posterior distribution is buried in the object we named draws. Think of a draw as an individual sample or realization of the posterior distribution; hence, draws is a collection of realizations that collectively serve as our representative sample of the posterior distribution. At this point, please realize the posterior distribution is only expressed as a representative sample (revisit the Representing Uncertainty chapter for a comparison of the different ways of expressing uncertainty) - that is the best we can do for all but a few special cases. Also note that unlike draw and draws which are singular and plural, respectively - the word sample can refer to either the collection of draws or just one single draw of the joint posterior distribution.

To see a representative sample of the posterior distribution, we access the R object, draws, created in step 6 above. Ignore the details of why the draws object is a list instead of a data frame and simply use the as.matrix() and as_tibble() functions from the tidyverse package to return a data frame:

library(tidyverse)
# extract samples as data frame
drawsDF = draws %>% as.matrix() %>% as_tibble()
drawsDF
## # A tibble: 4,000 x 1
##    theta
##    <dbl>
##  1 0.479
##  2 0.601
##  3 0.827
##  4 0.533
##  5 0.728
##  6 0.457
##  7 0.288
##  8 0.612
##  9 0.708
## 10 0.763
## # ... with 3,990 more rows

The theta column of drawsDF contains our representative sample of the posterior distribution. In this case, that representative sample includes 4,000 samples of theta. Recall from Figure 16.1 that our initial generative model assumed theta, the probability of success, was uniformly distributed between 0 and 1. Now, after observing 2 successes and 1 failure (see Step 1 above), we should expect our plausibility to favor \(\theta\) values closer to \(2/3\). Let’s visualize the plausibility using a histogram.

To get a refresher on histograms and density curves, check out this constructive video from Khan Academy: https://youtu.be/PUvUQMQ7xQk.

Using ggplot, we can visualize how plausibility was reallocated from our uniform prior of \(\theta\) in light of the observed data; in other words, we can see the posterior distribution for \(\theta\) after observing 2 out of 3 stores increase sales:

## make histogram with bins spaced every 0.025
library(tidyverse)
ggplot(drawsDF, aes(x = theta)) + 
  geom_histogram(breaks = seq(0,1,0.025))

Figure 16.5: Histogram of the posterior distribution for theta.

Histogram of the posterior distribution for theta.

Notice how low values of \(\theta\) are represented less frequently than before and values from \(\theta = 50\%\) to \(\theta = 75\%\) seem to be represented more frequently. Given our training in the tidyverse, we can create a more effective visual; a comparison of prior and posterior (remember to google any functions/packages/etc. you are not familiar with):

library(tidyverse)

# named vector mapping colors to labels
colors = c("Uniform Prior" = "aliceblue", 
           "Posterior" = "cadetblue")

# use geom_density to create smoothed histogram
ggplot(drawsDF, aes(x = theta)) +
  geom_area(stat = "function", fun = dunif, 
            alpha = 0.5, color = "black", 
            aes(fill = "Uniform Prior"))+ 
  geom_density(aes(fill = "Posterior"),
               alpha = 0.5) +
  scale_fill_manual(values = colors) +
  labs(y = "Probability Density",
       fill = "Distribution Type")

Figure 16.6: Comparison of prior and posterior probability density function estimates.

Comparison of prior and posterior probability density function estimates.

Some takeaways from this graph regarding plausibility reallocation:

  • Low values of \(\theta (\approx < 30\%)\) are now deemed less plausible; two out of three successes is simply inconsistent with low values of \(\theta\).
  • Very high values of \(\theta (\approx > 90\%)\) are less plausbile; one failure in three tries is not likely to occur if \(\theta\) were to be super-high.
  • Our best guess at \(\theta\) went from 0.5 (i.e. the mean of theta when uniformly distributed) to mean(drawsDF$theta) = 0.598.
  • We are still very unsure of \(\theta\) after only 3 stores. We can see this by looking at quantiles of the posterior distribution suggesting a 90% percentile interval from 0.241 to 0.899:

See here for more details about interpreting a quantile function: https://en.wikipedia.org/wiki/Quantile_function.

quantile(drawsDF$theta, probs = c(0.05,0.5,0.95))
##        5%       50%       95% 
## 0.2408417 0.6158816 0.8993699

This continued large band of uncertainty is a good thing. We only have three data points, we should not be very confident in \(\theta\). If we want a tighter interval of uncertainty, then we simply need to get more data. Exploring more data, different models, and other priors are all coming in subsequent chapters.

16.3 Making Probabilistic Statements From Posterior Distributions

With a representative sample of the posterior joint distribution available to us, namely drawsDF, we can expand on the strategies of the “Joint Distributions Tell Us Everything” chapter to make probabilistic statements from representative samples. For example, we might be curious to know \(P(\theta > 50\%)\). Using indicator functions, we can get our estimate\(^{**}\): ** Representative samples from posterior distributions will vary each time you run the code. Hence, your answer might be slightly different than the one shown here.

## estimate that theta > 50 percent
drawsDF %>%
  mutate(indicatorFlag = 
           ifelse(theta > 0.5, TRUE, FALSE)) %>%
  summarize(pctOfThetaVals = mean(indicatorFlag))
## # A tibble: 1 x 1
##   pctOfThetaVals
##            <dbl>
## 1          0.681

We would then state that “the probability that \(\theta\) is greater than \(50\%\) is approximately 68%”.

Similarly, we can answer more complicated queries. For example, what is the \(P(40\% < \theta < 60\%)\):

## estimate that 40% < theta < 60%
drawsDF %>%
  mutate(indicatorFlag = 
           ifelse(theta > 0.4 & theta < 0.6, 
                  TRUE, FALSE)) %>%
  summarize(pctOfThetaVals = mean(indicatorFlag))
## # A tibble: 1 x 1
##   pctOfThetaVals
##            <dbl>
## 1          0.292

The power of this workflow cannot be overstated and you will use it frequently for making probabilistic statements. Statements that will come in handy when it is time to use data to inform decision making under uncertainty.

16.4 Getting More Help

Make sure to check out the website: https://greta-stats.org/articles/get_started.html for more detailed information on using greta.

If you are interested in understanding how the representative sample is computed, see Michael Betancourt’s excelllent videos on Scaleable Bayesian Inference with Hamiltonian Monte Carlo (HMC). HMC is the computational technique used by greta/tensorflow when computing the representative sample from the posterior distribution. The videos are available at https://betanalpha.github.io/speaking/.