Chapter 13 Bayesian Inference On Graphical Models

Suppose some dark night a policeman walks down a street, apparently deserted; but suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying a bag which turns out to be full of expensive jewelry. The policeman doesn’t hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion? … A moment’s thought makes it clear that our policeman’s conclusion was not a logical deduction from the evidence; for there may have been a perfectly innocent explanation for everything. It might be, for example, that this gentleman was the owner of the jewelry store and he was coming home from a masquerade party, and didn’t have the key with him. But just as he walked by his store a passing truck threw a stone through the window; and he was only protecting his own property. Now while the policeman’s reasoning process was not logical deduction, we will grant that it had a certain degree of validity. The evidence did not make the gentleman’s dishonesty certain, but it did make it extremely plausible. (- Jaynes 2003Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge university press.)

In the above story, the policeman used what we call plausible reasoning - an allocation of credibility to all possible explanations for his observations. While yes, just maybe, a masquerade party explains the man’s outfit - more likely, the model that best explains all of the collected data is that this is a robbery in progress.

In this book, our version of plausible reasoning is called Bayesian inference - a methodology for updating our mathematical representations of the world - or small pieces of the world - as more data becomes available.

13.1 An Illustrative Example

Figure 13.1: Before and after picture’s of Chili’s store remodelling initiative.

Before and after picture's of Chili's store remodelling initiative.

Figure 13.1 shows a sample of how Chili’s restaurants were refreshing their stores’ look and feel. You can imagine that through this investment Chili’s expects to increase store traffic. Let’s walk through a hypothetical argument between two managers as to the effect of this type of investment.

Assume the two managers have competing models for what effect a proposed Chili’s exterior remodel will have on its stores:

  1. The Optimist Model (Model1) : This model is from a manager who is very excited about the effect of exterior renovations and argues that 70% of all Chili’s stores will see at least a 5% increase in sales.
  2. The Pessimist Model (Model2) : This model is from a manager who felt the old facade was still fresh. This manager argues that only 20% of all Chili’s stores will see at least a 5% increase in sales.

These managers recognize and respect their differing opinions - they agree to test the remodel in a randomly chosen store. They hire you as a business data analyst and ask you to make the decision as to whose model is more right in light of the test’s results. Your job is to allocate credibility to the two competing models both before seeing results and after seeing results. Initially, you might not have any reason to favor one model over another, but as data is collected, your belief in whose model is more believable will change. For example, if the tested store sales decrease, then the pessimist model would seem more credible. Quantifying - using probability - how to allocate credibility both with and without data is your task.

13.2 Building A Graphical Model of the Real-World

The first step is to create a graphical model representation of the Chili’s question. Starting simple, let’s only imagine that we test the remodel in one store and our single data point (i.e. whether the one tested store increases sales or not) follows a Bernoulli distribution. The graphical model is simply the random variable oval:

Figure 13.2: The sales increase outcome from renovating a Chili’s store.

And, the statistical model with the mathematical details is represented like this:

\[ \begin{aligned} X \equiv& \textrm{ Sales Increase: } \\ & \textrm{ If sales increase more than 5}\% \textrm{, then }X=1 \textrm{ otherwise, }X=0.\\ X \sim & \textrm{ Bernoulli}(\theta)\\ \end{aligned} \]

You might notice that \(\theta\) is being used in place of \(p\) for describing the parameter of the Bernoulli Distribution. Often, mathmaticians will use greek letters to describe distribution parameters and use regular (Latin) letters for data that is observable in the real-world (e.g. a sales increase). Sometimes, we will spell out theta or other greek letters because the computational world does not make the actual greek letters like \(\theta\) easy to work with. **We will explore the more realistic case of multiple and infinite possiblities for \(\theta\) in subsequent sections.

Great, we have seen this model before when representing coin flips. Our data is analogous to heads or tails of a coin flip. The data will be reduced to a zero or one for each store. If given \(\theta\), we could generate random sales data using rbern(n=1, prob=theta), but, we do not know \(\theta\); the reason we are looking at data is to answer the question: “what is \(\theta\)?”

From the managerial story above, \(\theta\) can only take on two values.\(^{**}\) In the optimistic model \(\theta=70\%\) and in the pessimistic model \(\theta=20\%\). So, we have two models of the world and are uncertain as to which one is correct. Without data, we have no reason to believe one manager over another, so \(P(\theta=70\%)=50\%\) and \(P(\theta=20\%)=50\%\) - i.e. each manager is equally likely to be correct. This is just like saying \(P(model1)=P(model2)=50\%\). Before any data is considered, this allocation of credibility assigning probability to all considered models is called the prior. The prior is the initial probability allocated among all the possible models.

See https://youtu.be/nCRTuwCdmP0 for gaining some intuition about prior probabilities.

So now, we can more completely specify our data story using both a graphical and a statistical model with specified prior probabilities. The graphical model we would use to communicate with stakeholders is just the oval representing our uncertainty in a store’s success (random variable math-labels for \(\Theta\) and \(X\) are included for extra clarity in connecting the graphical model and statistical model):

Figure 13.3: A model of how sales increases at Chili’s store are determined. First, a success probability is specified and then, the successfulness of a sales increase is determined as a function of the success probability.

And, the statistical model is represented like this:

\[ \begin{aligned} X \equiv& \textrm{ Sales Increase: } \\ & \textrm{ If sales increase more than 5} \% \textrm{, then }X=1 \textrm{ otherwise, }X=0.\\ X \sim & \textrm{ Bernoulli}(\theta)\\ \Theta \equiv& \textrm{ Store Success Probability: } \\ \end{aligned} \]

\[ \Theta \sim \begin{array}{ccc} \textrm{Outcome } & \textrm{ Realization }(\theta) & f(\theta) \\ \hline Model1 & \textrm{ 70%} & \textrm{ 50%} \\ Model2 & \textrm{ 20%} & \textrm{ 50%} \\ \end{array} \]

\(\Theta\) is the uppercase Greek letter for lowercase \(\theta\). Recall, we try to use uppercase letters for random variables and lowercase letters for their realizations.

where the prior probability distribution for \(\Theta\) is given in tabular form.

\(\Theta\) is just capital \(\theta\). Remember we use capital letters, even capital Greek letters, to represent a random variable. Lowercase letters represent realizations or observations.)

Two main classes of statistical models exist. A generative model can be used to generate random data as it gives a full probabilistic recipe (i.e. a joint distribution) for creating random data. In contrast, a discriminative model classifies or predicts without the ability to simulate all the random variables of the model.

Figure 13.3 and the accompanying statistical model represents a generative model. A crude definition of a generative model is that it is a recipe for simulating real-world data observations. In this case, simulating a single store’s success mimics a top-down reading of Figure 13.3:

  1. Simulate a potential success probability by randomly picking, with equal probability, either the optimist (\(\theta = 70\%\)) or pessimist model (\(\theta = 20\%\)).
  2. Simulate a store’s success by using the probability from (1) and generating a Bernoulli random variable with that probability of success (\(X \sim \textrm{Bernoulli}(\theta)\)).

We can easily show how to simulate an observation by writing code for the recipe in R:

library(causact) # for rbern function
set.seed(111) # to get the same random numbers

# Generate Which Manager Model is Correct
# Map theta = 70% to the number 1 and
# map theta = 20% to the number 0.
sampleModel = rbern(n=1, p = 0.5)
theta = ifelse(sampleModel == 1, 0.7, 0.2)

theta ## print the random value
## [1] 0.7
# Generate 20 sample stores for that theta
rbern(n = 20, p = theta)
##  [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1

KEY CONCEPT: Much of your subsequent work in this course will use this notion of generative models as recipes. You will 1) create generative models that serve as skeleton recipes - recipes with unknown parameters - for how real-world data arises, and 2) inform models with data by reallocating plausibility to the recipe parameters that are most consistent with the data. In the above, had the value of theta been hidden from you, you would be able to guess it from the data; most of the 20 stores succeeded in the simulation.

A guiding principle for creating good generative models is that the generated data should mimic data that you, as a business data analyst, believe is plausible. If the generative model outputs implausible data in high frequency, then your model is not capturing the essence of the underlying data story; your modelling assumptions will need work. When the generative model seems correct in the absence of data, then data can feed the updating process (to be explored in Section 13.4) which sensibly reallocates prior probability so that model parameters that tend to generate data similar to the observed data are deemed more plausible.

13.3 From Model To Joint Distribution

The graphical model of Figure 13.3 shows that we have uncertainty about two related random variables: 1) Success Probability (\(\Theta\)) and 2) Sales Increase (\(X\)). Our assumption - built into our prior - is that one of our two models, \(\theta = 70\%\) or \(\theta = 20\%\), is correct.\(^{**}\) We are also confident about what collecting data from one store might yield: either a single success or a single failure in terms of the sales increase. Prior to collecting data, there are four combinations of model and data that are potential truths. Each combination’s prior plausibility, represented by \(a\%, b\%, c\%,\) and \(d\%\), are elements in the below table:

\(^{**}\) This implicit assumption that one of the considered generative models is correct is sometimes referred to as the small world assumption. See this comparison by Richard McElreath of small world versus large world highlighting the implications of this assumption: https://youtu.be/WFv2vS8ESkk.

\[ \begin{array}{cc|cc} & & \textrm{ Possible} & \textrm{Models } \\ & & \theta = 70\% & \theta = 20\% \\ \hline \textrm{Possible} & \textit{Success} & a\% & b\% \\ \textrm{Data} & \textit{Failure} & c\% & d\% \end{array} \]

Prior to data collection, we know (based on assumptions) a few things must be true:

If you are rusty on your defintion of conditional probability, take a moment to refresh yourself by viewing this set of videos from Khan Academy: https://www.khanacademy.org/math/statistics-probability/probability-library/conditional-probability-independence/v/calculating-conditional-probability.

  1. One of the four scenarios must be true, thus \(a+b+c+d = 1\).
  2. Equal probability is given to the possible models such that \(P(\theta = 70\%)=P(\theta = 20\%)=0.5\), thus \(a+c = b+d = 0.5\).
  3. Assuming any model is true, then we know the conditional probability of success - i.e. \(P(Success|\theta = 70\%) = 70\%\) and \(P(Success|theta = 20\%) = 20\%\) (this is just the definition of \(\theta\)).
  4. If we apply the definition of conditional probability - i.e. \(P(A \textrm{ and } B) = P(B) * P(A|B)\), we can define all the probabilities:
    1. \(a = P(\theta = 70\%) * P(Success|\theta = 70\%) = 50\% * 70\% = 1/2 * 7/10 = 7/20\)
    2. \(b = P(\theta = 20\%) * P(Success|\theta = 20\%) = 50\% * 20\% = 1/2 * 2/10 = 2/20\)
    3. \(c = P(\theta = 70\%) * P(Failure|\theta = 70\%) = 50\% * 30\% = 1/2 * 3/10 = 3/20\)
    4. \(d = P(\theta = 20\%) * P(Failure|\theta = 20\%) = 50\% * 80\% = 1/2 * 8/10 = 8/20\)
  5. The marginal probability of success (i.e. the probabilty of success without reference to the model) is the sum of the first row’s elements: \(P(Success) = P(Success \textrm{ and } \theta = 70\%) + P(Success \textrm{ and } \theta = 20\%) = a + b = 7/20 + 2/20 = 9/20\).
  6. The marginal probability of failure (i.e. the probabilty of failure without reference to the model) is the sum of the second row’s elements: \(P(Failure) = P(Failure \textrm{ and } \theta = 70\%) + P(Failure \textrm{ and } \theta = 20\%) = c + d = 3/20 + 8/20 = 11/20\).

Taken together, the above truths enable us to fully specify the marginal distributions \(P(Model)\) and \(P(Data)\), as well as the joint distribution \(P(Model \textrm{ and } Data)\):

\[ \begin{array}{cc|cc|c} & & \textrm{ Possible} & \textrm{Models } & \textbf{} \\ & & \theta = 70\% & \theta = 20\% & P(Data) \\ \hline \textrm{Possible} & \textit{Success} & 7/20 & 2/20 & 9/20 \\ \textrm{Data} & \textit{Failure} & 3/20 & 8/20 & 11/20 \\ \hline \textbf{} & P(Model) & 10/20 & 10/20 & 20/20 \\ \end{array} \]

When you see something like \(P(Model)\) or \(P(Data)\) what this really means is that if I specify a model (or data), you are able to return a probability. For example, \(P(\theta = 70\%) = 50\%\) and \(P(Success) = 9/20)\). These are called marginal distributions (literally because they are often calculated in the margins by summing rows or columns of a two-way table). Often, we are interested in the distribution over several random variables. This distribution is referred to as a joint distribution over those random variables. In our example, the joint distribution is represented as \(P(Model \textrm{ and } Data)\). This means if I give you any specific realizations of both \(Model\) and \(Data\), you can return a probability. For example, \(P(\theta = 70\% \textrm{ and } Success) = 7/20\).

In data analysis, we are always intereseted in updating our beliefs about \(P(Model)\). Right now, both models are allocated \(50\%\) plausibility, but as soon as we see a success or failure, we will want to reallocate that plausibility towards either the optimist model in the case of success or the pessimist model in the case of failure.

13.4 Bayesian Updating Of The Joint Distribution

Our generative model recipe led us to this prior joint distribution for \(P(Model \textrm{ and } Data)\):

\[ \begin{array}{cc|cc} & & \textrm{ Possible} & \textrm{Models } \\ & & \theta = 70\% & \theta = 20\% \\ \hline \textrm{Possible} & \textit{Success} & 7/20 & 2/20 \\ \textrm{Data} & \textit{Failure} & 3/20 & 8/20 \\ \end{array} \]

Inference is the process of reallocating prior probability in models based on observed data, i.e. determining \(P(model|data)\). Mathematically, there is only one way to do this- Bayes rule.

13.4.1 A Quick Derivation of Bayes Rule

From the definition of conditional probability, we know that \(P(A \textrm{ and } B)\) can be represented in two ways:

\[ P(A \textrm{ and } B)=P(A│B) \times P(B)=P(B|A) \times P(A) \]

Thus, by simple mathematical manipulation of the last equality we get Bayes rule:

\[ P(A│B)=\frac{P(B│A) \times P(A)}{P(B)} \]

Since A and B are somewhat arbitrary, let’s relate Bayes Rule to the concept of collecting data to reallocate probability among competing models:

\[ P(Model | Data) = \frac{P(Data|Model) \times P(Model)}{P(Data)} \]

13.4.2 Using Bayes Rule

The mathematics of Bayes rule might seem intimidating, but showing how Bayes rule works using our joint distribution table can make it more intuitive. Let’s assume the first store is a success - sales increase by more than 5%. Thus, we no longer need to worry about the case of \(Failure\) because we are certain it did not happen. So, we zoom in on the row that did happen - \(Success\):

\[ \begin{array}{cc|cc} & & \textrm{ Possible} & \textrm{Models } \\ & & \theta = 70\% & \theta = 20\% \\ \hline \textrm{Possible} & Success & \textbf{7/20} & \textbf{2/20} \\ \textrm{Data} & Failure & \tiny{3/20} & \tiny{8/20} \\ \end{array} \]

Once zoomed in on this row, we need to re-allocate 100% of our plausability measure to just this row because we are 100% certain it happened. Intuitively, it seems that the proportion allocated to each model should be proportional to our prior beliefs - and this is actually what the mathematics of Bayes rule does. See Khan Academy for a refresher on dividing fractions if you need it: https://youtu.be/f3ySpxX9oeM To make the success row sum to 1 instead of \(\frac{9}{20}\), simply divide each element in the row by \(\frac{9}{20}\). This would yield updated probabilities of \(\frac{7/20}{9/20} = \frac{7}{9}\) for the optimist model and \(\frac{2/20}{9/20} = \frac{2}{9}\) for the pessimist model. Notice how we can arrive at the same result using the Bayes rule formula below:

\[ \begin{aligned} P(Model | Data) &= P(\theta = 70\% | Success) \\ &= \frac{P(Success | \theta = 70\%) \times P(\theta = 70\%)}{P(Success)}\\ &= \frac{P(Success | \theta = 70\%) \times P(\theta = 70\%)}{P(Success \textrm{ and } \theta = 70\%) + P(Success \textrm{ and } \theta = 20\%)}\\ &= \frac{7/10 \times 1/2}{7/20 + 2/20}\\ &= 7/9 \approx 77.8\% \end{aligned} \]

and

\[ \begin{aligned} P(Model | Data) &= P(\theta = 20\% | Success) \\ &= \frac{P(Success | \theta = 20\%) \times P(\theta = 20\%)}{P(Success)}\\ &= \frac{P(Success | \theta = 20\%) \times P(\theta = 20\%)}{P(Success \textrm{ and } \theta = 70\%) + P(Success \textrm{ and } \theta = 20\%)}\\ &= \frac{2/10 \times 1/2}{7/20 + 2/20}\\ &= 2/9 \approx 22.2\% \end{aligned} \] Take time to convince yourself that Bayes rule is just a mathematically formal way of saying take each element in the observed row and divide by the sum of that row.

A good check that you did the Bayes rule math correctly is to ensure your posterior probabilities sum to 1 (i.e. \(\frac{7}{9} + \frac{2}{9} = 1 \textrm{ }\checkmark\)). So, while initally we were unsure as to which manager might be right, the data leads us to allocate more probability to the optimistic model - from 50% prior probability to 77.8% after observing the one store. This might seem like an overly big increase. However, the pessimist model suggests that there is only a 20% chance of success, thus \(Success\) is a rare outcome in that model - so rare in fact, that the mathematics deems that model inconsistent with the data and hence, allocated alot of probability away from that model.

Bayes rule takes some working with it to fully digest. Spend a moment to make sure you know where each part of the above Bayesian updating calculation comes from. The components of a Bayes rule calculation are frequently referred to and each one has a special name:

  1. Prior - \(P(Model)\): The probability allocated to each model prior to processing collected data - i.e. \(P(\theta = 70\%)\) and \(P(\theta = 50\%)\). This is decided by the business analyst after consulting with and learning from domain experts.
  2. Likelihood - \(P(Data|Model)\): The probability of observing any potential data given a specified model. Our model of how data gets generated was simple, it is just \(Data \sim \textrm{ Bernoulli}(\theta)\).
  3. Evidence or Marginal Likelihood - \(P(Data)\): This is the probability of observing a particular data value. For the two-way table representation of a joint distribution, it is the sum of each row.
  4. Posterior - \(P(Model|Data)\): This is the updated probabily allocated to each model after observing data. If the data is \(Success\), then the posterior gives \(P(\theta = 70\% | Success)\) and \(P(\theta = 20\% | Success)\). Only the calculation for \(Data = Success)\) was shown above. If the data was \(Failure\), then the posterior gives a different allocation of probability, namely \(P(\theta = 70\% | Failure)\) and \(P(\theta = 20\% | Failure)\)

Thus, Bayes rule restated for data analysis purposes:

\[ Posterior = \frac{Prior * Likelihood}{Evidence} = \frac{P(Model) * P(Data|Model)}{P(Data)} = P(Model|Data) \]

shows exactly how we update our prior original assumptions \(P(Model)\) in response to \(Data\). In words, we reallocate plausability to the models where the data is more consistent with the implications of the model as determined by the likelihood.

13.5 Inference Summary So Far

The steps for initially allocating and then, realloocating probability in light of data are to:

  1. Make A Graphical Model - Construct plausible relationships among the inter-related random variables of interest - in this chapter’s example the two random variables were \(Success \textrm{ }Probability\) and \(Sales \textrm{ }Increase\).
  2. Provide Prior Probabilities Via A Statistical Model - Create prior probabilities to ensure that your model can generate seemingly plausible data. Our prior was easy to figure out (50%/50%) based on the context, but in future chapters we will explore more complex scenarios.
  3. Update Using Bayes Rule - Use Bayesian updating to process data and reallocate probability among the competing models. So far, we learned to do this mathematically, but we will leverage the power of R and its packages to make the computer do this tedious work for us.

When additional data is received, please know that Bayes rule still works. Just take the posterior output from the original data, and then use that as your prior when digesting the new information. Equivalently, you can just update the original prior with the combined set of original data and new data. Either way, Bayes rule will yield the same results.

While the math may seem tedious now, we are actually learning the math now to move away from it in the future; once we understand how the math works, we can leverage computation to do the math for us with a deeper understanding of what the computation does.