- 1 Welcome
- 2 Becoming a Data-Driven Business Analyst
- 3 The Computing Environment
- 4 R: Basic Usage
- 5 R Packages: causact, tidyverse, etc.
- 6 dplyr: Manipulating Data Frames
- 7 dplyr: Data Manipulation For Insight
- 8 ggplot2: Data Visualization Using The Grammar of Graphics
- 9 ggplot2: The Four Stages of Visualization
- 10 Representing Uncertainty
- 11 Joint Distributions Tell You Everything
- 12 Graphical Models Tell Joint Distribution Stories
- 13 Bayesian Inference On Graphical Models
- 14 Generative DAGs As Prior Joint Distributions
- 15 Install Tensorflow, greta, and causact
- 16 greta: Bayesian Updating And Probabilistic Statements About Posteriors
- 17 causact: Quick Inference With Generative DAGs
- 18 The beta Distribution
- 19 Parameter Estimation
- 20 Posterior Predictive Checks
- 21 Decision Making
- 22 A Simple Linear Model
- 23 Linear Predictors and Inverse Link Functions
- 24 Multi-Level Modelling
- 25 Compelling Decisions and Actions Under Uncertainty
- 26 Your Journey Continues

We now continue last chapter’s investigation of Crossfit Gym’s use of yoga. Our goal is to bring the mathematical estimates of our posterior distribution’s parameters back into the real-world with compelling and visually-based recommendations. To do this, we 1) define our *outcomes of interest*, 2) *compute a posterior distribution* for those outcomes, and 3) communicate our beliefs about those outcomes by *visualizing the outcomes of interest*.

Our main outcome of interest is signup probability. We will investigate signup probability by investigating that particular node in relation to our decision. Representing this is the deceptively simple generative decision DAG shown in Figure 25.2.

```
dag_create() %>%
dag_node("Yoga Stretching?","yoga",
dec = TRUE) %>%
dag_node("Signup Probability","prob") %>%
dag_edge("yoga","prob") %>%
dag_plate("Stretch Type","",
nodeLabels = c("yoga","prob")) %>%
dag_render(shortLabel = TRUE, wrapWidth = 12)
```

The top-down narrative of Figure 25.2 is as follows. Crossfit Gyms decides on whether to offer yoga stretching and the sign-up probability across their gyms changes as a result. Our job is to quantify this change and form an opinion on whether to offer yoga stretching or not across the gyms; we will form our opinion using the posterior distribution.

Running the generative DAG of Figure 25.1 through the `dag_greta()`

function yields a posterior distribution, but our job as analysts does not end there. We now have to make sense of the posterior distribution and communicate its implications to stakeholders.

Rerunning the analysis of the previous chapter:

where `drawsDF`

is now a representative sample of the posterior distribution associated with Figure 25.1. This data frame is a sample of 4,000 draws of 28 variables. Let’s list the 28 variables using `names(drawsDF)`

.

`alpha_1` |
`alpha_8` |
`beta_3` |
`beta_10` |

`alpha_2` |
`alpha_9` |
`beta_4` |
`beta_11` |

`alpha_3` |
`alpha_10` |
`beta_5` |
`beta_12` |

`alpha_4` |
`alpha_11` |
`beta_6` |
`mu_alpha` |

`alpha_5` |
`alpha_12` |
`beta_7` |
`mu_beta` |

`alpha_6` |
`beta_1` |
`beta_8` |
`sd_alpha` |

`alpha_7` |
`beta_2` |
`beta_9` |
`sd_beta` |

Figure 25.3 is a subset of Figure 25.1 and our objective node, `theta`

(a.k.a. *Signup Probability*), is the last descendant or bottom-child node of the graph.** ** We safely omit the children of this variable to save space since they do not affect our decision. Perusing our posterior’s 28 random variables, we might notice that `theta`

is not one of them. Bummer! We are going to need to do some coding to get a representative sample for `theta`

.

`theta`

is omitted because it is a calculated node; its realization is a deterministic function of its parent `y`

which is also a calculated node. Note that an oval’s double-perimeter is the visual clue for a calculated node (see Figure 25.3). Its parents include both random and observed nodes. So to actually determine `theta`

, we need follow the generative recipe from grandparents (`alpha`

,`beta`

,`j`

, and `x`

) to grandchild (`theta`

) via linear predictor `y`

.

Notice, we do not care about `mu_alpha`

and the other parents of `alpha`

and `beta`

. Once we have a representative sample of `alpha`

and `beta`

, plus the observed nodes `j`

and `x`

, then we can calculate `theta`

.

For example, let’s estimate the additional sign-up probability when using “Yoga Stretch” at gym number 12?

Get a single draw of the required nodes from the representative sample.

Compute a value for the linear predictor with and without yoga (i.e.

`x=1`

and`x=0`

, respectively) using the formula shown for this node in Figure 25.3. Plugging in for`x`

, we get the two different values of linear preditor`y`

that interest us:`x=1`

\(\rightarrow y_{yoga} = \alpha_{12}+\beta_{12}*1\)) and without yoga (`x=0`

\(\rightarrow y_{trad} = \alpha_{12}+\beta_{12}*0 = \alpha_{12}\)):Compute the values for

`theta`

with and without yoga at gym12 using the inverse-logit function (i.e. the link function formula shown for this node in Figure 25.3):Compute the increased probability of signup when using yoga at gym12:

And now, viewing these computed values:

```
## # A tibble: 1 x 7
## alpha_12 beta_12 y_yoga y_trad theta_yoga theta_trad probIncDueToYoga
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.88 0.123 -1.76 -1.88 0.147 0.132 0.0147
```

we see that for this draw, around 15% of yoga trial customers end up signing up for a membership versus 13% of customers doing traditional stretching. According to this draw then, approximately 2% is the signup probability increase due to yoga stretching.

Let’s get a little mathematical and declare this difference in probabilities to be a new random variable `Z_{gymID}`

like:

\[ Z_{12} \equiv \textrm{ Probability increase due to yoga stretching at gym 12}, \]

The following code scales the above four steps for creating one draw to creating a column of representative samples for \(Z_{12}\):

```
postDF = drawsDF %>%
select(alpha_12,beta_12) %>%
mutate(y_yoga = alpha_12 + beta_12 * 1) %>%
mutate(y_trad = alpha_12) %>%
mutate(theta_yoga = 1 / (1+exp(-y_yoga))) %>%
mutate(theta_trad = 1 / (1+exp(-y_trad))) %>%
mutate(z_12 = theta_yoga - theta_trad)
```

The column we just made, `z_12`

, is our posterior distribution for the change in probability due to yoga stretching. We can visualize and summaarize this posterior density:

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.18377 0.01167 0.04423 0.06520 0.09634 0.71258
```

and form only a mild opinion that yoga is probably helpful at gym 12 (i.e. most of the plausible values are above zero).

To convert posterior probabilities into decisions, we want a visual that communicates a recommendation. In business, a random variable outcome of interest is usually made more compelling by converting it to some measure of money. Let’s assume that that the value of each new customer is estimated to be $500 in *net present value* terms. We can then create a mathematical formula for value created by yoga stretching per trial customer:

\[ ValueOfYogaStretchingForGym12 = 500 \times Z_{12} \]

and also, represent it computationally

We now have a random variable of the per customer profit estimate if gym12 adopts yoga stretching for the next year versus not adopting yoga stretching. We can summarize this random variable graphically,

```
moneyDF %>%
ggplot(aes(x = ValueCreated)) +
geom_density(fill = "cadetblue", alpha = 0.5) +
scale_x_continuous(labels = scales::dollar)
```

which shows both the plausibility of losing money as well as making up to say $100 per customer as a result of the decision. We can find some additional metrics:

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -91.886 5.835 22.114 32.600 48.169 356.290
```

telling us the median value for gym12 is about $22 of extra value per customer. This means that we assign a 50/50 chance to being above or below this number in terms of value created per customer. Hence, if it costs extra money (e.g. licensing fees, additional labor expense, additional equipment expense, etc.), say $25 per customer to offer the class, then this investment at gym12 might not be recovered.

Lastly, since continuous probability estimates are sometimes difficult for decision makers (and ourselves) to understand, we can create a discrete probability distribution by creating bins (see http://wilkelab.org/classes/SDS348/2016_spring/projects/project1/project1_hints.html).

```
breaks = c(-1000,-20,0,20,40,60,80,100,1000)
labels = c("<-$20","-$20 - $0","$0 - $20",
"$20 - 40$","$40 - $60","$60 - $80",
"$80-$100","$100+")
bins = cut(moneyDF$ValueCreated,
breaks,
include.lowest = T,
right=FALSE,
labels=labels)
moneyDF$bins = bins ## add new column
```

And then, we use the newly created bins to give us a very nice and interpretable plot as shown in Figure 25.4.

```
## add label for percentage in each bin
plotDF = moneyDF %>%
group_by(bins) %>%
summarize(countInBin = n()) %>%
mutate(pctInBin = countInBin / sum(countInBin)) %>%
mutate(label = paste0(round(100*pctInBin,0),"%")) %>%
mutate(makeMoney = ifelse(bins %in% levels(bins)[1:2],
"Not Profitable",
"Profitable"))
## Create more interpretable plot
plotDF %>%
ggplot(aes(x = bins, y = pctInBin,
fill = makeMoney)) +
geom_col(color = "black") +
geom_text(aes(label=label), nudge_y = 0.015) +
xlab("Value Added Per Trial Customer") +
ylab("Probability of Outcome") +
scale_fill_manual(values = c("lightcoral",
"cadetblue")) +
theme(legend.position = "none") +
coord_flip() +
ggtitle("Making Yoga Stretching Mandatory for Gym 12")
```

From the above, we can easily talk to any decision maker about the possiblities for various outcomes. For example, summing the bottom two percentages tells us there is an approximately 17% chance of yoga stretching creating negative value. Additionally, if Crossfit anticipates a cost of $20 per customer, then there is a 46% chance of not breaking even by using this policy.