- 1 Welcome
- 2 Becoming a Data-Driven Business Analyst
- 3 The Computing Environment
- 4 R: Basic Usage
- 5 R Packages: causact, tidyverse, etc.
- 6 dplyr: Manipulating Data Frames
- 7 dplyr: Data Manipulation For Insight
- 8 ggplot2: Data Visualization Using The Grammar of Graphics
- 9 ggplot2: The Four Stages of Visualization
- 10 Representing Uncertainty
- 11 Joint Distributions Tell You Everything
- 12 Graphical Models Tell Joint Distribution Stories
- 13 Bayesian Inference On Graphical Models
- 14 Generative DAGs As Business and Mathematical Narratives
- 15 Install causact’s Python Dependencies
- 16 causact: Computational Bayesian Inference Workflows
- 17 The beta Distribution
- 18 Parameter Estimation
- 19 Posterior Predictive Checks
- 20 Decision Making
- 21 A Simple Linear Model
- 22 Linear Predictors and Inverse Link Functions
- 23 Multi-Level Modelling
- 24 Compelling Decisions and Actions Under Uncertainty
- 25 Your Journey Continues

The fun brickr package converts images into a mosaic made of Lego building blocks. The above mosaic is put here to emphasize that we are learning building blocks for making models of data-generating processes. Each block is used to make some mathematical representation of the real-world. The better our representations, the better our insights. Instead of using Lego bricks, our tool of choice is the generative DAG. We have almost all the building blocks we need, latent nodes, observed nodes, calculated nodes, edges, plates, linear models, and probability distributions, but this chapter introduces one last powerful building block - the inverse link function.

The *range* of a function is the set of values that the
function can give as output. For a linear predictor with non-zero slope,
this range is any number from -\(\infty\) to \(\infty\).

This chapter, we focus on restricting the *range* of linear predictors. A linear predictor for data observation, \(i\), is any function expressable in this form:

\[ f(x_{i1},x_{i2},\ldots,x_{in}) = \alpha + \beta_1 * x_{i1} + \beta_2 * x_{i2} + \cdots + \beta_n * x_{in} \]

where \(x_{i1},x_{i2},\ldots,x_{in}\) is the \(i^{th}\) observation of a set of \(n\) explanatory variables, \(\alpha\) is the base-level output when all the explanatory variables are zero (e.g. y-intercept when \(n=1\)), and \(\beta_j\) the coefficient for the \(j^{th}\) explanatory variable (\(j \in \{1,2,\ldots,n\}\)). When \(n=1\), this is just the equation of a line as in last chapter. When there is more than one explanatory variable, we are making a function with *high-dimensional* input - meaning the input includes multiple explanatory RV realizations per observed row. High-dimensional functions are no longer easily plotted, but the interpretation of the coefficients remain consistent with our developing intuition.

Explanatory variable effects are fully summarized in the corresponding coefficients, \(\beta\). If an individual coefficient \(\beta\) is positive, the linear prediction increases by \(\beta\) units for each unit change in the explanatory variable. For example, we thought it plausible for the expected sales price of a home to go up by $120 for every additional square foot; 10 additional square feet, then the home value increases $1,200; 100 additional square feet, then the home value increases $12,000. You can continue this logic ad-nauseum until you have infinitely big houses with infinite home prices. The takeaway is that linear predictors, in theory, can take on values anywhere from -\(\infty\) to \(\infty\).

An inverse link function takes linear predictor output, which ranges from -\(\infty\) to \(\infty\), and confines it in some way to a different scale. For example, if we want to use many explanatory variables to explain success probability, our method will be to estimate a linear predictor and then, transform it so its value is forced to lie between zero and 1 (i.e. match the domain over which probabilities exist). More generally, inverse link functions are used to make linear predictors map to predicted values that are on a different scale. For our purposes, we will look at two specific inverse link functions:

*Exponential*: The exponential function converts a linear predictor of the form \(\alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n\) into a curve that is restricted to values between 0 and \(+\infty\). This is useful for converting a linear predictor into a non-negative value. For example, the rate of tickets issued in New York city can be modelled by taking a linear predictor for tickets and turning it into a non-negative rate of ticket issuance. If we label the linear predictor value \(y\) and the transformed value \(\lambda\), the exponential function converting \(y\) to \(\lambda\) is defined here: \[ \lambda = \exp(y) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n \]*Inverse Logit*(aka logistic): This function provides a way to convert a linear predictor of the form \(\alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n\)) into a curve that is restricted to values between 0 and 1. This is useful for converting a linear predictor to a probability. If we label the linear predictor value \(y\) and the transformed value \(\theta\), the inverse logit function converting \(y\) to \(\theta\) is defined here (note the negative sign): \[ \theta = \frac{1}{1+\exp(-y)}= \frac{1}{1+\exp(-(\alpha + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n))} \]

While the beauty of these functions is that it allows us to use the easily-understood linear model form and still also have a form that is useful in a generative DAG. The downside is we lose interpretability of the coefficients. The only thing we get to say easily is that higher values of the linear predictor correspond to higher values of the transformed output.

When communicating the effects of explanatory variables that are put through inverse link functions, you should either: 1) simulate observed data using the prior or posterior’s generative recipe, or 2) consult one of the more rigorous texts on Bayesian data analysis for some mathematical tricks to interpreting generative recipes with these inverse link functions (see references at end of book).

Figure 22.1 takes a generic example of a Poisson count variable and makes the expected rate of occurrence a function of an explanatory variable.

For a specific example, think about modelling daily traffic ticket in New York City. The expected rate of issuance would be a linear predictor based on explanatory variables such as inches of snow, holiday, president in town, end-of-month, etc. Since linear predictors can turn negative and the rate parameter of a Poisson random variable must be strictly positive, we use the exponential function to get from linear predictor to rate.

```
library(causact)
dag_create() %>%
dag_node("Count Data","k",
rhs = poisson(rate),
obs = TRUE) %>%
dag_node("Exp Rate","rate",
rhs = exp(y),
child = "k") %>%
dag_node("Linear Predictor","y",
rhs = alpha + beta * x,
child = "rate") %>%
dag_node("Intercept","alpha",
child = "y") %>%
dag_node("Explantory Var Coeff","beta",
child = "y") %>%
dag_node("Observed Expl Var","x",
child = "y",
obs = TRUE) %>%
dag_plate("Observation","i",
nodeLabels = c("k","rate","y","x")) %>%
dag_render()
```

Figure 22.2: Graph of the exponential function. The linear predictor in our case is alpha + beta * x. The role of the exp function is to map this linear predictor to a scale that is non-negative. This essentailly takes any number from -infinity to infinty and provides a positive number as an output.

The inverse link function transformation takes place in the node for `rate`

. The linear predictor, \(y\), can take on any value from -\(\infty\) to \(\infty\), but as soon as it is transformed with the exponential function, it can only be a positive number. This transformation is shown in Figure 22.2 where positive and negative x-axis values become solely positive y-axis values.

From Figure 22.2, we see that negative values of the linear predictor are transformed into values of rate between 0 and 1 and positive values of the linear predictor get transformed into rate values greater than 1. Notice this transformation is non-linear, and hence caution must be used interpreting the slope coefficients of the linear predictor. We will see this in the next chapter.

Figure 22.3 shows a generic generative DAG which leverages the inverse logit link function.

```
library(causact)
dag_create() %>%
dag_node("Bernoulli Data","z",
rhs = bernoulli(theta),
obs = TRUE) %>%
dag_node("Success Probability","theta",
rhs = 1 / (1+exp(-y)),
child = "z") %>%
dag_node("Linear Predictor","y",
rhs = alpha + beta * x,
child = "theta") %>%
dag_node("Intercept","alpha",
child = "y") %>%
dag_node("Explantory Var Coeff","beta",
child = "y") %>%
dag_node("Observed Expl Var","x",
child = "y",
obs = TRUE) %>%
dag_plate("Observation","i",
nodeLabels = c("z","theta","y","x")) %>%
dag_render()
```

The use of the inverse logit function is done inside a method called logistic regression. Check out this sequence of videos that begin here (https://youtu.be/zAULhNrnuL4) on logistic regression for some additional insight.

Note the inverse link function transformation takes place in the node for `theta`

. To start to get a feel for what this transformation does, observe Figure 22.4. When the linear predictor is zero, the associated probability is 50%. Increasing the linear predictor will increase the associated probability, but with diminishing effect. When the linear predictor is increased by one unit from say 1 to 2, the corresponding probability goes from about 73% to 88% (i.e. from \(\frac{1}{1+\exp(-1)}\) to \(\frac{1}{1+\exp(-2)}\)). This is a 15% jump. However, increasing the linear predictor by one additional unit has probability go from 88% to 95% - only a 7% jump. Further increasing the linear predictor has diminishing effect. Likewise, large negative values in the linear predictor lead to ever-closer to zero values for probability.

Figure 22.4: Graph of the inverse logit function (aka the logistic function). The linear predictor in our case is alpha + beta * x. The role of the inverse logit function is to map this linear predictor to a scale bounded by zero and one. This essentailly takes any number from -infinity to infinty and provides a probability value as an output.

Almost anytime you are modelling a probability as a function of many explanatory variables, using the inverse logit-link function is an obvious choice to make the mathematics work.

You have officially been exposed to all the building blocks you need for executing Bayesian inference of ever-increasing complexity. These include latent nodes, observed nodes, calculated nodes, edges, plates, probability distributions, linear predictors, and inverse-link functions. While you have not seen every probability distribution or every inverse-link function, you have now seen enough that you should be able to digest new instances of these things. In the next chapter, we seek to build confidence by increasing the complexity of the business narrative and the resulting generative DAG to yield insights. Insights you might not even have thought possible!

**Exercise 22.1 **Assume a linear predictor, \(y\), is being run through the exponential function to ensure a positive value, \(k\). This positive value, \(k\), represents the average number of customers in the drive-thru line at a local Wendy’s restaurant. Currently, your data analysis suggests that \(y = 1.1\). What is the associated value of \(k\), i.e. the average number of customers in line?

**Exercise 22.2 **Continue from the previous exercise with the following additional information. According to your data analysis, a certain process change lowers the value of the linear predictor by 0.4 to \(y = 0.7\). What is the implied reduction in number of customers in the drive-thru?

**Exercise 22.3 **Assume a linear predictor, \(y\), is being run through the inverse-logit link function to transform the linear predictor to a probability. This probability, \(\theta\), represents the probability a customer purchases french fries at a local Wendy’s restaurant. For a particular draw of your posterior distribution, the increase in the linear predictor due to the coupon promotion is 1.5. Assuming that the value of the linear predictor without the coupon is 0.5, what is the change in absolute probability of the likelihood a customer orders french fries when presented with the coupon (according to this draw of the posterior)?

Share this page on Twitter: Share this on Twitter

YouTube playlist link for videos that accompany each chapter: https://youtube.com/playlist?list=PLassxuIVwGLPy-mtohX-NXrjD8fc9FBOc

Buy a beautifully printed full-color version of "A Business Analyst's Guide to Business Analytics" on Amazon: http://www.amazon.com/dp/B08DBYPRD2