Do not let the fancy calligraphy notation of \(\mathcal{P}(X_1,\ldots,X_n)\) scare you - its equivalent to last chapter’s \(f(X_1,\ldots,X_n)\). A frustration of learning math is that there are multiple conventions for naming the same things. Let’s start getting used to this so we can read further notes beyond this book. The fancy \(\mathcal{P}\) is just a function that takes input and spits out a probability. In this textbook and others, you will see all sorts of equivalent notation for a probability function including \(\mathcal{P}(\cdot), f(\cdot), Pr(\cdot), P(\cdot), \textrm{and } p(\cdot)\) where \(\cdot\) is replaced by a set of random variable values. For joint distributions, the input is values of \(n\) random variables. \(X_1,\ldots,X_n\) should be thought of as you supplying realizations \(x_1,\ldots,x_n\) and the function returns a probability. The use of the capital \(X\)’s in \(\mathcal{P}(X_1,\ldots,X_n)\) refers to the fact that a distribution gives a probability for all possible realizations of the random variables.
The most complete method of reasoning about sets of random variables is by having a joint probability distribution. A joint probability distribution, \(\mathcal{P}(X_1,\ldots,X_n)\), assigns a probability value to all possible assignments or realizations of sets of random variables. The goal of this chapter is to 1) introduce you to the notation of joint probability distributions and 2) convince you that if you are given a joint probability distribution that you would be able to answer some very useful questions related to probability.
Consider the graphical model from Shenoy and Shenoy (2000Shenoy, Catherine, and Prakash P Shenoy. 2000. Bayesian Network Models of Portfolio Risk and Return. The MIT Press.) and depicted in Figure 11.1.
In the diagram, there are four random variables: 1) Interest Rate \((IR)\), 2) Stock Market \((SM)\), 3) Oil Industry \((OI)\), and 4) Stock Price \((SP)\) (assume for an oil company). The arrows between the random variables tell a story of precedence and causal direction: interest rate influences the stock market which then, in combination with the state of the oil industry will determine the stock price (we will learn more about these arrows in the Graphical Models chapter). For simplicity and to gain intuition about joint distributions, assume that each of these four random variables is binary-valued, meaning they can each take two possible assignments:
Random Variable \((X_i)\) | Set of Possible Values (i.e. \(Val(X_i)\)) |
---|---|
\(IR\) | \({high, low}\) |
\(SM\) | \({good, bad}\) |
\(OI\) | \({good, bad}\) |
\(SP\) | \({high, low}\) |
Thus, our probability space has \(2 \times 2 \times 2 \times 2 = 16\) values corresponding to the possible assignments to these four variables. So, a joint distribution must be able to assign probability to these 16 combinations. Here is one possible joint distribution:
Note that probability notation on the Internet or in books does not conform to just one convention. I will mix conventions in this book, not to confuse you, but to give you exposure to other conventions you might encounter when going forward on your learning journey outside of this text. Everytime I introduce a slightly changed notation, I will add comments in the margin introducting it.
\(IR\) | \(SM\) | \(OI\) | \(SP\) | \(P(IR,SM,OI,SP)\) |
---|---|---|---|---|
\(high\) | \(good\) | \(good\) | \(high\) | 0.016 |
\(low\) | \(good\) | \(good\) | \(high\) | 0.168 |
\(high\) | \(bad\) | \(good\) | \(high\) | 0.04 |
\(low\) | \(bad\) | \(good\) | \(high\) | 0.045 |
\(high\) | \(good\) | \(bad\) | \(high\) | 0.018 |
\(low\) | \(good\) | \(bad\) | \(high\) | 0.189 |
\(high\) | \(bad\) | \(bad\) | \(high\) | 0.012 |
\(low\) | \(bad\) | \(bad\) | \(high\) | 0.0135 |
\(high\) | \(good\) | \(good\) | \(low\) | 0.004 |
\(low\) | \(good\) | \(good\) | \(low\) | 0.042 |
\(high\) | \(bad\) | \(good\) | \(low\) | 0.04 |
\(low\) | \(bad\) | \(good\) | \(low\) | 0.045 |
\(high\) | \(good\) | \(bad\) | \(low\) | 0.012 |
\(low\) | \(good\) | \(bad\) | \(low\) | 0.126 |
\(high\) | \(bad\) | \(bad\) | \(low\) | 0.108 |
\(low\) | \(bad\) | \(bad\) | \(low\) | 0.1215 |
Notation note: \(P(X,Y) =\) \(P(X \textrm{ and } Y)\). Each defines a function where you supply realizations \(x\) and \(y\) and the probability function will return \(P(X=x \textrm{ and }Y=y)\). For example, let \(X \equiv\) outcome of a dice roll and \(Y \equiv\) outcome of a coin flip. Hence, you can supply potential outcomes, like \(P(6,Heads) =\) which means \(P(X=6,Y=heads)\) and the function output would be \(\frac{1}{12}\) (if you were to do the math).
Collectively, the above 16 probabilities represent the joint distribution \(P(IR,SM,OI,SP)\) - meaning, you plug in values for all four random variables and it gives you a probability. For example, \(P(IR=low,SM=bad,OI=bad,SP=low)\) yields a probability assignment of 12.15%.
If its been a while since you have seen the summation notation, \(\sum\), or set notation like \(\in\), you can do a quick review of them at wikipedia: https://en.wikipedia.org/wiki/Summation and https://en.wikipedia.org/wiki/Set_(mathematics), respectively.
More generally speaking, a marginal distribution is a compression of information where only information regarding the marginal variables is maintained. Take a set of random variables, \(X\) (e.g. \(\{IR,SM,OI,SP\}\)), and a subset of those variables \(Y\) (e.g. \(\{OI\}\)). And using standard mathematical convention, let \(Z = X \setminus Y\) be the set of random variables in \(X\) that are not in \(Y\) (i.e. \(Z = \{IR,SM,SP\}\). Assuming discrete random variables, then the marginal distribution \(P(Y)\) is calculated from the joint distribution \(P(Y) = \sum_{Z} P(Y=y,Z=z)\). Effectively, when the joint probability distribution is in tabular form, one just sums up the probabilities in each row where \(Y=y\).
One might also be curious about probability assignments for just a subset of the random variables. This smaller subset of variables can be called marginal variables and their probability distribution is called a marginal distribution. For example, the marginal distribution for oil industry \((OI)\) is notated as \(P(OI)\) and represents a probability distribution over just one of the four variables - ignoring the others. The marginal distribution can be derived from the joint distribution using the formula:
\[ P(OI = x) = \sum_{i \in IR, j \in SM, k \in SP} \left( P(OI=x,IR=i,SM=j,SP=k) \right) \]
Think of a marginal distribution as a function of the marginal variables. Given realizations of the marginal variables, the function returns a probability. Applying the above formula to determine the marginal distribution of \(OI\) yields a tabular representation of the marginal distribution (Table 11.1).
Table 11.1: A marginal distribution shown in table form.
Realization (\(x\)) | \(P(OI = x)\) |
---|---|
0.016 + 0.168 + 0.04 + | |
\(Good\) | 0.045 + 0.004 + 0.042 + |
0.04 + 0.045 = 0.4 | |
0.018 + 0.189 + 0.012 + | |
\(Bad\) | 0.0135 + 0.012 + 0.126 + |
0.108 + 0.1215 = 0.6 |
Exercise 11.2 Suppose we are interested in both the Stock Market (\(SM\)) and the Oil Industry (\(OI\)). We can find the marginal distribution for these two variables, \(P(SM,OI)\). This is sometimes called a joint marginal distribution; joint referring to the presence of multiple variables and marginal referring to the notion that this is a subset of the original joint distribution. So, given the probabilities in the above joint distribution, what is the marginal distribution for \(\{SM,OI\}\) - i.e. give a probability function for
If you need a refresher on conditional probability, see the wikipedia article here: https://en.wikipedia.org/wiki/Conditional_probability and the khan academy video here: https://youtu.be/6xPkG2pA-TU.
Conditional distributions can be used to model scenarios where a subset of the random variables are known (e.g. data) and the remaining subset is of interest (e.g. model parameters). For example, we might be interested in getting the conditional distribution of Stock Price (\(SP\)) given that we know the Interest Rate. The notation for this is \(P(SP|IR)\) and can be calculated using the definition of conditional probablity:
Think of a conditional distribution as a function of the variables to the left of the conditioning pipe ($ | $) - since they are assumed given, you already know the value for the right-side variables with 100% certainty. You supply realizations of the left-side variables, the function returns a probability.
\[ P(A |B) = \frac{P(A \textrm{ and } B)}{P(B)} \]
For our specific problem:
\[ P(SP|IR) = \frac{P(SP \textrm{ and } IR)}{P(IR)} \]
To calculate conditional probabilities when already given the joint distribution, use a two-step process:
\(1.\) First, to simplify the problem, calculate the numerator, i.e. the marginal distribution for \(P(SP,IR)\), To get the marginal distribution, just aggregate the rows in the joint distribution as done in the previous section on marginal distributions. and rid ourselves of the variables that we are not interested in:
\(IR\) | \(SP\) | \(P(IR,SP)\) |
---|---|---|
\(high\) | \(high\) | 0.086 |
\(low\) | \(high\) | 0.4155 |
\(high\) | \(low\) | 0.164 |
\(low\) | \(low\) | 0.3345 |
\(2.\) Then, calculate any conditional distribution \(P(SP|IR)\) of interest by plugging in the given value for \(IR\) and all of the possible \(SP\) values. For example \(P(SP|IR=high)\) means we need to be able to find a probability for the two \(SP\) outcomes given that we know \(IR = high\). Hence, we calculate \(P(SP=high|IR=high)\) and \(P(SP=low|IR=high)\):
\[ \begin{aligned} P(SP=high|IR=high) &= \frac{P(SP=high,IR=high)}{P(IR=high)} \\ &= \frac{P(SP=high,IR=high)}{P(SP=high,IR=high) + P(SP=low,IR=high)} \\ &= \frac{0.086}{0.086 + 0.164} \\ &= 0.344 \end{aligned} \]
and,
\[ \begin{aligned} P(SP=low|IR=high) &= \frac{P(SP=low,IR=high)}{P(IR=high)} \\ &= \frac{P(SP=low,IR=high)}{P(SP=high,IR=high) + P(SP=low,IR=high)} \\ &= \frac{0.164}{0.086 + 0.164} \\ &= 0.656 \end{aligned} \]
which yields the following tabular representation of the conditional distribution for \(P(SP=x|IR=high)\):
\(x\) | \(P(SP=x \lvert IR=high)\) |
---|---|
\(high\) | \(0.344\) |
\(low\) | \(0.656\) |
Sometimes, we are not interested in a complete probability distribution, but rather seek a high-probability assignment to some subset of variables. For this, we can use a \(\textrm{MAP}\) query (maximum a posteriori query). A \(\textrm{MAP}\) finds the most likely assignment of all non-evidentiary variables (i.e. unknown values). Basically, you search the joint distribution for the largest probability value. For example, the maximum a posterior estimate of stock price given \(IR=high\) would be given by the following formula:
\[ \arg \max_{x \in SP} P(SP=x|IR=high) \]
which in natural language asks for the argument (i.e. the realization of stock price) that maximizes the conditional probability \(P(SP=x|IR=high)\). From above, we realize that \(P(SP=high|IR=high) = 0.344\) and \(P(SP=low|IR=high) = 0.656\) and hence, the MAP estimate is that \(SP = low\) because \(0.656 > 0.344\).
Why don’t we just use joint probability distributions all the time? Despite the expressive power of having a joint probability distribution, they are not that easy to directly construct due to the curse of dimensionality. As the number of random variables being considered in a dataset grows, the number of potential probability assignments grows too. Even in the era of big data, this curse of dimensionality still exists. Generally speaking, an exponential increase is required in the size of the dataset as each new descriptive feature is added.
Let’s assume we have \(n\) random variables with each having \(k\) values. Thus, the joint distribution requires \(k^n\) probabilites. Even if \(k=2\) and \(n=34\), this leads to 17,179,869,184 possibilities (over 17 billion). To make this concrete, a typical car purchase decision might easily look at 34 different variables (e.g. make, model, color, style, financing, etc.). So, to model this decision would require a very large joint distribution which actually dwarfs the amount of data that is available. As a point of comparison, well under 100 million motor vehicles were sold worldwide in 2019 - i.e. less than one data point per possible combination of features. Despite this “curse”, we will learn to get around it with more compact representations of joint distributions. These representations will require less data, but will still yield the power to answer queries of interest; just as if we had access to the full joint distribution.