# Notes from Probability Theory Chapter 2

Chapter 2, The Quantitative Rules, begins with a quote:
Probability theory is nothing but common sense reduced to calculation. Laplace, 1819.

In my quotes file I have it in the "original?" French: La théorie des probabilités n'est que le bon sens reduit au calcul.--Pierre Simon, Marquis de Laplace

This chapter is devoted to the mathematical deduction of the quantitative rules of inference which follow from the three desiderata of last time. They were 1) representation of degrees of plausibility by real numbers; 2) qualitative correspondence with common sense; 3) consistency.

### 2.1 - The product rule

We begin by finding a consistent rule that relates the plausibility of $A\cdot B$ to the individual plausibilities of A and B separately. Specifically we are trying to find $AB|C$, where C represents the general background Context or information or knowledge we have and assume to be true. (This is intentionally vague.)

There are two processes our robot can take when trying to decide if $AB$ is true or not, and they're symmetrically the same but I'll list them both. The first process is two steps:

1: decide whether B is true, whose plausibility is represented as $(B|C)$
2: if we accepted B as true, decide whether A is true. $(A|BC)$

The symmetric process is:

1*: decide whether A is true; $(A|C)$
2*: if we accepted A as true, decide whether B is true. $(B|AC)$

Focusing on the first process, if AB is true, it is necessary that step 1 ends in deciding that B is true, so the plausibility $B|C$ should be involved somehow. Then, provided step 1 resulted in B being true (otherwise we can short-circuit and declare AB false), then we need to consider and decide if A is true. So $A|BC$ should be involved somehow. Similarly with the second process. Also remember that AB is commutative, so that any process that computes AB should also compute the same result as BA, and therefore we only need the two plausibilities from either of the above processes and nothing else. No other plausibilities need to be involved, so whatever function we come up with to describe AB|C depends on B|C and A|BC. Let's write that down:

$(A \cdot B | C) = F[(B|C), (A|B\cdot C)]$

We currently don't know what the implementation of F should be because we currently don't know how those two plausibilities ought to relate.

One mistake people might make is to assume we only need to talk about A|C and B|C. This is easily shown to be false: it's quite plausible that the next person you meet has blue eyes and also quote plausible that the person's hair is black, and it's reasonably plausible that both are true. On the other hand, it's quite plausible that the left eye is blue, and quite plausible that the right eye is brown, but very implausible that both are true. So we need more than A|C and B|C, we need to follow one of the two processes outlined above.

Do our desiderata imply a form that F must take? Yes. The math is simple and straightforward if you know calculus, but is only marginally interesting and if you want to look at it go read Jaynes.

For now I'll appeal to the name of $AB$--the logical product. The form F takes is simple multiplication. Let us define the function $w(X|C)$ to represent the plausibility of some proposition X given C, as a real number, between 0 and 1, with 0 representing absolute impossibility in X's truth value and 1 representing absolute certainty in its truth value. (This definition can be derived from the desiderata.) Combined with me telling you F is multiplication, then the previous equation takes on the form:

$w(A \cdot B | C) = w(B|C) * w(A|B \cdot C) = w(A|C) * w(B|A \cdot C)$

The middle expression used if process 1 is to be taken, the third expression used if process 2 is to be taken. This is henceforth known as the product rule.

### 2.2 - The Sum Rule

If we consider a single logical proposition A, we know that the logical product $A \cdot \overline{A}$ is always false, and the logical sum $A + \overline{A}$ is always true. Thus the plausibility of A being false must depend in some way on the plausibility that A is true. In other words,

$w(\overline{A}|C) = F[ w(A|C) ]$

where we currently don't know the nature of F again. Through some modestly interesting math we can again arrive at the required nature of F from our desiderata, that you can go look at Jaynes for. For now I'll again appeal to the name of this section, and tell you that F(x) = 1 - x. In other words (to expose the sum):

$w(A|C) + w(\overline{A}|C) = 1$

This is the sum rule. It should be intuitively true that this is the case, and Jaynes is there if you need help convincing yourself it's true.

Now, you might have been expecting something different from the title of this section. You may be wondering "What's the plausibility $w(A+B|C)$?" A fine question! Let's look at how repeatedly using the sum and product rules, and our knowledge of negation, we can algebraically derive a way to compute this. From now on I'm going to use p instead of w.

\begin{align} p(A+B|C) &= 1 - p(\overline{A \cdot B} | C) = 1 - p(\overline{A}|C)*p(\overline{B}|\overline{A} \cdot C) \\ &= 1 - p(\overline{A}|C)[1 - p(B|\overline{A} \cdot C)] = p(A | C) + p(\overline{A} \cdot B|C) \\ &= p(A|C) + p(B|C) * p(\overline{A}|B \cdot C) = p(A|C) + p(B|C) [1 - p(A|B \cdot C)] \\ &= p(A|C) + p(|C) - p(A \cdot B | C) \end{align}

This is known as the generalized sum rule. The primitive sum rule is a special case of this one, letting B = !A. Please spend the minimum minutes necessary to understand this derivation. If you don't understand (because a lot of "obvious" steps are omitted), please leave a comment and I will post a more complete step-by-step derivation with an explanation of each step.

It is important to note that the product rule and the non-general sum rule are an adequate set for plausible inference operations in the same sense that NAND is adequate for all boolean logic operations. We saw the generalized sum rule comes from the other two, shortly we will see the famous Bayes' Theorem also comes from a trivial application of the product rule.

In practice we will likely have enough background information that brute-force calculation with just the two rules won't be necessary and we can apply higher-level theorems (like Bayes' and marginalization and even higher ones still). Just as in practice computer engineers with breadboards use many more logic chips than just NAND chips. Still, it's neat to play axiom-golf and it's neat that two very simple rules determine all there is about something as complicated-sounding as inferential reasoning. Extending logic according to our desiderata wasn't so hard was it!

Here are two exercises you're invited to do, as well as my own solutions that I have fairly high confidence in but nevertheless I would like reassurances.

Exercise 2.1: Is there a general formula for $p(C|A+B)$ analogous to the generalized sum rule? Go ahead and try to find it.

Exercise 2.2: Suppose we have a set of propositions ${A_1, \ldots, A_n}$, which on some general information X are mutually exclusive. Show that $p(C|(A_1 + A_2 + \ldots + A_n) \cdot X)$ is the weighted average of the separate plausibilities $p(C|A_i \cdot X)$. That is, show

$p(C|(A_1 + A_2 + \ldots + A_n) \cdot X) = p(C|A_1\cdot X + A_2 \cdot X + \ldots + A_n \cdot X) = \frac{\Sigma_i p(A_i|X)*p(C|A_i \cdot X)}{\Sigma_i p(A_i|X)}$

Solution for 2.1: a general formula does in fact exist, if we make the assumption that there is additional information we can factor out of A and B. Some proposition known to be true. It's kind of analogous to factoring out a 1. Let's call this new true proposition $I$ such that $p(A|X\cdot I) = p(A|X)$. First let's rewrite the product rule:

\begin{align} p(AB|C) &= p(A|BC) * p(B|C) \\ p(A|BC) &= \frac{p(AB|C)}{p(B|C)} \end{align}

also let's derive Bayes' Theorem from there:

$p(A|BC) = \frac{p(B|AC)*p(A|C)}{p(B|C)}$

We'll use these presently:

\begin{align} p(C|A+B) &= p(C|(A+B)\cdot I) = \frac{p(C \cdot (A+B)|I)}{p(A+B|I)} \\ &= \frac{p(C\cdot A + C\cdot B | I)}{p(A|I) + p(B|I) - p(A\cdot B|I)} \\ &= \frac{p(C\cdot A|I) + p(C\cdot B|I) - p(A\cdot B \cdot C | I)}{p(A|I) + p(B|I) - p(A\cdot B|I)} \\ &= \frac{p(C|A\cdot I)*p(A|I) + p(C|B\cdot I)*p(B|I) - p(C|A\cdot B \cdot I)*p(A\cdot B|I)}{p(A|I) + p(B|I) - p(A\cdot B|I)} \\ &= \frac{p(C|A)*p(A|I) + p(C|B)*p(B|I) - p(C|A\cdot B)*p(A\cdot B|I)}{p(A|I) + p(B|I) - p(A\cdot B|I)} \\ &= \frac{p(C|A)*p(A|I) + p(C|B)*p(B|I) - p(C|A\cdot B)*p(A|B\cdot I)*p(B|I)}{p(A|I) + p(B|I) - p(A|B\cdot I)*p(B|I)} \\ &= \frac{p(C|A)*p(A|I) + p(B|I)*\Big[p(C|B) - p(C|A \cdot B)*p(A|B)\Big]}{p(A|I) + p(B|I) * \Big[1 - p(A|B)\Big]} \\ &= \frac{p(B|I)}{p(B|I)} * \frac{\frac{p(C|A)*p(A|I)}{p(B|I)} + p(C|B) - p(C|A \cdot B)*p(A|B)}{\frac{p(A|I)}{p(B|I)} + 1 - p(A|B)} \\ &= \frac{p(A|C) + p(C|B) - p(C|A \cdot B)*p(A|B)}{\frac{p(A|I)}{p(B|I)} + 1 - p(A|B)} (Bayes) \\ &= \frac{p(A|C) + p(C|B) - p(C \cdot A | B)}{\frac{p(A|I)}{p(B|I)} + 1 - p(A|B)} * \frac{p(B|A\cdot I)}{p(B|A\cdot I)} \\ &= \frac{p(B|A)*\Big[p(A|C) + p(C|B) - p(C \cdot A | B)\Big]}{p(A|B) + p(B|A) - p(A|B)*p(B|A)} \end{align}

That's as far as I care to reduce it. I'm not sure how useful this equation might be but there you go. I'd appreciate someone double-checking my math and making sure this is actually valid.

Solution for 2.2: one part of the exercise I left out is what "mutually exclusive" means. Jaynes writes $p(A_i \cdot A_j | X) = p(A_i | X) \delta_{ij}$ which isn't very illuminating for the novice. The delta function there is shorthand for saying "this value is 1 if i == j and 0 otherwise." Consider rolling a 6-sided die. Each of the possible outcomes are mutually exclusive. So if you were to ask our robot "how plausible is it that I roll a 1 and a 6 on the same role?" (that is, $p(1, 6 | fair\ 6\ sided\ die\ being\ rolled\ once)$, our robot better answer something like "impossible!" or just "0". However, if you ask it for "1 and 1", this is no different from just asking "1", and so if it's a fair die the robot should answer something like "1/6".

So, with the knowledge of what that delta represents, we can continue! Let's restate the exercise: show that

\begin{align} p(C|(A_1 + A_2 + \ldots + A_n) \cdot X) &= p(C|A_1\cdot X + A_2 \cdot X + \ldots + A_n \cdot X) \\ &= \frac{\Sigma_i p(A_i|X)*p(C|A_i \cdot X)}{\Sigma_i p(A_i|X)} \end{align}

This is pretty straight forward to show with the help of our N=2 case for exercise 2.1. Let's start mid-way and apply the mutually exclusive assumption, replacing $I$ with $X$ and $A,B$ with $A_1,A_2$.

\begin{align} p(C|(A_1 + A_2) \cdot X) &= \frac{p(C|A_1\cdot X)*p(A_1|X) + p(C|A_2\cdot X)*p(A_2|X) - p(C|A_1\cdot A_2 \cdot X)*p(A_1\cdot A_2|X)}{p(A_1|X) + p(A_2|X) - p(A_1\cdot A_2|X)} \\ &= \frac{p(C|A_1\cdot X)*p(A_1|X) + p(C|A_2\cdot X)*p(A_2|X)}{p(A_1|X) + p(A_2|X)} \\ &= \frac\sum_{i=1}^{2} p(A_i|X)*p(C|A_i \cdot X)}\sum_{i=1}^2 p(A_i|X)} \end{align}

I will appeal to intuition that this generalizes to arbitrary N. (Convince yourself of this if you don't believe me.)

### 2.3 - Qualitative properties

Join me next time!

#### Posted on 2012-01-25 by Jach

Tags: bayes, books, math, probability

LaTeX allowed in comments, use $\\...\\$\$ to wrap inline and $$...$$ to wrap blocks.