Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Notes from Probability Theory Chapter 1 continued

Last time I covered some notes on the first section of the first chapter, today we'll go a little further.

1.2 - Analogies with physical theories

A quote that directly precedes this section in its expanded form:
"A mathematician is a person who can find analogies between theorems; a better mathematician is one who can see analogies between proofs and the best mathematician can notice analogies between theories. One can imagine that the ultimate mathematician is one who can see analogies between analogies."
--Stefan Banach

There are some interesting analogies with the theory of probability theory and physical theories that Jaynes mentions. The first is how reality is complicated; there is so much stuff out there that even our best theories cannot handle it all. Where does reality get its computation power, we need some of that! So our theories generally start small and look at little pieces of things, and when they work out they're expanded into larger theories to look at larger things. We're still not sure if we'll eventually get a Theory of Everything or not, but at least history seems to indicate that we'll get a Theory of Very Nearly Approximately Everything at some point.

As physical models get bigger they also get more complicated, so too with models founded on probability theory. If you've read Feynman's QED, he elegantly expresses the very simple rules which singular photons and other particles behave. The problem is when you try to reason about billions of them all over the place.

There's also another analogy the two share, that of the peculiar notion that our theories often have trouble with things "familiar" to us. Jaynes gives the example of the difference in ultraviolet spectra of iron and nickel which can be explained in exhaustive mathematical detail and that the existence of those things is still unknown to the vast majority of humans, but something as familiar and ordinary to most humans like the growth of grass pushes our limits and in some cases we're just utterly useless. This peculiar notion should put some prior constraint on our models that we shouldn't expect too much out of them and be prepared to change and update.

Another analogy is that any advances frequently lead to consequences that have great practical value, but it's somewhat unpredictable when this will happen or if it will happen. I've heard rumor about a cheap way to do a full intuitive calculation that's new, and while it may be wrapped up in useless-sounding papers like "here's how we can do a database join faster", the eventual consequences could be great. Jaynes gives two scientific discoveries: first, Röntgen's discovery of X-rays leading to many new forms of medical diagnosis. Second, Maxwell's discovery of "another term in the equation for curl H" which led to near-instant communication around the Earth.

1.3 - The thinking computer

From the legendary John von Neumann: "You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!"

The problem with fully general AI isn't AI or slow machines or something like that, it's that we are lacking a key piece or pieces of knowledge to build it right. One piece of knowledge we lack is what exactly "thinking" consists of. Probability theory helps us answer many of those questions, because it's also the study of common sense.

Probability has also led to the creation of very powerful software programs whose ability humans cannot match alone. Humans may be able to reason about a couple competing hypotheses at once, as long as they're really different, but for problems some software is made for like determining relative plausibilities of 100 different hypotheses competing to explain 10,000 separate observations and pieces of evidence? Good luck doing that with paper and pen! But similarly good luck if all you had were a computer and no probability theory. In the policeman example from before, what determines whether the policeman's suspicion of a crime should raise a lot or a little when he sees a broken window? Probability theory will tell us.

The notion of a thinking machine is also useful for developing the theory. So far we've been talking about "human common sense" and things like that, but humans are weird creatures and prone to craziness and outbursts that don't match the common sense of others around them. So let us ask instead of building something that perfectly matches human common sense, let's build a machine that can do useful plausible reasoning following a set of rules, and let's make sure these rules express a sort of idealized common sense that respected, educated, [otherfeelgoodwordwehumansusehere] humans can agree to when they're not under emotional distress.

1.4 - Introducing the Robot

Probability theory is about describing the actions of a robot brain we design according to some rules. These rules will be proposed from considered desiderata; that is, things that are desirable in a working human brain, and things that we think a respected, educated, rational individual, on discovering they were violating one of these desirable traits, would wish to revise their thinking.

Our premises are truly arbitrary assumptions (sort of). We can make the robot do whatever we like with whatever rules we like. But this robot should have a purpose. It should provide "useful plausible reasoning". So when we make a set of rules, we need to test the robot and see if its reasoning seems comparable to ours and if we think it might be a good candidate on problems we can't do ourselves. When the rules work, it's an accomplishment of the theory, not a particular premise.

The robot will reason about propositions typically denoted with capital letters but I like to write out words, and I may get in the habit of using sigils to denote them such as $myprop is better than $yourprop instead of saying A is better than B. For now, we'll restrict the robot to classical logic where all propositions must be exactly true or exactly false. This limits our robot severely since it can't deal with uncertainty. Our robot is no better than Prolog. Later on we'll see about how we might be able to extend our two-valued 1-bit logic to a more useful logic that can handle uncertainties of any kind. (They say Fuzzy Logic handles epistemic uncertainty while Probability Theory handles uncertainty of events, but probability theory can be made to handle epistemic uncertainty as well. We can either start talking about how plausible propositions are along with events, or we can embed our robot into the problem. "What is our uncertainty about the robot answering some $answer to this epistemic question?")

1.5 - Boolean Algebra

George Boole gave us several nice notations for doing algebra with 1-bit true/false logic instead of having to reason it out in the head all the time. The two most common ideas (for which there are several notations) are those of the "logical product" and the "logical sum".

The logical product is this: $$AB$$. Sometimes you'll see it as $$A,B$$ with a comma in between, sometimes with an explicit "product" sign as in $$A \cdot B$$, sometimes you'll see it as $$A \wedge B$$, but they all mean to say "both A and B are true". Sometimes it's called the conjunction of A and B.

The logical sum is this: $$A+B$$. Sometimes you'll see it as $$A \vee B$$, it means "at least one of A or B is true, they might both be". Sometimes it's called the disjunction of A and B. Note that order doesn't matter for either the product or the sum, but also note that the symbols are only notation--A and B are true/false propositions, not numbers.

The order of operations should be the standard expected one. $$AB + C$$ denotes $$(AB) + C$$ and not $$A(B + C)$$.

When considering multiple propositions, there might be dependencies. For instance, A is true if and only if B is true (which means equivalently that B is true if and only if A is true). If we can demonstrate such a dependency, that allows us to substitute A for B and B for A anywhere we see fit. This is a very powerful tool since it allows us to deductively prove that some proposition B, that we can't find out about for some reason, is in fact true or false based solely on what we know about another proposition A and knowing that A and B are linked. When A and B are linked we say they have the same "truth value".

Let us make our first real axiom to set in stone for the theory of plausible reasoning Jaynes is developing: any two propositions that share a truth value are equally plausible. Reasonable enough?

The denial, or the negation, of a proposition has several notations. The most common I use is overlining the proposition with a bar. Thus $$\overline{A} := A\ is\ false$$. It's saying "Not A". Three other notations that come up: $$\sim A$$, $$!A$$, and $$A^c$$. (C for complement.)

Watch out for ambiguity with the bar since it's easy to abuse. $$\overline{AB}$$ means "'AB' is false". $$\overline{A}\overline{B} = \overline{A}\ \overline{B} = \overline{A}\ \ \ \ \overline{B}$$ (note the separation in the bar) means "both 'A' and 'B' are false". For the first one, in order for "AB" to be false only one of A or B has to be false, the other could be true. The $$\LaTeX$$ renderer on my blog makes a difference by shifting the bar height, but to aid avoiding ambiguity I'll try to either put a space or use the explicit product dot.

Let us further clarify this by bringing up two negation rules commonly known as De Morgan's laws.

\overline{AB} = \overline{A} + \overline{B} \\
\overline{A+B} = \overline{A} \cdot \overline{B}

Convince yourself of this using truth tables. These two laws are also known as the "Duality Property".

Jaynes lists a few other properties. Commutativity and Associativity work as expected. Distributivity is also fairly standard, but contains a new identity you might not be familiar with. The common Distributivity is $$A(B+C) = AB + AC$$, then uncommon one is $$A + (BC) = (A+B)(A+C)$$. We also have the Idempotence property, which states that $$AA = A$$ and $$A+A = A$$.

The last point to be made about Boolean Algebra is that of Implication. It is written $$A \rightarrow B$$ or more often $$A \Rightarrow B$$. It can be read multiple ways, most commonly as "A implies B". Sometimes it's read as "If A, then B". Sometimes it's read slightly reversed, as "B entails A". Note that this is a proposition in itself; it has nothing to do with the truth or false values of either A or B, it simply asserts a relationship between them. (That doesn't even have to be causal.) This is the truth table for Implication given all values of A and B:

ABA => B

If you recall the first two strong syllogisms from last time, they fall naturally out of Implication. $$A \rightarrow B$$ is exactly the same as saying that $$A \overline{B} = false$$ as above. By the duality property, this is also the same as saying that $$\overline{A} + B$$ is true. We can also represent the entire truth table in the form of an equation: $$A = AB$$, which is our actual definition of $$A \rightarrow B$$. If we know A, that determines B, and if we know B, that determines A.

If A is True, B must also be True to get a True implication statement as the table shows. This is the same as the first strong syllogism.

If B is False, A must also be False in order to get a True implication. This is the same as the second strong syllogism.

If A is false, B can be either True or False for a True implication as the table shows. If B is True, A can be True or False for a True implication as the table shows. Whereas pure logic regards these last two statements as not saying much, we defined two weak syllogisms that did in fact say something. Jaynes calls the term "weak syllogism" misleading. Because pure logic doesn't say anything about those two statements, the theory of plausible reasoning which does say something about them should not be considered a "weakened" form of logic but an extension. The theory of plausible reasoning contains classic, pure, deductive logic as a special case.

Jaynes mentions a tricky point I don't think is very tricky, but I'll mention it again here for completeness. In common talk people might say "A implies B" to mean that we can logically deduce B from A. In real, formal logic, however, it means only that "A" and "AB" have the same truth value represented by the equation above. To actually deduce and derive requires a whole host of propositions that we accept as true that we can use in the deduction. Logic simply says that every true statement implies every other true statement, and that every false proposition implies all propositions. If this was to be interpreted as logical deduction, then every false proposition would be logically contradictory. Clearly then deduction is not the intended purpose of logical implication alone.

1.6 - Adequate sets of operations

So far we've talked about four core "connectives", or "operations". To recap, we started with just two propositions "A" and "B", then we defined 1) the logical product, 2) the logical sum, 3) the logical implication, 4) the logical negation.

Using those four primitives we can generate other new propositions. They turn out to be sufficient for generating every logical proposition, but can we play the common math game of "find the minimal number of axioms that still lets us do it"? That is, can we arrive at, say, the logical sum from only the other three?

Yes. In fact we just need AND and NOT: $$A + B = \overline{(\overline{A} \cdot \overline{B})}$$. If we combine AND and NOT into a single operation "NAND", we have: $$A \uparrow B = \overline{AB} = \overline{A} + \overline{B}$$. This gives us immediately negation, the logical product, and the logical sum. $$\overline{A} = A \uparrow B$$, $$AB = (A \uparrow B) \uparrow (A \uparrow B)$$, and $$A + B = (A \uparrow A) \uparrow (B \uparrow B)$$ respectively. Which means that NAND is sufficient to generate all other logical propositions or functions.

NOR is similarly powerful: $$A \downarrow B = \overline{A + B} = \overline{A} \cdot \overline{B}$$.

Is this just a mathematical game though? There is a practical purpose to this exercise, and that's in the design of computer circuits. For those not in the know, computers are just 1s and 0s, trues and falses, but those 1s and 0s have to be represented by electricity which isn't quite as discrete and simple. And so we have a concept of "logic gates" that have input pins and output pins, the output pins changing electrical characteristics depending on the input pins. So a very simple "AND gate" might have three pins: the first two pins represent A and B, the third represents the logical product. Therefore due to some magical electronics outside the scope of this post, the third pin will only contain the electrical characteristic corresponding to a "1" if both the first two pins do as well. Because NANDs and NORs are both sufficient for all other logic, it can be beneficial to stuff as many NAND gates as you can onto a single chip and therefore being able to perform any logical function by hooking them up in various ways.

Jaynes feels like we've gone far enough into deductive logic at this point. The book is after all about Probability Theory, an extension to logic, we should start focusing more on that!

1.7 - The basic desiderata

We're going to build an extension to classical logic off of some following conditions. These are not axioms because we're not asserting anything is true. They are called "desiderata" because they're simply goals we think are desirable, and that we think any reasonable human should also consider desirable. Chapter two will address the question of whether these goals are actually possible to realize mathematically (they are). And if they're realizable mathematically, we can implement them in our robot! I will list here the three desiderata then talk about them. They are three goals we wish our robot to satisfy before we can even begin to consider it a worthy robot for the purpose of broad rational reasoning in the face of imperfect information that we're up against all the time as humans.

1: Degrees of plausibility are represented by Real numbers.

2: Qualitative correspondence with common sense.

3: Consistency--in three senses of the word as below
3a: If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
3b: The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains. In other words, the robot is completely non-ideological.
3c: The robot always represents equivalent states of knowledge by equivalent plausibility assignments. That is, if in two problems the robot's state of knowledge is the same (except perhaps for the labeling of the propositions), then it must assign the same plausibilities in both.

Desideratum 1 shouldn't be controversial. It turns out to be required theoretically, but the fact our goal is to program a robot to think about plausibilities kind of insists that we use numbers somewhere.

Desideratum 2 also shouldn't be controversial. While the robot might sometimes come to different conclusions as humans, it should at least have a reasoning process that qualitatively is similar enough to how humans reason. We hinted at that last time with the weak syllogisms and there are other interesting ways one can discover about how humans reason.

Jaynes says that 1, 2, and 3a are the "structural" requirements for our robot, and 3b and 3c are "interface" requirements that relate the robot's behavior to the outer world. These desiderata implore us to make some notational assumptions that we'll go over here.

Since we're using real numbers for plausibilities, we'll adopt the convention that a larger number corresponds to being more plausible. We'll also assume what's called a "continuity property", which, informally, simply states that a slight rise in something's plausibility should correspond only to a slight rise in the number.

Another notational convention is that of conditional dependence. As humans we don't really consider propositions in isolation, we consider them in the context of other propositions that we assume are true. The symbol $$A | B$$ represents this, and can be read as "the conditional plausibility that A is true, given that B is true." It's usually simply stated as "A given B". We can use the same notation as above with regular logic for either side of the bar. So we might construct $$A+B | CD$$ to represent the plausibility that at least one of the propositions A and B is true, given that both C and D are true.

Since we have this notion of a range of possible numbers, we're no longer limited to the equals comparison operator of classical logic. Therefore $$(A|B) > (C|B)$$ says that, given B, A is more plausible than C. (It does not say how much.)

Suppose that we have some old information C that gets updated from new evidence to new information K, in such a way that we now consider A more likely. That is, $$A|K > A|C$$. Then, therefore, $$\overline{A} | K < \overline{A} | C$$; in other words, the plausibility that A is false decreases with the new information K.

1.8 - Comments

Jaynes deserves an entire quote here:

As politicians, advertisers, salesmen, and propagandists for various political, economic, moral, religious, psychic, environmental, dietary, and artistic doctrinaire positions know only too well, fallible human minds are easily tricked, by clever verbiage, into committing violations of the above desiderata. We shall try to ensure that they do not succeed with our

We emphasize another contrast between the robot and a human brain. By Desideratum I, the robot’s mental state about any proposition is to be represented by a real number. Now, it is clear that our attitude toward any given proposition may have more than one ‘coordinate’. You and I form simultaneous judgments about a proposition not only as to whether it is plausible, but also whether it is desirable, whether it is important, whether it
is useful, whether it is interesting, whether it is amusing, whether it is morally right, etc. If we assume that each of these judgments might be represented by a number, then a fully adequate description of a human state of mind would be represented by a vector in a space of a rather large number of dimensions.

Not all propositions require this. For example, the proposition ‘The refractive index of water is less than 1.3’ generates no emotions; consequently the state of mind which it produces has very few coordinates. On the other hand, the proposition, ‘Your mother-in-law just wrecked your new car’ generates a state of mind with many coordinates. Quite generally, the situations of everyday life are those involving many coordinates. It is just for this reason, we suggest, that the most familiar examples of mental activity are often the most difficult to reproduce by a model. Perhaps we have here the reason why science and mathematics are the most successful of human activities: they deal with propositions which produce the simplest of all mental states. Such states would be the ones least perturbed by a given amount of imperfection in the human mind.

Of course, for many purposes we would not want our robot to adopt any of these more ‘human’ features arising from the other coordinates. It is just the fact that computers do not get confused by emotional factors, do not get bored with a lengthy problem, do not pursue hidden motives opposed to ours, that makes them safer agents than men for carrying out certain tasks.

These remarks are interjected to point out that there is a large unexplored area of possible generalizations and extensions of the theory to be developed here; perhaps this may inspire others to try their hand at developing ‘multidimensional theories’ of mental activity, which would more and more resemble the behavior of actual human brains – not all of which is undesirable. Such a theory, if successful, might have an importance beyond our present ability to imagine.[5]

For the present, however, we shall have to be content with a much more modest undertaking. Is it possible to develop a consistent ‘one-dimensional’ model of plausible reasoning? Evidently, our problem will be simplest if we can manage to represent a degree of plausibility uniquely by a single real number, and ignore the other ‘coordinates’ just mentioned.

We stress that we are in no way asserting that degrees of plausibility in actual human minds have a unique numerical measure. Our job is not to postulate – or indeed to conjecture about – any such thing; it is to investigate whether it is possible, in our robot, to set up such a correspondence without contradictions.

But to some it may appear that we have already assumed more than is necessary, thereby putting gratuitous restrictions on the generality of our theory. Why must we represent degrees of plausibility by real numbers? Would not a ‘comparative’ theory based on a system of qualitative ordering relations such as (A|C) > (B|C) suffice? This point is discussed further in Appendix A, where we describe other approaches to probability theory and note that some attempts have been made to develop comparative theories which it was thought would be logically simpler, or more general. But this turned out not to be the case; so, although it is quite possible to develop the foundations in other ways than ours, the final results will not be different.

Indeed, some psychologists think that as few as five dimensions might suffice to characterize a human personality; that is, that we all differ only in having different mixes of five basic personality traits which may be genetically determined. But it seems to us that this must be grossly oversimplified; identifiable chemical factors continuously varying in both space and time (such as the distribution of glucose metabolism in the brain) affect mental activity but cannot be represented faithfully in a space of only five dimensions. Yet it may be that five numbers can capture enough of the truth to be useful for many purposes.

My own comments on that is that since dimensionality is an interpretation, even our one-dimensional probability theory is general enough for highly dimensional problems so long as we're careful enough when compressing them.

1.8.1 - Common language vs. formal logic

Jaynes mentions that common language is perfectly capable of being as precise as formal logic statements, but common language is a lot more complicated with plenty of room for error and subtle nuances that simplified logic cannot grasp without being explicitly told. He gives this example: "Mr. A, to affirm his objectivity, says, 'I believe what I see.' Mr. B retorts: 'He doesn't see what he doesn't believe.'" A logical standpoint considers both statements to mean the same thing, a common language standpoint suggests the statements had the intent and effect to carry opposite meanings.

Another example is to be wary of mixing up "is" and "=". Judea Pearl would also suggest mixing up either of those with "flows causally from", and especially don't mix that up with "implies", and don't mix that up with "derives" or even "is derived from". One nice thing about math and logic is that the symbols generally have formal definitions devoid of ambiguity. A downside of course is that the same symbols tend to get reused and so acquire different meanings in different contexts, as seen above with the use of "+". In programming we also have things like "i = i+1" which is a causal statement instructing the computer to do something in a certain order.

He gives another example that "Knowledge is Power" is true in both human relations and thermodynamics, but "Power is Knowledge" is absurd (and obscene) and most importantly false.

This is more than just the fact that "is" like other verbs has a subject and predicate. Native English speakers may forget that "is" has two major meanings. It's the difference between "The room is noisy" and "There is noise in the room." French sometimes distinguishes between the two types (not in this case) by saying something more like "The room has noise" (but again not in this case). Apparently Turkish enforces the distinction strictly. The second statement is ontological and asserts something's existence, the first statement is epistemological and expresses the speaker's personal perception.

This is also a great example of the dumb "If a tree falls in a forest does it make a sound?" question. "This room is noisy!" "No it's not!" "Yes it is!" That is an argument over personal perceptions, and says nothing about whether there is in fact vibrations in the air of the room. Similarly with the tree, clearly if you believe the laws of physics a falling tree will create air vibrations, thus there will be a sound, but if there is no human brain around to hear the vibrations, the tree will cause no perception of sound to be had by a human. Many arguments are simply over definitions or a failure to see multiple definitions.

Anyway, English is particularly bad about mixing up epistemological statements and ontological ones. If we interpret the sentence "It's noisy in here!" as ontologically instead of epistemologically, we have committed the mind project fallacy by asserting that our own private thoughts and sensations are realities existing externally in Nature. It's a common cause of trouble both in language and in probability theory. And in philosophy. And in psychology. And in physics. Indeed, it's a general problem with the human brain that we do not want our robot to share. Traditionally, epistemological statements can be handled decently by fuzzy logic. But we can also handle them by our robot with sufficient care. Instead of asking our robot for the plausibility of "It's noisy in here!", we ask our robot for the plausibility of "Kevin thinks it's noisy in here." Or we might ask it the plausibility of "People like Kevin think it's noisy in here", and then we've really got ourselves a fuzzy problem! It would be interesting perhaps to have a robot that could automatically detect the difference between those three questions and give the response for "the intended meaning", I'm not sure if this is even possible though without a robot that can simulate human minds.

1.8.2 - Nitpicking

Jaynes has a great reply to anyone who thinks two-valued deductive logic is insufficient for all types of reasoning we might want to make, but who also thinks probability theory isn't the way forward. He says: "By all means, investigate other possibilities if you wish to; and please let us know about it
as soon as you have found a new result that was not contained in two-valued logic or our extension of it, and is useful in scientific inference."

Indeed, some people have developed "ternary logic" where propositions can have three values. And indeed there are multiple "n-valued logics" out there. However it can be shown that those "n-valued logics" applied to a particular set of propositions is either equivalent to the classic 2-valued logic applied to a bigger set, or it has internal inconsistencies and then why would you use it.

That's it for chapter 1. Chapter 2 actually starts doing some math! Oh noes! It's pretty simple math. Until the exercises at the end (the math is still simple but there's a lot of it). It's about the same length as chapter 1, so expect a couple blog posts from me as I cover it. I will also provide solutions for the exercises.

Posted on 2012-01-22 by Jach

Tags: bayes, books, math, probability


Trackback URL:

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.