## Distance to the truth

Here’s my brief proposal in response to the debate found here regarding verisimilitude.

Theories are probability distributions over observations, and convertible to probability distributions over possible worlds. This way to describe theories is richer than a compatibility relation between theories and worlds. Theories are not just compatible or incompatible with possible worlds, they assign probabilities to them.

The notion of truth, in the sense of the complete truth, can be described by a probability distribution as well. In a scenaro with no indeterminacy, the true theory, call it T, is a degenerate case of a probability distribution: it assigns probability 1 to the actual world, and zero elsewhere. In a scenario with indeterminacy, the true theory assigns probabilities to possible worlds; this is similar to the indeterminacy present in some interpretations of quantum mechanics.

Once we have established that theories, including the true theory T, are probability distributions, then distance to the truth is a matter of choosing (somewhat arbitrarily) a metric on probability distributions. We can choose, for example, the Jensen–Shannon divergence, because it has the nice property of always being finite (images from wikipedia) where and D is the Kullback–Leibler divergence. So the distance to the truth for a theory H is

D(H) = JSD(T, H)

where T, the true theory, is given. Of course, it is more interesting to consider what we think is the distance to the truth, as we don’t have magical access to what the truth really is. Our estimation of what the truth is can be obtained via Bayesian inference using experimental evidence. So we could define what we think is the truth as

T = argmax(H) P(H|E)

where H are theories, and E is observed evidence. Then we would estimate the distance to the truth as the distance to the theory we think is most likely (given by argmax)

D(H, E) = JSD(T, H)

where

T = argmax(H) P(H|E)

But there is another possibility. In the above formula, we are not using all our experimental evidence. Some of the information is thrown away by taking only the most likely theory, and ignoring the rest of the probability distribution over theories that evidence establishes. Remember that what we want is to compare probability distributions over worlds. In order to integrate all the information that evidence provides, we can compare our theory against the predictive distribution over worlds that all theories contribute to, not just the most likely one.  We define the predictive distribution over worlds as

Pd(W) = ∑ P(W|H)P(H|E)

where the sum ∑ is over theories H. Finally, our new estimate of distance to the truth becomes

D(H, E) = JSD(Pd, H)

where

Pd = ∑ P(W|H)P(H|E)

## Meta-theoretic induction in action In my last post I presented a simple model of meta-theoretic induction. Let’s instantiante it with concrete data and run through it. Say we have

 E1 E2 E3 Observations made for different domains 1-3 S1 S2 S3 Simple theories for domains 1-3 C1 C2 C3 Complex theories for domains 1-3 S Meta-theory favoring simple theories C Meta-theory favoring complex theories

That is, we have three domains of observation with corresponding theories. We also have two meta-theories that will produce priors on theories. The meta-theories themselves will be supported by theories’ sucesses or failures. Successes of simple theories support S, successes of complex theories support C. Now define the content of the theories through their likelihoods

En P(En|Sn) P(En|Cn)
E1 3/4 1/4
E2 3/4 1/4
E3 3/4 3/4

Given that E1, E2 and E3 are evidence, this presents a scenario where theories S1 and S2 were successful, whereas theories C1 and C2 were not. S3 and C3 represent theories that are equally well supported by previous evidence (E3) but with different future predictions. This is the crux of the example, where the simplicity bias enters into the picture. Our meta-theories are defined by

P(Sn|S) = 3/4, P(Sn|C) = 1/4

P(Cn|C) = 3/4, P(Cn|S) = 1/4

Meta-theory S favors simple theories, whereas meta-theory C favors complex theories. Finally, our priors are neutral

P(Sn) = P(Cn) = 1/2

P(S) = P(C) = 1/2

We want to process evidence E1 E2, and see what happens at the critical point, where S3 and C3 make the same predictions. The sequence is as follows

1. Update meta theories S and C with E1 and E2
2. Produce a prior on S3 and C3 with the updated C and S
3. Update S3 and C3 with E3

The last step produces probabilities for S3 and C3; these theories make identical predictions but will have different priors granted by S and C. This will formalize the statement

Simpler theories are more likely to be true because they have been so in the past

### The model as a bayesian network

Instead of doing all the above by hand (using equations 3,4,5,6), it’s easier to construct the corresponding bayesian network and let some general algorithm do the work. Formulating the model this way makes it much easier to understand, in fact it seems almost trivial. Additionally, our assumptions of conditional independence (1 and 2) map directly into the bayesian network formalism of nodes and edges, quite convenient! Node M represents the meta-theory, with possible values S and C, the H nodes represent theories, with possible values Sn and Cn. Note the lack of edges between Hn and Ex formalizing (1), and the lack of edges between M and En formalizing (2) (these were our assumptions of conditional independence).

I constructed this network using the SamIam tool developed at UCLA. With this tool we can construct the network and then monitor probabilities as we input data into the model, using the tool’s Query Mode. So let’s do that, fixing the actual outcome of the evidence nodes E1, E2 and E3 (click to enlarge) Theories S1 and S2 make correct predictions and are thus favoured by the data over C1 and C2. This in turn favours the meta-theory S, which is assigned a probability of 73% over meta-theory C, with 26%. Now, theories S3 and C3 make the same predictions about E3, but because of our meta-theory being better supported, they are assigned different probabilities. Again, recall our starting point

Simpler theories are more likely to be true because they have been so in the past

We can finally state this technically, as seen here The simple theory S3 is favored at 61% over C3 with 38%, even though they make the same predictions. In fact, we can see how this works if we look at what happens with and without meta-theoretic induction where as expected the mirrors of S3 and C3 would be granted the same probabilities. So everything seems to work, our meta-theory discriminates different theories and is itself justified via experience, as was the objective

Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations.

But, what happens when we add a meta-theory like Occam(t) into the picture? What happens when we apply the same argument at the meta-level that prompted the meta-theoretic justitification of simplicity we’ve developed? We define a meta-theory S-until-T with

P(S1|S-until-T) =  P(S2|S-until-T) = 3/4

P(S3|S-until-T) = 1/4

which yields this network Now both S and S-until-T accrue the same probability through evidence and therefore produce the same prior on S3 and C3, 50%. It seems we can’t escape our original problem.

Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S3 and C3. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

We are just recasting the original problem at the meta level, we end up begging the question or in an infinite regress.

In conclusion, we have succeeded in formalizing meta-theoretic induction in a bayesian setting, and have verified that it works as intended. However, it ultimately does not solve the problem of justificating simplicity. The simplicity principle remains a prior belief independent of experience.

(The two networks used in this post are metainduction1.net and metainduction2.net, you need the SamIam tool to open these files)

 Simplicity is justified if we previously assume simplicity

## Formalizing meta-theoretic induction In this post I formalize the discussion presented here, recall

Simpler theories are more likely to be true because they have been so in the past

We want to formalize this statement into something that integrates into a bayesian scheme, such that the usual inference process, updating probabilities with evidence, works. The first element we want to introduce into our model is the notion of a meta-theory. A meta-theory is a statement about theories, just as a theory is a statement about observations (or the world if you prefer a realist language).

As a first approximation, we could formalize meta-theories as priors over theories. In this way, a meta-theory prior, together with observations, would yield probabilities for theories through the usual updating process. This formalization is technically trivial, we just relabel priors over theories as meta-theories. But this approach does not account for the second half of the original statement

..because they have been so in the past.

As pure priors, meta-theories would never be the object of justification. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations. In order to integrate with normal theories, meta-theories must accumulate probability via conditioning on observations, just as normal theories do.

We cannot depend on or add spurious observations like “this theory was right” as a naive mechanism for updating; this would split the meta and theory level. Evidence like “this theory was right” must be embedded in existing observations, not duplicated somewhere else as a stand alone, ad-hoc ingredient.

Finally, the notion of meta-theory introduces another concept, that of distinct theory domains. This concept is necessary because it is through cross-theory performance that a meta-theoretical principle can emerge. No generalization or principle would be even possible if there were no different theories to begin with. Because different theories may belong to different domains, meta-theoretic induction must account for logical dependencies pertaining to distinct domains; these theories make explicit predictions only about their domain.

Summing up:

Our model will consist of observations/evidence, theories and meta-theories. Theories and corresponding observations are divided into different domains; meta-theories are theories about theories, and capture inter-theoretic dependencies (see below). Meta-theories do not make explicit predictions.

Let’s begin by introducing terms

En: An element of evidence for domain n 

Hn: A theory over domain n

M: A meta-theory

Observations that do not pertain to a theory’s domain will be called external evidence. An important assumption in this model is that theories are conditionally independent of external observations given a meta-theory. This means that a theory depends on external observations only through those observation’s effects on meta-theories.

We start the formalization of the model with our last remark, conditional independence of theories and external observations given a meta-theory

P(Hn|Ex,M) = P(Hn|M) …………………… (1)

Additionally, any evidence is conditionally independent of a meta-theory given its corresponding theory, i.e. it is theories that make predictions, meta-theories only make predictions indirectly by supporting theories.

P(En|M,Hn) = P(En|Hn) …………………… (2)

Now we define how a meta-theory is updated

P(M|En) = P(En|M) * P(M) / P(En) …………………… (3)

this is just Bayes’ theorem. The important term is the likelihood, which by the law of total probability is

P(En|M) = P(En|M,Hn) * P(Hn|M) + P(En|M,¬Hn) * P(¬Hn|M)

which by conditional independence (1)

P(En|M) = P(En|Hn) * P(Hn|M) + P(En|¬Hn) * P(¬Hn|M) …………………… (4)

This equation governs how a meta-theory is updated with new evidence En. Now to determine how the meta-theory determines a theory’s prior. Again by total probability

P(Hn|Ex) = P(Hn|Ex,M) * P(M|Ex) + P(Hn|Ex,¬M) * P(¬M|Ex)

which by conditional independence (2)

P(Hn|Ex) = P(Hn|M) * P(M|Ex) + P(Hn|¬M) * P(¬M|Ex) …………………… (5)

The following picture illustrates how evidence updates a meta-theory which in turn produces a prior. Note that evidence E1 and E2 are external to H3 Lastly, updating a theory based on matching evidence is, as usual

P(Hn|En) = P(En|Hn) * P(Hn) / P(En…………………… (6)

Equations 3,4,5 and 6 are the machinery of the model through which evidence can be processed in sequence. See it in action in the next post.

 A given En represents a sequence of observations made for a domain n. So Hn|En represents induction in a single step, although in practice it would occur with successive bayesian updates for each subelement of evidence.

 This characteristic is the meta analogue of conditional independence between observations given theories. In other words, just as logical dependencies between observations are mediated by theories, inter-domain logical dependencies between theories are mediated by meta-theories.

## Simplicity and meta-theoretic induction During my discussion of induction and occams razor, I have said

The key is that our intuition, in fact our intelligence in general, has a built-in simplicity bias. We strongly favor the odd number theory because it is the simplest theory that fits the facts. Hence induction, including our everyday intuitions and the scientific method, is founded upon Occam’s razor as a way to discriminate between equally supported theories.

Note that I have described Occam as a simplicity bias. I deliberately chose this word in order to convey that simplicity is a guideline that we use prior and independent of experience. In the language of probability, Occam takes the form of a prior probability distribution that favors simpler theories before any updating has occured, and is unaffected by posterior evidence.

This state of affairs does not seem satisfactory; as described above, Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice. The obvious course would be to say something like this:

Simpler theories are more likely to be true because they have been so in the past

thus grounding the principle on experience, just like any other case of inductive inference and scientific knowledge. But let’s go back to the source of the problem that this principle is trying to fix. We have two theories, S and C, which make identical, correct predictions with respect to all observations made to date. These two theories only differ in their future predictions.

And yet, in practice, we consider the predictions made by S much more likely than those made by C. Because by definition these two theories share likelihoods for observed evidence, it is only through their priors that we can assign them different probabilities. Here’s where the simplicity principle comes in. We favour theory S because it is simple, hence granting it a greater prior probability and consequently a greater posterior despite its shared likelihood with C. When asking ourselves how we justify the simplcity principle, we answer

Because simple theories have been true in the past.

So the simplicity principle acts like a meta-theory and can accrue probability through experience just like any other theory. Until now, everything seems to work, but here’s the problem. Let’s say we have two guiding principles:

Occam: Simpler theories are more likely to be true

Occam(t): Simpler theories are more likely to be true until time t

Whatever the mechanism by which the simplicity meta-theory accumulates posterior probability, so shall its peculiar brother, and in the same exact amount. When going back to our two theories, S and C, Occam will favour S while Occam(t) will favour C. Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S and C. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

So in conclusion, if we try to solve the problem with

Simpler theories are more likely to be true because they have been so in the past

we are just recasting the original problem at the meta level, we end up begging the question or in an infinite regress. Which should not really come as a surprise, there is no way to justify knowledge absolutely, there must be unjustified assumptions somewhere. In my point of view, Occam is such a bedrock

The way I see it, the foundations of scientific knowledge are the postulates of probability theory (as derived for example by Bernardo-Smith or Cox) together with Occam’s razor.

Please see my next post where I formalize meta-theoretic induction

 In essence, this is the problem of induction, which dates back to Hume who first posed it in its original form.

 The future is like the past because in the past, the future was like the past

## Epistemological dive Note: I wrote this piece before the two posts presenting simple model of learning using bayesian inference. There is significant overlap, and  conclusions are stated without complete explanations.

I attended an informal talk on climate change recently, after which I had several discussion regarding the scientific process and the foundations of knowledge (in science).

One question was Is scientific knowledge is inductive or deductive? Well, the scientific method requires deductive inference to establish the logical consequences of a theory in order to make predictions. But the justification of theories, the method by which a theory is temporarily accepted or discarded is inductive. In the language of Bayes, theory confirmation/invalidation occurs by updating theory posteriors inductively (P(H|E)), whereas evidence conditioning on theories (P(E|H)) is derived deductively.

So, although deduction plays a part in establishing the logical consequences of theories in the form of testable predictions, the nature of the knowledge, or rather, the process by which that knowledge is gained, is fundamentally inductive.

What does this say about the foundations of knowledge in science? If scientific knowledge were deductive, we could simply say that its foundations are axiomatic. We could also talk about incompleteness and other interesting things. But if as we have stated this knowledge is inductive, what are its foundations? Is induction a valid procedure, and what are its endpoints?

This is a very deep subject, trying to go all the way to the bottom is what I have titled this post as an epistemological dive. I’m not going to give this a thorough treatment here, but I’ll briefly state what my position is and what I argued in discussion that day.

The way I see it, the foundations of scientific knowledge are the postulates of probability theory (as derived for example by Bernardo-Smith or Cox) together with Occam’s razor. In fact, given that most people are aware of probability theory I would say that the best single answer to the foundation of knowledge, in the sense that it is something we are less aware of, is Occam’s razor. I will give a brief example of this, borrowed from a talk by Shane Legg on machine super intelligence.

Let’s consider a minimal example of a scientific process. An agent is placed in an environment and must form theories whose predictions correctly match the agent’s observations. Although minimal, this description accounts for the fundamental elements of science. There is one missing element, and that is a specification of how the agent forms theories, but for now we will use our own intuition, as if we were the agent.

For this minimal example we will say that the agent observes a sequence of numbers which its environment produces. Thus, the agent’s observations are the sequence, and it must form a theory which correctly describes past observations and predicts future ones. Let’s imagine this is what happens a time goes forward, beginning with

1

For the moment there is only one data point, so it seems impossible to form a theory in a principled way.

1,3

Among others, two theories could be proposed here, odd numbers, and powers of 3, with corresponding predictions of 5 and 9:

f(n) = 2n – 1

f(n) = 3^n

the observations continue:

1,3,5

The powers of three theory is ruled out due to the incorrect prediction 9, while odd number theory was correct.

1,3,5,7

The odd number theory has described all observations and made correct predictions. At this point our agent would be pretty confident that the next observation will be 9.

1,3,5,7,57

What?! That really threw the agent off, it was very confident that the next item would be 9. But it turned out to be 57.  As the builder of this small universe I’ll let you know the correct theory, call it theory_57:

f(n) = 2n – 1 + 2(n-1)(n-2)(n-3)(n-4)

which if you check correctly describes all the numbers in the sequence of observations. If the 5th observation had instead been 9, our odd number theory would have been correct again, and we would have stayed with it. So depending on this 5th observation:

9 => f(n) = 2n-1

57 => f(n) = 2n – 1 + 2(n-1)(n-2)(n-3)(n-4)

Although we only list two items, the list is actually infinite because there are an infinite number of theories that correctly predict the observations up until the 4th result. In fact, and here is the key, there are an infinite number of theories that correctly predict any number of observations! But let us restrict the discussion to only the two seen above.

What our intuition tells us is that no reasonable agent would believe in theory_57 after the fourth observation, even though it is just as compatible with the odd number theory. Our intuition strongly asserts that the odd number theory is the correct theory for that data. But how can we justify that on the basis of induction, if they make the same predictions (ie they have the same P(E|H))?

The key is that our intuition, in fact our intelligence in general, has a built-in simplicity bias. We strongly favor the odd number theory because it is the simplest theory that fits the facts. Hence induction, including our everyday intuitions and the scientific method, is founded upon Occam’s razor as a way to discriminate between equally supported theories.

Without this or some more specific bias (in machine learning we would call this inductive bias), induction would be impossible, as there would be too many theories to entertain. Occam’s razor is the most generally applicable bias; it is a prerequisite for any kind of induction, in science or anywhere else.