Simplicity and meta-theoretic induction

During my discussion of induction and occams razor, I have said

The key is that our intuition, in fact our intelligence in general, has a built-in simplicity bias. We strongly favor the odd number theory because it is the simplest theory that fits the facts. Hence induction, including our everyday intuitions and the scientific method, is founded upon Occam’s razor as a way to discriminate between equally supported theories.

Note that I have described Occam as a simplicity bias. I deliberately chose this word in order to convey that simplicity is a guideline that we use prior and independent of experience. In the language of probability, Occam takes the form of a prior probability distribution that favors simpler theories before any updating has occured, and is unaffected by posterior evidence.

This state of affairs does not seem satisfactory; as described above, Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice[1]. The obvious course would be to say something like this:

Simpler theories are more likely to be true because they have been so in the past

thus grounding the principle on experience, just like any other case of inductive inference and scientific knowledge. But let’s go back to the source of the problem that this principle is trying to fix. We have two theories, S and C, which make identical, correct predictions with respect to all observations made to date. These two theories only differ in their future predictions.

And yet, in practice, we consider the predictions made by S much more likely than those made by C. Because by definition these two theories share likelihoods for observed evidence, it is only through their priors that we can assign them different probabilities. Here’s where the simplicity principle comes in. We favour theory S because it is simple, hence granting it a greater prior probability and consequently a greater posterior despite its shared likelihood with C. When asking ourselves how we justify the simplcity principle, we answer

Because simple theories have been true in the past.

So the simplicity principle acts like a meta-theory and can accrue probability through experience just like any other theory. Until now, everything seems to work, but here’s the problem. Let’s say we have two guiding principles:

Occam: Simpler theories are more likely to be true

Occam(t): Simpler theories are more likely to be true until time t

Whatever the mechanism by which the simplicity meta-theory accumulates posterior probability, so shall its peculiar brother, and in the same exact amount. When going back to our two theories, S and C, Occam will favour S while Occam(t) will favour C. Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S and C. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

So in conclusion, if we try to solve the problem with

Simpler theories are more likely to be true because they have been so in the past

we are just recasting the original problem at the meta level, we end up begging the question[2] or in an infinite regress. Which should not really come as a surprise, there is no way to justify knowledge absolutely, there must be unjustified assumptions somewhere. In my point of view, Occam is such a bedrock

The way I see it, the foundations of scientific knowledge are the postulates of probability theory (as derived for example by Bernardo-Smith or Cox) together with Occam’s razor.

Please see my next post where I formalize meta-theoretic induction


[1] In essence, this is the problem of induction, which dates back to Hume who first posed it in its original form.

[2] The future is like the past because in the past, the future was like the past

Epistemological dive

Note: I wrote this piece before the two posts presenting simple model of learning using bayesian inference. There is significant overlap, and  conclusions are stated without complete explanations.

I attended an informal talk on climate change recently, after which I had several discussion regarding the scientific process and the foundations of knowledge (in science).

One question was Is scientific knowledge is inductive or deductive? Well, the scientific method requires deductive inference to establish the logical consequences of a theory in order to make predictions. But the justification of theories, the method by which a theory is temporarily accepted or discarded is inductive. In the language of Bayes, theory confirmation/invalidation occurs by updating theory posteriors inductively (P(H|E)), whereas evidence conditioning on theories (P(E|H)) is derived deductively.

So, although deduction plays a part in establishing the logical consequences of theories in the form of testable predictions, the nature of the knowledge, or rather, the process by which that knowledge is gained, is fundamentally inductive.

What does this say about the foundations of knowledge in science? If scientific knowledge were deductive, we could simply say that its foundations are axiomatic. We could also talk about incompleteness and other interesting things. But if as we have stated this knowledge is inductive, what are its foundations? Is induction a valid procedure, and what are its endpoints?

This is a very deep subject, trying to go all the way to the bottom is what I have titled this post as an epistemological dive. I’m not going to give this a thorough treatment here, but I’ll briefly state what my position is and what I argued in discussion that day.

The way I see it, the foundations of scientific knowledge are the postulates of probability theory (as derived for example by Bernardo-Smith or Cox) together with Occam’s razor. In fact, given that most people are aware of probability theory I would say that the best single answer to the foundation of knowledge, in the sense that it is something we are less aware of, is Occam’s razor. I will give a brief example of this, borrowed from a talk by Shane Legg on machine super intelligence.

Let’s consider a minimal example of a scientific process. An agent is placed in an environment and must form theories whose predictions correctly match the agent’s observations. Although minimal, this description accounts for the fundamental elements of science. There is one missing element, and that is a specification of how the agent forms theories, but for now we will use our own intuition, as if we were the agent.

For this minimal example we will say that the agent observes a sequence of numbers which its environment produces. Thus, the agent’s observations are the sequence, and it must form a theory which correctly describes past observations and predicts future ones. Let’s imagine this is what happens a time goes forward, beginning with


For the moment there is only one data point, so it seems impossible to form a theory in a principled way.


Among others, two theories could be proposed here, odd numbers, and powers of 3, with corresponding predictions of 5 and 9:

f(n) = 2n – 1

f(n) = 3^n

the observations continue:


The powers of three theory is ruled out due to the incorrect prediction 9, while odd number theory was correct.


The odd number theory has described all observations and made correct predictions. At this point our agent would be pretty confident that the next observation will be 9.


What?! That really threw the agent off, it was very confident that the next item would be 9. But it turned out to be 57.  As the builder of this small universe I’ll let you know the correct theory, call it theory_57:

f(n) = 2n – 1 + 2(n-1)(n-2)(n-3)(n-4)

which if you check correctly describes all the numbers in the sequence of observations. If the 5th observation had instead been 9, our odd number theory would have been correct again, and we would have stayed with it. So depending on this 5th observation:

9 => f(n) = 2n-1

57 => f(n) = 2n – 1 + 2(n-1)(n-2)(n-3)(n-4)

Although we only list two items, the list is actually infinite because there are an infinite number of theories that correctly predict the observations up until the 4th result. In fact, and here is the key, there are an infinite number of theories that correctly predict any number of observations! But let us restrict the discussion to only the two seen above.

What our intuition tells us is that no reasonable agent would believe in theory_57 after the fourth observation, even though it is just as compatible with the odd number theory. Our intuition strongly asserts that the odd number theory is the correct theory for that data. But how can we justify that on the basis of induction, if they make the same predictions (ie they have the same P(E|H))?

The key is that our intuition, in fact our intelligence in general, has a built-in simplicity bias. We strongly favor the odd number theory because it is the simplest theory that fits the facts. Hence induction, including our everyday intuitions and the scientific method, is founded upon Occam’s razor as a way to discriminate between equally supported theories.

Without this or some more specific bias (in machine learning we would call this inductive bias), induction would be impossible, as there would be too many theories to entertain. Occam’s razor is the most generally applicable bias; it is a prerequisite for any kind of induction, in science or anywhere else.

Learning and the subject-object distinction

Previously I presented a model where learning is impossible. In this post I want to emphasize

..not only must the phenomenon be learnable, but additionally the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

In the model, the environment (phenomenon) that gives rise to observations in the form of the binary sequence of ‘0’ and ‘1’ is left unspecified. Nothing is said about how the environment evolves, whether it is deterministic or stochastic, or whether it follows a certain rule or not. The model I presented is thus completely orthogonal to the nature of the environment. And yet

Whatever the sequence of events, the learning agent does not gain any knowledge about the future from the past, learning is impossible.

So a property of the learning agent, the subject, makes learning impossible irrespective of the environment, the object. This property of the subject is the belief that all sequences of observations are equally likely, that is, the lack of a-priori bias favoring any of the outcomes[1]. Even if the object was completely predictable, the subject would be unable to learn. Learning imposes constraints on both, hence the subject-object distinction;

To drive this point home, we could specify any environment and note how the conclusions regarding the model would not change. I hinted at this by presenting an example of the environment’s evolution that began with 111. Now assume the environment is such that it produces ‘1’ indefinitely in a completely deterministic and predictable way. You could interpret this as the classical example in philosophical treatments of induction: ‘1’ means that the sun rises the next day[2], and ‘0’ means that the sun does not rise. But again, this would make no difference, the learner would never catch on to this regularity.

Conversely, specifying an unlearnable environment will not do the learner any good either, of course. In fact, the astute reader will have realized that the learning agent’s prior corresponds exactly to the belief that the sequence of ‘0’ and ‘1’ are the results of a series of coin flips, where the coin is fair. And of course, given the assumption describing the coin, previous coin flips do not yield any information that serve to make predictions about future coin flips; the environment has no structure to be learned.

Most problems of bayesian inference include, in the problem statement, a description of the environment that is automatically used as the agent’s prior knowledge or at least a starting point, as in the typical case of drawing balls from an urn. This prior knowledge is incomplete of course, as no inference would be necessary otherwise. But in these cases the subject-object distinction is not so apparent; analysis about the agent’s learning performance assumes the problem definition is of course true!

However, the subject-object distinction is more important when asking what model of inference applies to the scientific investigation of nature and the problem of induction. This is because in these models, there is no problem definition, prior knowledge is genuinely prior to any experience.

Pending questions: What problem definition applies to inductive inference in science? What happens when extending our model to cases with infinite observations/theories? Does learning logically require bias, and if so, what bias is appropriate universally and intuitively acceptable?[3]

[1] In fact, not only must there be bias, but it must be a bias that exploits structure. Altering the distribution such that it favors more ‘1’ in the sequence irrespective of previous observations is a bias, but does not allow learning. This is another important distinction, the entropy-learnability distinction.

[2] Or after 24 hours if you want to be picky about tautologies

[3] Another more technical question is, can prior knowledge in inference problems always be recast as a bias over theories that are deterministic predictions over entire sequences of possible events? (as we saw when noting that the binary sequence model is equivalent to a repeated coin flip scenario) If so, what property of these distributions allows learning?

When learning is impossible

(image taken from

I’ve defined learning as the extraction of generally applicable knowledge from specific examples. In that post I remarked

An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be learnable.

Today I’m going to a present a model where learning is impossible in the context of bayesian inference. We will see in this case that not only must the phenomenon be learnable, but also that the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

As components of the model we have an agent, an environment from which observations are made, and theories the agent probabilistically reasons about as the object of its learning. For observations we use a binary sequence, S = {0, 1}^n, like for example

S = 1010111010

The learning agent sees a number of elements and must try to predict subsequent ones according to different theories, which are of the form H = {0,1}^n. An important aspect of the model is that the agent will consider all possible theories that can explain and predict observations. The number of theories is therefore equal to the number of possible observations, which are both 2^n. If the agent considered a smaller number of theories it could be that the true theory describing the environment would be left out.

Furthermore, let’s say that a-priori, the agent has no reason to consider any theory more likely than the rest. So it will assign an a-priori equal probability to each theory:

P(H) = 1 / 2^n

Define the total observations up to a given time as Si, where i <= n, and that a given theory is Hk, where k <= n. We can apply bayes theorem to obtain the probability that a given theory is true given the observations (feel free to skip the math down to the conclusion):

P(Hk|Si) = P(Si|Hk)*P(Hk) / P(Si)

and the probability of a given sequence of observations P(Si) is obtained by summing[1] over all theories that yield such a prediction:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

in other words, summing over all theories that begin with Si. To see exactly whats happening let’s restrict the example to n = 4. This gives us a total of 2^4 = 16 possible observations and theories. Say the agent has observed three elements ‘111’ and call the sequence S3:

S3 = 111

Let’s calculate the posterior probability on theories for this case. First for theories that do not predict 111:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

but since P(Si|Hk) = 0, then

P(Hk|Si) = 0

ie theories that do not predict 111 are ruled out as should be the case. There are two theories that do predict 111:

H1 = {1110}

H2 = {1111}

the denominator of the posterior is

Sum(k) P(S3|Hk)*P(Hk)

there are two theories that predict the sequence, therefore

Sum(k) P(S3|Hk)*P(Hk) = P(H1) + P(H2)

plugin this in, the posterior is therefore

P(H1|S3) = P(S3|H1)*P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(S3|H2)*P(H2) / [P(H1) + P(H2)]

since both H1 and H2 predict S3 (P(S3|H) = 1), this reduces to

P(H1|S3) = P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(H2) / [P(H1) + P(H2)]

but because all theories are equally likely a priori

P(H1) = P(H2) = 1/16


P(H1|S3) = 1/16/ [1/16 + 1/16] = 1/2

and similarly

P(H2|S3) = 1/16/ [1/16 + 1/16] = 1/2

So H1 and H2 are assigned equal probabilities, 1/2. Because no other theories are possible and 1/2 + 1/2 = 1, it all works out. Now, the agent will use these two theories to predict the next observation:

P(1110|S3) = 1 * 1/2 + 0 * 1/2 = 1/2

P(1111|S3) = 0 * 1/2 + 1 * 1/2 = 1/2

Thus, the agent considers that it is equally likely for the next element to be 1 or 0.

But there is nothing special about the example we chose with n = 4 and S3 = 111. In fact, you could carry out the exact same calculations for any n, and S. Here’s the key point, the learning agent makes the exact same predictions as to what will happen no matter how many observations it has made, and no matter what those observations are. Whatever the sequence of events, it does not gain any knowledge about the future from the past, learning is impossible.

I’m going to leave the discussion for later posts, but here are some relevant questions that will come up:

Does learning logically require bias? Can one meaningfully speak of theories when there is no compression of observations? What happens when the model is extended to an infinite number of observations/theories? Is this an adequate (though simplistic) model of scientific investigation/knowledge?


[1] I’m using the notation Sum(n) as the equivalent of the Sigma sum over elements with subscript n

The most incomprehensible thing about the universe is that it is comprehensible

It’s a quote by Albert Einstein, which is where we left off last time. Comprehensible translates to, for example, mathematically intelligible, regular or lawful. These are different ways to say that it is possible to arrive at descriptions of the world that allow us to understand it and make predictions. Einstein’s point was that there is no particular reason to expect the universe to be the way it is, i.e. following elegant mathematical laws. It could have just as well been a chaotic mess impossible to make sense out of.

Regularity in the Sierpinski triangle

It’s hard to tell whether it’s even meaningful to speak of the way the universe could have been without speaking of how the universe and its characteristics arise. Indeed, one of the deepest questions in physics is, why does the universe have the laws it has? (Second only to why is there something rather than nothing?)

But imagine for the moment that the universe was in fact a messy chaos. Well, one thing seems clear, that kind of universe would not contain life, because life is one of the most obvious examples of order and regularity (or if you like, life requires order and regularity to exist), and intelligent life is precisely the kind of life that requires most order.

The point is that our very existence screens off the possibility of a non-regular universe, it is impossible for us to observe anything different because we would not have existed under those circumstances. This point is known as the anthropic principle. Does it answer the question? Not really; the anthropic principle has been labeled as unscientific and metaphysical by critics. You have to be careful to not take the point too far. In this case I’m just saying that life implies a selection effect to the universe it inhabits.

But again, that does not answer the question. However, if we additionally postulate that there isn’t one universe, but many, the situation makes some sense:

Alice: Why is the universe comprehensible?

Bob: The thing is, there isn’t just one, there are many, so it turns out that some of them are comprehensible, just like in a lottery someone must end up winning.

Alice: But what about the coincidence that we landed precisely on a comprehensible one?

Bob: That’s not a coincidence, our very existence implies that the universe we are in must be orderly. We couldn’t have landed in any other one.

Alice: So it’s a combination of those two things that answers the question, the anthropic principle is not enough..

Bob: Yes

Although in fact the question is still not answered because we had to postulate the existence of many universes, and we could in turn ask ourselves why that is the case. Oh well.