Causation and determinism

In this post I will try to disentangle the notions of determinism and causality, and suggest a different way to think of them. I came to think of these issues via the following informal assertion

The decay of a radioactive atom has no cause

I will not be discussing hidden variable theories nor Bell’s inequalities; I will assume outright that the phenomenon of radioactive decay is intrinsically random (as opposed to “apparent randomness” induced by ignorance), its quantum mechanical description is the most complete model possible; said model assigns probabilities to outcomes. With that out of the way, the usual argument that arrives at the above statement is

1) Radioactive decay is intrinsically random, indeterminate and cannot be predicted

2) There is no physical factor which determines whether a radioactive atom decays or not

3) Therefore, that a specific atom decays has no cause

Although the argument makes sense I am hesitant to accept 3) as is, and what it implies about how we think of causality.

Causality has been confusing minds for hundreds of years, it is a very difficult subject as evidenced by the volumes that have been written on it. So there’s not much point in trying to figure out what causality means exhaustively, via conceptual analysis in the tradition of analytic philosophy. Instead we will just quickly define causality using the mathematics of causal models, and see where that takes us for a specific scenario. In the context of these models, we will define causality according to two complementary questions:

A) what is meant by “the effect of a cause”

B) what is meant by “the cause of an effect”

Two types of causal models have been developed over the last thirty years, causal bayesian networks and structural causal models. These two formalisms are largely equivalent[4]; both make use of graph representations. Vertices in these graphs correspond to the variables under study whereas edges represent causal influences between the variables.  Guided by these graphs, one follows precise procedures to obtain mathematical expressions for causal queries over variables. These expressions are cast in the language of probability.

From [Pearl2000] page 15

In this post I will refer to structural causal models as described in Pearl’s Causality: Models, Reasoning, and Inference [Pearl2000]. To begin, we have the definition[2]


In the above, D is a directed graph whose edges represent causal influences; these influences are quantitavely specified by functions (structural equations) on variables. Finally, probabilities are assigned to variables not constrained by functions, these are exogenous variables.

The effect of a cause

Given a structural causal model, question A) can be answered with the following result



The difference E(Y | do(x’)) – E(Y | do(x”)) is sometimes taken as the definition of “causal effect” (Rosenbaum and Rubin 1983)

The causal effect of changing the variable x’ => x” on y is defined as the difference in expectation of the value that y will take. Note how the formalism includes explicit notation for interventions, do(x).

The cause of an effect

Question b) looks at it from a different point of view. Instead of asking what the effects of some cause are, we ask what the cause of some effect is; it’s a question of attribution. These questions naturally assume the form of counterfactuals (wikipedia):


A counterfactual conditional, is a conditional (or “if-then”) statement indicating what would be the case if its antecedent were true.

For example

If it were raining, then he would be inside.

Before you run off screaming “metaphysics!”, “non-falsifiability!” or other variants of hocus pocus, rest assured: counterfactuals have a clear empirical content. In fact, what grants counterfactuals their empirical content is the same assumption that allows confirmation of theories via experiments: that physical laws are invariant. Counterfactuals make predictions just the same way as experiments validate hypothesis. If I say

“if you had dropped the glass it would have accelerated downwards at g”

I am also saying that

“If you now drop the glass, it will accelerate downwards at g”

Given that I make the assumption that all relevant factors remain equal (ie, gravity has not suddenly disappeared).

The following result allows us to answer queries about counterfactuals[3]:


Once we have expressions for counterfactuals, we can answer questions of type B), with the following results. Note that these results are expressed in terms of counterfactuals, which is why one needs theorem 7.1 as a prerequisite.


This completes the brief listing of key results for the purposes of the discussion.


So what was the point of pasting in all these definitions without going into the details? The point is that given the formalisms of these models and their associated assumptions[5], we can think quantitatively about questions A) and B), without going into the nightmare of trying to figure out what causality “really means” from scratch. Our original criteria now have assumed a quantitative form:

A) The difference in expectation on some value Y when changing some variable X

B1)The probability that some variable X is a necessary requirement for the value of some observed variable Y

B2) The probability that some variable X is a sufficient requirement for the value of some observed variable Y

Thankfully our example of a radioactive atom is very simple compared to the applications causal models were designed for; for our purposes we do not need to work hard to identify the structure nor the probabilities  involved, these are given to us by the physics of nuclear decay.

Feynman diagram for Beta- decay (wikipedia)

Having said this, we construct a minimal model for eg. negative beta decay with the following two variables

r: The neutron-proton ratio, with values High, Normal, Low (using some arbitrary numerical threshold)

d: Whether β- decay occurs at some time t, with values True, False

Our questions, then, are

Q1) What is the causal effect of r=High on d?

Q2) What is the probability of necessity P(N) of r = High, relative to the observed effect d = True?

Q3) What is the probability of necessity P(S) of r = High, relative to the observed effect d = True?

In order to interpret the answers to the above questions we must first go into some more details about causality and the models we have used.

General causes, singular causes, and probabilities

Research into causality has distinguished two categories of causal claims:

General (or type-level) causal claims:

Drunk driving causes accidents.

Singular (or token-level) causal claims:

The light turning on was caused by me flipping the switch.

General claims describe overall patterns in events, singular claims describe specific events. This distinction brings us to another consideration. The language in which causal models yield expressions is that of probability. We have seen probabilities assigned to the value of some effect, as well as probabilities assigned to the statement that a cause is sufficient, or is necessary. But how do these probabilities arise?

Functional causal models are deterministic; the structural equations that describe causal mechanisms (graph arrows) yield unique values for variables as a function of their influences. On the other hand, the exogenous variables, those that are not specified within the model, but rather are inputs to it, have an associated uncertainty. Thus probabilities arise from our lack of knowledge about the exact values that these external conditions have. The epistemic uncertainty spreads from exogeneous variables throughout the rest of the model.

[Pearl2000] handles the general/singular dichotomy elegantly: there is no crisp border, rather there is a continuous spectrum as models range from general to specific, corresponding to how much uncertainty exists in the associated variables. A general causal claim is one where exogenous variables have wide probability distributions; as information is added these probabilities are tightened and the claim becomes singular. In the limit, there is no uncertainty, the model is deterministic.

We can go back to question 1) whose answer can be interpreted without much difficulty.

Q1) What is the causal effect of r=High (high nuclear ratio) on d (decay)?

If physics is correct, having a certain values for r will increase the expectaton of d being equal to True, relative to some other value for r. This becomes a general causal claim,

A1) High nuclear ratio causes Beta- decay

So, relative to our model, high nuclear ratio is a cause of Beta- decay. Note that we can say this despite the fact that decay is intrinsically indeterministic. Even though the probabilities are of a fundamentally different nature, the empirical content is indistinguishable from any other general claim with epistemic uncertainty. Hence, in this particular case determinism is not required to speak of causation.

The more controversial matter is attribution of cause for a singular indeterministic phenomenon, which is where we began.

3) A specific atom decay has no cause

This is addressed by questions 2) and 3).

Q2) What is the probability of necessity P(N) of r = High, relative to the observed effect d = True?

Q3) What is the probability of necessity P(S) of r = High, relative to the observed effect d = True?

Recall, functional causal models assign probabilities that arise from uncertainty in exogenous variables; this is what we see in definitions 9.2.1 and 9.2.2. The phrase “probability of sufficiency/necessity” conveys that sufficiency/necessity is a determinate property of the phenomenon, it’s just that we don’t have enough information to identify it. Therefore, in the singular limit these properties can be expressed as logical predicates

Sufficiency(C, E): Cause => Effect

Necessity(C, E): Effect => Cause

In the case of the decay of a specific atom at some time the causal claims become completely singular, definitions 9.2.1 and 9.2.2 reduce to evaluations of whether the above predicates hold. If we assume that atoms with low nuclear ratio do not undergo Beta- decay, our answers are:

A2) High nuclear ratio is a necessary cause of Beta- decay

A3) High nuclear ratio is not a sufficient cause of Beta- decay

Thus the truth of the statement that the decay of a radioactive atom has no cause depends on whether you are interested in sufficiency or necessity. In particular, that the atom would not have decayed were it not for its high nuclear ratio suggests this ratio was a cause of its decay.

But let’s make things more complicated, let’s say there is a small probability that atoms with low nuclear ratios show Beta- decay. We’d have to say that (remember, relative to our model) the decay of a specific atom at some time has no cause, because neither criterion of sufficiency or necessity is met.

The essence of causality, determinism?

We can continue to stretch the concept. Imagine that a specific nuclear ratio for a specific atom implied a 99.99% probability of decay at some time t, and also that said probability of decay for any other nuclear ratio were 0.001%. Would we still be comfortable saying that the decay of such an atom had no cause?

Singular indeterministic events are peculiar things. They behave according to probabilities, like those of general causation, but are fully specified, like instances of singular deterministic causation. Can we not just apply the methods and vocabulary of general causation to singular indeterministic events?

In fact, we can. We can modify functional causal models such that the underlying structural equations are stochastic, as mentioned in [Pearl2000] section 7.2.2. Another method found in [Steel2005] is to add un-physical exogenous variables that account for the outcomes of indeterministic events. Both of these can be swapped into regular functional models. This should yield equivalent definitions of 9.2.1 and 9.2.2, where probability of sufficiency and necessity are replaced with degrees, giving corresponding versions of A2) and A3).

In this approach, singular causation is not an all or nothing property, it is progressive. Just as general causal claims are expressed with epistemic probabilities, singular causal claims are expressed in terms of ontological probabilities. In this picture, saying that a particular radioactive decay had no cause would be wrong. Instead, perhaps we could say that a specific decay was “partially” or “mostly” caused by some property of that atom, rather than that there was no cause.

I believe this conception of causality is more informative. Throwing out causation just because probabilities are not 100% is excessive and misleading, it ignores regularities and discards information that has predictive content. The essence of causation, I believe, is not determinism, but counterfactual prediction, which banks on regularity, not certainty. It seems reasonable to extend the language we use for general causes onto singular ones, as their implications have the same empirical form. Both make probabilistic predictions, both can be tested.

No cause

What would it mean to say that some event has no cause, according to this interpretation? It would mean that an event is entirely unaffected by, and independent of, any of the universe’s state; no changes made anywhere would alter the probabilities we assign to its occurence. Such an event would be effectively “disconnected” or “transparent”.

Pollock – Lavender Mist Number 1

We could even imagine a completely causeless universe, where all events would be of this kind. It is not easy to see how such a strange place would look like. The most obvious possibility would be a chaotic universe, with no regularities. If we described such a universe as an n-dimensional (eg 3 + 1) collection of random variables, a causeless universe would exhibit zero interaction information, and zero intelligibility, as if every variable resulted of an independent coinflip. But it is not clear to me whether this scenario necessarily follows from a causeless  universe assumption.








[5] See eg causal markov condition, minimality, stability


The essential ingredient of causation, as argued in Pearl (2009:361) is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.

[7] Laurea and her tolerance

Occam’s razor in a cellular physics universe

A cellular automaton (

cellular automaton (CA) is an algorithm acting on cells in  a grid at discrete time steps. The cells can be typically in two states on or off. At each step, the CA computes what the new state of the cells are,  as a function of the state of its neighbors. Here is a simple example of how the new cells are calculated from the old ones:

in this example, the new cell is shown below, where the input cells (neighbors) are the three above. The image at the top of this post shows the evolution of a CA, by displaying new cells at each row. In other words, time flows vertically downwards.

CA’s were discovered in the 1940’s by Stanislaw Ulam and John von Neumann, who were working together at Los Alamos National Laboratory. Perhaps the most famous automaton is the Game of Life, invented by John Conway in 1970.

In this post we will consider a model of a universe based on cellular automata and see what it says about Occam’s razor and the problem of induction. The idea that the universe is describable by a cellular automaton is not new

many scholars have raised the question of whether the universe is a cellular automaton.[68] Consider the evolution of rule 110: if it were some kind of “alien physics”, what would be a reasonable description of the observed patterns?[69]

If you didn’t know how the images were generated, you might end up conjecturing about the movement of some particle-like objects (indeed, physicist James Crutchfield made a rigorous mathematical theory out of this idea proving the statistical emergence of “particles” from CA). Then, as the argument goes, one might wonder if our world, which is currently well described by physics with particle-like objects, could be a CA at its most fundamental level.

This idea is a specific variant of a more general perspective known as digital physics

In physics and cosmology, digital physics is a collection of theoretical perspectives based on the premise that the universe is, at heart, describable by information, and is therefore computable.

Note that we are not claiming digital physics here, but rather constructing a model based on some initial postulates and seeing where it leads us.

Given this background we can consider the problem of induction in a CA universe. The properties of this model are:

1) The universe consists of an n-dimensional infinite grid of cells

2) The time evolution of cells is governed by a cellular automaton

Let’s add our scientist. An agent in this universe makes observations and must formulate hypothesis as to what natural laws describe reality. If we accept a bayesian model, the problem of induction is how to construct a prior on possible theories such that inference is possible. But what form do theories have in this model?

From CA’s to Boolean functions

Although not immediately obvious, typical (2-state) CA’s are equivalent to boolean functions. This is something I noticed when I came across the equation that describes the number of CA’s as a function of states and neighbors:

The general equation for such a system (CA) of rules is kks , where k is the number of possible states for a cell, and s is the number of neighboring cells

This has the same shape as the expression 22k, which is the number of boolean functions for arity k . The connection is simple: a CA with 2-state cells that takes n neighbors as inputs to produce a new output cell (again 2-state) is equivalent to a function

ƒ : Bk → B, where B = {0, 1}

which is precisely the definition of a k-arity boolean function. In this CA -> Boolean Function correspondence the arity is given by the CA’s dimensionality and neighborhood. Below is a one-dimensional CA, each cell’s new value is a function of its two adjacent neighbors plus its own value (arity of 3).

Rule 179

This CA is known as rule 179 because that number encodes the binary specification of the boolean function. You can see this by looking at its truth table (I’m using bexpred):


Truth table specification for Rule 179

The table shows the output of the 3-ary function, inputs A,B,C. If you read the output bits bottom up you get 10110011 which in decimal is 179.

Boolean functions, expressions and trees

Besides the equivalence with CA’s, boolean functions are in general described by boolean algebra and are specified with boolean expressions or formulas. In this algebra variables take on the values true (T), false (F),  and the operators are disjunction (v), conjunction (^) and negation (~). For example, Rule 179 above can be formulated as

(A^C) v ~B

where A is the left neighbor cell, B is the center, and C is the right neighbor; you can check that this in fact corresponds to the CA by applying the formula on cells: doing this repeteadly would result in the pattern in the image above.

The nature of boolean is expressions is such that you can represent them as trees. For example (from D. Gardy[1]), the expression

x ^ (y v z v ~y v y) ^ (~y v t v (x ^ ~v) v u)

can be represented as


Image taken from [1]

this representation of is very similar to that of boolean circuits, in which boolean expressions are represented as directed acyclic graphs. This representation allows classifying boolean circuits in terms of their computational complexity:

In theoretical computer science, circuit complexity is a branch of computational complexity theory in which Boolean functions are classified according to the size or depth of Boolean circuits that compute them.

There are two measures of complexity, depth and circuit-size complexity. In this post we will use a boolean expression analog of circuit-size complexity, which measures the computational complexity of a boolean function by the number of nodes of the minimal circuit that computes it.

L(f) = length of shortest formula (boolean expression) computing f

With this last piece we can revisit our model and add some further detail:

1) The universe consists of an n-dimensional infinite grid of cells

2) The time evolution of cells is governed by some 2-state cellular automaton describable by a boolean tree of complexity L(f)

We can also answer the question posed earlier:

What form do theories have in this model?

The theories our scientist constructs take the from of boolean expressions or equivalently boolean trees. As stated before, the problem of induction in a bayesian setting is about constructing priors over theories. In our model this now translates into constructing a prior over boolean expressions.

Finally, we will postulate two desirable properties such a prior must have, following the spirit of work on algorithmic probability[3][4]. One is Epicurus’ Principle of Multiple Explanations:

Epicurus: if several theories are consistent with the observed data, retain them all

The other is the Principle of Insufficient Reason

when we have no other information than that exactly mutually exclusive events can occur, we are justified in assigning each the probability 1/N.

These last three epistemological characteristics complete our model:

3) Theories take the form of boolean expressions with tree complexity L(f)

4) A-priori all theories are consistent with evidence (Epicurus)

5) A-priori all theories are equally likely

A uniform prior on boolean expressions

Per the characteristics of our model we wish to construct a prior probability distribution over boolean expressions such that

a) The distribution’s support comprises all boolean expressions for some n-dimensional 2-state CA

b) All boolean expression are assigned equal probability

In order to achieve this we turn to results by Lefmann and Savicky[1] et al. on a specific tree representation of boolean formulas, And/Or trees:

We consider such formulas to be rooted binary trees.. each of the inner nodes .. is labeled by AND or OR. Each leaf is labelled by a literal, i.e. a variable or its negation

Note that these properties of And/Or trees do not reduce their expressiveness: any boolean expression can be formulated as an And/Or tree.

We wish to construct a uniform (a) probability distribution over all (b) And/Or trees, which are infinite. Lefmann and Savicky (see also Woods[6]) proved that such a probability distribution exists as an asymptotic limit of a uniform finite distribution:


Finally we will use two results (later improved in Chauvin[7]) which relate the probability P(f) and the boolean expression complexity L(f) in the form of probability bounds:




establishing upper and lower bounds. Note the L(f) term in both cases.


Let’s recap. We defined a toy universe governed by a variant of CA physics, then showed the equivalence between these CA’s, boolean functions, expressions and finally trees. After adding two epistemological principles we recast the problem of induction in this model in terms of constructing a uniform prior over boolean expressions (theories). Further restrictions (And/Or tree representation of theories) allowed us to use existing results to establish the existence of, and then provide upper and lower bounds on, our uniform prior.

The key characteristic in these bounds is the term for the boolean function’s complexity. In theorem 3.1, the L(f) term appears as a positive exponential on a number < 1. In theorem 3.5, L(f) appears as a negative exponential on a number > 1. This means that the complete bounds are monotonically decreasing with increasing expression complexity. This is essentially equivalent to Occam’s razor.

Thus we have shown[8] that Occam’s razor emerges automatically from the the properties of our model; we get the razor “for free”, without having to add it as a separate assumption. Our scientist would therefore be justified in assigning higher probabilities to simpler hypothesis.

As an example, we can see concrete values, not just bounds, for the prior distribution in Chauvin [7], for the specific case of n = 3 (This would correspond with a 2-state 1-dimensional CA).


Sample P(f) for n = 3 (taken from [7])

The column of interest is labelled P(f). We can see how probabilities decrease with increasing boolean expression complexity. Refer to section 2.4 of that paper to see the corresponding increasing values of L(f).


Although we have reviewed the basic steps that outline how Occam’s razor follows from our simple model’s properties, we have not discussed the details as to how and why this happens. In a future post we’ll discuss these details, and the possibility that the mechanism at work may (or may not) generalize to other formalizations of universe-theory-prior.


[1] D. Gardy. Random Boolean expressions. In Colloquium on Computational Logic and Applications, volume AF, pages 1–36. DMTCS Proceedings, 2006.

[2] H. Lefmann and P. Savicky. Some typical properties of large And/Or Boolean formulas. Random Structures and Algorithms, 10:337351, 1997.

[3] Principles of Solomonoff Induction and AIXI 

[4] A Philosophical Treatise of Universal Induction


[6]  A. Woods. Coloring rules for finite trees, and probabilities of monadic second order sentences. Random Structures and Algorithms, 10:453485, 1997.

[7] B. Chauvin, P. Flajolet, D. Gardy, and B. Gittenberger. And/Or trees revisited. Combinatorics Probability and Computing, 13(4 5):475497,July-September 2004

[8] We are leaving out some technical details here. One is that monotonically decreasing bounds do not imply a monotonically decreasing probability. There may be local violations of Occam’s razor, but the razor must holds besides minor fluctuations. In the sample results for n=3 in Chauvin[7], probabilities are in fact monotonically decreasing.

Two, the asymptotics for P(f) for fixed m and P(f) for trees <= m are the same, see [7] 2.1 and [1] 3.3.3

Another detail is the assumption that ceteris paribus, a minimal expression computing f1 corresponding to expression e1 will be shorter than the minimal expression computing f2 corresponding expression e2, if e1 < e2. I e1 < e2 implies on average L(f1) < L(f2).

Finally, it is worth nothing that it is the syntactic prior over boolean expressions that induces an occamian prior over boolean functions. What makes this work is that formula reductions[9] produce multiplicities in the syntactic space for any given element in semantic space. A uniform prior over boolean functions would not yield Occam, this would have to be added separately (ie, the problem of induction)

[9] Boolean expressions may be reduced (simplified) using the laws of boolean algebra. Here is an example boolean reduction

The image above shows a reduction of the 3-ary boolean expression


which yields

A*C + !B

Which is in fact the boolean function corresponding to Rule 179

Distance to the truth

Here’s my brief proposal in response to the debate found here regarding verisimilitude.

Theories are probability distributions over observations, and convertible to probability distributions over possible worlds. This way to describe theories is richer than a compatibility relation between theories and worlds. Theories are not just compatible or incompatible with possible worlds, they assign probabilities to them.

The notion of truth, in the sense of the complete truth, can be described by a probability distribution as well. In a scenaro with no indeterminacy, the true theory, call it T, is a degenerate case of a probability distribution: it assigns probability 1 to the actual world, and zero elsewhere. In a scenario with indeterminacy, the true theory assigns probabilities to possible worlds; this is similar to the indeterminacy present in some interpretations of quantum mechanics.

Once we have established that theories, including the true theory T, are probability distributions, then distance to the truth is a matter of choosing (somewhat arbitrarily) a metric on probability distributions. We can choose, for example, the Jensen–Shannon divergence, because it has the nice property of always being finite (images from wikipedia)


and D is the Kullback–Leibler divergence. So the distance to the truth for a theory H is

D(H) = JSD(T, H)

where T, the true theory, is given. Of course, it is more interesting to consider what we think is the distance to the truth, as we don’t have magical access to what the truth really is. Our estimation of what the truth is can be obtained via Bayesian inference using experimental evidence. So we could define what we think is the truth as

T = argmax(H) P(H|E)

where H are theories, and E is observed evidence. Then we would estimate the distance to the truth as the distance to the theory we think is most likely (given by argmax)

D(H, E) = JSD(T, H)


T = argmax(H) P(H|E)

But there is another possibility. In the above formula, we are not using all our experimental evidence. Some of the information is thrown away by taking only the most likely theory, and ignoring the rest of the probability distribution over theories that evidence establishes. Remember that what we want is to compare probability distributions over worlds. In order to integrate all the information that evidence provides, we can compare our theory against the predictive distribution over worlds that all theories contribute to, not just the most likely one.  We define the predictive distribution over worlds as

Pd(W) = ∑ P(W|H)P(H|E)

where the sum ∑ is over theories H. Finally, our new estimate of distance to the truth becomes

D(H, E) = JSD(Pd, H)


Pd = ∑ P(W|H)P(H|E)

Meta-theoretic induction in action

In my last post I presented a simple model of meta-theoretic induction. Let’s instantiante it with concrete data and run through it. Say we have

E1 E2 E3 Observations made for different domains 1-3
S1 S2 S3 Simple theories for domains 1-3
C1 C2 C3 Complex theories for domains 1-3
S Meta-theory favoring simple theories
C Meta-theory favoring complex theories

That is, we have three domains of observation with corresponding theories. We also have two meta-theories that will produce priors on theories. The meta-theories themselves will be supported by theories’ sucesses or failures. Successes of simple theories support S, successes of complex theories support C. Now define the content of the theories through their likelihoods

En P(En|Sn) P(En|Cn)
E1 3/4 1/4
E2 3/4 1/4
E3 3/4 3/4

Given that E1, E2 and E3 are evidence, this presents a scenario where theories S1 and S2 were successful, whereas theories C1 and C2 were not. S3 and C3 represent theories that are equally well supported by previous evidence (E3) but with different future predictions. This is the crux of the example, where the simplicity bias enters into the picture. Our meta-theories are defined by

P(Sn|S) = 3/4, P(Sn|C) = 1/4

P(Cn|C) = 3/4, P(Cn|S) = 1/4

Meta-theory S favors simple theories, whereas meta-theory C favors complex theories. Finally, our priors are neutral

P(Sn) = P(Cn) = 1/2

P(S) = P(C) = 1/2

We want to process evidence E1 E2, and see what happens at the critical point, where S3 and C3 make the same predictions. The sequence is as follows

  1. Update meta theories S and C with E1 and E2
  2. Produce a prior on S3 and C3 with the updated C and S
  3. Update S3 and C3 with E3

The last step produces probabilities for S3 and C3; these theories make identical predictions but will have different priors granted by S and C. This will formalize the statement

Simpler theories are more likely to be true because they have been so in the past

The model as a bayesian network

Instead of doing all the above by hand (using equations 3,4,5,6), it’s easier to construct the corresponding bayesian network and let some general algorithm do the work. Formulating the model this way makes it much easier to understand, in fact it seems almost trivial. Additionally, our assumptions of conditional independence (1 and 2) map directly into the bayesian network formalism of nodes and edges, quite convenient!


Node M represents the meta-theory, with possible values S and C, the H nodes represent theories, with possible values Sn and Cn. Note the lack of edges between Hn and Ex formalizing (1), and the lack of edges between M and En formalizing (2) (these were our assumptions of conditional independence).

I constructed this network using the SamIam tool developed at UCLA. With this tool we can construct the network and then monitor probabilities as we input data into the model, using the tool’s Query Mode. So let’s do that, fixing the actual outcome of the evidence nodes E1, E2 and E3 (click to enlarge)

Theories S1 and S2 make correct predictions and are thus favoured by the data over C1 and C2. This in turn favours the meta-theory S, which is assigned a probability of 73% over meta-theory C, with 26%. Now, theories S3 and C3 make the same predictions about E3, but because of our meta-theory being better supported, they are assigned different probabilities. Again, recall our starting point

Simpler theories are more likely to be true because they have been so in the past

We can finally state this technically, as seen here

The simple theory S3 is favored at 61% over C3 with 38%, even though they make the same predictions. In fact, we can see how this works if we look at what happens with and without meta-theoretic induction

where as expected the mirrors of S3 and C3 would be granted the same probabilities. So everything seems to work, our meta-theory discriminates different theories and is itself justified via experience, as was the objective

Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations.

But, what happens when we add a meta-theory like Occam(t) into the picture? What happens when we apply the same argument at the meta-level that prompted the meta-theoretic justitification of simplicity we’ve developed? We define a meta-theory S-until-T with

P(S1|S-until-T) =  P(S2|S-until-T) = 3/4  

P(S3|S-until-T) = 1/4

which yields this network

Now both S and S-until-T accrue the same probability through evidence and therefore produce the same prior on S3 and C3, 50%. It seems we can’t escape our original problem.

Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S3 and C3. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

We are just recasting the original problem at the meta level, we end up begging the question[1] or in an infinite regress.

In conclusion, we have succeeded in formalizing meta-theoretic induction in a bayesian setting, and have verified that it works as intended. However, it ultimately does not solve the problem of justificating simplicity. The simplicity principle remains a prior belief independent of experience.

(The two networks used in this post are and, you need the SamIam tool to open these files)

[1] Simplicity is justified if we previously assume simplicity

Formalizing meta-theoretic induction

In this post I formalize the discussion presented here, recall

Simpler theories are more likely to be true because they have been so in the past

We want to formalize this statement into something that integrates into a bayesian scheme, such that the usual inference process, updating probabilities with evidence, works. The first element we want to introduce into our model is the notion of a meta-theory. A meta-theory is a statement about theories, just as a theory is a statement about observations (or the world if you prefer a realist language).

As a first approximation, we could formalize meta-theories as priors over theories. In this way, a meta-theory prior, together with observations, would yield probabilities for theories through the usual updating process. This formalization is technically trivial, we just relabel priors over theories as meta-theories. But this approach does not account for the second half of the original statement

..because they have been so in the past.

As pure priors, meta-theories would never be the object of justification. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations. In order to integrate with normal theories, meta-theories must accumulate probability via conditioning on observations, just as normal theories do.

We cannot depend on or add spurious observations like “this theory was right” as a naive mechanism for updating; this would split the meta and theory level. Evidence like “this theory was right” must be embedded in existing observations, not duplicated somewhere else as a stand alone, ad-hoc ingredient.

Finally, the notion of meta-theory introduces another concept, that of distinct theory domains. This concept is necessary because it is through cross-theory performance that a meta-theoretical principle can emerge. No generalization or principle would be even possible if there were no different theories to begin with. Because different theories may belong to different domains, meta-theoretic induction must account for logical dependencies pertaining to distinct domains; these theories make explicit predictions only about their domain.

Summing up:

Our model will consist of observations/evidence, theories and meta-theories. Theories and corresponding observations are divided into different domains; meta-theories are theories about theories, and capture inter-theoretic dependencies (see below). Meta-theories do not make explicit predictions.

Let’s begin by introducing terms

En: An element of evidence for domain n [1]

Hn: A theory over domain n

M: A meta-theory

Observations that do not pertain to a theory’s domain will be called external evidence. An important assumption in this model is that theories are conditionally independent of external observations given a meta-theory. This means that a theory depends on external observations only through those observation’s effects on meta-theories[2].

We start the formalization of the model with our last remark, conditional independence of theories and external observations given a meta-theory

P(Hn|Ex,M) = P(Hn|M) …………………… (1)

Additionally, any evidence is conditionally independent of a meta-theory given its corresponding theory, i.e. it is theories that make predictions, meta-theories only make predictions indirectly by supporting theories.

P(En|M,Hn) = P(En|Hn) …………………… (2)

Now we define how a meta-theory is updated

P(M|En) = P(En|M) * P(M) / P(En) …………………… (3)

this is just Bayes’ theorem. The important term is the likelihood, which by the law of total probability is

P(En|M) = P(En|M,Hn) * P(Hn|M) + P(En|M,¬Hn) * P(¬Hn|M)

which by conditional independence (1)

P(En|M) = P(En|Hn) * P(Hn|M) + P(En|¬Hn) * P(¬Hn|M) …………………… (4)

This equation governs how a meta-theory is updated with new evidence En. Now to determine how the meta-theory determines a theory’s prior. Again by total probability

P(Hn|Ex) = P(Hn|Ex,M) * P(M|Ex) + P(Hn|Ex,¬M) * P(¬M|Ex)

which by conditional independence (2)

P(Hn|Ex) = P(Hn|M) * P(M|Ex) + P(Hn|¬M) * P(¬M|Ex) …………………… (5)

The following picture illustrates how evidence updates a meta-theory which in turn produces a prior. Note that evidence E1 and E2 are external to H3

Lastly, updating a theory based on matching evidence is, as usual

P(Hn|En) = P(En|Hn) * P(Hn) / P(En…………………… (6)

Equations 3,4,5 and 6 are the machinery of the model through which evidence can be processed in sequence. See it in action in the next post.


[1] A given En represents a sequence of observations made for a domain n. So Hn|En represents induction in a single step, although in practice it would occur with successive bayesian updates for each subelement of evidence.

[2] This characteristic is the meta analogue of conditional independence between observations given theories. In other words, just as logical dependencies between observations are mediated by theories, inter-domain logical dependencies between theories are mediated by meta-theories.