Occam’s razor in a cellular physics universe

A cellular automaton (http://www.noyzelab.com/)

cellular automaton (CA) is an algorithm acting on cells in  a grid at discrete time steps. The cells can be typically in two states on or off. At each step, the CA computes what the new state of the cells are,  as a function of the state of its neighbors. Here is a simple example of how the new cells are calculated from the old ones:

in this example, the new cell is shown below, where the input cells (neighbors) are the three above. The image at the top of this post shows the evolution of a CA, by displaying new cells at each row. In other words, time flows vertically downwards.

CA’s were discovered in the 1940’s by Stanislaw Ulam and John von Neumann, who were working together at Los Alamos National Laboratory. Perhaps the most famous automaton is the Game of Life, invented by John Conway in 1970.

In this post we will consider a model of a universe based on cellular automata and see what it says about Occam’s razor and the problem of induction. The idea that the universe is describable by a cellular automaton is not new

many scholars have raised the question of whether the universe is a cellular automaton.[68] Consider the evolution of rule 110: if it were some kind of “alien physics”, what would be a reasonable description of the observed patterns?[69]

If you didn’t know how the images were generated, you might end up conjecturing about the movement of some particle-like objects (indeed, physicist James Crutchfield made a rigorous mathematical theory out of this idea proving the statistical emergence of “particles” from CA). Then, as the argument goes, one might wonder if our world, which is currently well described by physics with particle-like objects, could be a CA at its most fundamental level.

This idea is a specific variant of a more general perspective known as digital physics

In physics and cosmology, digital physics is a collection of theoretical perspectives based on the premise that the universe is, at heart, describable by information, and is therefore computable.

Note that we are not claiming digital physics here, but rather constructing a model based on some initial postulates and seeing where it leads us.

Given this background we can consider the problem of induction in a CA universe. The properties of this model are:

1) The universe consists of an n-dimensional infinite grid of cells

2) The time evolution of cells is governed by a cellular automaton

Let’s add our scientist. An agent in this universe makes observations and must formulate hypothesis as to what natural laws describe reality. If we accept a bayesian model, the problem of induction is how to construct a prior on possible theories such that inference is possible. But what form do theories have in this model?

From CA’s to Boolean functions

Although not immediately obvious, typical (2-state) CA’s are equivalent to boolean functions. This is something I noticed when I came across the equation that describes the number of CA’s as a function of states and neighbors:

The general equation for such a system (CA) of rules is kks , where k is the number of possible states for a cell, and s is the number of neighboring cells

This has the same shape as the expression 22k, which is the number of boolean functions for arity k . The connection is simple: a CA with 2-state cells that takes n neighbors as inputs to produce a new output cell (again 2-state) is equivalent to a function

ƒ : Bk → B, where B = {0, 1}

which is precisely the definition of a k-arity boolean function. In this CA -> Boolean Function correspondence the arity is given by the CA’s dimensionality and neighborhood. Below is a one-dimensional CA, each cell’s new value is a function of its two adjacent neighbors plus its own value (arity of 3).

Rule 179

This CA is known as rule 179 because that number encodes the binary specification of the boolean function. You can see this by looking at its truth table (I’m using bexpred):

Truth table specification for Rule 179

The table shows the output of the 3-ary function, inputs A,B,C. If you read the output bits bottom up you get 10110011 which in decimal is 179.

Boolean functions, expressions and trees

Besides the equivalence with CA’s, boolean functions are in general described by boolean algebra and are specified with boolean expressions or formulas. In this algebra variables take on the values true (T), false (F),  and the operators are disjunction (v), conjunction (^) and negation (~). For example, Rule 179 above can be formulated as

(A^C) v ~B

where A is the left neighbor cell, B is the center, and C is the right neighbor; you can check that this in fact corresponds to the CA by applying the formula on cells: doing this repeteadly would result in the pattern in the image above.

The nature of boolean is expressions is such that you can represent them as trees. For example (from D. Gardy[1]), the expression

x ^ (y v z v ~y v y) ^ (~y v t v (x ^ ~v) v u)

can be represented as

Image taken from [1]

this representation of is very similar to that of boolean circuits, in which boolean expressions are represented as directed acyclic graphs. This representation allows classifying boolean circuits in terms of their computational complexity:

In theoretical computer science, circuit complexity is a branch of computational complexity theory in which Boolean functions are classified according to the size or depth of Boolean circuits that compute them.

There are two measures of complexity, depth and circuit-size complexity. In this post we will use a boolean expression analog of circuit-size complexity, which measures the computational complexity of a boolean function by the number of nodes of the minimal circuit that computes it.

L(f) = length of shortest formula (boolean expression) computing f

With this last piece we can revisit our model and add some further detail:

1) The universe consists of an n-dimensional infinite grid of cells

2) The time evolution of cells is governed by some 2-state cellular automaton describable by a boolean tree of complexity L(f)

We can also answer the question posed earlier:

What form do theories have in this model?

The theories our scientist constructs take the from of boolean expressions or equivalently boolean trees. As stated before, the problem of induction in a bayesian setting is about constructing priors over theories. In our model this now translates into constructing a prior over boolean expressions.

Finally, we will postulate two desirable properties such a prior must have, following the spirit of work on algorithmic probability[3][4]. One is Epicurus’ Principle of Multiple Explanations:

Epicurus: if several theories are consistent with the observed data, retain them all

The other is the Principle of Insufficient Reason

when we have no other information than that exactly mutually exclusive events can occur, we are justified in assigning each the probability 1/N.

These last three epistemological characteristics complete our model:

3) Theories take the form of boolean expressions with tree complexity L(f)

4) A-priori all theories are consistent with evidence (Epicurus)

5) A-priori all theories are equally likely

A uniform prior on boolean expressions

Per the characteristics of our model we wish to construct a prior probability distribution over boolean expressions such that

a) The distribution’s support comprises all boolean expressions for some n-dimensional 2-state CA

b) All boolean expression are assigned equal probability

In order to achieve this we turn to results by Lefmann and Savicky[1] et al. on a specific tree representation of boolean formulas, And/Or trees:

We consider such formulas to be rooted binary trees.. each of the inner nodes .. is labeled by AND or OR. Each leaf is labelled by a literal, i.e. a variable or its negation

Note that these properties of And/Or trees do not reduce their expressiveness: any boolean expression can be formulated as an And/Or tree.

We wish to construct a uniform (a) probability distribution over all (b) And/Or trees, which are infinite. Lefmann and Savicky (see also Woods[6]) proved that such a probability distribution exists as an asymptotic limit of a uniform finite distribution:


Finally we will use two results (later improved in Chauvin[7]) which relate the probability P(f) and the boolean expression complexity L(f) in the form of probability bounds:




establishing upper and lower bounds. Note the L(f) term in both cases.


Let’s recap. We defined a toy universe governed by a variant of CA physics, then showed the equivalence between these CA’s, boolean functions, expressions and finally trees. After adding two epistemological principles we recast the problem of induction in this model in terms of constructing a uniform prior over boolean expressions (theories). Further restrictions (And/Or tree representation of theories) allowed us to use existing results to establish the existence of, and then provide upper and lower bounds on, our uniform prior.

The key characteristic in these bounds is the term for the boolean function’s complexity. In theorem 3.1, the L(f) term appears as a positive exponential on a number < 1. In theorem 3.5, L(f) appears as a negative exponential on a number > 1. This means that the complete bounds are monotonically decreasing with increasing expression complexity. This is essentially equivalent to Occam’s razor.

Thus we have shown[8] that Occam’s razor emerges automatically from the the properties of our model; we get the razor “for free”, without having to add it as a separate assumption. Our scientist would therefore be justified in assigning higher probabilities to simpler hypothesis.

As an example, we can see concrete values, not just bounds, for the prior distribution in Chauvin [7], for the specific case of n = 3 (This would correspond with a 2-state 1-dimensional CA).

Sample P(f) for n = 3 (taken from [7])

The column of interest is labelled P(f). We can see how probabilities decrease with increasing boolean expression complexity. Refer to section 2.4 of that paper to see the corresponding increasing values of L(f).


Although we have reviewed the basic steps that outline how Occam’s razor follows from our simple model’s properties, we have not discussed the details as to how and why this happens. In a future post we’ll discuss these details, and the possibility that the mechanism at work may (or may not) generalize to other formalizations of universe-theory-prior.


[1] D. Gardy. Random Boolean expressions. In Colloquium on Computational Logic and Applications, volume AF, pages 1–36. DMTCS Proceedings, 2006.

[2] H. Lefmann and P. Savicky. Some typical properties of large And/Or Boolean formulas. Random Structures and Algorithms, 10:337351, 1997.

[3] Principles of Solomonoff Induction and AIXI 

[4] A Philosophical Treatise of Universal Induction

[5] http://www.scholarpedia.org/article/Algorithmic_probability#Bayes.2C_Occam_and_Epicurus

[6]  A. Woods. Coloring rules for finite trees, and probabilities of monadic second order sentences. Random Structures and Algorithms, 10:453485, 1997.

[7] B. Chauvin, P. Flajolet, D. Gardy, and B. Gittenberger. And/Or trees revisited. Combinatorics Probability and Computing, 13(4 5):475497,July-September 2004

[8] We are leaving out some technical details here. One is that monotonically decreasing bounds do not imply a monotonically decreasing probability. There may be local violations of Occam’s razor, but the razor must holds besides minor fluctuations. In the sample results for n=3 in Chauvin[7], probabilities are in fact monotonically decreasing.

Two, the asymptotics for P(f) for fixed m and P(f) for trees <= m are the same, see [7] 2.1 and [1] 3.3.3

Another detail is the assumption that ceteris paribus, a minimal expression computing f1 corresponding to expression e1 will be shorter than the minimal expression computing f2 corresponding expression e2, if e1 < e2. I e1 < e2 implies on average L(f1) < L(f2).

Finally, it is worth nothing that it is the syntactic prior over boolean expressions that induces an occamian prior over boolean functions. What makes this work is that formula reductions[9] produce multiplicities in the syntactic space for any given element in semantic space. A uniform prior over boolean functions would not yield Occam, this would have to be added separately (ie, the problem of induction)

[9] Boolean expressions may be reduced (simplified) using the laws of boolean algebra. Here is an example boolean reduction

The image above shows a reduction of the 3-ary boolean expression


which yields

A*C + !B

Which is in fact the boolean function corresponding to Rule 179

Causal entropy maximization and intelligence

Taken from Causal Entropic Forces

Recently I was referred to a paper titled Causal Entropic Forces published in Physical Review Letters that attempts to link intelligence and entropy maximization. You can find reviews of this paper here and here. The paper starts with

Recent advances in fields ranging from cosmology to computer science have hinted at a possible deep connection between intelligence and entropy maximization….In this Letter, we explicitly propose a first step toward such a relationship in the form of a causal generalization of entropic forces that we show can spontaneously induce remarkably sophisticated behaviors associated with the human ‘‘cognitive niche,’’ including tool use and social cooperation, in simple physical systems.

The authors then go on to define a causal path version of entropy. Briefly, this is a generalization from standard entropy, a measure of how many states a system can be in at a specific point in time, to causal path entropy, a measure of how many paths that system can follow during a given time horizon. In technical language, microstates are mapped to paths in configuration space, and macrostates are mapped to configuration space volumes:

In particular, we can promote microstates from instantaneous configurations to fixed-duration paths through configuration space while still partitioning such microstates into macrostates according to the initial coordinate of each path

In other words, an initial coordinate establishes a volume in configuration space which represents possible future histories starting at that point. This is the macrostate (depicted as a cone in the image above)

Having defined this version of entropy, the authors then add the condition of entropy maximization to their model; this is what they call causal entropic forcing. For this to have a net effect, some macrostates have volumes which are partially blocked off for physical reasons. Consequently these macrostates have less available future paths, and less causal path entropy. The result is that different macrostates with different entropies can be differentially favored by condition of causal entropy maximization:

there is an environmentally imposed excluded path-space volume that breaks translational symmetry, resulting in a causal entropic force F directed away from the excluded volume.

Note that, contrary to actual thermodynamical systems that naturally exhibit entropy maximization for statistical reasons, causal entropic forcing is not physical, it is a thermodynamics inspired premise the authors add to their model as a “what if” condition, to see what behaviour results. So, what happens when systems are subject to causal entropic forcing?

 we simulated its effect on the evolution of the causal macrostates of a variety of simple mechanical systems: (i) a particle in a box, (ii) a cart and pole system, (iii) a tool use puzzle, and (iv) a social cooperation puzzle…The latter two systems were selected because they isolate major behavioral capabilities associated with the human ‘‘cognitive niche’’

Before you get excited, the “tool use puzzle” and “social cooperation puzzle” are not what one would imagine. They are simple “toy simulations” that can be interpreted as tool use and social cooperation. In any case, the result was surprising. When running these simulation the authors observed adaptive behaviour that was remarkably sophisticated given the simplicity of the physics model it emerged from. What’s more, not only was the behaviour adaptive, but it exhibited a degree of generality; the same basic model was applied to both examples without specific tuning.

The remarkable spontaneous emergence of these sophisticated behaviors from such a simple physical process suggests that causal entropic forces might be used as the basis for a general—and potentially universal—thermodynamic model for adaptive behavior.

How does this fit in with intelligence?

I see two different ways one can think about this new approach. One, as an independent definition of intelligence from very simple physical principles. Two, in terms of existing definitions of intelligence, seeing where it fits in and if it can be shown to be equivalent or recovered partially.

Defining intelligence as causal entropy maximization (CEM) is a very appealing as it only requires a few very basic physical principles to work. In this sense it is a very powerful concept. But as all definitions it is neither right nor wrong, its merit rests on how useful it is. The question is thus how well does this version of intelligence capture our intutions about the concept, and how well it fits with existing phenomena that we currently classify as intelligent[1]. Ill consider a simple example to suggest that intelligence defined this way cannot be the entire picture.

That example is unsurprisingly life, the cradle of intelligence. The concept that directly collides with intelligence defined as CEM is negentropy. Organisms behave adaptively to keep their biological systems within the narrow space of parameters that is compatible with life. We would call this adaptive behaviour intelligent, and yet its effect is precisely that of reducing entropy. Indeed, maximizing causal entropy for a living being means one thing, death.

One could argue that the system is not just the living organism, but the living organism plus its environment, and that in that case the entropy perhaps would be maximized[2]. This could resolve the apparent incompatibility, but CEM still seems unsatisfying. How can a good definition of intelligence leave out an essential aspect of intelligent life: the entropy minimization that all living beings must carry out. Is this local entropy minimization implicit in the overall causal entropy maximization?

CEM, intelligence and goals

Although there is no single correct existing definition of intelligence, it can be said that current working definitions share certain common features. Citing [3]

If we scan through the definitions pulling out commonly occurring features we find that intelligence is:

• A property that an individual agent has as it interacts with its environment or environments.

• Is related to the agent’s ability to succeed or profit with respect to some goal or objective.

• Depends on how able to agent is to adapt to different objectives and environments.

In particular, intelligence is related to the ability to achieve goals. One of the appealing characteristics of CEM as defining intelligence is that it does not need to define goals explicitly. In the simulations carried out by the authors the resulting behaviour seemed to be directed at achieving some goal that was not specified by the experimenters. It could be said that the goals emerged spontaneously from CEM.  But it remains to be seen whether this goal directed behaviour results automatically in real complex environments. For the example of life I mentioned above, it looks to be just the opposite.

So in general, how does CEM fit in with existing frameworks where intelligence is the ability to achieve goals in a wide range of environments[3]? Again, I see two possibilities:

  • CEM is a very general heuristic[4] that aligns with standard intelligence when there is uncertainty in the utility of different courses of action
  • CEM can be shown to be equivalent if there exists an encoding that represents specific goals via blocked off regions in configuration space (macrostates)

The idea behind the first possibility is very simple. If an agent is faced with many possibilities where it is unclear which one will lead to achieving its goals, then maximizing expected utility would seek to follow courses of action that allow it to react adaptively and flexibly when more information becomes available. This heuristic is just a version of keep your options open.

The second idea is just a matter of realizing that CEM’s resulting behaviour depends on how you count possible paths to determine the causal entropy of a macrostate. If one were to rule out paths that result in low utility given certain goals, then CEM could turn out to be equivalent to existing approaches to intelligence. Is it possible to recover intelligent goal directed behaviour as an instance of CEM given the right configuration space restrictions?


[1] Our intuitions about intelligence exist prior to any technical definition. For example, we would agree that a monkey is more intelligent than a rock, and that a person is more intelligent than a fly. A definition that does not fit these basic notions would be unsatisfactory.

[2] http://prd.aps.org/abstract/PRD/v76/i4/e043513

[3] http://www.vetta.org/documents/A-Collection-of-Definitions-of-Intelligence.pdf

[4] This seems related to the idea of basic AI drives identified by Omohundro in his paper http://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf. In particular 6. AIs will want to acquire resources and use them efficiently. Availability of resources translates to the ability to follow more paths in configuration space, paths that would be unavailable otherwise.

A cognitive approach to evaluating programming languages

Programmers have been known to engage in flame wars about programming languages (and related matters like choice of text editor, operating system or even code indent style). Rational arguments are absent from these heated debates, differences in opinion usually reduce to personal preferences and strongly held allegiances without much objective basis. I have discussed this pattern of thinking before as found in politics.

Although humans have a natural tendency to engage in this type of thought and debate for any subject matter, the phenomenon is exacerbated for fields in which there is no available objective evidence to reach conclusions; no method to settle questions in a technical and precise way. Programming languages are a clear example of this, and so better/worse opinions are more or less free to roam without the constraints of well established knowledge. Quoting from a presentation I link to below

Many claims are made for the efficacy and utility of new approaches to software engineering – structured methodologies, new programming paradigms, new tools, and so on. Evidence to support such claims is thin and such evidence, as there is, is largely anecdotal. Of proper scientific evidence there is remarkably little. – Frank Bott

Fortunately there is a recognized wisdom that can settle some debates: there is no overall better programming language, you merely pick the right tool for the job. This piece of wisdom is valuable for two reasons. First, because it is most probably true. Second, its down to earth characterization of a programming language as just a tool inoculates against religious attitudes towards it; you dont worship tools, you merely use them.

But even though this change of attitude is welcome and definitely more productive than the usual pointless flame wars, it does not automatically imply that there is no such thing as a better or worse programming language for some class of problems, or that better or worse cannot be defined in some technical yet meaningful way. After all, programming languages should be subject to advances like any other engineering tool The question is, what approach can be used to even begin to think about programming, programs, and programming languages in a rigorous way?

One approach is to establish objective metrics on source code that reflect some property of the program that is relevant for the purposes of writing better software. One such metric is the Cyclomatic complexity as a measure of soure code complexity. The motivation for this metric is clear, complex programs are harder to understand, maintain and debug. In this sense, cyclomatic complexity is an objective metric that tries to reflect a property that can be interpreted as better/worse; a practical recommendation could be to write and refactor programs code in a way that minimizes the value of this metric.

But the problem with cyclomatic complexity, or any measure, is whether it in fact reflects some property that is relevant and has meaningful consequences. It is not enough that the metric is precisely defined and objective if it doesn’t mean anything.  In the above, it would be important to determine that cyclomatic complexity is in fact correlated with difficulty in understading, maintaining, and debugging. Absent this verified correlation, one cannot make the jump from an objective metric on code to some interpretation in terms of better/worse, and we’re back where we started.

The important thing to note is that correctly assigning some property of source code a better/worse interpretation is partly a matter of human psychology, a field whose methods and conclusions can be exploited. The fact that some program is hard to understand (or maintain, debug, etc) is a consequence both of some property of the program and some aspect of the way we understand programs. This brings us to the concept of the psychology of programming as a necessary piece in the quest to investigate programming in a rigorous and empirical way.

Michael Hansen discusses these ideas in this talk: Cognitive Architectures: A Way Forward for the Psychology of Programming. His approach is very interesting, it attempts to simulate cognition via the same cognitive architectures that play a role in artificial general intelligence. Data from these simulations can cast light as to how how different programming language features impact cognition, and therefore how these features perform in the real world.

The ACT-R cognitive architecture

I have to say, however, that this approach seems very ambitious to me. First, because modeling cognition is incredibly hard to get right. Otherwise we’d already have machine intelligence. Secondly, because it is hard to isolate the effects of anything beyond a low granularity feature. And programming languages, let alone paradigms, are defined by the interplay of many of these features and characteristics. Both of these problems are recognized by the speaker.

[1] Image taken from http://www.lackuna.com/2013/01/02/4-programming-languages-to-ace-your-job-interviews/

Jürgen Schmidhuber at AGI-2011: Fast Deep/Recurrent Nets for AGI Vision


It’s all about deep learning these days. I previously posted a video here of a talk by Andrew Ng where one can also see unsupervised feature learning, as for example Gabor filters, ie these features are learned by the network automatically.