Causal entropy maximization and intelligence

Taken from Causal Entropic Forces

Recently I was referred to a paper titled Causal Entropic Forces published in Physical Review Letters that attempts to link intelligence and entropy maximization. You can find reviews of this paper here and here. The paper starts with

Recent advances in fields ranging from cosmology to computer science have hinted at a possible deep connection between intelligence and entropy maximization….In this Letter, we explicitly propose a first step toward such a relationship in the form of a causal generalization of entropic forces that we show can spontaneously induce remarkably sophisticated behaviors associated with the human ‘‘cognitive niche,’’ including tool use and social cooperation, in simple physical systems.

The authors then go on to define a causal path version of entropy. Briefly, this is a generalization from standard entropy, a measure of how many states a system can be in at a specific point in time, to causal path entropy, a measure of how many paths that system can follow during a given time horizon. In technical language, microstates are mapped to paths in configuration space, and macrostates are mapped to configuration space volumes:

In particular, we can promote microstates from instantaneous configurations to fixed-duration paths through configuration space while still partitioning such microstates into macrostates according to the initial coordinate of each path

In other words, an initial coordinate establishes a volume in configuration space which represents possible future histories starting at that point. This is the macrostate (depicted as a cone in the image above)

Having defined this version of entropy, the authors then add the condition of entropy maximization to their model; this is what they call causal entropic forcing. For this to have a net effect, some macrostates have volumes which are partially blocked off for physical reasons. Consequently these macrostates have less available future paths, and less causal path entropy. The result is that different macrostates with different entropies can be differentially favored by condition of causal entropy maximization:

there is an environmentally imposed excluded path-space volume that breaks translational symmetry, resulting in a causal entropic force F directed away from the excluded volume.

Note that, contrary to actual thermodynamical systems that naturally exhibit entropy maximization for statistical reasons, causal entropic forcing is not physical, it is a thermodynamics inspired premise the authors add to their model as a “what if” condition, to see what behaviour results. So, what happens when systems are subject to causal entropic forcing?

 we simulated its effect on the evolution of the causal macrostates of a variety of simple mechanical systems: (i) a particle in a box, (ii) a cart and pole system, (iii) a tool use puzzle, and (iv) a social cooperation puzzle…The latter two systems were selected because they isolate major behavioral capabilities associated with the human ‘‘cognitive niche’’

Before you get excited, the “tool use puzzle” and “social cooperation puzzle” are not what one would imagine. They are simple “toy simulations” that can be interpreted as tool use and social cooperation. In any case, the result was surprising. When running these simulation the authors observed adaptive behaviour that was remarkably sophisticated given the simplicity of the physics model it emerged from. What’s more, not only was the behaviour adaptive, but it exhibited a degree of generality; the same basic model was applied to both examples without specific tuning.

The remarkable spontaneous emergence of these sophisticated behaviors from such a simple physical process suggests that causal entropic forces might be used as the basis for a general—and potentially universal—thermodynamic model for adaptive behavior.

How does this fit in with intelligence?

I see two different ways one can think about this new approach. One, as an independent definition of intelligence from very simple physical principles. Two, in terms of existing definitions of intelligence, seeing where it fits in and if it can be shown to be equivalent or recovered partially.

Defining intelligence as causal entropy maximization (CEM) is a very appealing as it only requires a few very basic physical principles to work. In this sense it is a very powerful concept. But as all definitions it is neither right nor wrong, its merit rests on how useful it is. The question is thus how well does this version of intelligence capture our intutions about the concept, and how well it fits with existing phenomena that we currently classify as intelligent[1]. Ill consider a simple example to suggest that intelligence defined this way cannot be the entire picture.

That example is unsurprisingly life, the cradle of intelligence. The concept that directly collides with intelligence defined as CEM is negentropy. Organisms behave adaptively to keep their biological systems within the narrow space of parameters that is compatible with life. We would call this adaptive behaviour intelligent, and yet its effect is precisely that of reducing entropy. Indeed, maximizing causal entropy for a living being means one thing, death.

One could argue that the system is not just the living organism, but the living organism plus its environment, and that in that case the entropy perhaps would be maximized[2]. This could resolve the apparent incompatibility, but CEM still seems unsatisfying. How can a good definition of intelligence leave out an essential aspect of intelligent life: the entropy minimization that all living beings must carry out. Is this local entropy minimization implicit in the overall causal entropy maximization?

CEM, intelligence and goals

Although there is no single correct existing definition of intelligence, it can be said that current working definitions share certain common features. Citing [3]

If we scan through the definitions pulling out commonly occurring features we find that intelligence is:

• A property that an individual agent has as it interacts with its environment or environments.

• Is related to the agent’s ability to succeed or profit with respect to some goal or objective.

• Depends on how able to agent is to adapt to different objectives and environments.

In particular, intelligence is related to the ability to achieve goals. One of the appealing characteristics of CEM as defining intelligence is that it does not need to define goals explicitly. In the simulations carried out by the authors the resulting behaviour seemed to be directed at achieving some goal that was not specified by the experimenters. It could be said that the goals emerged spontaneously from CEM.  But it remains to be seen whether this goal directed behaviour results automatically in real complex environments. For the example of life I mentioned above, it looks to be just the opposite.

So in general, how does CEM fit in with existing frameworks where intelligence is the ability to achieve goals in a wide range of environments[3]? Again, I see two possibilities:

  • CEM is a very general heuristic[4] that aligns with standard intelligence when there is uncertainty in the utility of different courses of action
  • CEM can be shown to be equivalent if there exists an encoding that represents specific goals via blocked off regions in configuration space (macrostates)

The idea behind the first possibility is very simple. If an agent is faced with many possibilities where it is unclear which one will lead to achieving its goals, then maximizing expected utility would seek to follow courses of action that allow it to react adaptively and flexibly when more information becomes available. This heuristic is just a version of keep your options open.

The second idea is just a matter of realizing that CEM’s resulting behaviour depends on how you count possible paths to determine the causal entropy of a macrostate. If one were to rule out paths that result in low utility given certain goals, then CEM could turn out to be equivalent to existing approaches to intelligence. Is it possible to recover intelligent goal directed behaviour as an instance of CEM given the right configuration space restrictions?


[1] Our intuitions about intelligence exist prior to any technical definition. For example, we would agree that a monkey is more intelligent than a rock, and that a person is more intelligent than a fly. A definition that does not fit these basic notions would be unsatisfactory.



[4] This seems related to the idea of basic AI drives identified by Omohundro in his paper In particular 6. AIs will want to acquire resources and use them efficiently. Availability of resources translates to the ability to follow more paths in configuration space, paths that would be unavailable otherwise.

Jürgen Schmidhuber at AGI-2011: Fast Deep/Recurrent Nets for AGI Vision


It’s all about deep learning these days. I previously posted a video here of a talk by Andrew Ng where one can also see unsupervised feature learning, as for example Gabor filters, ie these features are learned by the network automatically.

Meta-theoretic induction in action

In my last post I presented a simple model of meta-theoretic induction. Let’s instantiante it with concrete data and run through it. Say we have

E1 E2 E3 Observations made for different domains 1-3
S1 S2 S3 Simple theories for domains 1-3
C1 C2 C3 Complex theories for domains 1-3
S Meta-theory favoring simple theories
C Meta-theory favoring complex theories

That is, we have three domains of observation with corresponding theories. We also have two meta-theories that will produce priors on theories. The meta-theories themselves will be supported by theories’ sucesses or failures. Successes of simple theories support S, successes of complex theories support C. Now define the content of the theories through their likelihoods

En P(En|Sn) P(En|Cn)
E1 3/4 1/4
E2 3/4 1/4
E3 3/4 3/4

Given that E1, E2 and E3 are evidence, this presents a scenario where theories S1 and S2 were successful, whereas theories C1 and C2 were not. S3 and C3 represent theories that are equally well supported by previous evidence (E3) but with different future predictions. This is the crux of the example, where the simplicity bias enters into the picture. Our meta-theories are defined by

P(Sn|S) = 3/4, P(Sn|C) = 1/4

P(Cn|C) = 3/4, P(Cn|S) = 1/4

Meta-theory S favors simple theories, whereas meta-theory C favors complex theories. Finally, our priors are neutral

P(Sn) = P(Cn) = 1/2

P(S) = P(C) = 1/2

We want to process evidence E1 E2, and see what happens at the critical point, where S3 and C3 make the same predictions. The sequence is as follows

  1. Update meta theories S and C with E1 and E2
  2. Produce a prior on S3 and C3 with the updated C and S
  3. Update S3 and C3 with E3

The last step produces probabilities for S3 and C3; these theories make identical predictions but will have different priors granted by S and C. This will formalize the statement

Simpler theories are more likely to be true because they have been so in the past

The model as a bayesian network

Instead of doing all the above by hand (using equations 3,4,5,6), it’s easier to construct the corresponding bayesian network and let some general algorithm do the work. Formulating the model this way makes it much easier to understand, in fact it seems almost trivial. Additionally, our assumptions of conditional independence (1 and 2) map directly into the bayesian network formalism of nodes and edges, quite convenient!


Node M represents the meta-theory, with possible values S and C, the H nodes represent theories, with possible values Sn and Cn. Note the lack of edges between Hn and Ex formalizing (1), and the lack of edges between M and En formalizing (2) (these were our assumptions of conditional independence).

I constructed this network using the SamIam tool developed at UCLA. With this tool we can construct the network and then monitor probabilities as we input data into the model, using the tool’s Query Mode. So let’s do that, fixing the actual outcome of the evidence nodes E1, E2 and E3 (click to enlarge)

Theories S1 and S2 make correct predictions and are thus favoured by the data over C1 and C2. This in turn favours the meta-theory S, which is assigned a probability of 73% over meta-theory C, with 26%. Now, theories S3 and C3 make the same predictions about E3, but because of our meta-theory being better supported, they are assigned different probabilities. Again, recall our starting point

Simpler theories are more likely to be true because they have been so in the past

We can finally state this technically, as seen here

The simple theory S3 is favored at 61% over C3 with 38%, even though they make the same predictions. In fact, we can see how this works if we look at what happens with and without meta-theoretic induction

where as expected the mirrors of S3 and C3 would be granted the same probabilities. So everything seems to work, our meta-theory discriminates different theories and is itself justified via experience, as was the objective

Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations.

But, what happens when we add a meta-theory like Occam(t) into the picture? What happens when we apply the same argument at the meta-level that prompted the meta-theoretic justitification of simplicity we’ve developed? We define a meta-theory S-until-T with

P(S1|S-until-T) =  P(S2|S-until-T) = 3/4  

P(S3|S-until-T) = 1/4

which yields this network

Now both S and S-until-T accrue the same probability through evidence and therefore produce the same prior on S3 and C3, 50%. It seems we can’t escape our original problem.

Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S3 and C3. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

We are just recasting the original problem at the meta level, we end up begging the question[1] or in an infinite regress.

In conclusion, we have succeeded in formalizing meta-theoretic induction in a bayesian setting, and have verified that it works as intended. However, it ultimately does not solve the problem of justificating simplicity. The simplicity principle remains a prior belief independent of experience.

(The two networks used in this post are and, you need the SamIam tool to open these files)

[1] Simplicity is justified if we previously assume simplicity

Formalizing meta-theoretic induction

In this post I formalize the discussion presented here, recall

Simpler theories are more likely to be true because they have been so in the past

We want to formalize this statement into something that integrates into a bayesian scheme, such that the usual inference process, updating probabilities with evidence, works. The first element we want to introduce into our model is the notion of a meta-theory. A meta-theory is a statement about theories, just as a theory is a statement about observations (or the world if you prefer a realist language).

As a first approximation, we could formalize meta-theories as priors over theories. In this way, a meta-theory prior, together with observations, would yield probabilities for theories through the usual updating process. This formalization is technically trivial, we just relabel priors over theories as meta-theories. But this approach does not account for the second half of the original statement

..because they have been so in the past.

As pure priors, meta-theories would never be the object of justification. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations. In order to integrate with normal theories, meta-theories must accumulate probability via conditioning on observations, just as normal theories do.

We cannot depend on or add spurious observations like “this theory was right” as a naive mechanism for updating; this would split the meta and theory level. Evidence like “this theory was right” must be embedded in existing observations, not duplicated somewhere else as a stand alone, ad-hoc ingredient.

Finally, the notion of meta-theory introduces another concept, that of distinct theory domains. This concept is necessary because it is through cross-theory performance that a meta-theoretical principle can emerge. No generalization or principle would be even possible if there were no different theories to begin with. Because different theories may belong to different domains, meta-theoretic induction must account for logical dependencies pertaining to distinct domains; these theories make explicit predictions only about their domain.

Summing up:

Our model will consist of observations/evidence, theories and meta-theories. Theories and corresponding observations are divided into different domains; meta-theories are theories about theories, and capture inter-theoretic dependencies (see below). Meta-theories do not make explicit predictions.

Let’s begin by introducing terms

En: An element of evidence for domain n [1]

Hn: A theory over domain n

M: A meta-theory

Observations that do not pertain to a theory’s domain will be called external evidence. An important assumption in this model is that theories are conditionally independent of external observations given a meta-theory. This means that a theory depends on external observations only through those observation’s effects on meta-theories[2].

We start the formalization of the model with our last remark, conditional independence of theories and external observations given a meta-theory

P(Hn|Ex,M) = P(Hn|M) …………………… (1)

Additionally, any evidence is conditionally independent of a meta-theory given its corresponding theory, i.e. it is theories that make predictions, meta-theories only make predictions indirectly by supporting theories.

P(En|M,Hn) = P(En|Hn) …………………… (2)

Now we define how a meta-theory is updated

P(M|En) = P(En|M) * P(M) / P(En) …………………… (3)

this is just Bayes’ theorem. The important term is the likelihood, which by the law of total probability is

P(En|M) = P(En|M,Hn) * P(Hn|M) + P(En|M,¬Hn) * P(¬Hn|M)

which by conditional independence (1)

P(En|M) = P(En|Hn) * P(Hn|M) + P(En|¬Hn) * P(¬Hn|M) …………………… (4)

This equation governs how a meta-theory is updated with new evidence En. Now to determine how the meta-theory determines a theory’s prior. Again by total probability

P(Hn|Ex) = P(Hn|Ex,M) * P(M|Ex) + P(Hn|Ex,¬M) * P(¬M|Ex)

which by conditional independence (2)

P(Hn|Ex) = P(Hn|M) * P(M|Ex) + P(Hn|¬M) * P(¬M|Ex) …………………… (5)

The following picture illustrates how evidence updates a meta-theory which in turn produces a prior. Note that evidence E1 and E2 are external to H3

Lastly, updating a theory based on matching evidence is, as usual

P(Hn|En) = P(En|Hn) * P(Hn) / P(En…………………… (6)

Equations 3,4,5 and 6 are the machinery of the model through which evidence can be processed in sequence. See it in action in the next post.


[1] A given En represents a sequence of observations made for a domain n. So Hn|En represents induction in a single step, although in practice it would occur with successive bayesian updates for each subelement of evidence.

[2] This characteristic is the meta analogue of conditional independence between observations given theories. In other words, just as logical dependencies between observations are mediated by theories, inter-domain logical dependencies between theories are mediated by meta-theories.