# Meta-theoretic induction in action In my last post I presented a simple model of meta-theoretic induction. Let’s instantiante it with concrete data and run through it. Say we have

 E1 E2 E3 Observations made for different domains 1-3 S1 S2 S3 Simple theories for domains 1-3 C1 C2 C3 Complex theories for domains 1-3 S Meta-theory favoring simple theories C Meta-theory favoring complex theories

That is, we have three domains of observation with corresponding theories. We also have two meta-theories that will produce priors on theories. The meta-theories themselves will be supported by theories’ sucesses or failures. Successes of simple theories support S, successes of complex theories support C. Now define the content of the theories through their likelihoods

En P(En|Sn) P(En|Cn)
E1 3/4 1/4
E2 3/4 1/4
E3 3/4 3/4

Given that E1, E2 and E3 are evidence, this presents a scenario where theories S1 and S2 were successful, whereas theories C1 and C2 were not. S3 and C3 represent theories that are equally well supported by previous evidence (E3) but with different future predictions. This is the crux of the example, where the simplicity bias enters into the picture. Our meta-theories are defined by

P(Sn|S) = 3/4, P(Sn|C) = 1/4

P(Cn|C) = 3/4, P(Cn|S) = 1/4

Meta-theory S favors simple theories, whereas meta-theory C favors complex theories. Finally, our priors are neutral

P(Sn) = P(Cn) = 1/2

P(S) = P(C) = 1/2

We want to process evidence E1 E2, and see what happens at the critical point, where S3 and C3 make the same predictions. The sequence is as follows

1. Update meta theories S and C with E1 and E2
2. Produce a prior on S3 and C3 with the updated C and S
3. Update S3 and C3 with E3

The last step produces probabilities for S3 and C3; these theories make identical predictions but will have different priors granted by S and C. This will formalize the statement

Simpler theories are more likely to be true because they have been so in the past

### The model as a bayesian network

Instead of doing all the above by hand (using equations 3,4,5,6), it’s easier to construct the corresponding bayesian network and let some general algorithm do the work. Formulating the model this way makes it much easier to understand, in fact it seems almost trivial. Additionally, our assumptions of conditional independence (1 and 2) map directly into the bayesian network formalism of nodes and edges, quite convenient! Node M represents the meta-theory, with possible values S and C, the H nodes represent theories, with possible values Sn and Cn. Note the lack of edges between Hn and Ex formalizing (1), and the lack of edges between M and En formalizing (2) (these were our assumptions of conditional independence).

I constructed this network using the SamIam tool developed at UCLA. With this tool we can construct the network and then monitor probabilities as we input data into the model, using the tool’s Query Mode. So let’s do that, fixing the actual outcome of the evidence nodes E1, E2 and E3 (click to enlarge) Theories S1 and S2 make correct predictions and are thus favoured by the data over C1 and C2. This in turn favours the meta-theory S, which is assigned a probability of 73% over meta-theory C, with 26%. Now, theories S3 and C3 make the same predictions about E3, but because of our meta-theory being better supported, they are assigned different probabilities. Again, recall our starting point

Simpler theories are more likely to be true because they have been so in the past

We can finally state this technically, as seen here The simple theory S3 is favored at 61% over C3 with 38%, even though they make the same predictions. In fact, we can see how this works if we look at what happens with and without meta-theoretic induction where as expected the mirrors of S3 and C3 would be granted the same probabilities. So everything seems to work, our meta-theory discriminates different theories and is itself justified via experience, as was the objective

Occam seems like an unjustified and arbitrary principle, in effect, an unsupported bias. Surely, there should be some way to anchor this widely applicable principle on something other than arbitrary choice. We need a way to represent a meta-theory such that it favours some theories over others and such that it can be justified through observations.

But, what happens when we add a meta-theory like Occam(t) into the picture? What happens when we apply the same argument at the meta-level that prompted the meta-theoretic justitification of simplicity we’ve developed? We define a meta-theory S-until-T with

P(S1|S-until-T) =  P(S2|S-until-T) = 3/4

P(S3|S-until-T) = 1/4

which yields this network Now both S and S-until-T accrue the same probability through evidence and therefore produce the same prior on S3 and C3, 50%. It seems we can’t escape our original problem.

Because both Occam and Occam(t) are supported by the same amount of evidence, equal priors will be assigned to S3 and C3. The only way out of this is for Occam and Occam(t) to have different priors themselves. But this leaves us back where we started!

We are just recasting the original problem at the meta level, we end up begging the question or in an infinite regress.

In conclusion, we have succeeded in formalizing meta-theoretic induction in a bayesian setting, and have verified that it works as intended. However, it ultimately does not solve the problem of justificating simplicity. The simplicity principle remains a prior belief independent of experience.

(The two networks used in this post are metainduction1.net and metainduction2.net, you need the SamIam tool to open these files)

 Simplicity is justified if we previously assume simplicity