I’ve defined learning as *the extraction of generally applicable knowledge from specific examples*. In that post I remarked

An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be

learnable.

Today I’m going to a present a model where learning is impossible in the context of bayesian inference. We will see in this case that not only must the phenomenon be learnable, but also that the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

As components of the model we have an **agent**, an **environment** from which **observations** are made, and **theories** the agent probabilistically reasons about as the object of its learning. For observations we use a binary sequence, *S = {0, 1}^n*, like for example

*S = 1010111010*

The learning agent sees a number of elements and must try to predict subsequent ones according to different theories, which are of the form *H = {0,1}^n*. An important aspect of the model is that the agent will consider *all possible theories* that can explain and predict observations. The number of theories is therefore equal to the number of possible observations, which are both *2^n*. If the agent considered a smaller number of theories it could be that the true theory describing the environment would be left out.

Furthermore, let’s say that a-priori, the agent *has no reason to consider any theory more likely than the rest*. So it will assign an a-priori equal probability to each theory:

*P(H) = 1 / 2^n*

Define the total observations up to a given time as *Si*, where* i <= n*, and that a given theory is *Hk*, where *k <= n*. We can apply bayes theorem to obtain the probability that a given theory is true given the observations (feel free to skip the math down to the conclusion):

*P(Hk|Si) = P(Si|Hk)*P(Hk) / P(Si)*

and the probability of a given sequence of observations P(Si) is obtained by summing[1] over all theories that yield such a prediction:

*P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)*

in other words, summing over all theories that begin with Si. To see exactly whats happening let’s restrict the example to n = 4. This gives us a total of 2^4 = 16 possible observations and theories. Say the agent has observed three elements ‘111’ and call the sequence S3:

S3 = 111

Let’s calculate the posterior probability on theories for this case. First for theories that do* not* predict *111*:

*P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)
*

but since P(Si|Hk) = 0, then

*P(Hk|Si) = 0*

ie theories that do not predict *111* are ruled out as should be the case. There are two theories that *do* predict *111*:

*H1 = {1110}*

*H2 = {1111}*

the denominator of the posterior is

*Sum(k) P(S3|Hk)*P(Hk)*

there are two theories that predict the sequence, therefore

*Sum(k) P(S3|Hk)*P(Hk) = P(H1) + P(H2)*

plugin this in, the posterior is therefore

*P(H1|S3) = P(S3|H1)*P(H1) / [P(H1) + P(H2)]*

*P(H2|S3) = P(S3|H2)*P(H2) / [P(H1) + P(H2)]*

since both H1 and H2 predict S3 (P(S3|H) = 1), this reduces to

*P(H1|S3) = P(H1) / [P(H1) + P(H2)]*

*P(H2|S3) = P(H2) / [P(H1) + P(H2)]*

but because all theories are equally likely a priori

*P(H1) = P(H2) = 1/16*

so

*P(H1|S3) = 1/16/ [1/16 + 1/16] = 1/2*

and similarly

*P(H2|S3) = **1/16/ [1/16 + 1/16] =** 1/2*

So H1 and H2 are assigned equal probabilities, 1/2. Because no other theories are possible and 1/2 + 1/2 = 1, it all works out. Now, the agent will use these two theories to predict the next observation:

*P(1110|S3) = 1 * 1/2 + 0 * 1/2 = 1/2*

P(1111|S3) = 0 * 1/2 + 1 * 1/2 = **1/2**

Thus, the agent considers that it is equally likely for the next element to be *1* or *0*.

But there is nothing special about the example we chose with n = 4 and S3 = 111. In fact, you could carry out the exact same calculations for any n, and S. Here’s the key point, the* learning agent makes the exact same predictions as to what will happen no matter how many observations it has made, and no matter what those observations are*. Whatever the sequence of events, it does not gain any knowledge about the future from the past, **learning is impossible**.

I’m going to leave the discussion for later posts, but here are some relevant questions that will come up:

Does learning *logically* require bias? Can one meaningfully speak of theories when there is no compression of observations? What happens when the model is extended to an infinite number of observations/theories? Is this an adequate (though simplistic) model of scientific investigation/knowledge?

Notes/references

[1] I’m using the notation Sum(n) as the equivalent of the Sigma sum over elements with subscript n

David: You have defined a model in which it is impossible to learn because there is nothing to be learned. There is no structure in the underlying reality. Although one can say that in your example we learn precisely that, that the next value is random with probabilities 50:50.

Learning is about having structure. For instance, if observations provide you with 90% ones and 10% zeroes, your reasoning seems to say that still you don’t have a reason to assign a higher probability to a theory that predicts the same proportion in future observations. Any theory is as good as any other (this does not imply that you must give them equal probabilities either). You would be right in saying this, of course, but here is where the structure comes. We also have observations that say that considering theories with structure is better than otherwise, and so we use this kind of theories.

Of course, you can take the argument one step further (or nerarer) and say that reality may not humor us with the usefulness of these theories in the future. My point is not to say that we can have an ultimate logical support for knowlegde (I deny that, and I think that you do too), rather, I am saying that the use of statistical inference is not a circular argument, as you can have models in which it works. Of course you can have models in which it doesn’t. But this is besides my point, and I think this is something we have to live with.

I disagree, the model I have provided may or may not have structure, the model is agnostic about that. I could tell you that in that model, the underlying reality always shows ‘1’. The point is that, despite the structure present in the environment, the agent is unable to learn anything due to its prior assumptions. In other words, it is the combination of structure in the environment AND bias in the agent, that allows learning (in this case). In a further post, I’ll show what kind of bias it is and how it would work in this particular example (ie, with a neverending sequence of ‘1’)

As for the rest of your comment, agreed.

We don’t need to disagree in the first part. I was taking the process as a whole. I am also saying is that there must be structure, as you also point out. What I am saying is that what you call bias may not be independent from the fact that there is structure. In a formal model, as the one you present, it may be, but in all (I think all) applications our bias is about structure, and may well be dependent on it. Part of our priors may come from models about reality endowed in our instincts by evolution, or may come from having learned other structures in other realms.

My point is, again: we have models in which learning is possible as well as models in which learning is impossible. But this does not seem a point of disagreement. What seems to be de disagreement is about the need of a bias, and this may be more a matter of semantics. We may call it a structure on the priors that may or may not have a relation with the structure in the reality. Again, we can construct models of the former (showing some relation) that show learning. I don’t know if we can construct models of the later (with no relation) whatsoever. But I will wait for your next post before going any further.

Agreed. However, the problem of induction, in a universal sense, is that the prior need not correspond to reality, as in the end, it is _prior_ to experience

Yes