Previously I presented a model where learning is impossible. In this post I want to emphasize
..not only must the phenomenon be learnable, but additionally the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.
In the model, the environment (phenomenon) that gives rise to observations in the form of the binary sequence of ‘0’ and ‘1’ is left unspecified. Nothing is said about how the environment evolves, whether it is deterministic or stochastic, or whether it follows a certain rule or not. The model I presented is thus completely orthogonal to the nature of the environment. And yet
Whatever the sequence of events, the learning agent does not gain any knowledge about the future from the past, learning is impossible.
So a property of the learning agent, the subject, makes learning impossible irrespective of the environment, the object. This property of the subject is the belief that all sequences of observations are equally likely, that is, the lack of a-priori bias favoring any of the outcomes. Even if the object was completely predictable, the subject would be unable to learn. Learning imposes constraints on both, hence the subject-object distinction;
To drive this point home, we could specify any environment and note how the conclusions regarding the model would not change. I hinted at this by presenting an example of the environment’s evolution that began with 111. Now assume the environment is such that it produces ‘1’ indefinitely in a completely deterministic and predictable way. You could interpret this as the classical example in philosophical treatments of induction: ‘1’ means that the sun rises the next day, and ‘0’ means that the sun does not rise. But again, this would make no difference, the learner would never catch on to this regularity.
Conversely, specifying an unlearnable environment will not do the learner any good either, of course. In fact, the astute reader will have realized that the learning agent’s prior corresponds exactly to the belief that the sequence of ‘0’ and ‘1’ are the results of a series of coin flips, where the coin is fair. And of course, given the assumption describing the coin, previous coin flips do not yield any information that serve to make predictions about future coin flips; the environment has no structure to be learned.
Most problems of bayesian inference include, in the problem statement, a description of the environment that is automatically used as the agent’s prior knowledge or at least a starting point, as in the typical case of drawing balls from an urn. This prior knowledge is incomplete of course, as no inference would be necessary otherwise. But in these cases the subject-object distinction is not so apparent; analysis about the agent’s learning performance assumes the problem definition is of course true!
However, the subject-object distinction is more important when asking what model of inference applies to the scientific investigation of nature and the problem of induction. This is because in these models, there is no problem definition, prior knowledge is genuinely prior to any experience.
Pending questions: What problem definition applies to inductive inference in science? What happens when extending our model to cases with infinite observations/theories? Does learning logically require bias, and if so, what bias is appropriate universally and intuitively acceptable?
 In fact, not only must there be bias, but it must be a bias that exploits structure. Altering the distribution such that it favors more ‘1’ in the sequence irrespective of previous observations is a bias, but does not allow learning. This is another important distinction, the entropy-learnability distinction.
 Or after 24 hours if you want to be picky about tautologies
 Another more technical question is, can prior knowledge in inference problems always be recast as a bias over theories that are deterministic predictions over entire sequences of possible events? (as we saw when noting that the binary sequence model is equivalent to a repeated coin flip scenario) If so, what property of these distributions allows learning?