Learning and the subject-object distinction

Previously I presented a model where learning is impossible. In this post I want to emphasize

..not only must the phenomenon be learnable, but additionally the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

In the model, the environment (phenomenon) that gives rise to observations in the form of the binary sequence of ‘0’ and ‘1’ is left unspecified. Nothing is said about how the environment evolves, whether it is deterministic or stochastic, or whether it follows a certain rule or not. The model I presented is thus completely orthogonal to the nature of the environment. And yet

Whatever the sequence of events, the learning agent does not gain any knowledge about the future from the past, learning is impossible.

So a property of the learning agent, the subject, makes learning impossible irrespective of the environment, the object. This property of the subject is the belief that all sequences of observations are equally likely, that is, the lack of a-priori bias favoring any of the outcomes[1]. Even if the object was completely predictable, the subject would be unable to learn. Learning imposes constraints on both, hence the subject-object distinction;

To drive this point home, we could specify any environment and note how the conclusions regarding the model would not change. I hinted at this by presenting an example of the environment’s evolution that began with 111. Now assume the environment is such that it produces ‘1’ indefinitely in a completely deterministic and predictable way. You could interpret this as the classical example in philosophical treatments of induction: ‘1’ means that the sun rises the next day[2], and ‘0’ means that the sun does not rise. But again, this would make no difference, the learner would never catch on to this regularity.

Conversely, specifying an unlearnable environment will not do the learner any good either, of course. In fact, the astute reader will have realized that the learning agent’s prior corresponds exactly to the belief that the sequence of ‘0’ and ‘1’ are the results of a series of coin flips, where the coin is fair. And of course, given the assumption describing the coin, previous coin flips do not yield any information that serve to make predictions about future coin flips; the environment has no structure to be learned.

Most problems of bayesian inference include, in the problem statement, a description of the environment that is automatically used as the agent’s prior knowledge or at least a starting point, as in the typical case of drawing balls from an urn. This prior knowledge is incomplete of course, as no inference would be necessary otherwise. But in these cases the subject-object distinction is not so apparent; analysis about the agent’s learning performance assumes the problem definition is of course true!

However, the subject-object distinction is more important when asking what model of inference applies to the scientific investigation of nature and the problem of induction. This is because in these models, there is no problem definition, prior knowledge is genuinely prior to any experience.

Pending questions: What problem definition applies to inductive inference in science? What happens when extending our model to cases with infinite observations/theories? Does learning logically require bias, and if so, what bias is appropriate universally and intuitively acceptable?[3]


[1] In fact, not only must there be bias, but it must be a bias that exploits structure. Altering the distribution such that it favors more ‘1’ in the sequence irrespective of previous observations is a bias, but does not allow learning. This is another important distinction, the entropy-learnability distinction.

[2] Or after 24 hours if you want to be picky about tautologies

[3] Another more technical question is, can prior knowledge in inference problems always be recast as a bias over theories that are deterministic predictions over entire sequences of possible events? (as we saw when noting that the binary sequence model is equivalent to a repeated coin flip scenario) If so, what property of these distributions allows learning?

When learning is impossible

(image taken from climbnewfoundland.com)

I’ve defined learning as the extraction of generally applicable knowledge from specific examples. In that post I remarked

An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be learnable.

Today I’m going to a present a model where learning is impossible in the context of bayesian inference. We will see in this case that not only must the phenomenon be learnable, but also that the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

As components of the model we have an agent, an environment from which observations are made, and theories the agent probabilistically reasons about as the object of its learning. For observations we use a binary sequence, S = {0, 1}^n, like for example

S = 1010111010

The learning agent sees a number of elements and must try to predict subsequent ones according to different theories, which are of the form H = {0,1}^n. An important aspect of the model is that the agent will consider all possible theories that can explain and predict observations. The number of theories is therefore equal to the number of possible observations, which are both 2^n. If the agent considered a smaller number of theories it could be that the true theory describing the environment would be left out.

Furthermore, let’s say that a-priori, the agent has no reason to consider any theory more likely than the rest. So it will assign an a-priori equal probability to each theory:

P(H) = 1 / 2^n

Define the total observations up to a given time as Si, where i <= n, and that a given theory is Hk, where k <= n. We can apply bayes theorem to obtain the probability that a given theory is true given the observations (feel free to skip the math down to the conclusion):

P(Hk|Si) = P(Si|Hk)*P(Hk) / P(Si)

and the probability of a given sequence of observations P(Si) is obtained by summing[1] over all theories that yield such a prediction:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

in other words, summing over all theories that begin with Si. To see exactly whats happening let’s restrict the example to n = 4. This gives us a total of 2^4 = 16 possible observations and theories. Say the agent has observed three elements ‘111’ and call the sequence S3:

S3 = 111

Let’s calculate the posterior probability on theories for this case. First for theories that do not predict 111:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

but since P(Si|Hk) = 0, then

P(Hk|Si) = 0

ie theories that do not predict 111 are ruled out as should be the case. There are two theories that do predict 111:

H1 = {1110}

H2 = {1111}

the denominator of the posterior is

Sum(k) P(S3|Hk)*P(Hk)

there are two theories that predict the sequence, therefore

Sum(k) P(S3|Hk)*P(Hk) = P(H1) + P(H2)

plugin this in, the posterior is therefore

P(H1|S3) = P(S3|H1)*P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(S3|H2)*P(H2) / [P(H1) + P(H2)]

since both H1 and H2 predict S3 (P(S3|H) = 1), this reduces to

P(H1|S3) = P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(H2) / [P(H1) + P(H2)]

but because all theories are equally likely a priori

P(H1) = P(H2) = 1/16

so

P(H1|S3) = 1/16/ [1/16 + 1/16] = 1/2

and similarly

P(H2|S3) = 1/16/ [1/16 + 1/16] = 1/2

So H1 and H2 are assigned equal probabilities, 1/2. Because no other theories are possible and 1/2 + 1/2 = 1, it all works out. Now, the agent will use these two theories to predict the next observation:

P(1110|S3) = 1 * 1/2 + 0 * 1/2 = 1/2

P(1111|S3) = 0 * 1/2 + 1 * 1/2 = 1/2

Thus, the agent considers that it is equally likely for the next element to be 1 or 0.

But there is nothing special about the example we chose with n = 4 and S3 = 111. In fact, you could carry out the exact same calculations for any n, and S. Here’s the key point, the learning agent makes the exact same predictions as to what will happen no matter how many observations it has made, and no matter what those observations are. Whatever the sequence of events, it does not gain any knowledge about the future from the past, learning is impossible.

I’m going to leave the discussion for later posts, but here are some relevant questions that will come up:

Does learning logically require bias? Can one meaningfully speak of theories when there is no compression of observations? What happens when the model is extended to an infinite number of observations/theories? Is this an adequate (though simplistic) model of scientific investigation/knowledge?


Notes/references

[1] I’m using the notation Sum(n) as the equivalent of the Sigma sum over elements with subscript n

Coincidences and explanations

I was reading about a famous article by physicist Eugene Wigner titled The unreasonable effectiveness of mathematics, where, citing Wikipedia

In the paper, Wigner observed that the mathematical structure of a physics theory often points the way to further advances in that theory and even to empirical predictions, and argued that this is not just a coincidence and therefore must reflect some larger and deeper truth about both mathematics and physics.

I’ll write about this in a later post, but for now this brings me to consider what we mean by coincidence and how we think about them.

In the above, a coincidence is remarked between two apparently independent domains, that of mathematics, and that of the structure of the world. In general, when finding striking coincidences our instinct is to reach for an explanation. Why? Because by definition a striking coincidence is basically something of a-priori very low probability, something implausible that merits investigation to “make sense of things”.

An explanation of a coincidence is a restatement of its content that raises its probability to a level such that it is no longer a striking state of affairs, the coincidence is dissolved. Example:

Bob: Have you noticed that every time the sun rises the rooster crows? What an extraordinary coincidence!

Alice: Don’t be silly Bob, that’s not a coincidence at all, the rooster crows when it sees the sun rise. Nothing special

Bob: Erm… true. And why did David choose me to play the part of fool in this dialogue?

Alice’s everyday response to coincidence is at heart nothing other than statistical inference, be it bayesian or classical hypothesis testing[1]. The coincidence at face value plays the role of a hypothesis (null hypothesis) that assigns a low probability to the event, ie the hypothesis of a chance occurrence between two seemingly independent things. The explanation in turn plays the role of the accepted hypothesis by virtue of assigning a high probability to what is observed.

So one could say that the way we respond and deal with coincidence is really a mundane form of how science works, where theories are presented in response to facts, and those that better fit those facts are accepted as explanations of the world.

But how do explanations work internally? The content of an explanation is the establishment of a relationship between the two a-priori independent facts, typically through causal mechanisms. The causal link is what raises the probability of one given the other, and therefore of the joint event. In the example, the causal link is ‘the rooster crows when it sees the sun rise‘.

But the links are not always direct. An interesting example comes from what in statistics is called a spurious relationship. Again, Wikipedia says:

An example of a spurious relationship can be illuminated examining a city’s ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both

although the emphasis here is about the lack of direct causal relationship, the point regarding coincidence is the same. Prior to realizing that both facts have a common cause (the explanation is the heat wave), one would have regarded the relationship between ice cream sales and drownings as a strange coincidence.

In the extreme case the explanation reveals that the two facts are really two facets of the same thing. The coincidence is dissolved: any given fact must necessarily coincide with itself. Before the universal law of gravitation, it would have been regarded as extraordinary that both the apples falling from a tree, and the movement of planets in the heavenly skies had the same behavior. But we know now that they are really different aspects of the same phenomenon.


Notes

[1] The act of explanation is, in classical statistics language, the act of rejecting the null hypothesis. In the Bayesian picture, the explanation is what is probabilistically inferred due to the higher likelihood it assigns to the facts (and its sufficient prior probability)

Rationality slides

Here is a presentation (in spanish) on rationality for skepticamp I’ve worked on recently. It needs some polish but the content is essentially complete. A summary of the main points

  • Optimization and the second law of thermodynamics are opposing forces
  • Intelligence is a type of optimization that evolved in certain species to counteract the 2nd law through behavior. Intelligence functions through observation, learning and prediction
  • Prediction requires a correct representation of the environment, this defines epistemic rationality as a component of intelligence
  • Classical logic fails to model rationality as it cannot deal with uncertainty
  • Probability theory is an extension of logic to domains with uncertainty
  • Probability theory and Bayes define a standard of ideal rationality. Other methods are suboptimal approximations
  • Probability theory as formalization of rationality:
    • Provides a quantitative model of the scientific method as a special case of Bayes theorem
    • Provides operational, quantitative definitions of belief and evidence
    • Naturally relates predictive power and falsifiability through the sum rule of probability
    • Explains pathological beliefs of the vacuous kind; astrology, card reading, divination, etc
    • Explains pathological beliefs of the floating kind; “There is a dragon in my garage”
    • Exposes fraudulent retrodiction; astrology, cold reading, ad-hoc hypothesis, bad science, bad economics, etc
    • Dissolves false disagreements described by matching predictions but different verbal formulations
    • Naturally embeds empiricism, positivism and falsificationism
  • Pathological beliefs can be analyzed empirically by re-casting them as physical phenomena in brains, the province of cognitive science.
  • A naturalistic perspective automatically explains human deviations from rationality; evolution will always favor adaptations that increase fitness even if they penalize rationality
  • Today, politics is an example of rationality catastrophe; in the ancestral environment, irrationality that favored survival in a social context (tribes) was a successful adaptation. (Wright, Yudkowsky)
  •  

    Recommended reading

    Probability Theory: The Logic of Science
    Bayesian Theory (Wiley Series in Probability and Statistics)
    Hume’s Problem: Induction and the Justification of Belief

    Various papers

    Bayesian probability – Bruyninckx(2002)
    Philosophy and the practice of Bayesian statistics – Gelman(2011)
    Varieties of Bayesianism – Weisberg
    No Free Lunch versus Occam’s Razor in Supervised Learning – Lattimore, Hutter(2011)
    A Material Theory of Induction – Norton(2002)
    Bayesian epistemology – Hartmann, Sprenger(2010)
    The Illusion of Ambiguity: from Bistable Perception to Anthropomorphism – Ghedini(2011)
    Bayesian Rationality and Decision Making: A Critical Review – Albert(2003)
    Why Bayesian Rationality Is Empty, Perfect Rationality Doesn’t Exist, Ecological Rationality Is Too Simple, and Critical Rationality Does the Job* – Albert(2009)
    A Better Bayesian Convergence Theorem – Hawthorne

    Bayesian epistemology (stanford)
    lesswrong.com (excellent blog on rationality)

    Worse than ignorant

    A maximum entropy uniform probability distribution over some outcome corresponds to a state of zero knowledge about the phenomenon in question. Such a distribution assigns equal probability to, does not favor, nor prohibits any result; moreover, any result that comes about is equally compatible with said probability distribution. So it seems this maximum entropy probability distribution is the worst case scenario in terms of knowledge about a phenomenon. Indeed, in this state of knowledge transmitting the description of the results requires the maximum amount of information, hence maximum entropy.

    However, one can in fact do worse than zero knowledge. It is worse to have a low entropy, but incorrect,  belief, than to have a maximum entropy and ignorant lack of belief. We could informally call this state as that of not zero knowledge, but of negative knowledge. Not only do we not know anything about a phenomenon, worse still we have a false belief.

    These notions can be well understood in terms of the Kullback-Leibler divergence. Starting from a state of low entropy, but incorrect, probability distribution, bayesian updates will generally modify said distribution into one of higher entropy. Instinctively, it seems that going to higher entropy is a step backwards. We now need more information to describe the phenomenon than before.

    The key feature of KL that corrects this wrong intuition is that in the context of bayesian updates, KL divergence measures the change in the quantify of information necessary to describe the phenomenon, as modeled by our updated probability distribution, from the standpoint of the state of knowledge prior to updating, that is, from the standpoint of our previous non-updated distribution. So even though our new distribution is of higher entropy, it is more efficient at coding (describing) the phenomenon than our previous low entropy, but incorrect distribution.

    The KL divergence measures the expected information gain of a bayesian update. It is a positive quantity; updating with new evidence will, on average, leave us in a state of knowledge that is more accurate.