I’ve been following Jeff Hawkins’ work at Numenta for a while, although I’ve never played with their HTM technology[1]. My view is that the fastest route to AGI is through neuroscience inspired algorithms and cognitive architectures. What Demis Hassabis’ calls a systems neuroscience approach. Numenta’s work does not seem to me to qualify[2] as an integral route to AGI, but definitely looks like a promising building block, not at the systems level, but at the perception algorithm level.

The news is that Numenta will be pushing a cloud based platform comercially, called Grok. Its strong points stem from those things the neo-cortex does well, mainly dealing with time based data robustly and autonomously (online learning), with little of the fine tuning that is involved in traditional machine learning approaches. It is aimed at prediction and anomaly detection in data streams.

The pieces of AGI will be falling into place over the next decades, and maybe this is one of them.

I’ve defined learning as the extraction of generally applicable knowledge from specific examples. In that post I remarked

An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be learnable.

Today I’m going to a present a model where learning is impossible in the context of bayesian inference. We will see in this case that not only must the phenomenon be learnable, but also that the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

As components of the model we have an agent, an environment from which observations are made, and theories the agent probabilistically reasons about as the object of its learning. For observations we use a binary sequence, S = {0, 1}^n, like for example

S = 1010111010

The learning agent sees a number of elements and must try to predict subsequent ones according to different theories, which are of the form H = {0,1}^n. An important aspect of the model is that the agent will consider all possible theories that can explain and predict observations. The number of theories is therefore equal to the number of possible observations, which are both 2^n. If the agent considered a smaller number of theories it could be that the true theory describing the environment would be left out.

Furthermore, let’s say that a-priori, the agent has no reason to consider any theory more likely than the rest. So it will assign an a-priori equal probability to each theory:

P(H) = 1 / 2^n

Define the total observations up to a given time as Si, where i <= n, and that a given theory is Hk, where k <= n. We can apply bayes theorem to obtain the probability that a given theory is true given the observations (feel free to skip the math down to the conclusion):

P(Hk|Si) = P(Si|Hk)*P(Hk) / P(Si)

and the probability of a given sequence of observations P(Si) is obtained by summing[1] over all theories that yield such a prediction:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

in other words, summing over all theories that begin with Si. To see exactly whats happening let’s restrict the example to n = 4. This gives us a total of 2^4 = 16 possible observations and theories. Say the agent has observed three elements ‘111’ and call the sequence S3:

S3 = 111

Let’s calculate the posterior probability on theories for this case. First for theories that do not predict 111:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

but since P(Si|Hk) = 0, then

P(Hk|Si) = 0

ie theories that do not predict 111 are ruled out as should be the case. There are two theories that do predict 111:

H1 = {1110}

H2 = {1111}

the denominator of the posterior is

Sum(k) P(S3|Hk)*P(Hk)

there are two theories that predict the sequence, therefore

Sum(k) P(S3|Hk)*P(Hk) = P(H1) + P(H2)

plugin this in, the posterior is therefore

P(H1|S3) = P(S3|H1)*P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(S3|H2)*P(H2) / [P(H1) + P(H2)]

since both H1 and H2 predict S3 (P(S3|H) = 1), this reduces to

P(H1|S3) = P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(H2) / [P(H1) + P(H2)]

but because all theories are equally likely a priori

P(H1) = P(H2) = 1/16

so

P(H1|S3) = 1/16/ [1/16 + 1/16] = 1/2

and similarly

P(H2|S3) = 1/16/ [1/16 + 1/16] =1/2

So H1 and H2 are assigned equal probabilities, 1/2. Because no other theories are possible and 1/2 + 1/2 = 1, it all works out. Now, the agent will use these two theories to predict the next observation:

P(1110|S3) = 1 * 1/2 + 0 * 1/2 = 1/2

P(1111|S3) = 0 * 1/2 + 1 * 1/2 = 1/2

Thus, the agent considers that it is equally likely for the next element to be 1 or 0.

But there is nothing special about the example we chose with n = 4 and S3 = 111. In fact, you could carry out the exact same calculations for any n, and S. Here’s the key point, the learning agent makes the exact same predictions as to what will happen no matter how many observations it has made, and no matter what those observations are. Whatever the sequence of events, it does not gain any knowledge about the future from the past, learning is impossible.

I’m going to leave the discussion for later posts, but here are some relevant questions that will come up:

Does learning logically require bias? Can one meaningfully speak of theories when there is no compression of observations? What happens when the model is extended to an infinite number of observations/theories? Is this an adequate (though simplistic) model of scientific investigation/knowledge?

Notes/references

[1] I’m using the notation Sum(n) as the equivalent of the Sigma sum over elements with subscript n

I’ve spokenbefore of regularity, but haven’t defined it exactly. But before going into that, let’s first consider the intuitive notion that comes to mind. By regularity we mean something exhibiting pattern, repetition, invariance. We say something is regular if it follows a rule. In fact, the word’s etymology matches this, as regular is derived from the latin regula, rule. Repetition and invariance result from the continued applicability of the rule, over time and/or space, to that which is regular. For example

1, 3, 5, 7, 9, 11, 13, 15….

we say this sequence is regular because it follows a rule. The rule is

each number is the result of adding 2 to the number before it

As per our scheme above, the rule is applicable throughout the sequence, the +2 difference repeats, and it is invariant.

This way of looking at regularity matches the language we’ve used previously when defining the key aspect of learning as the extraction of generally applicable knowledge from specific examples. In this case the specific examples would be any subset of the sequence, the general case is the sequence in its entirety, and the extracted knowledge is the rule “+2”.

We can take this further and try to formalize it by realizing one consequence of the repetition characteristic. And it is that something that repeats can be shortened. The reason is simple, if we know the rule, we can describe the entire object[1] by just describing the rule. The rule, by continued repetition, will reproduce the object up to any length. We can use the example before, and note how the sequence can be described succinctly as

f(x) = 2x + 1

which is much shorter than the sequence (which can be infinite in fact). So we can think of the rule as a compression of the object, or from the other point of view, the object is the expansion of the rule. Here’s another example

In this case, the object is a fractal, which can be described graphically by a potentially infinite set of points. The level of detail is infinite in the sense that one can zoom-in to arbitrary levels without loss of detail. This is why the description at a literal level (ie pixels) is infinitely long. However, like all fractals, the Mandelbrot set can be compactly described mathematically. So we say the set is highly regular by virtue of the existence of a short description that can reproduce all its detail. Here’s the short (formal) description for the Mandelbrot set

In mathematics the length of what we have called short description is known as Kolmogorov complexity, or alternatively, Algorithmic Information Content. It is a measure of the quantity of information in an object, and is the inverse of regularity as we have discussed it here[2]. We say that something with a comparably low AIC exhibits regularity as it can be compressed down to something much shorter.

I’ll regularly return to the concept of regularity as it is a fundamental way to look at pretty much everything, and is thus a very deep subject.

[1] For lack of a better word, I’m using the word object to refer to that which can house regularity, which is basically anything you can think of

[2] Note that this is not the only way to look at regularity, but rather one of two main formalizations. What we have seen here is the algorithmic approach to complexity (and regularity), but there is also a statistical view that is more suited to objects that do not have a fixed description, but rather a statistical one.

The other day the feasibility of AI came up in casual conversation. Somebody argued that machines can only do what they’re told to, that they just follow a pre-programmed set of rules or commands, and so on.

In response to this I normally cite the field of machine learning as an example of computers doing things (I’ll use classification here) they were not explicitly programmed to do. When recognizing characters, for instance, the computer is not explicitly given a set of rules to determine what letter a given handwritten element corresponds to. Rather it learns from a training set, which for classification is a set of elements together with their correct labels (classifications).

An objection was made to my response: that the process is supervised; even though the programming is not explicit, it is nonetheless programming in the form of example-classification pairs. In effect, the computer is just being fed rules, even if we call that process learning. Now, I could have gone on about unsupervised learning, but there’s a better point to make.

The supervised/unsupervised aspect is not the key to whether learning occurs or not. What is the defining characteristic of learning is the fact that the computer can correctly classify examples it has not seen before.

An agent that only memorizes the training set can correctly identify previously seen elements, but it cannot deal with unseen ones, its memorized knowledge has nothing to say about them. A learning agent, in contrast, manages to extract some knowledge from the training set that can be used to classify new unseen elements. The key to learning is the thus the ability to generalize from examples. It is the acquisition of knowledge from specific cases that is applicable in general.

However, this is not the end of the story. An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be learnable. If no “explanation” exists that reproduces the data, then there is nothing to be learned [2], the data is just “noise” [3]. This is why a phone book cannot be learned, it can only be memorized.

[2] Alternative formulations of this are: that the data has no internal structure; the data cannot be compressed; the data’s constituent elements have zero mutual information

[3] Unfortunately there is no way to tell whether an “explanation” exists or not, short of finding it. Hence the uncomputability of Kolmogorov complexity.

One of the themes in AI is the progressive replacement of hand crafted knowledge/programming with autonomous learning. Another theme is the progressive shift from narrow domain specific abilities to generally applicable performance. These themes are closely related: domain specific performance is usually achieved via encoding domain specific expert knowledge, provided by humans, into the AI itself. This encoding is fixed and static, and is potentially brittle if the agent is subjected to a domain outside its narrow region of applicability.

In this talk we see hints of generality in machine learning via the replacement of hand crafted, tuned features (as input to a later stage learning phase) with a learning phase that autonomously learns these features. Because domain specific features, for example for vision or speech recognition, are typically designed by subject matter experts, learning is constrained by that specific manual task. One cannot use vision features in an audio problem and so forth.

However, when machine learning is targeted at learning the features themselves (in an unsupervised scheme), the potential for generality becomes present. If features can be autonomously learned for various domains, the machine learning process becomes general in so far as the feature learning’s performance is comparable or superior to that using hand crafted knowledge. This is exactly what is demonstrated here.

And, to make it even more relevant in terms of current research avenues, this is related to findings in neuroscience that suggest the general applicability of some kinds of learning in the neocortex, where for example patients can learn to see with their auditory cortex or their somatosensory cortex (by rewiring the optic nerve). This suggests the possibility that there is a unique learning algorithm at work, at least in the learning of concepts at low levels of the hierarchy close to the perceptual level, which is exactly where features reside.