I’ve been following Jeff Hawkins’ work at Numenta for a while, although I’ve never played with their HTM technology[1]. My view is that the fastest route to AGI is through neuroscience inspired algorithms and cognitive architectures. What Demis Hassabis’ calls a systems neuroscience approach. Numenta’s work does not seem to me to qualify[2] as an integral route to AGI, but definitely looks like a promising building block, not at the systems level, but at the perception algorithm level.

The news is that Numenta will be pushing a cloud based platform comercially, called Grok. Its strong points stem from those things the neo-cortex does well, mainly dealing with time based data robustly and autonomously (online learning), with little of the fine tuning that is involved in traditional machine learning approaches. It is aimed at prediction and anomaly detection in data streams.

The pieces of AGI will be falling into place over the next decades, and maybe this is one of them.

In this talkTim Sweeney discusses the long term future of computing and its impact on society, and specifically that of 3D graphics technology. It’s something I’ve asked myself before, how far away are photorealistic realtime 3D graphics? When will games and computer generated graphics in movies be realistic to the point of indistinguishable from the real thing?

The route that Sweeney follows to answer this question is to first ask what the limits of our perception are, in particular our vision (eyes plus visual cortex). Any improvements beyond these limits cannot be perceived, so they establish a definitive upper bound on realism. And this bound can be convincingly determined due to well established knowledge of our biology and sensory apparatus. A few numbers are given, for example in terms of maximum resolution and frames per second that we can perceive.

These numbers together with somewhat less objective intuitions as to which graphics approximations (such as lighting algorithms) are necessary to achieve realism, yield a factor of 2000 times today’s hardware performance to reach the target.

Sweeney then rightly notes that some aspects of presenting realistic graphics are not just a matter of simulating physical processes that are completely understood, but rather require theoretical advances. The main one is artificial intelligence, necessary to realistically simulate characters in terms of animation and even behavior. Still, the term photorealistic doesn’t necessarily include things like these.

Summing up, I’d say the 2000x figure is his rough answer, and it seems to me about right. How does this translate into years? You could use Moore’s law, but I suspect that would underestimate the potential improvements in graphics performance which could advance faster thanks to the particular parallel nature of rendering. So, I’d venture circa 2020 to get a nice round figure.

I’ve defined learning as the extraction of generally applicable knowledge from specific examples. In that post I remarked

An agent may have the ability to learn, but that is not enough to guarantee that learning does in fact take place [1]. The extra necessary ingredient is that the target of learning must be learnable.

Today I’m going to a present a model where learning is impossible in the context of bayesian inference. We will see in this case that not only must the phenomenon be learnable, but also that the learning agent must incorporate a bias to exploit existing regularity. Without such a bias, the learner cannot penalize complex “noisy” hypotheses that fit the data.

As components of the model we have an agent, an environment from which observations are made, and theories the agent probabilistically reasons about as the object of its learning. For observations we use a binary sequence, S = {0, 1}^n, like for example

S = 1010111010

The learning agent sees a number of elements and must try to predict subsequent ones according to different theories, which are of the form H = {0,1}^n. An important aspect of the model is that the agent will consider all possible theories that can explain and predict observations. The number of theories is therefore equal to the number of possible observations, which are both 2^n. If the agent considered a smaller number of theories it could be that the true theory describing the environment would be left out.

Furthermore, let’s say that a-priori, the agent has no reason to consider any theory more likely than the rest. So it will assign an a-priori equal probability to each theory:

P(H) = 1 / 2^n

Define the total observations up to a given time as Si, where i <= n, and that a given theory is Hk, where k <= n. We can apply bayes theorem to obtain the probability that a given theory is true given the observations (feel free to skip the math down to the conclusion):

P(Hk|Si) = P(Si|Hk)*P(Hk) / P(Si)

and the probability of a given sequence of observations P(Si) is obtained by summing[1] over all theories that yield such a prediction:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

in other words, summing over all theories that begin with Si. To see exactly whats happening let’s restrict the example to n = 4. This gives us a total of 2^4 = 16 possible observations and theories. Say the agent has observed three elements ‘111’ and call the sequence S3:

S3 = 111

Let’s calculate the posterior probability on theories for this case. First for theories that do not predict 111:

P(Hk|Si) = P(Si|Hk)*P(Hk) / Sum(k) P(Si|Hk)*P(Hk)

but since P(Si|Hk) = 0, then

P(Hk|Si) = 0

ie theories that do not predict 111 are ruled out as should be the case. There are two theories that do predict 111:

H1 = {1110}

H2 = {1111}

the denominator of the posterior is

Sum(k) P(S3|Hk)*P(Hk)

there are two theories that predict the sequence, therefore

Sum(k) P(S3|Hk)*P(Hk) = P(H1) + P(H2)

plugin this in, the posterior is therefore

P(H1|S3) = P(S3|H1)*P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(S3|H2)*P(H2) / [P(H1) + P(H2)]

since both H1 and H2 predict S3 (P(S3|H) = 1), this reduces to

P(H1|S3) = P(H1) / [P(H1) + P(H2)]

P(H2|S3) = P(H2) / [P(H1) + P(H2)]

but because all theories are equally likely a priori

P(H1) = P(H2) = 1/16

so

P(H1|S3) = 1/16/ [1/16 + 1/16] = 1/2

and similarly

P(H2|S3) = 1/16/ [1/16 + 1/16] =1/2

So H1 and H2 are assigned equal probabilities, 1/2. Because no other theories are possible and 1/2 + 1/2 = 1, it all works out. Now, the agent will use these two theories to predict the next observation:

P(1110|S3) = 1 * 1/2 + 0 * 1/2 = 1/2

P(1111|S3) = 0 * 1/2 + 1 * 1/2 = 1/2

Thus, the agent considers that it is equally likely for the next element to be 1 or 0.

But there is nothing special about the example we chose with n = 4 and S3 = 111. In fact, you could carry out the exact same calculations for any n, and S. Here’s the key point, the learning agent makes the exact same predictions as to what will happen no matter how many observations it has made, and no matter what those observations are. Whatever the sequence of events, it does not gain any knowledge about the future from the past, learning is impossible.

I’m going to leave the discussion for later posts, but here are some relevant questions that will come up:

Does learning logically require bias? Can one meaningfully speak of theories when there is no compression of observations? What happens when the model is extended to an infinite number of observations/theories? Is this an adequate (though simplistic) model of scientific investigation/knowledge?

Notes/references

[1] I’m using the notation Sum(n) as the equivalent of the Sigma sum over elements with subscript n