A maximum entropy uniform probability distribution over some outcome corresponds to a state of zero knowledge about the phenomenon in question. Such a distribution assigns equal probability to, does not favor, nor prohibits any result; moreover, any result that comes about is equally compatible with said probability distribution. So it seems this maximum entropy probability distribution is the worst case scenario in terms of knowledge about a phenomenon. Indeed, in this state of knowledge transmitting the description of the results requires the maximum amount of information, hence maximum entropy.

However, one can in fact do worse than zero knowledge. It is worse to have a low entropy, but incorrect, belief, than to have a maximum entropy and ignorant lack of belief. We could informally call this state as that of not zero knowledge, but of negative knowledge. Not only do we not know anything about a phenomenon, worse still we have a false belief.

These notions can be well understood in terms of the Kullback-Leibler divergence. Starting from a state of low entropy, but incorrect, probability distribution, bayesian updates will generally modify said distribution into one of higher entropy. Instinctively, it seems that going to higher entropy is a step backwards. We now need more information to describe the phenomenon than before.

The key feature of KL that corrects this wrong intuition is that in the context of bayesian updates, KL divergence measures the change in the quantify of information necessary to describe the phenomenon, as modeled by our updated probability distribution, *from the standpoint of the state of knowledge prior to updating*, that is, from the standpoint of our previous non-updated distribution. So even though our new distribution is of higher entropy, it is more efficient at coding (describing) the phenomenon than our previous low entropy, but incorrect distribution.

The KL divergence measures the expected information gain of a bayesian update. It is a positive quantity; updating with new evidence will, on average, leave us in a state of knowledge that is more accurate.

Hi David.

Let us consider two claims:

1) the state of knowledge of anyone is COMPLETELY described by a probability distribution.

2) if we know absolutely nothing about a coin, then we know it is as likely to land heads as odds.

3) if we know that after a very large number of trials, the frequency of the coin landing heads closely fluctuate around 1/2, then we know (or are very confident) it is as likely to land heads as odds.

Are situations 2) and 3) completely similar?

If not, it seems we’ve got to give up 1).

Otherwise, I can’t help but quote Wesley Salmon:

“Knowledge of probabilities is concrete knowledge about occurrences; otherwise it is useless for prediction and action. According to the principle of indifference, this kind of knowledge can result immediately from our ignorance of reasons to regard one occurrence as more probable as another. This is epistemological magic. Of course, there are ways of transforming ignorance into knowledge – by further investigation and the accumulation of more information. It is the same with all “magic”: to get the rabbit out of the hat you first have to put him in. The principle of indifference tries to perform “real magic”.

Clearly, saying “since we do NOT know anything we KNOW it is equally likely to occur” seems to be an instance of magical thinking.

This is why it appears to me preferable to

– either think that the probability is indefinite

– or think that all prior values are allowed (subjective Bayesian stance)

– or think that one’s conviction is expressed by the interval [0;1] (imprecise probability)

I’d greatly appreciate if you could explain me why I’m irrational to think that way.

Cheers.

I must edit 1):

“1) the state of knowledge of anyone

about a propositionis COMPLETELY described by asingle value”There is an enormous amount of articles concerning imprecise probability which attack that claim.

And if one’s state of knowledge is, say, represented by an upper and lower value instead of a single one, the argument from maximizing entropy are doomed to fail, since they rely on knowledge being described through a single real probability which take on a definite value in every situation.