Uncertainty, information and cryptography

(cross posted from here)

In this post I’m going to talk about three types of uncertainty, and how the foundations of cryptography can be understood in their terms. Wikpedia says

Uncertainty is the situation which involves imperfect and / or unknown information.  It applies to predictions of future events, to physical measurements that are already made, or to the unknown.

Two main concepts to note here, information and knowledge. We could say that uncertainty is lack of knowledge or lack of information. As we’ll see these two ideas are not equivalent and do not cover all cases. We start at the strongest form of uncertainty.

Ontological uncertainty: indeterminacy

The Bloch sphere representing the state of a spin 1/2 particle.

In quantum mechanics, certain particles (spin 1/2 particles such as electrons) have a property called spin that when measured[1] can give two discrete results, call them “spin up” and “spin down”. This is described by the equation

| \psi \rangle = \alpha |0 \rangle + \beta |1 \rangle,\,

such that when measured, the probability of obtaining spin up is α², and spin down is β². A question we could ask ourselves is, before we measure it, is the spin up or is it down? But the equation above only gives us probabilities of what will happen when we make the measurement.

In mainstream interpretations of quantum mechanics there is no fact of the matter as to what the value of the spin was before we made the measurement. And there is no fact of the matter as to what the measurement will yield prior to it happening. The information that our question is asking for simply does not exist in the universe.

It is this intrinsic indeterminacy that makes us use the term ontological uncertainty: the uncertainty is not a property of our knowledge, but a property of nature. Our confusion is a consequence of the ontological mismatch between nature and our model of it. We sum up this type of uncertainty as:

The information does not exist and therefore we cannot know it.

By the way, the heisenberg uncertainty principle which is of this type is not very well named, as it can be confused with the subject of the next section. A better name would be indeterminacy principle.

Epistemic uncertainty: information deficit

We started with the strongest and also strangest form of uncertainty. The second is the every day type encountered when dealing with incomplete information. In contrast to the previous type, this uncertainty is a property of our state knowledge, not a property of nature itself. So when we ask, for example, what caused the dinosaur extinction, we are referring to some fact about reality, whether or not we have or will have access to it. Or if playing poker we wonder if we have the best hand, we are referring to an unknown but existing fact, the set of all hands dealt.

Boltzmann’s tombstone with the entropy equation.

Uncertainty as incomplete information is central to fields like information theory, probability and thermodynamics where it is given a formal and quantitative treatment. The technical term is entropy, and it’s measured in bits. We say a description has high entropy if there is a lot of information missing from it. If we ask whether a fair coin will land heads or tails we are missing 1 bit of information. If we ask what number will come out from throwing a fair 8 sided die, we are missing 3 bits. The result of the die throw has more possible results, and therefore higher uncertainty about it than the coin flip result. So it has more bits of entropy. We sum up this type of uncertainty as:

The information exists, but we do not know it.

Before finishing a small clarification. If you were expecting the concept of randomness to appear when talking about coin flips and die rolls here’s the reason why it did not. In this section I have restricted the discussion to classical physics, where phenomena are deterministic although we may not know all the initial conditions. The combination of determinism + unknown initial conditions is what underlies the use of randomness in the macroscopic world. This type of randomness is sometimes called subjective randomness to distinguish it from intrinsic randomness, which is basically another term for ontological uncertainty of the first section.

11486.jpg
A deterministic coin flipping machine

The third type..

And now to the interesting bit. Let’s say I tell you that I have all the information about something, but I still don’t know everything about it. Sounds contradictory right? Here’s an example to illustrate this kind of situation.

  1. All men are mortal
  2. Socrates is a man

If now somebody tells you that

3. Socrates is mortal.

Are they giving you any information? Hopefully it seems to you like they told you something you already knew. Does that mean you had all the information before given statement 3? Put differently, does statement 3 contain any information not present in 1,2?

Modus Barbara.svg
One of the 24 valid syllogism types

Consider another example.

  1. x = 1051
  2. y = 3067
  3. x * y = 3223417

In this case statement 3 tells us something we probably didn’t know. But does statement 3 contain information not present int 1,2? We can use definitions from information theory to offer one answer. Define three random variables (for convenience in some arbitrary range a-b)

x ∈ {a-b}, y ∈ {a-b}, x*y {…}

We can calculate the conditional entropy according to the standard equation

Untitled2

which in our case gives

H(x*y | x, y) = 0

The conditional entropy of x*y given x and y is zero. This is just a technical way to say that given statements 1 and 2, statement 3 contains no extra information: whatever 3 tells us was already contained in 1,2. Once x and y are fixed, x*y follows necessarily. This brings us back to the beginning of the post

We could say that uncertainty is lack of knowledge or lack of information. As we’ll see these two ideas are not equivalent and do not cover all cases.

It should be apparent now that these two ideas are different. We have here cases where we have all the information about something (x, y), and yet we do not know everything about it (x*y).

Logical uncertainty: computation deficit

The step that bridges having all the information with having all the knowledge has a name: computation. Deducing (computing) the conclusion from the premises in the Socrates syllogism does not add any information. Neither does computing x*y from x and y. But computation can tell us things we did not know even though the information was there all along.

We are uncertain about the blanks, even though we have all the necessary information to fill them.

In this context, computing is a process that extracts consequences present implicitly in information. The difference between deducing the conclusion of a simple syllogism, and multiplying two large numbers is a difference in degree, not a difference in kind. However, there is a clear difference in that without sufficient computation, we will remain uncertain about things that are in a sense already there. At the upper end we have cases like Fermat’s last theorem, about which mathematicians had been uncertain for 350 years. We finish with this summary of logical uncertainty:

The information exists, we have all of it, but there are logical consequences we don’t know.

Pierre de Fermat

Cryptography: secrecy and uncertainty

Cryptography  (from Greek κρυπτός kryptós, “hidden, secret”; and γράφειν graphein, “writing”) is the practice and study of techniques for secure communication in the presence of third parties called adversaries

The important word here is secret, which should remind of us uncertainty. Saying that we want a message to remain secret with respect to an adversary is equivalent to saying that we want this adversary to be uncertain about the message content. Although our first intuition would point in the direction of epistemic uncertainty, the fact is that in practice this is not usually the case.

Let’s look at an example with the Caesar cipher, named after Julius Caesar, who used it ~2000 years go. The Caesar replaces each letter in the message with another letter obtained by shifting the alphabet a fixed number of places. This number of places plays the role of encryption key.  For example, with a shift of +3

abcdefghijklmnopqrstuvwxyz
defghijklmnopqrstuvwxyzabc

Let’s encrypt a message using this +3 key:

cryptography is based on uncertainty
fubswrjudskb lv edvhg rq xqfhuwdlqwb

We hope that if our adversary gets hold of the encrypted message he/she will not learn its secret, whereas our intended recipient, knowing the +3 shift key just needs to apply the reverse procedure (-3 shift) to recover it. When analyzing ciphers it is  assumed that our adversary will capture our messages and also will know the procedure, if not the key (in this case +3) used to encrypt. Using these assumptions, let’s imagine we are the adversary and capture this encrypted message:

govv nyxo iye rkfo pyexn dro combod

We want to know the secret, but we don’t know the secret key shift value. But then we realize that the alphabet has 26 characters, and therefore there are only 25 possible shifts, a shift of 26 leaves the message unchanged. So how about trying all the keys and seeing what happens:

FNUU MXWN HXD QJEN OXDWM CQN BNLANC
EMTT LWVM GWC PIDM NWCVL BPM AMKZMB
DLSS KVUL FVB OHCL MVBUK AOL ZLJYLA
CKRR JUTK EUA NGBK LUATJ ZNK YKIXKZ
BJQQ ITSJ DTZ MFAJ KTZSI YMJ XJHWJY
AIPP HSRI CSY LEZI JSYRH XLI WIGVIX
ZHOO GRQH BRX KDYH IRXQG WKH VHFUHW
YGNN FQPG AQW JCXG HQWPF VJG UGETGV
XFMM EPOF ZPV IBWF GPVOE UIF TFDSFU
WELL DONE YOU HAVE FOUND THE SECRET
VDKK CNMD XNT GZUD ENTMC SGD RDBQDS
UCJJ BMLC WMS FYTC DMSLB RFC QCAPCR
TBII ALKB VLR EXSB CLRKA QEB PBZOBQ
SAHH ZKJA UKQ DWRA BKQJZ PDA OAYNAP
RZGG YJIZ TJP CVQZ AJPIY OCZ NZXMZO
QYFF XIHY SIO BUPY ZIOHX NBY MYWLYN
PXEE WHGX RHN ATOX YHNGW MAX LXVKXM
OWDD VGFW QGM ZSNW XGMFV LZW KWUJWL
NVCC UFEV PFL YRMV WFLEU KYV JVTIVK
MUBB TEDU OEK XQLU VEKDT JXU IUSHUJ
LTAA SDCT NDJ WPKT UDJCS IWT HTRGTI
KSZZ RCBS MCI VOJS TCIBR HVS GSQFSH
JRYY QBAR LBH UNIR SBHAQ GUR FRPERG
IQXX PAZQ KAG TMHQ RAGZP FTQ EQODQF
HPWW OZYP JZF SLGP QZFYO ESP DPNCPE

We found that the secret was revealed when trying a key shift of +10. Note how we were able to pick out the correct message because none of the other attempts gave meaningful results. This happens because the space of possible keys is so small that only one of them decrypts to a possible message. In technical terms, the key space and message space[2] are small enough compared to the length of the message that only one key will decrypt. The following equation[3] states this in terms of uncertainty:

Untitled2

The left part of the expression, H(Key | Ciphertext), tells us how much uncertainty about the key remains once we have obtained the encrypted message. Note the term S(c) which represents how many keys decrypt a meaningful message. As we saw above, S(c) = 1, which yields

H(K | C) = ∑ P(c) * log2 (1) = ∑ P(c) * 0 = 0

In words, there is no uncertainty about the key, and therefore the secret message, once we know the encrypted message[4]. Of course, when we initially captured this

govv nyxo iye rkfo pyexn dro combod

we did not know the secret, but we had all the information necessary to reveal it. We were only logically uncertain about the secret and needed computation, not information, to find it out.

Alberti’s cipher disk (1470)

Although we have seen this only for the simple Caesar cipher, it turns out that except for special cases, many ciphers have this property given a large enough message to encrypt. In public key ciphers, like those used in many secure voting systems, this is the case irrespective of message size. So we can say that practical cryptography is based around logical uncertainty, since our adversaries have enough information to obtain the secret. But as we saw previously, there are different degrees of logical uncertainty. Cryptography depends on this uncertainty being “strong” enough to protect secrets.

Talking about degrees of logical uncertainty leads us to computational complexity.

Computational complexity and logical uncertainty

Just as entropy measures epistemic uncertainty, computational complexity can be said to measure logical uncertainty. In probability theory we study how much information one needs to remove epistemic uncertainty. Computational complexity studies how much computation one needs to remove logical uncertainty. We saw that deducing the conclusion of the Socrates syllogism was easy, but multiplying two large numbers was harder. Complexity looks at how hard these problems are relative to each other. So if we are looking for the foundations of cryptography we should definitely look there.

Take for example the widely used RSA public key cryptosystem. This scheme is based (among other things) on the computational difficulty of factoring large numbers. We can represent this situation with two statements, for example

  1. X=1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139
  2. X=3797522793694367392280887275544562785456553663 199*40094690950920881030683735292761468389214899724061

Statement 2 (the factors) is entailed by statement 1, but obtaining 2 from 1 requires significant computational resources. In a real world case, an adversary that captures a message encrypted under the RSA scheme will require such an amount of computation to reveal its content, that this possibility is labeled infeasible. Let’s be a bit more precise than that. This means that an adversary, using the fastest known algorithm for the task, will require thousands of years of computing on a modern pc.

If the last statement didn’t trigger alarm bells, perhaps I should emphasize the words “known algorithm”. We know that with known algorithms the task is infeasible, but what if a faster algorithm is found? You would expect complexity theory would have an answer to that hypothetical situation. The simple fact of the matter is that it doesn’t.

In complexity theory, problems for which efficient algorithms exist are put into a class called P. Although no efficient algorithm is known for integer factorization, whether it is in P or not is an open problem[5].  In other words, we are logically uncertain about whether factorization is in P or not!

Several complexity classes

If we assume that integer factorization is not in P then a message encrypted with RSA is secure. So in order to guarantee an adversary’s logical uncertainty about secret messages, cryptographic techniques rely on assumptions that are themselves the object of logical uncertainty at the computational complexity level! Not the kind of thing you want to find when looking for foundations.

The bottom line

It’s really not that bad though. If you think carefully about it, what matters is not just whether factorization and other problems are in P or not, but whether adversaries will find the corresponding efficient algorithms. The condition that factorization is in P and that the efficient algorithms are secretly found by adversaries is much stronger than the first requirement on its own. More importantly, the second condition seems to be one we can find partial evidence for.

Whether or not evidence can be found for a logical statement is a controversial subject. Does the fact that no one has proved that factorization is in P count as evidence that it is not in P? Some say yes and some say no. But it seems less controversial to say that the fact that no algorithm has been found counts as evidence for the possibility that we (as a species with given cognitive and scientific level of advancement) will not find it in the near future.

A quantum subroutine

The bottom line for the foundations of cryptography is a question of both logical and epistemic uncertainty. On one hand, computational complexity questions belong in the realm of logic, and empirical evidence for this seems conceptually shaky. But the practical aspects of cryptography not only depend on complexity questions, but also on our ability to solve them. Another point along these lines is that computational complexity tells us about difficulty for algorithms given certain computational primitives. But the question of what primitives we have access to when building computing devices is a question of physics (as quantum computing illustrates). This means we can justify or question confidence in the security of cryptography through empirical evidence about the physical world. Today, it is the combination of results from computational complexity together with empirical evidence about the world that form the ultimate foundations of cryptography.


References

[1] Along the x, y, or z axes

[2] Without going into details, the message space is smaller than the set of all combinations of letters given that most of these combinations are meaningless. Meaningful messages are redundantly encoded.

[3] http://www14.in.tum.de/konferenzen/Jass05/courses/1/papers/gruber_paper.pdf

[4] The equation refers to the general case, but we can still use it to illustrate a particular case.

[5] To be precise, it’s that and the more general question of whether P=NP.

Reddit-style filtering for e-democracy

(Cross posted from here.)

When voting, people select among several options that have been predefined prior to the election. In more open scenarios of citizen participation, people are asked not only to vote for predefined options, but to contribute their own ideas before the final vote. This introduces another activity to the democratic process, we can call it filtering.

The aim of filtering is to select, out of a very large number of ideas, the very few best ones either to directly implement them or to put them to a formal vote. Filtering has to address the problem that voting cannot: How do we select out of potentially hundreds of ideas without having each voter evaluate them all? The solution provided by filtering is a common theme in many proposals to augment democracy: division of cognitive labour.

A single voter cannot rank hundreds of ideas, but the cognitive load of selecting the best ones can be spread across the many citizens that are contributing them. Filtering can scale because as the number of ideas grows with the number of people suggesting them, so too does the available cognitive effort available to evaluate them.

However, a key distinction must be noted between voting and filtering. The process of distributing cognitive labour allows filtering many ideas, but once we accept that not all people are evaluating everything, the information available is necessarily incomplete. In other words, the process of ranking ideas has an element of uncertainty, it is statistical.

Reddit-style filtering

When looking for tools to implement internet mediated filtering, sooner or later one runs into systems like reddit, slashdot, digg and the like. At an abstract level, we can pick out the following features which are relevant to the problem of filtering:

  1. Many users can submit a large amount of items as well as evaluate them
  2. Users evaluate items with a binary signal, eg voting up/down
  3. A ranking method exists that sorts items for some definition of better => worse.

At this level of description, these three properties fit filtering well. We have ideas that we wish to rank, and we cope with the large volume by having the users evaluate them, possibly using binary signal such as approval (or not approval). The meat of the matter lines in the third property, specifically in the phrase “for some definition of better => worse”.

Quality of ideas

Above we have used the abstract term “items”. In reddit, items can be stories or comments. The distinction is important because reddit uses different ranking algorithms for each. The key feature of filtering that departs from the reddit model is that reddit and similar sites are news aggregation tools. This is why stories in reddit include a novelty component in their ranking. Ceteris paribus, new items are better than old items.

Filtering on the other hand is not about news, but about ideas/proposals. Although novelty may still be a factor, it is not a central one. Idea quality must be mostly about voters evaluation through a binary signal, and not very much about its submission date. Filtering ranking is more like reddit’s ranking of comments than it is of stories.

The problem is simple then: we must rank ideas according to users’ binary evaluation in a way that deals with incomplete information. That is, we will not have a +/- judgement for every idea from every user, only subsets of this. Fortunately the problem as posed is directly an instance of binomial proportion estimation, an urn model.

We have, for every idea, an unknown quantity which represents the fraction of all users who would evaluate it positively. This quantity can be statistically inferred using the number voters that have in fact evaluated the idea and approved it. To do this we use the betabinomial model, see here for an introduction. Briefly, we choose a uniform uninformative prior distribution with parameters alpha = beta = 1. Because the beta is conjugate to the binomial, posterior distribution is also a beta given by

The mean of a beta distrubution is given by

If we plug this into the posterior we see that the mean is simply approvals / total, which makes sense. So a first approximation would be to rank ideas by the mean of the (posterior) probability distribution, which is just the fraction of approvals out of total evaluations.

What about uncertainty?

The problem with the above is that it ignores how confident we are about the estimation of an idea’s quality. For example, an idea evaluated 1000 times with 500 approvals would rank equally to an idea evaluated twice and approved once. In technical terms, using the mean throws away information about the probability distribution. This is what the author of this post realized:

suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings). WRONG.

The author then goes on to suggest a lower bound wilson confidence interval, this is how reddit currently ranks comments. Another option along these lines is to use a lower bound on our beta posterior using its cumulative distribution function (I wont go into more details here). In either case, the spirit of this measure is to rank ideas that we are confident are good higher than ideas which have not been evaluated enough.

What about equality/winner takes all/sunk ideas/information gain?

This is where filtering departs strongly from systems like reddit. Consider:

  • Ideas should have an equal opportunity of being selected.
  • Once ideas are ranked highly there is a snowball effect as they are more visible and can accumulate approvals.
  • New good ideas will be lost and sunk as they cannot compete with the volume of older ones that have accumulated approvals.
  • By penalizing uncertainty you concentrate evaluation away from where its most needed, information gain is minimized.

All these objections follow a common theme, the theme of balancing two competing objectives:

  • We want to rank good ideas highly
  • We want to allow for the discovery of good yet unproven ideas

Thompson Sampling and multi-armed bandits

The above dilemma is another form of the exploration-exploitation tradeoff that appears in reinforcement learning, an area of machine learning. There is one specific problem in machine learning whose structure is very similar to the problem of rating items and filtering: multi-armed bandits. In this post James Neufeld first makes the connection between rating comments on reddit and muti-armed bandits.

The comment scoring problem on reddit is slightly different from the basic setting described above. This is because we are not actually choosing a single comment to present to the user (pulling one arm) but, instead, producing a ranking of comments. There are some interesting research papers on modelling this problem precisely, for example [Radlinski, et al.,2008], but, it turns out to be a combinatorial optimization (hard). However, rather than going down this complex modelling path, one could simply rank all of the μ¯iμ¯i samples instead of taking just the max, this gives us a full ranking and, since the max is still at the top, is unlikely to adversely affect the convergence of the algorithm.

He then goes on to propose adapting Thompson sampling, a solution applied to multiarmed bandits in reinforcement learning, to the case of rating comments in reddit. The method of Thompson Sampling (or probability matching) is simple given what we’ve seen above regarding the beta posterior. Instead of using the beta posterior mean, or a lower bound on its cumulative distribution, we simply sample from it. The procedure is:

  1. For each idea, sample from its posterior beta probability distribution
  2.  Construct a ranking according to the sampled value

Here’s how Thompson sampling relates to the two objetives stated above

  • We want to rank good ideas highly

Sampling from a high quality posterior will tend to produce larger values.

  • We want to allow for the discovery of good yet unproven ideas

Because we are sampling there is a chance that unproven ideas will be ranked highly. Furthermore this possibility is greater for better ideas and for ideas with high uncertainty. There is more to be investigated here about the relationship between Thompson sampling and maximizing information gain (eg Kullback Leibler divergence).

Extensions

The base Thompson sampling model can be extended in several ways. For example:

  • Time: we can incorporate time factors by having decays on the beta distribution parameters.
  • Collaborative filtering: we can apply weight factors on beta parameters depending on affinity with the user.
  • Boosting newly submitted ideas: we can choose non uniform priors to ensure high exposure to new ideas.

Probability of an election tie

(from http://www.uncommondescent.com/wp-content/uploads/2013/06/coin-flip.jpg)

Sparked by recent events in politics, a lot of debate and controversy has occurred on the Spanish blogosphere around a simple question of probability:

What is the probability that a Yes/No election with 3030 voters results in a tie?

Before suggesting answers, let me make it clear that the main controversy has occurred when trying to answer this question in its barest form, without any additional information besides its simplest formulation above. To make it doubly clear, this is all the information that defines the problem:

1) There are 3030 voters that can vote Yes or No.
2) Yes and No votes are treated as Bernoulli trials.

We model a series of Bernoulli trials with a binomial distribution. It has two parameters, the number of events, and the probability of success for each event:

X ~ Bin(n, p)

Our question is answered by this piece-wise function

P(tie) =

P(X = n/2) if n is even

0 otherwise

All we need to do is plug in the parameters and we’re done. We’ve been given n = 3030 in our problem definition. But wait a minute, what about p? The problem definition states that votes are Bernoulli trials, but we know nothing about p!

In order to create intuition for the situation, let me pose two related questions.

What is the probability of getting 5 heads in 10 trials when tossing a fair coin?

What is the probability of getting 5 heads in 10 trials when tossing a coin where the only thing we know about the coin is that it can land heads or tails?

In the first question we have been given information about the coin we are tossing, which we input into the binomial model. In the second question we know nothing about the coin, and therefore nothing about the second parameter p.

This is precisely the case with our election problem. So how do we proceed? In both cases the answer is the same, we must construct a version of the binomial that allows us to represent this state of information. The beta-binomial probability distribution comes to the rescue. From wikipedia:

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random.

I hope something rang a bell when you saw the word “unknown” above, this is exactly our situation. What we do, therefore, is to construct a non-informative prior over p that represents our lack of information about said parameter. In the beta-binomial distribution this prior takes the form of a Beta distribution, and the usual choice as non-informative prior is Beta(1, 1), with alpha = beta = 1. You can see how this choice of prior favors no values of p:

MSP3179205a459f8dg1af060000219hieg2gbbdd77f
Remember it’s a probability density function

Having represented our state of knowledge about p as the choice of prior Beta(1, 1), and given that the parameter n is 3030, we now have all the ingredients to calculate things in a way that is consistent with our problem definition. We do this by using the probability mass function of the beta binomial:

We therefore want (since 3030 is even):

P(X = 1515)

= (3030 choose 1515) * Beta[1515 + 1, 1515 +1] / Beta[1, 1]

= 1/3031

Does that fraction seem funny? That value is precisely one divided by the total number of possible election results. You can see this by considering that results can go from [Yes 0 – 3030 No] all the way up to [Yes 3030 – 0 No]. And in fact, using our beta-binomial model with Beta(1, 1)  all these results are given the same probability: 1/3031.

This should come as no surprise, given that as we’ve said repeatedly, the problem definition is such that we know nothing about the election, we have no way to favor one result over the other. You can see this below, the probability for all results is 1/3031 = 0.00033.

MSP111h4205fe7g2d41h20000325i704i6g13h403

The p = 0.5 mistake

In spite of all of the above, most of the people that analyzed our problem got another result, not 1/3031, but instead 0.0145. This corresponds to calculating a binomial distribution with p = 0.5:

X ~ Binomial(3030, 0.5)

P(X=1515)

= (3030 choose 1515) * (0.5^1515) * (0.5^1515)

= 0.01449

How did they get to assuming p = 0.5? Well, it seems that those who went this route did not know about the beta-binomial distribution, and the beta prior that allows us to represent knowledge about p. Without these tools they made an unwarranted assumption: that the lack of information about p is equivalent to 100% certainty that p is 0.5. The source of that confusion is an insidious “coincidence”:

The probability of heads and tails for an unknown coin “happens” to be exactly the same as that for a coin which we know with 100% probability that it is fair.

Let me restate that

P(head for a single coin toss we know nothing about) = 0.5
P(head for a single coin toss we know 100% to be fair) = 0.5

Because the value is the same, it’s easy to jump to the conclusion that a series of coin tosses for a coin that we know nothing about is treated the same way as a series of coin tosses for which we know for sure that the coin is fair! Unfortunately the above coincidence does not reappear:

P(n successes for coin tosses we know nothing about)  P(n successes for coin tosses we know 100% to be fair)

To illustrate that setting p=0.5 (or any other point value) represents zero uncertainty about the value of p, let’s plot a few graphs for priors and probabilities of ties. This will show how our non-informative prior Beta(1, 1) progressively approximates the p=0.5 assumption as we reduce its uncertainty.

  • alpha = beta = 1 (non-informative)
MSP3179205a459f8dg1af060000219hieg2gbbdd77f
probability over p
MSP111h4205fe7g2d41h20000325i704i6g13h403
probability of yes votes

probability of tie = 1 / 3031 = 0.00033

  • alpha = beta = 10

With Beta(10, 10) the probability of tie increases from 1/3031 to 0.0012:

MSP19171egg4d3i2229cgcb00003h2cg0i369hi8g37
probability over p
MSP10531g554a6fc043h21c0000144a972697b87g94
probability of yes votes

probability of tie = 0.0012

  • alpha = beta = 100
probability over p
probability over p
MSP852227956eegbd4a60000003f290b49gdh19a2e
probability of yes votes

probability of tie = 0.0036

  • alpha = beta = 10000
probability over p
probability over p
probability of yes votes
probability of yes votes

probability of tie = 0.0135

  • alpha = beta = 5×10^5

probability of tie = 0.01447

A formal version of the above trend is:

As the variance of our prior tends to zero, the probability of a tie tends to 0.1449 (the value obtained with p = 0.5)

How can we interpret p?

Given the assumption that voter choices are  Bernoulli trials, what can be said about the significance of p? We can offer an interpretation, although this won’t change any of our results above.

Having said this, consider that p describes how likely it is that a voter selects Yes or No in an election. If we interpret that a voter chooses Yes or No depending on his/her preferences and the content of the question posed in the election, then p represents the relationship between the election content and the set of voter’s preferences.

Saying that we don’t know anything about the election is like saying that we know nothing about the voter’s preferences and question posed to them. If, for example, we knew that the question was about animal rights, and we also knew that the set of voters were animal activists, then we’d probably have a high value of p. Conversely if we asked gun supporters about gun control, p would be low. And if we asked a generally controversial question to the general population, we’d have p around 0.5.

Unless we have prior information that rules out or penalizes certain combinations of elections questions and preferences, we must use a non-informative prior for p, such as Beta(1,1).

What about using more information?

In most of this post I’ve been talking about an ideal case that is unrealistic. If we do have information about the election beyond its bare description we can incorporate it into our beta prior the same way we’ve done above.

This is precisely what makes bayesian models useful: prior information is made explicit and inferences are principled. I won’t go into the details of this as it would merit a post on its own, only note that using prior information must be done carefully to avoid inconsistencies like the ones described here.

The pairwise-bradleyterry model for pairwise voting

In a previous post we discussed pairwise voting and the pairwise-beta model as a way to obtain a global ranking over candidates using bayesian inference with the beta distribution. In that post we remarked in the final paragraph that the pairwise-beta model is not perfect:

In particular, it does not exploit information about the strength of the opposing item in a pairwise comparison.

In this post we will look at a better model which addresses this particular problem, albeit at a computational cost. To begin we present a pathological case which exhibits the problem when using the pairwise-beta.

Consider the following set of pairwise ballots, where A, B, C, D, E and F are options, and A > B indicates that A is preferred to B. There are 5 ballots:

A > B

B > C

C > D

D > E

F > E

Applying the pairwise-beta algorithm method to this set of ballots yields the following output (options A-F are referred to as numbers 0-5):

[(0, {'wins': 1, 'losses': 0, u'score': 0.6666666666666666}), 
(5, {'wins': 1, 'losses': 0, u'score': 0.6666666666666666}), 
(1, {'wins': 1, 'losses': 1, u'score': 0.5}), 
(2, {'wins': 1, 'losses': 1, u'score': 0.5}), 
(3, {'wins': 1, 'losses': 1, u'score': 0.5}), 
(4, {'wins': 0, 'losses': 2, u'score': 0.25})]

which is equivalent to the following ranking:

  1. A, F
  2. B, C, D
  3. E

A and F share the first position. B, C and D share the second position. E is last.

Hopefully the problem in this ranking is apparent: the strength of the opposing option in a pairwise comparison is not affecting the global ranking. This is why option F, which only beats the last option, is ranked at the same position as A, which “transitively” beats every other option. Similarly, options B, C and D are ranked at the same level, even though presumably option B should be stronger as it beats option C which beats option D.

In other words, beating a strong option should indicate more strength than beating a weak option. Similarly, being beaten by a strong option should indicate less weakness than being beaten by a weak option.

We can resort to the Bradley-Terry [1] model to address these shortcomings. The Bradley-Terry is a probabilistic model that can be used to predict the outcome of pairwise comparisons, as well as to obtain a global ranking from them. It has the following form:

and in logit form[2]:

logit2

The parameters (p’s and lambdas) can be fit using maximum likelihood estimation. One can consider these to represent the relative strength of options and therefore give a global ranking, although strictly speaking their interpretation is rooted in probabilities of outcomes of comparisons.

In order to apply this model we can use the BradleyTerry2 R package by Turner and Firth[2], which fits the model using tabular input data. Armed with this package all we need is some extra plumbing in our Agora Voting tallying component and we’re ready to go. Let’s run it against the same ballots we did above, we get:

[(0, {'score': 73.69821}),
 (1, {'score': 49.13214}),
 (2, {'score': 24.56607}),
 (3, {'score': 1.186101e-06}),
 (5, {'score': 0.0}),
 (4, {'score': -24.56607})]

which is equivalent to the following ranking:

  1. A
  2. B
  3. C
  4. D
  5. F
  6. E

Notice how this ranking does away with all the problems we mentioned with the pairwise-beta result. In particular, note how option F, which above was ranked joint first, is in this case ranked fifth. This is because it beat option E, which is last, and therefore not much strength can be inferred from that comparison.

Before concluding that the pairwise-beta model is terrible, remember that the results we got here correspond to a handpicked pathological set of ballots. In general it seems reasonable to expect results from both models to converge as more data accumulates and the strength of opponents is evened out. This hypothesis seems to match that stated in work by Salganik[3], where the pairwise-beta and a more robust model are compared saying:

In the cases considered in the paper, the two estimates of the score were very similar; there was a correlation of about 0.95 in both cases.

In summary, in this and the previous post we have described two models that can be used for pairwise elections, where candidates are asked to compare options in pairs. We have seen how one of the models works well and is easy to calculate, but can potentially give unrealistic rankings when data is sparse. We then considered a second more robust model which addresses this problem, but is computationally more expensive. Further work is required to determine exactly how computationally demanding our pairwise-bradleyterry implementation is.

 


[1] BRADLEY, R. A. and TERRY, M. E. (1952). Rank analysis of incomplete block designs. I. The method of paired comparisons.  – http://www.jstor.org/stable/2334029

[2] Bradley-Terry Models in R: The BradleyTerry2 Package – http://www.jstatsoft.org/v48/i09/paper

[3] Wiki surveys: Open and quantifiable social data collection -http://arxiv.org/pdf/1202.0500v2.pdf

The pairwise-beta model for pairwise voting

In a pairwise vote, voters are asked to repeatedly pick between pairs of options, selecting the one they favor of the two. The procedure then combines all these pairwise choices in order to establish a global ranking for all options. A pairwise vote is statistical in nature, it must infer preference data that voters have not explicitly stated in order to obtain a result.

This statistical property allows obtaining an approximate preference ordering over a large list of options without overwhelming the voter with too much work. For example, if each voter was asked to establish a preference over 50 items, they would be exhausted and participation would suffer.

The pairwise-beta is a simple bayesian method used to rank items in pairwise votes. It is based on the beta-binomial model [1], which is composed of a beta prior and a binomial likelihood. This model is very tractable: because the beta distribution is conjugate to the binomial, the posterior also has beta form and is easily obtained:

bb1

I will not present a formal justification of the pairwise-beta model for pairwise comparisons, rather I will present some intuitions that should convey how and why the model works.

The key bridge to interpret pairwise comparisons in terms of the beta-binomial model is to realise that the better-worse relation between options maps directly into the success/failure outcomes of a bernoulli trial. We can thus establish a correspondence[4] between pairwise comparisons and bernoulli trials:

  • each item i corresponds to a sequence of bernoulli trials, Bi
  • a comparison in which i wins corresponds to a success in Bi
  • a comparison in which i loses corresponds to a failure in Bi

The question we are trying to answer is

Given the proportion of comparisons in which i wins, what is the proportion of items that are globally better than i?

that reformulated in terms of our correspondences becomes

Given a sequence of bernoulli trials Bi, what is the proportion of successes/losses for i?

Which is a case of standard binomial proportion estimation[2]. As we noted before, the posterior of the beta binomial is also a beta distribution, given by

bb3

If we want a point estimate we just use the mean for this distribution which is

bb2

This gives us, for each item i, an estimation for the number of items that are better/worse than itself. This leads directly to a global ranking: the best ranked items will be those which are estimated to be better than most other items.

 In summary, the procedure is

  1.  For each item i, obtain the corresponding sequence of bernoulli trials Bi
  2. For each item i, calculate the posterior beta distribution mean given the data from 1)
  3. Create a global ranking based on the proportions for each item, as calculated in 2)

The pairwise-beta model is simple but not perfect. In particular it does not exploit information about the strength of the opposing item in a pairwise comparison. However, despite this drawback it performs well in practice. Please refer to [3] for details.

 


[1] http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf

[2] http://homepage.ntu.edu.tw/~ntucbsc/%A5%CD%AA%AB%C2%E5%BE%C7%B2%CE%ADp%BF%D4%B8%DF%A4@971008/[3]%20Chapter%208.pdf

[3] http://arxiv.org/pdf/1202.0500v2.pdf

[4] The correspondence is two to one, as each comparison yields two bernoulli trials

Error handling: return values and exceptions

My colleague edulix started a discussion on the golang list about the merits of go’s error handling. This got me thinking about the problem of error handling in general, and that no language seems to have has gotten it quite right. What follows is a quick braindump.

Two approaches

There are two prominent approaches to error handling in software engineering today, using return values and exceptions.

With return value error handling, errors are indicated through the values returned from functions, the caller has to write checks on these values to detect error conditions. Originally, error returns were encoded as falling outside the range of data corresponding to normal operation.

For example, when opening a file with fopen in C++, the normal return is a pointer to an object representing the file. But if an error occurs the function returns null (additional information about the error is available from a global variable, errno). Just as the caller of the function uses normal return values, said caller is responsible for dealing with errors present in the return.

Exceptions establish a separate channel to relay error information back from functions. When using exceptions, a function’s return only holds values that correspond to normal behavior. In this sense the values returned by functions are more immediately identified with the function’s purpose, details regarding what can and does go wrong are separated into the exceptional channel.

For example, the purpose of the Java’s parseInt function is to parse an integer. So the return value’s content is exclusively a consequence of this purpose: the integer parsed from the string. If something goes wrong when attempting to achieve this purpose it will be represented in the throwing of an exception which holds error details. Whereas with return values there is no need for additional machinery beyond that provided by normal function calls, exception requires an extra mechanism: try/catch blocks.

Exceptions’ claimed virtues

exc
Exceptions try to declutter code (http://mortoray.com/2012/03/08/the-necessity-of-exceptions/)

Exceptions aim to be an improvement over return values. Here are some possible shortcomings of the return value approach:

  • The code is cluttered with error checking that reduces its readability, intent is obscured.
  • Sometimes the scope at which an error occurs is not the best one to handle it. In these cases one has to manually propagate error’s backwards, using several returns until an adequate error handling scope is reached. This is cumbersome.
  • If one forgets to check for errors the program may continue to run in an inconsistent state, causing fatal errors down the line that may not be easy to understand.

Exceptions try to address these possible shortcomings:

  • With exceptions normal program flow is separated from error handling code, which lives in special catch blocks for that purpose. This declutters normal behaviour code, making intent clear and readable.
  • With exceptions errors programmers have the option of handling errors at the right scope, without having to manually return across several levels. When exceptions are not handled at the scope the error occurs, they are propagated automatically up the stack until a suitable handler is found.
  • If exceptions are not handled, the program crashes immediately instead of continuing in an inconsistent state. A stack trace shows the error, and the call stack leading to the error.

But it’s not all good

The story doesn’t end here of course, exceptions have their problems. Some say that they are worse than the solution they were trying to improve upon. This is the position that many proponents of Go’s error handling defend, stating that returns with multiple values (feature not present orignally in C, hence errno and the like) are not only sufficient but simply better than exceptions:

  • Because exceptions propagate up and can be handled elsewhere, programmers are tempted to ignore errors and “let someone else deal with it”. This lazy behaviour can lead to errors being handled too late or not at all.
  • It is in principle impossible to know whether a given function call can throw an exception, short of analyzing its entire forward call tree. As a consequence, the use of exceptions in a language is equivalent to hidden goto’s that can jump control back up the call stack at any function invocation. In contrast, when using return values error information is formally present in the function signature.

Whereas before we listed exception propagation as an improvement (allowing error handling at the right scope), here it is listed as a weakness: it can encourage bad programming practice. But it is the second criticism that I find more significant: exceptions are opaque and may lead to unpredictable behaviour. With such an unpredictable ingredient, the programmer is unable to ancipate and control the potential paths his/her code can take, as lurking under every function call is the potential for a jump in execution back up the call stack.

Of course, in real world practice, exceptions are not the randomness disaster that this description may suggest[1]. Properly documented functions do communicate to the programmer, to a reasonable approximation, what can and cannot happen exception-wise. And properly implemented functions do not whimsically throw exceptions at every opportunity, “magically” making your code jump to a random location.

Good intentions: checked exceptions

What about checked exceptions, you say? The rationale for checked exceptions in Java is precisely one that formally addreses the very problem we just described, because

With checked exceptions, function calls must, in their type signature, specify what exceptions can be thrown.

Checked exceptions are part of a function’s type and the compiler ensures that exceptions are either caught or declared to be thrown (a use of the type system that fits well with what I’ve advocated here). Isn’t this exactly what’s needed? The subtle problem is that the phrase “caught or declared to be thrown” does not mean the same as “correctly handled”. The use of checked exceptions encourages lazy behaviour from programmers that makes things even worse. When forced by the compiler to deal with checked exceptions, programmers bypass it by

  • Indiscriminately adding throws clauses, which just add noise to the code
  • Writing empty try/catch blocks that never get populated with error handling code, potentially resulting in silent failures

The second problem is especially dangerous, it can result in exceptions being swallowed and silenced until some larger error occurs down the line, an error which will be very hard to diagnose. If you recall, this is precisely the third drawback we listed for error handling via return values. Both problems arise, not due to the language feature itself, but out of imperfect programming.

The bottom line for a language feature is not what some ideal programmer can do with it, but the use it encourages in the real world. In this case the argument goes that checked exceptions (and as we saw above, error return values) make it harder to write correct error handling code; it requires additional discipline not to shoot oneself in the foot. Which is a shame, since the rationale behind checked exceptions is very appealing: documentation at the type level and compiler enforcement of proper error handling.

Quick recap. Exceptions aim to be an improvement over return values offering benefits mentioned above: code decluttering, error handling at the right scope via propagation and failing fast to avoid hidden errors. Of these three features, the second is questioned and undermined by the two points that proponents of return values make. Opacity in particular is a strong drawback. Checked exceptions aim to resolve this by formally including exception information in the function type and enforcing error handling through the compiler, but the industry consensus seems to be that they do more harm than good.

The way forward

It’s difficult to reach a general conclusion in favor of exceptions or return values, the matter is unclear. But let’s assume that exceptions are no good. Granted this assumption, it would not mean that the problems with return values magically went away. Is there some way, besides exceptions, to address these problems?

One approach I find promising achieves decluttering but within the confines of errors as return values, using a functional style. This technique exploits several programming language features: sum types, pattern matching and monads. But those are just details, what’s important is the end result, see this example in Rust:

fn from_file(path: &Path) -> ProgramResult<()> {
    let file = try!(File::open(path).read_to_str().map_err(to_program));
    try!(do_something(file.as_str()));
    Ok(())
}

The first two lines of the function are operations that can fail, they return a sum type that can represent either a success or failure. The desired behaviour is such that if any error occurs at these lines, then the function from_file should stop executing and must return that error.

Notice how the logic that achieves this is not cluttering the code, it is embedded in the try! macro, which uses pattern matching on the sum type to either return a value to be used in the next line (the file variable), or short circuit program flow back to the caller. If everything works the code eventually returns Ok. Let’s restate the main points

  • from_file has a sum type return that indicates that the function may fail.
  • the individual invocations within the function itself also return that type.
  • the try! macro extracts successful values from these invocations, or shorts circuit execution passing the error as the return of from_file.
  • the caller of the from_file function can proceed the same way or or deal with errors directly (via pattern matching)

Besides the internal machinery and language features in use, I hope it’s clear that this style of handling errors mimics some of the positive aspects of exceptions without using anything but return values. Here is another example which has a similar[2] intent, this time in Scala

val value = for {
    row <- database fetchRowById 42
    key <- row get "port_key"
    value <- myMap get key
} yield value

val result = value getOrElse defaultValue

This code shows a series of operations that can each fail: getting a row from a database, getting one of its columns, and then doing a lookup on a map. Finally, if any of the steps fail a default value is assigned. The equivalent code with standard null checking would be an ugly and repetitive series of nested if-blocks checking for null. As in the previous example, this error checking logic is not cluttering the code, but occurs behind the scenes thanks to the Option monad. Again, no exceptions, just return values.

Error handling is hard. After years of using exceptions it is still controversial whether they are a net gain or a step back. We cannot expect that a language feature will turn up and suddenly solve everything. Having said that, it seems to me that the functional style we have seen has something to offer in at least one of the areas where traditional return values fall short. Time will tell whether this approach is a net gain.


Notes

[1] If real world application of exceptions was in fact disastrous, software written with exceptions would just never work, and apparently it does; exceptions are widely in use in countless systems today.

So a reasonable position to take is that criticisms of exceptions are on the mark when pointing out that the exception mechanism is opaque and potentially unpredictable. This criticism does not mean that software written with exceptions is inherently flawed, but that exceptions make correct error handling hard. But then again, error handling is a hard problem to begin with. Does the opacity of exceptions null its advantages?

[2] In this case the problem is how to deal with failure in the form of null values.

[3] In Rust for example, the problem of swallowing errors is also addressed with #[must_use] , see http://doc.rust-lang.org/std/result/

Further reading

Exceptions

http://www.joelonsoftware.com/items/2003/10/13.html

http://www.lighterra.com/papers/exceptionsharmful/

The Necessity of Exceptions

Checked exceptions

http://www.artima.com/intv/handcuffs3.html

http://googletesting.blogspot.ru/2009/09/checked-exceptions-i-love-you-but-you.html

Rust

http://doc.rust-lang.org/std/io/index.html

http://rustbyexample.com/result/try.html

http://www.hoverbear.org/2014/08/12/Option-Monads-in-Rust/

Scala

http://danielwestheide.com/blog/2012/12/26/the-neophytes-guide-to-scala-part-6-error-handling-with-try.html

http://twitter.github.io/effectivescala/#Error handling-Handling exceptions

Exploring Scala Options

https://groups.google.com/forum/#!topic/scala-debate/K6wv6KphJK0%5B101-125-false%5D

Integer encoding of multiple-choice ballots (2)

In the last post we saw how simple arithmetic with the right choice of base can encode integer lists for multiple ballot choices in an elegant and compact way. A couple of points were mentioned in the notes section, one of them was

our scheme allows encoding repeated choices, which are usually not legal ballots. For example, one can encode the choice 1,1,1 but that would be an illegal vote under all the voting systems we mentioned as examples. The fact that legal ballots are a subset of all possible encodings means that the encoding is necessarily suboptimal with respect to those requirements. We will look at an alternative scheme in the next post.

The alternative scheme we show below attempts to optimize the encoding by using a variable base as digits are encoded. The main idea is that as chosen integers in the list are encoded, the remaining ones are constrained since they cannot be repeated. Thus the base can be reduced as less choices are possible. In practice the process is complicated by the need to keep track of what digits correspond to what, as gaps form in the remaining integers and the corresponding meaning of digits changes. Here is a python implementation

def encode2(digits, choices = 9):
  radix = choices + 1
  present = range(0, radix)
  digits2 = deque()
  radix = radix - len(digits) + 2
  for d in digits:
    index = present.index(int(d))
    present.remove(int(d))
    digits2.appendleft(index)

  number = 0
  print(digits2)
  for d in digits2:
    number += d
    number = number * radix
    # print(d, radix, number)
    radix = radix + 1

  return number / (radix - 1)

def decode2(number, choices = 9):
  radix = choices + 1
  print("decode %d" % number)
  digits = deque()
  present = range(0, radix)
  tmp = number
  while(number > 0):
    # print(number, radix, number % radix)
    digits.append(number % radix)
    number = number / radix
    radix = radix - 1

  print(digits)
  ret = deque()
  for d in digits:
    # print(d, present, ret)
    val = present[d - 0]
    ret.append(val)
    present.remove(val)

  return ret

and a sample session using both encoders

>>> execfile("encoder.py")
>>> decode(encode([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], 21), 21)
decode 14606467545964956303452810
deque([1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L])
>>> decode2(encode2([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], 21), 21)
decode 36697695360790800022
deque([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

Note the shorter value produced by the second encoder

encode: 14606467545964956303452810

encode2: 36697695360790800022

Despite the reduction, the second encoder is not optimal (the first encoder is optimal given repeatable choices); the range of output numbers is larger than that of legal ballots. It would be interesting to see how to obtain the most compact solution, a detailed analysis could compare these schemes systematically to get quantitive measures of space efficiency.


Integer encoding of multiple-choice ballots

Secure voting systems supporting privacy through encryption must encode ballot contents into integers before they can be encrypted[1]. This encoding step is mostly trivial. For example, imagine a yes-no-abstain ballot. One can simply apply the following mapping to yield integer plaintexts for corresponding ballots:

Yes => 1
No => 2
Abstain => 3

But things can get a bit more involved when dealing with multiple-selection ballots. These are ballots where the voter makes more than one choice. They can be either ranked ballots, where the voter specifies a preference relation over the selections, or unranked ballots where no such a preference exists. Examples of voting systems using the former are single transferable vote or instant runoff voting.  Systems like approval voting or plurality at large are examples of the second type, using unranked ballots.

Imagine we are using one of these systems to elect a candidate out of a field four: Alice, Bob, Charlie, and Donna. We first apply the trivial mapping:

Alice => 1
Bob => 2
Charlie => 3
Donna => 4

But how do we encode a complete ballot, for example, a ballot with (X corresponds to marked choices)

Alice X
Bob X
Charlie O
Donna O

Unlike the yes-no-abstain ballot above, the content of the ballot corresponds to a list of integers: 1 and 2. We could use the following mapping

encode: Alice, Bob => 1, 2 => 12

The ballot is encoded as the number 12, resulting from the concatenation of the string representations of each of the integers. But what if there are more candidates, and we need to encode something like:

encode: 11, 3 => 113

That won’t work, because 113 could represent either 11 and 3, or 1 and 13.

decode: 113 => ?

We can remedy this by adding some padding such that each choice is well separated:

encode: 11, 3 => “1103” => 1103

Then when decoding, we convert the integer to a string, split it every 2 characters, and finally obtain integers for each of the candidates:

decode: 1103 => “1103” => “11”, “03” => 11, 03

But there’s still a problem, what about this choice:

encode: 1, 13 => “0113” => 113

We run into trouble here, because the string “0113” corresponds to the integer 113; there is no mathematical difference between “0113” and “113”. To fix this, when decoding we can first check that the string length is a multiple of 2 (since we are using 2 chars per candidate integer), if it is not we prepend the required zeros. The encode-decode process would be

encode: 1, 13 => “0113” => 113
decode: 113 => “113” (prepend zero) => “0113”  => “01”, “13” => 1, 13

I hear you complain that all this concatenation, padding, and prepending looks a bit hackish, can we do better?

Let’s go back to our first example, when we simply wanted to do

encode: Alice, Bob => 1, 2 => 12

This looked very nice and simple. Can’t we do something like this in general, without any string hackery? The first step is to go back to the definition of decimal numbers.

Decimal numbers (cplusplus.com)

In these terms, the encoding 1, 2 => 12 corresponds to

(10^1) * 1 + (10^0) * 2 = 12

Here we have expressed the encoding of 1, 2 using arithmetic, no string operations involved. The ballot choices are interpreted as digits according to the mathematical definition of decimal numbers. (In fact, this is what goes on under the covers when you convert a string like “12” into the number 12.) This gives us a a purely arithmetical description of the simple mapping we started with. Things then got complicated when we considered the possiblity of more choices (candidates) in the ballot. Let’s apply our mapping to that problematic ballot:

encode: 11, 3 => (10^1) * 11 + (10^0) * 3 = 113

Our new procedure fails the same way: the simple scheme where each digit represents one choice cannot be accommodated by the decimal digit representation of choices, and the result 113 is ambiguous. But wait, who says we have to encode according to a decimal representation? What if we were to map choices to hexadecimal digits,:

encode: 11, 3 => (10^1) * B + (10^0) * 3 = B3

And we’ve restored simplicity and correctness to our scheme. B3 encodes the choice 11, 3 with one choice per digit and no ambiguity! If the B3 looks like cheating, just remember, B3 is a representation of a number that in decimal format turns out to be 179. The encode-decode process could just as well be written as

encode: 11, 3 => 179
decode: 179 => 11, 3

The bottom line is we can encode lists of integers into an integer provided we use the optimal base, which is equal to the number of possible choices in the ballot plus one.

Let’s revisit our original example, with Alice, Bob, Charlie and Donna. Since we have four candidates, our base is 4 + 1 = 5. The encoding is thus:

encode: Alice, Bob => 1, 2 => (10^1) * 1 + (10^0) * 2 = 12 (base 5) = 7 (decimal)

in short:

encode: 1, 2 => 7
decode: 7 => 1, 2

Note that not only is this method simpler with no string operations or padding, but the encoded values are smaller. Compare:

encode: 1, 2 => 12
encode: 11, 3 => 1103

with

encode: 1, 2 => 7
encode: 11, 3 => 179

Which should not come as a surprise, encoding with the specified base is the most compact[2] encoding possible (proof left as excercise for the reader). A larger base wastes space encoding ballot contents that are not possible, whereas a smaller base is insufficient to encode all possible ballots.

Finally, here is a python implementation of the encoder we have proposed

def encode(choices, max_choices):
    radix = max_choices + 1
    number = 0
    for d in choices:
        number += int(d)
        number = number * radix

    return number / radix

def decode(number, max_choices):
    radix = max_choices + 1
    choices = deque()
    while(number > 0):
       choices.appendleft(number % radix)
       number = number / radix

    return choices

In the next post we will further discuss details as to the compactness of the encoding mentioned in [2].


[1] In the case of ElGamal encryption used in Agora Voting, the plaintext must be encoded into an element of the multiplicative subgroup G of order q of the ring Zp, where p and q are suitably chosen prime numbers. In order to do this, the plaintext must be first encoded into an integer, after which it is mapped to a member of G, and subsequently encrypted.

[2] A few caveats must be mentioned. First, we are using 1-based indices to represent choices, which means some values are unused. Second, our scheme allows encoding repeated choices, which are usually not legal ballots. For example, one can encode the choice 1,1,1 but that would be an illegal vote under all the voting systems we mentioned as examples. The fact that legal ballots are a subset of all possible encodings means that the encoding is necessarily suboptimal with respect to those requirements. We will look at an alternative scheme in the next post.

Voter fraud and bayesian inference – part 3

Here’s part1 and part2.

Welcome back. In the previous posts we saw how to do inference using the beta-binomial to get probabilities for the proportion of fake ballots in an election, as well as an upper bound on the probability that the election result is incorrect. We briefly mentioned the hypergeometric distribution but did discuss it further nor use it.

Like the binomial (and beta-binomial), the hypergeometric distrbution can be used to model the number of successes in a series of sampling events with a binary outcome. The distinction is that the binomial models sampling with replacement, whereas the hypergeometric models sampling without replacement. In other words, if we are sampling from a box, the binomial applies when the sample is returned to the box before drawing more samples. The hypergeometric applies when the sample is not returned. But wait, doesn’t that mean that we’ve been doing it wrong?

When auditing ballots we keep track of those already checked, a ballot is never audited twice. Shouldn’t we then be using the hypergeometric distribution? It turns out that the binomial distribution approaches the hypergeometric distribution in the limit of a large total number of items compared to the number sampled. This fits our case, as we can only audit a limited number of ballots compared to all those cast.

beta5
Hypergeometric for increasing values of N. The bottom right is the corresponding beta-binomial.

As we saw in the previous post, the beta distribution is a conjugate prior for the binomial, which makes inference very easy. Unfortunately this is no the case for the hypergeometric. But because of the converging behaviour seen above, we can stick to the beta-binomial’s easy calculations without sacrificing accuracy. For the sake of completeness we will quickly show the posterior for the hypergeometric, following [1]. Incidentally this “manual calculation” is what allowed us to obtain the images above, through the javascript implementation in the jsfiddle.

hyp

Again, this is just Bayes theorem with the hypergeometric likelihood function and a uniform prior. In [1] it is also pointed out that the normalization factor can be computed directly with this expression

hyp2

We use this in the jsfiddle implementation to normalize. Another thing to note is that the hypergeometric posterior is 0 at positions that are inconsistent with evidence. One cannot have less successes than have been observed, nor more than are possible given the evidence. These conditions are checked explicitly in the implementation. Finally, the jsfiddle does not contain an implementation for obtaining the upper bound on the probablity of election error, only code for the posterior is present. How about forking it and adding that yourself?

In these three posts we have used bayesian inference to calculate probabilities over proportion of fake ballots, and from there to calculate probabilities that an election result was incorrect. These probabilities could be used to achieve trust from stakeholders in that everything went well, or conversely to detect a possible fraud and invalidate an election.

I’ll finish the post by mentioning that the techniques we have seen here can be generalized beyond the special case of detecting fake ballots for plurality votes. For example, one could use bayesian inference to conduct ballot audits for the sake of checking tally correctness, not just from failures in authentication, but from errors in counting. See [2] for this kind of more general treatment


[1] http://www.wmbriggs.com/public/HGDAmstat4.pdf

[2] https://www.usenix.org/system/files/conference/evtwote12/rivest_bayes_rev_073112.pdf

In this work, because results of audits are not just binary, and because tallies are not only plurality, the authors use dirichlet distrbutions and sampling using posteriors to project possible alternative tallies.

Voter fraud and bayesian inference – part 2

We left off the discussion with

We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior.

Bayesian inference is a process that takes prior information, and adds evidence to obtain a posterior distribution. In our case this posterior distribution will be over the possible proportion of fake ballots in the set of all ballots. Let’s begin with the binomial case. What prior should we use? One answer is that, since we know nothing about the proportion of fake ballots we should be indifferent about each possibility. This translates into a uniform prior, where all proportions are equally likely. For example

P(proportion = fake ballots/ total ballots) = 1 / (total ballots + 1)

Since there are n + 1 possibilities for the number of fake ballots, we give each of them the same weight, which is 1 / (n + 1).

Beta + Binomial = Beta-Binomial

Before plugging this into bayes, a small technical detour.  Notice how the prior is itself a probability distribution, defined over the 0.0 – 1.0 interval. That is, the minimum proportion (0.0) is no fake ballots and maximum (1.0) is all fake ballots. It turns out there is a paramateric probability distribution one can use for this interval, it’s called the Beta distribution. The Beta distribution has two parameters, alpha and beta. The case of our neutral prior we defined above is equivalent to the Beta distribution with parameters (1, 1)

P(proportion ) = 1 / (n + 1) = Beta(1, 1)

We could express other knowledge with different choices of alpha and beta. But what’s the point of using the Beta, besides having a convenient way to specify priors? The point is that the Beta distribution is a conjugate prior of the binomial distribution. This means that the posterior distribution, once having taken into account available evidence, is also a Beta distribution. Meaning that the calculation of the posterior is much easier, as inference is just a matter of mapping the values of parameters of the Beta to some other values. Here is the posterior of the Beta distribution when it is used as the prior of the binomial (this is called the beta-binomial model).

beta

Equations taken from [1]. The first line is just Bayes theorem, but the payoff is that the last line corresponds to a beta distribution, but with different parameters. In summary

beta2

with a beta prior, bayesian inference reduces to remapping the initial parameters alpha and beta, to alpha + k and beta + n – k, where k is the number of successes and n is the number of trials. Conjugate priors are an algebraic convenience that allow obtaining analytic expressions for posteriors easily. End of detour, please refer to [1] for further details.

Armed with our use of the beta-binomial obtaining the posterior given some audit results is simple. If we audited 10 ballots and 3 of them were fake our posterior would simply be

P(proportion = p | fake audit count = 3 out of 10)

= Beta(1 + 3, 1 + 10 – 3)

= Beta(4, 8)

here’s what Beta(4, 8) looks like

beta3

note how the peak of the distribution is at 0.3, it makes sense since in the sample 3 out 10 ballots where invalid. Evidence has transformed our initial uniform prior into the distribution seen above. This meets our original objective, a way to judge how many ballots are fake in the entire set of ballots based on limited audits. But it’s not the end of the story. What we would like also is to have an estimate as to whether or not the election result is correct. As we said previously, this estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.

fraud

The probablity that an election result was correct, given uncertainty about fake ballots, depends on two things. One is the proportion of ballots that are fake, this is what we already have a calculation for. The other thing is the actual election outcome; specifically a measure of how close the result was. The reason is simple, if the election was close, a small number of invalid ballots could cast doubts on its correctness. Conversely, if the result was a landslide, the presence of fake votes has no bearing on its correctness. For our purposes we will stick with a simple example in which the election decides between two options via simple plurality.

Call the difference between the winning and losing option d

d = winner votes – loser votes

In order for the election to be wrong, there must be a minimum of d fake votes. The existence of d fake votes does not imply that the result was wrong, but d fake votes are a necessary condition. Thus a probability that the number of fake votes is greater than or equal to d represents an upper bound on probability that the election was incorrect. Call this E (for error)

P(proportion of fake votes >= d / total votes) = E

(upper limit on the probability that the election was wrong)

We have P(proportion), it is the posterior we got above. How do we get P(proportion >= some constant)? Through the beta distribution’s cumulative distribution function, which is defined in general as

In order to reverse the inequality, we just need to subtract it from 1 (gives us the tail distribution). We finally have

Probability of incorrect result

= P(proportion >= d / total ballots)

= 1 – Cumulative Distribution Function of P(d / total ballots)

 One final correction. Because we have sampled a number of votes with known results, we must apply our calculations to the remaining ballots.

P(E) = 1 – CDF(d – sampled ballots / total ballots – sampled ballots)

Let’s try an example, an election between option A and option B with the following numbers.

Votes for A = 550

Votes for B = 450

Total ballots = 1000

Audited ballots = 100

Audited fake ballots = 4

which gives

Posterior = Beta(5, 97)

d = 100

Minimum fraction of fake votes required to change result = (100 – 4) / (1000 – 10) = 0.1066

Upper bound on probability of error

= 1 – CDF(Beta(5, 97), 0.1066)

= 0.01352

In conclusion, the probability of error due to fake ballots in this election is less than or equal to 1.35%.

beta4

You can find a javascript implementation for everything we’ve seen until now in this jsfiddle. Calculations for the binomial, beta, hypergeometric and cumulative distribution function are done with the jStat javascript library. Wait, you say, what about the hypergeometric? We’ll leave that for the next post, which should be pretty short.


[1] http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf