Secure online voting

What do we mean by secure voting?

This is the subject of a new post on the nvotes blog, found here. Here’s the summary:

  • The term “secure voting” is generally thought to refer to cybersecurity and resistance to cyberattacks.
  • However, cybersecurity is a general property of hardware/software that does not reflect the specific requirements of voting as a political process. The secret ballot is an established and indispensable requirement for voting.
  • Secure voting systems must support privacy as well as integrity; these two requirements stand in opposition.
  • In a system supporting privacy, no one, not even system administrators or other privileged observers can violate the secret ballot.
  • In a system supporting end-to-end verifiability, voters can ensure that their vote was cast, recorded, and counted correctly.
  • Cryptographically secure voting systems employ cryptographic technology to satisfy these two properties simultaneously. The gold standard are end-to-end verifiable voting systems.
  • A secure voting system is an end-to-end verifiable voting system that also employs general computer security principles.

A mixnet-based secure voting protocol

This post will describe the main steps and operations that compose the cryptographic protocol of a re-encryption mixnet based voting system we are currently prototyping. This prototype is based around work[1][2] by the E-Voting Group at the Bern University of Applied Sciences, and uses their unicrypt library.  The main elements of the cryptographic scheme are

  • Public, tamper resistant, access controlled bulletin board
  • ElGamal cryptosystem[6]
  • Distributed public key generation
  • Election public key vote encryption with proof of plaintext knowledge
  • Authenticated vote casting (optionally with ciphertext signing) with cast as intended verifiability (Benaloh cast-or-cancel)
  • Re-encryption mixnet using Terelius-Wikstrom[4][5] proof of shuffle
  • Joint decryption via partial decryption of ciphertexts
  • Individual verifiability via hash checking on bulletin board (recorded as cast)
  • Universal verifiability via execution of verifier against bulletin board (counted as recorded)

These have been listed to roughly correspond chronologically with the protocol phases. Here they are at a glance.

protocol

This example shows two authorities both for key generation/joint decryption and mixing, but the protocol generalizes to any number of authorities. Also note that although above the key generation/decryption authorities are the same as the mixing authorities, this need not be the case. One could have four authorities such that two of them were key custodians and two of them were mixers. It is standard practice, however, that the number of authorities of each type is the same. There would be no privacy gain having more of one type as the limiting factor would the smaller number.

Key generation

In the first step the key generation/decryption authorities jointly create the election public key. This is the key with which voters will encrypt their votes before casting them. As can be seen in the diagram, this process occurs in parallel at each authority. Furthermore the simple distributed key generation scheme does not require communication between the authorities (as would be the case for example with a threshold scheme such as joint-feldman/pedersen). Each authority creates its share of the key and posts the public fragment, along with proofs of correctness, at the bulletin board. The bulletin board then checks the proofs and combines the shares, resulting in the public key which is also posted. The purpose of a distributed key generation scheme is to distribute trust such that the privacy of the vote is safe as long as just one of the authorities is not corrupted. Note that the corresponding private key only exists conceptually as the combination of private information at each authority, it is never recreated.

Voting

Once the public key is generated and publicly available at the bulletin board the election can begin. When casting votes, user’s voting clients, in this case the voting booth hosted by the browser, download the public key from the bulletin board. Once they have made their selection, the voting client encrypts the ballot and produces a hash. Before casting, users are presented with the option to audit their ballots according to Benaloh’s cast-or-cancel procedure. This provides cast-as-intended verifiability. Finally, the ballot is cast and sent to the bulletin board. The bulletin board verifies the user’s eligibility in the election as well as the proofs of plaintext knowledge, and then posts the vote. The user records the hash corresponding to their vote to allow verifying that their ballot was actually stored and counted correctly.

Mixing

When the election period is over and all the votes have been recorded the mixing phase begins. The purpose of this phase is to anonymize the votes such that it is impossible to trace what ciphertext belongs to what voter. This is necessary to protect privacy, as the joint decryption phase will reveal vote contents to allow tallying. Just like in the key generation phase is trust distributed across several authorities, so too is the mixing. Each mixing authority permutes and re-encrypts the votes handling them as input to the next authority. Only if all of the mixing authorities are corrupt is it possible to establish the correspondence between the cast votes on the bulletin board and the output of the mixing phase, which will be decrypted. The prototype uses the Terelius-Wikstrom proof of shuffle, which is composed of an offline and online phase. Although the offline phase (permutation) can be precomputed prior the election start our prototype does not exploit this feature, opting for simplicity. However, what is exploited is the parallelism made possible by the fact that the offline only depends on the vote count. This allows all authorities to this phase simultaneously once the election period ends. This is in contrast to the online shuffle phase, where each authority must wait for the mixing results of the previous authority. The diagram above reflects this; we can see the computation bars overlap in time for the permutation but not the shuffle. Each authority submits the vote mixing results along with the proofs to the bulletin board. The bulletin board verifies the shuffle and posts the mixed votes for the next authority to mix. Once all the authorities have completed the mix and the bulletin board has verified all the proofs the mixing phase is over.

Decrypting

Having completed the mixing phase the bulleting board contains the set of anonymized votes for the election. Because they are anonymized these votes can now be decrypted without compromising privacy. Just as the key generation phase was distributed across several authorities, these same authorities must intervene to decrypt the votes. Since the scheme is distributed but not a threshold system, all the authorities must participate in joint decryption. Similarly, as long as one authority remains honest it is not possible to decrypt non-anonymized votes. To carry out the joint decryption each authority downloads the mixed votes from the bulletin board and calculates their partial decryptions using its private share of the key, along with corresponding proofs of correctness. As can be seen above this process is parallel and occurs simultaneously at all authorities once the mixing phase is finished. The partial decryptions and proofs are then posted to the bulletin board, which verifies the proofs. Once all partial decryptions are available the bulletin board combines them and subsequently obtains the plaintexts. As noted previously, the private key is never reconstructed, the combination occurs only for partial decryptions.

Tally and verification

The plaintexts are posted on the bulletin board. This completes the public data for the election, which we summarize below:

  • Election public key shares and proofs of correctness (for each key authority)
  • Election public key
  • Cast votes, proofs of knowledge of plaintext, and signatures
  • Votes mixes and proofs of shuffle (for each mix authority)
  • Mixed votes partial decryptions and proofs (for each key authority)
  • Combined partial decryption and plaintexts

With this information:

  • The election result can be obtained by tallying plaintext votes
  • Each voter can verify that their vote was recorded correctly
  • Anyone can verify that the set of mixed votes corresponds to the recorded (cast) votes
  • Anyone can verify that the plaintexts correspond to correct decryption of the mixed votes
  • Anyone can verify that the election results corresponds to a correct tally of the plaintexts

The above properties, together with the ballot auditing procedure, make the prototype a secure[3] end-to-end verifiable voting system.


References

[1] https://github.com/bfh-evg/univote2/raw/development/doc/report/report.pdf

[2] http://subs.emis.de/LNI/Proceedings/Proceedings232/article100.html

[3] By secure we mean specifically that it employs cryptography to support privacy. No voting system is 100% secure in the general sense.

[4] B. Terelius and D. Wikstrom. Proofs of Restricted Shuffles. In D. J. Bernstein and ¨ T. Lange, editors, AFRICACRYPT’10, 3rd International Conference on Cryptology in Africa, LNCS 6055, pages 100–113, Stellenbosch, South Africa, 2010.

[5] D. Wikstrom. A Commitment-Consistent Proof of a Shuffle. In C. Boyd and J. Gonz ¨ alez ´ Nieto, editors, ACISP’09, 14th Australasian Conference on Information Security and Privacy, LNCS 5594, pages 407–421, Brisbane, Australia, 2009.

[6] http://caislab.kaist.ac.kr/lecture/2010/spring/cs548/basic/B02.pdf

Uncertainty, information and cryptography

(cross posted from here)

In this post I’m going to talk about three types of uncertainty, and how the foundations of cryptography can be understood in their terms. Wikpedia says

Uncertainty is the situation which involves imperfect and / or unknown information.  It applies to predictions of future events, to physical measurements that are already made, or to the unknown.

Two main concepts to note here, information and knowledge. We could say that uncertainty is lack of knowledge or lack of information. As we’ll see these two ideas are not equivalent and do not cover all cases. We start at the strongest form of uncertainty.

Ontological uncertainty: indeterminacy

The Bloch sphere representing the state of a spin 1/2 particle.

In quantum mechanics, certain particles (spin 1/2 particles such as electrons) have a property called spin that when measured[1] can give two discrete results, call them “spin up” and “spin down”. This is described by the equation

| \psi \rangle = \alpha |0 \rangle + \beta |1 \rangle,\,

such that when measured, the probability of obtaining spin up is α², and spin down is β². A question we could ask ourselves is, before we measure it, is the spin up or is it down? But the equation above only gives us probabilities of what will happen when we make the measurement.

In mainstream interpretations of quantum mechanics there is no fact of the matter as to what the value of the spin was before we made the measurement. And there is no fact of the matter as to what the measurement will yield prior to it happening. The information that our question is asking for simply does not exist in the universe.

It is this intrinsic indeterminacy that makes us use the term ontological uncertainty: the uncertainty is not a property of our knowledge, but a property of nature. Our confusion is a consequence of the ontological mismatch between nature and our model of it. We sum up this type of uncertainty as:

The information does not exist and therefore we cannot know it.

By the way, the heisenberg uncertainty principle which is of this type is not very well named, as it can be confused with the subject of the next section. A better name would be indeterminacy principle.

Epistemic uncertainty: information deficit

We started with the strongest and also strangest form of uncertainty. The second is the every day type encountered when dealing with incomplete information. In contrast to the previous type, this uncertainty is a property of our state knowledge, not a property of nature itself. So when we ask, for example, what caused the dinosaur extinction, we are referring to some fact about reality, whether or not we have or will have access to it. Or if playing poker we wonder if we have the best hand, we are referring to an unknown but existing fact, the set of all hands dealt.

Boltzmann’s tombstone with the entropy equation.

Uncertainty as incomplete information is central to fields like information theory, probability and thermodynamics where it is given a formal and quantitative treatment. The technical term is entropy, and it’s measured in bits. We say a description has high entropy if there is a lot of information missing from it. If we ask whether a fair coin will land heads or tails we are missing 1 bit of information. If we ask what number will come out from throwing a fair 8 sided die, we are missing 3 bits. The result of the die throw has more possible results, and therefore higher uncertainty about it than the coin flip result. So it has more bits of entropy. We sum up this type of uncertainty as:

The information exists, but we do not know it.

Before finishing a small clarification. If you were expecting the concept of randomness to appear when talking about coin flips and die rolls here’s the reason why it did not. In this section I have restricted the discussion to classical physics, where phenomena are deterministic although we may not know all the initial conditions. The combination of determinism + unknown initial conditions is what underlies the use of randomness in the macroscopic world. This type of randomness is sometimes called subjective randomness to distinguish it from intrinsic randomness, which is basically another term for ontological uncertainty of the first section.

11486.jpg

A deterministic coin flipping machine

The third type..

And now to the interesting bit. Let’s say I tell you that I have all the information about something, but I still don’t know everything about it. Sounds contradictory right? Here’s an example to illustrate this kind of situation.

  1. All men are mortal
  2. Socrates is a man

If now somebody tells you that

3. Socrates is mortal.

Are they giving you any information? Hopefully it seems to you like they told you something you already knew. Does that mean you had all the information before given statement 3? Put differently, does statement 3 contain any information not present in 1,2?

Modus Barbara.svg

One of the 24 valid syllogism types

Consider another example.

  1. x = 1051
  2. y = 3067
  3. x * y = 3223417

In this case statement 3 tells us something we probably didn’t know. But does statement 3 contain information not present int 1,2? We can use definitions from information theory to offer one answer. Define three random variables (for convenience in some arbitrary range a-b)

x ∈ {a-b}, y ∈ {a-b}, x*y {…}

We can calculate the conditional entropy according to the standard equation

which in our case gives

H(x*y | x, y) = 0

The conditional entropy of x*y given x and y is zero. This is just a technical way to say that given statements 1 and 2, statement 3 contains no extra information: whatever 3 tells us was already contained in 1,2. Once x and y are fixed, x*y follows necessarily. This brings us back to the beginning of the post

We could say that uncertainty is lack of knowledge or lack of information. As we’ll see these two ideas are not equivalent and do not cover all cases.

It should be apparent now that these two ideas are different. We have here cases where we have all the information about something (x, y), and yet we do not know everything about it (x*y).

Logical uncertainty: computation deficit

The step that bridges having all the information with having all the knowledge has a name: computation. Deducing (computing) the conclusion from the premises in the Socrates syllogism does not add any information. Neither does computing x*y from x and y. But computation can tell us things we did not know even though the information was there all along.

We are uncertain about the blanks, even though we have all the necessary information to fill them.

In this context, computing is a process that extracts consequences present implicitly in information. The difference between deducing the conclusion of a simple syllogism, and multiplying two large numbers is a difference in degree, not a difference in kind. However, there is a clear difference in that without sufficient computation, we will remain uncertain about things that are in a sense already there. At the upper end we have cases like Fermat’s last theorem, about which mathematicians had been uncertain for 350 years. We finish with this summary of logical uncertainty:

The information exists, we have all of it, but there are logical consequences we don’t know.

Pierre de Fermat

Cryptography: secrecy and uncertainty

Cryptography  (from Greek κρυπτός kryptós, “hidden, secret”; and γράφειν graphein, “writing”) is the practice and study of techniques for secure communication in the presence of third parties called adversaries

The important word here is secret, which should remind of us uncertainty. Saying that we want a message to remain secret with respect to an adversary is equivalent to saying that we want this adversary to be uncertain about the message content. Although our first intuition would point in the direction of epistemic uncertainty, the fact is that in practice this is not usually the case.

Let’s look at an example with the Caesar cipher, named after Julius Caesar, who used it ~2000 years go. The Caesar replaces each letter in the message with another letter obtained by shifting the alphabet a fixed number of places. This number of places plays the role of encryption key.  For example, with a shift of +3

Let’s encrypt a message using this +3 key:

We hope that if our adversary gets hold of the encrypted message he/she will not learn its secret, whereas our intended recipient, knowing the +3 shift key just needs to apply the reverse procedure (-3 shift) to recover it. When analyzing ciphers it is  assumed that our adversary will capture our messages and also will know the procedure, if not the key (in this case +3) used to encrypt. Using these assumptions, let’s imagine we are the adversary and capture this encrypted message:

We want to know the secret, but we don’t know the secret key shift value. But then we realize that the alphabet has 26 characters, and therefore there are only 25 possible shifts, a shift of 26 leaves the message unchanged. So how about trying all the keys and seeing what happens:

We found that the secret was revealed when trying a key shift of +10. Note how we were able to pick out the correct message because none of the other attempts gave meaningful results. This happens because the space of possible keys is so small that only one of them decrypts to a possible message. In technical terms, the key space and message space[2] are small enough compared to the length of the message that only one key will decrypt. The following equation[3] states this in terms of uncertainty:

Untitled2

The left part of the expression, H(Key | Ciphertext), tells us how much uncertainty about the key remains once we have obtained the encrypted message. Note the term S(c) which represents how many keys decrypt a meaningful message. As we saw above, S(c) = 1, which yields

H(K | C) = ∑ P(c) * log2 (1) = ∑ P(c) * 0 = 0

In words, there is no uncertainty about the key, and therefore the secret message, once we know the encrypted message[4]. Of course, when we initially captured this

we did not know the secret, but we had all the information necessary to reveal it. We were only logically uncertain about the secret and needed computation, not information, to find it out.

Alberti’s cipher disk (1470)

Although we have seen this only for the simple Caesar cipher, it turns out that except for special cases, many ciphers have this property given a large enough message to encrypt. In public key ciphers, like those used in many secure voting systems, this is the case irrespective of message size. So we can say that practical cryptography is based around logical uncertainty, since our adversaries have enough information to obtain the secret. But as we saw previously, there are different degrees of logical uncertainty. Cryptography depends on this uncertainty being “strong” enough to protect secrets.

Talking about degrees of logical uncertainty leads us to computational complexity.

Computational complexity and logical uncertainty

Just as entropy measures epistemic uncertainty, computational complexity can be said to measure logical uncertainty. In probability theory we study how much information one needs to remove epistemic uncertainty. Computational complexity studies how much computation one needs to remove logical uncertainty. We saw that deducing the conclusion of the Socrates syllogism was easy, but multiplying two large numbers was harder. Complexity looks at how hard these problems are relative to each other. So if we are looking for the foundations of cryptography we should definitely look there.

Take for example the widely used RSA public key cryptosystem. This scheme is based (among other things) on the computational difficulty of factoring large numbers. We can represent this situation with two statements, for example

  1. X=1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139
  2. X=3797522793694367392280887275544562785456553663 199*40094690950920881030683735292761468389214899724061

Statement 2 (the factors) is entailed by statement 1, but obtaining 2 from 1 requires significant computational resources. In a real world case, an adversary that captures a message encrypted under the RSA scheme will require such an amount of computation to reveal its content, that this possibility is labeled infeasible. Let’s be a bit more precise than that. This means that an adversary, using the fastest known algorithm for the task, will require thousands of years of computing on a modern pc.

If the last statement didn’t trigger alarm bells, perhaps I should emphasize the words “known algorithm”. We know that with known algorithms the task is infeasible, but what if a faster algorithm is found? You would expect complexity theory would have an answer to that hypothetical situation. The simple fact of the matter is that it doesn’t.

In complexity theory, problems for which efficient algorithms exist are put into a class called P. Although no efficient algorithm is known for integer factorization, whether it is in P or not is an open problem[5].  In other words, we are logically uncertain about whether factorization is in P or not!

Several complexity classes

If we assume that integer factorization is not in P then a message encrypted with RSA is secure. So in order to guarantee an adversary’s logical uncertainty about secret messages, cryptographic techniques rely on assumptions that are themselves the object of logical uncertainty at the computational complexity level! Not the kind of thing you want to find when looking for foundations.

The bottom line

It’s really not that bad though. If you think carefully about it, what matters is not just whether factorization and other problems are in P or not, but whether adversaries will find the corresponding efficient algorithms. The condition that factorization is in P and that the efficient algorithms are secretly found by adversaries is much stronger than the first requirement on its own. More importantly, the second condition seems to be one we can find partial evidence for.

Whether or not evidence can be found for a logical statement is a controversial subject. Does the fact that no one has proved that factorization is in P count as evidence that it is not in P? Some say yes and some say no. But it seems less controversial to say that the fact that no algorithm has been found counts as evidence for the possibility that we (as a species with given cognitive and scientific level of advancement) will not find it in the near future.

A quantum subroutine

The bottom line for the foundations of cryptography is a question of both logical and epistemic uncertainty. On one hand, computational complexity questions belong in the realm of logic, and empirical evidence for this seems conceptually shaky. But the practical aspects of cryptography not only depend on complexity questions, but also on our ability to solve them. Another point along these lines is that computational complexity tells us about difficulty for algorithms given certain computational primitives. But the question of what primitives we have access to when building computing devices is a question of physics (as quantum computing illustrates). This means we can justify or question confidence in the security of cryptography through empirical evidence about the physical world. Today, it is the combination of results from computational complexity together with empirical evidence about the world that form the ultimate foundations of cryptography.


References

[1] Along the x, y, or z axes

[2] Without going into details, the message space is smaller than the set of all combinations of letters given that most of these combinations are meaningless. Meaningful messages are redundantly encoded.

[3] http://www14.in.tum.de/konferenzen/Jass05/courses/1/papers/gruber_paper.pdf

[4] The equation refers to the general case, but we can still use it to illustrate a particular case.

[5] To be precise, it’s that and the more general question of whether P=NP.

Reddit-style filtering for e-democracy

(Cross posted from here.)

When voting, people select among several options that have been predefined prior to the election. In more open scenarios of citizen participation, people are asked not only to vote for predefined options, but to contribute their own ideas before the final vote. This introduces another activity to the democratic process, we can call it filtering.

The aim of filtering is to select, out of a very large number of ideas, the very few best ones either to directly implement them or to put them to a formal vote. Filtering has to address the problem that voting cannot: How do we select out of potentially hundreds of ideas without having each voter evaluate them all? The solution provided by filtering is a common theme in many proposals to augment democracy: division of cognitive labour.

A single voter cannot rank hundreds of ideas, but the cognitive load of selecting the best ones can be spread across the many citizens that are contributing them. Filtering can scale because as the number of ideas grows with the number of people suggesting them, so too does the available cognitive effort available to evaluate them.

However, a key distinction must be noted between voting and filtering. The process of distributing cognitive labour allows filtering many ideas, but once we accept that not all people are evaluating everything, the information available is necessarily incomplete. In other words, the process of ranking ideas has an element of uncertainty, it is statistical.

Reddit-style filtering

When looking for tools to implement internet mediated filtering, sooner or later one runs into systems like reddit, slashdot, digg and the like. At an abstract level, we can pick out the following features which are relevant to the problem of filtering:

  1. Many users can submit a large amount of items as well as evaluate them
  2. Users evaluate items with a binary signal, eg voting up/down
  3. A ranking method exists that sorts items for some definition of better => worse.

At this level of description, these three properties fit filtering well. We have ideas that we wish to rank, and we cope with the large volume by having the users evaluate them, possibly using binary signal such as approval (or not approval). The meat of the matter lines in the third property, specifically in the phrase “for some definition of better => worse”.

Quality of ideas

Above we have used the abstract term “items”. In reddit, items can be stories or comments. The distinction is important because reddit uses different ranking algorithms for each. The key feature of filtering that departs from the reddit model is that reddit and similar sites are news aggregation tools. This is why stories in reddit include a novelty component in their ranking. Ceteris paribus, new items are better than old items.

Filtering on the other hand is not about news, but about ideas/proposals. Although novelty may still be a factor, it is not a central one. Idea quality must be mostly about voters evaluation through a binary signal, and not very much about its submission date. Filtering ranking is more like reddit’s ranking of comments than it is of stories.

The problem is simple then: we must rank ideas according to users’ binary evaluation in a way that deals with incomplete information. That is, we will not have a +/- judgement for every idea from every user, only subsets of this. Fortunately the problem as posed is directly an instance of binomial proportion estimation, an urn model.

We have, for every idea, an unknown quantity which represents the fraction of all users who would evaluate it positively. This quantity can be statistically inferred using the number voters that have in fact evaluated the idea and approved it. To do this we use the betabinomial model, see here for an introduction. Briefly, we choose a uniform uninformative prior distribution with parameters alpha = beta = 1. Because the beta is conjugate to the binomial, posterior distribution is also a beta given by

The mean of a beta distrubution is given by

If we plug this into the posterior we see that the mean is simply approvals / total, which makes sense. So a first approximation would be to rank ideas by the mean of the (posterior) probability distribution, which is just the fraction of approvals out of total evaluations.

What about uncertainty?

The problem with the above is that it ignores how confident we are about the estimation of an idea’s quality. For example, an idea evaluated 1000 times with 500 approvals would rank equally to an idea evaluated twice and approved once. In technical terms, using the mean throws away information about the probability distribution. This is what the author of this post realized:

suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings). WRONG.

The author then goes on to suggest a lower bound wilson confidence interval, this is how reddit currently ranks comments. Another option along these lines is to use a lower bound on our beta posterior using its cumulative distribution function (I wont go into more details here). In either case, the spirit of this measure is to rank ideas that we are confident are good higher than ideas which have not been evaluated enough.

What about equality/winner takes all/sunk ideas/information gain?

This is where filtering departs strongly from systems like reddit. Consider:

  • Ideas should have an equal opportunity of being selected.
  • Once ideas are ranked highly there is a snowball effect as they are more visible and can accumulate approvals.
  • New good ideas will be lost and sunk as they cannot compete with the volume of older ones that have accumulated approvals.
  • By penalizing uncertainty you concentrate evaluation away from where its most needed, information gain is minimized.

All these objections follow a common theme, the theme of balancing two competing objectives:

  • We want to rank good ideas highly
  • We want to allow for the discovery of good yet unproven ideas

Thompson Sampling and multi-armed bandits

The above dilemma is another form of the exploration-exploitation tradeoff that appears in reinforcement learning, an area of machine learning. There is one specific problem in machine learning whose structure is very similar to the problem of rating items and filtering: multi-armed bandits. In this post James Neufeld first makes the connection between rating comments on reddit and muti-armed bandits.

The comment scoring problem on reddit is slightly different from the basic setting described above. This is because we are not actually choosing a single comment to present to the user (pulling one arm) but, instead, producing a ranking of comments. There are some interesting research papers on modelling this problem precisely, for example [Radlinski, et al.,2008], but, it turns out to be a combinatorial optimization (hard). However, rather than going down this complex modelling path, one could simply rank all of the μ¯iμ¯i samples instead of taking just the max, this gives us a full ranking and, since the max is still at the top, is unlikely to adversely affect the convergence of the algorithm.

He then goes on to propose adapting Thompson sampling, a solution applied to multiarmed bandits in reinforcement learning, to the case of rating comments in reddit. The method of Thompson Sampling (or probability matching) is simple given what we’ve seen above regarding the beta posterior. Instead of using the beta posterior mean, or a lower bound on its cumulative distribution, we simply sample from it. The procedure is:

  1. For each idea, sample from its posterior beta probability distribution
  2.  Construct a ranking according to the sampled value

Here’s how Thompson sampling relates to the two objetives stated above

  • We want to rank good ideas highly

Sampling from a high quality posterior will tend to produce larger values.

  • We want to allow for the discovery of good yet unproven ideas

Because we are sampling there is a chance that unproven ideas will be ranked highly. Furthermore this possibility is greater for better ideas and for ideas with high uncertainty. There is more to be investigated here about the relationship between Thompson sampling and maximizing information gain (eg Kullback Leibler divergence).

Extensions

The base Thompson sampling model can be extended in several ways. For example:

  • Time: we can incorporate time factors by having decays on the beta distribution parameters.
  • Collaborative filtering: we can apply weight factors on beta parameters depending on affinity with the user.
  • Boosting newly submitted ideas: we can choose non uniform priors to ensure high exposure to new ideas.

Probability of an election tie

(from http://www.uncommondescent.com/wp-content/uploads/2013/06/coin-flip.jpg)

Sparked by recent events in politics, a lot of debate and controversy has occurred on the Spanish blogosphere around a simple question of probability:

What is the probability that a Yes/No election with 3030 voters results in a tie?

Before suggesting answers, let me make it clear that the main controversy has occurred when trying to answer this question in its barest form, without any additional information besides its simplest formulation above. To make it doubly clear, this is all the information that defines the problem:

1) There are 3030 voters that can vote Yes or No.
2) Yes and No votes are treated as Bernoulli trials.

We model a series of Bernoulli trials with a binomial distribution. It has two parameters, the number of events, and the probability of success for each event:

X ~ Bin(n, p)

Our question is answered by this piece-wise function

P(tie) =

P(X = n/2) if n is even

0 otherwise

All we need to do is plug in the parameters and we’re done. We’ve been given n = 3030 in our problem definition. But wait a minute, what about p? The problem definition states that votes are Bernoulli trials, but we know nothing about p!

In order to create intuition for the situation, let me pose two related questions.

What is the probability of getting 5 heads in 10 trials when tossing a fair coin?

What is the probability of getting 5 heads in 10 trials when tossing a coin where the only thing we know about the coin is that it can land heads or tails?

In the first question we have been given information about the coin we are tossing, which we input into the binomial model. In the second question we know nothing about the coin, and therefore nothing about the second parameter p.

This is precisely the case with our election problem. So how do we proceed? In both cases the answer is the same, we must construct a version of the binomial that allows us to represent this state of information. The beta-binomial probability distribution comes to the rescue. From wikipedia:

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random.

I hope something rang a bell when you saw the word “unknown” above, this is exactly our situation. What we do, therefore, is to construct a non-informative prior over p that represents our lack of information about said parameter. In the beta-binomial distribution this prior takes the form of a Beta distribution, and the usual choice as non-informative prior is Beta(1, 1), with alpha = beta = 1. You can see how this choice of prior favors no values of p:

MSP3179205a459f8dg1af060000219hieg2gbbdd77f

Remember it’s a probability density function

Having represented our state of knowledge about p as the choice of prior Beta(1, 1), and given that the parameter n is 3030, we now have all the ingredients to calculate things in a way that is consistent with our problem definition. We do this by using the probability mass function of the beta binomial:

We therefore want (since 3030 is even):

P(X = 1515)

= (3030 choose 1515) * Beta[1515 + 1, 1515 +1] / Beta[1, 1]

= 1/3031

Does that fraction seem funny? That value is precisely one divided by the total number of possible election results. You can see this by considering that results can go from [Yes 0 – 3030 No] all the way up to [Yes 3030 – 0 No]. And in fact, using our beta-binomial model with Beta(1, 1)  all these results are given the same probability: 1/3031.

This should come as no surprise, given that as we’ve said repeatedly, the problem definition is such that we know nothing about the election, we have no way to favor one result over the other. You can see this below, the probability for all results is 1/3031 = 0.00033.

MSP111h4205fe7g2d41h20000325i704i6g13h403

The p = 0.5 mistake

In spite of all of the above, most of the people that analyzed our problem got another result, not 1/3031, but instead 0.0145. This corresponds to calculating a binomial distribution with p = 0.5:

X ~ Binomial(3030, 0.5)

P(X=1515)

= (3030 choose 1515) * (0.5^1515) * (0.5^1515)

= 0.01449

How did they get to assuming p = 0.5? Well, it seems that those who went this route did not know about the beta-binomial distribution, and the beta prior that allows us to represent knowledge about p. Without these tools they made an unwarranted assumption: that the lack of information about p is equivalent to 100% certainty that p is 0.5. The source of that confusion is an insidious “coincidence”:

The probability of heads and tails for an unknown coin “happens” to be exactly the same as that for a coin which we know with 100% probability that it is fair.

Let me restate that

P(head for a single coin toss we know nothing about) = 0.5
P(head for a single coin toss we know 100% to be fair) = 0.5

Because the value is the same, it’s easy to jump to the conclusion that a series of coin tosses for a coin that we know nothing about is treated the same way as a series of coin tosses for which we know for sure that the coin is fair! Unfortunately the above coincidence does not reappear:

P(n successes for coin tosses we know nothing about)  P(n successes for coin tosses we know 100% to be fair)

To illustrate that setting p=0.5 (or any other point value) represents zero uncertainty about the value of p, let’s plot a few graphs for priors and probabilities of ties. This will show how our non-informative prior Beta(1, 1) progressively approximates the p=0.5 assumption as we reduce its uncertainty.

  • alpha = beta = 1 (non-informative)
MSP3179205a459f8dg1af060000219hieg2gbbdd77f

probability over p

MSP111h4205fe7g2d41h20000325i704i6g13h403

probability of yes votes

probability of tie = 1 / 3031 = 0.00033

  • alpha = beta = 10

With Beta(10, 10) the probability of tie increases from 1/3031 to 0.0012:

MSP19171egg4d3i2229cgcb00003h2cg0i369hi8g37

probability over p

MSP10531g554a6fc043h21c0000144a972697b87g94

probability of yes votes

probability of tie = 0.0012

  • alpha = beta = 100
probability over p

probability over p

MSP852227956eegbd4a60000003f290b49gdh19a2e

probability of yes votes

probability of tie = 0.0036

  • alpha = beta = 10000
probability over p

probability over p

probability of yes votes

probability of yes votes

probability of tie = 0.0135

  • alpha = beta = 5×10^5

probability of tie = 0.01447

A formal version of the above trend is:

As the variance of our prior tends to zero, the probability of a tie tends to 0.1449 (the value obtained with p = 0.5)

How can we interpret p?

Given the assumption that voter choices are  Bernoulli trials, what can be said about the significance of p? We can offer an interpretation, although this won’t change any of our results above.

Having said this, consider that p describes how likely it is that a voter selects Yes or No in an election. If we interpret that a voter chooses Yes or No depending on his/her preferences and the content of the question posed in the election, then p represents the relationship between the election content and the set of voter’s preferences.

Saying that we don’t know anything about the election is like saying that we know nothing about the voter’s preferences and question posed to them. If, for example, we knew that the question was about animal rights, and we also knew that the set of voters were animal activists, then we’d probably have a high value of p. Conversely if we asked gun supporters about gun control, p would be low. And if we asked a generally controversial question to the general population, we’d have p around 0.5.

Unless we have prior information that rules out or penalizes certain combinations of elections questions and preferences, we must use a non-informative prior for p, such as Beta(1,1).

What about using more information?

In most of this post I’ve been talking about an ideal case that is unrealistic. If we do have information about the election beyond its bare description we can incorporate it into our beta prior the same way we’ve done above.

This is precisely what makes bayesian models useful: prior information is made explicit and inferences are principled. I won’t go into the details of this as it would merit a post on its own, only note that using prior information must be done carefully to avoid inconsistencies like the ones described here.