Reddit-style filtering for e-democracy

(Cross posted from here.)

When voting, people select among several options that have been predefined prior to the election. In more open scenarios of citizen participation, people are asked not only to vote for predefined options, but to contribute their own ideas before the final vote. This introduces another activity to the democratic process, we can call it filtering.

The aim of filtering is to select, out of a very large number of ideas, the very few best ones either to directly implement them or to put them to a formal vote. Filtering has to address the problem that voting cannot: How do we select out of potentially hundreds of ideas without having each voter evaluate them all? The solution provided by filtering is a common theme in many proposals to augment democracy: division of cognitive labour.

A single voter cannot rank hundreds of ideas, but the cognitive load of selecting the best ones can be spread across the many citizens that are contributing them. Filtering can scale because as the number of ideas grows with the number of people suggesting them, so too does the available cognitive effort available to evaluate them.

However, a key distinction must be noted between voting and filtering. The process of distributing cognitive labour allows filtering many ideas, but once we accept that not all people are evaluating everything, the information available is necessarily incomplete. In other words, the process of ranking ideas has an element of uncertainty, it is statistical.

Reddit-style filtering

When looking for tools to implement internet mediated filtering, sooner or later one runs into systems like reddit, slashdot, digg and the like. At an abstract level, we can pick out the following features which are relevant to the problem of filtering:

  1. Many users can submit a large amount of items as well as evaluate them
  2. Users evaluate items with a binary signal, eg voting up/down
  3. A ranking method exists that sorts items for some definition of better => worse.

At this level of description, these three properties fit filtering well. We have ideas that we wish to rank, and we cope with the large volume by having the users evaluate them, possibly using binary signal such as approval (or not approval). The meat of the matter lines in the third property, specifically in the phrase “for some definition of better => worse”.

Quality of ideas

Above we have used the abstract term “items”. In reddit, items can be stories or comments. The distinction is important because reddit uses different ranking algorithms for each. The key feature of filtering that departs from the reddit model is that reddit and similar sites are news aggregation tools. This is why stories in reddit include a novelty component in their ranking. Ceteris paribus, new items are better than old items.

Filtering on the other hand is not about news, but about ideas/proposals. Although novelty may still be a factor, it is not a central one. Idea quality must be mostly about voters evaluation through a binary signal, and not very much about its submission date. Filtering ranking is more like reddit’s ranking of comments than it is of stories.

The problem is simple then: we must rank ideas according to users’ binary evaluation in a way that deals with incomplete information. That is, we will not have a +/- judgement for every idea from every user, only subsets of this. Fortunately the problem as posed is directly an instance of binomial proportion estimation, an urn model.

We have, for every idea, an unknown quantity which represents the fraction of all users who would evaluate it positively. This quantity can be statistically inferred using the number voters that have in fact evaluated the idea and approved it. To do this we use the betabinomial model, see here for an introduction. Briefly, we choose a uniform uninformative prior distribution with parameters alpha = beta = 1. Because the beta is conjugate to the binomial, posterior distribution is also a beta given by

The mean of a beta distrubution is given by

If we plug this into the posterior we see that the mean is simply approvals / total, which makes sense. So a first approximation would be to rank ideas by the mean of the (posterior) probability distribution, which is just the fraction of approvals out of total evaluations.

What about uncertainty?

The problem with the above is that it ignores how confident we are about the estimation of an idea’s quality. For example, an idea evaluated 1000 times with 500 approvals would rank equally to an idea evaluated twice and approved once. In technical terms, using the mean throws away information about the probability distribution. This is what the author of this post realized:

suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings). WRONG.

The author then goes on to suggest a lower bound wilson confidence interval, this is how reddit currently ranks comments. Another option along these lines is to use a lower bound on our beta posterior using its cumulative distribution function (I wont go into more details here). In either case, the spirit of this measure is to rank ideas that we are confident are good higher than ideas which have not been evaluated enough.

What about equality/winner takes all/sunk ideas/information gain?

This is where filtering departs strongly from systems like reddit. Consider:

  • Ideas should have an equal opportunity of being selected.
  • Once ideas are ranked highly there is a snowball effect as they are more visible and can accumulate approvals.
  • New good ideas will be lost and sunk as they cannot compete with the volume of older ones that have accumulated approvals.
  • By penalizing uncertainty you concentrate evaluation away from where its most needed, information gain is minimized.

All these objections follow a common theme, the theme of balancing two competing objectives:

  • We want to rank good ideas highly
  • We want to allow for the discovery of good yet unproven ideas

Thompson Sampling and multi-armed bandits

The above dilemma is another form of the exploration-exploitation tradeoff that appears in reinforcement learning, an area of machine learning. There is one specific problem in machine learning whose structure is very similar to the problem of rating items and filtering: multi-armed bandits. In this post James Neufeld first makes the connection between rating comments on reddit and muti-armed bandits.

The comment scoring problem on reddit is slightly different from the basic setting described above. This is because we are not actually choosing a single comment to present to the user (pulling one arm) but, instead, producing a ranking of comments. There are some interesting research papers on modelling this problem precisely, for example [Radlinski, et al.,2008], but, it turns out to be a combinatorial optimization (hard). However, rather than going down this complex modelling path, one could simply rank all of the μ¯iμ¯i samples instead of taking just the max, this gives us a full ranking and, since the max is still at the top, is unlikely to adversely affect the convergence of the algorithm.

He then goes on to propose adapting Thompson sampling, a solution applied to multiarmed bandits in reinforcement learning, to the case of rating comments in reddit. The method of Thompson Sampling (or probability matching) is simple given what we’ve seen above regarding the beta posterior. Instead of using the beta posterior mean, or a lower bound on its cumulative distribution, we simply sample from it. The procedure is:

  1. For each idea, sample from its posterior beta probability distribution
  2.  Construct a ranking according to the sampled value

Here’s how Thompson sampling relates to the two objetives stated above

  • We want to rank good ideas highly

Sampling from a high quality posterior will tend to produce larger values.

  • We want to allow for the discovery of good yet unproven ideas

Because we are sampling there is a chance that unproven ideas will be ranked highly. Furthermore this possibility is greater for better ideas and for ideas with high uncertainty. There is more to be investigated here about the relationship between Thompson sampling and maximizing information gain (eg Kullback Leibler divergence).

Extensions

The base Thompson sampling model can be extended in several ways. For example:

  • Time: we can incorporate time factors by having decays on the beta distribution parameters.
  • Collaborative filtering: we can apply weight factors on beta parameters depending on affinity with the user.
  • Boosting newly submitted ideas: we can choose non uniform priors to ensure high exposure to new ideas.

Probability of an election tie

(from http://www.uncommondescent.com/wp-content/uploads/2013/06/coin-flip.jpg)

Sparked by recent events in politics, a lot of debate and controversy has occurred on the Spanish blogosphere around a simple question of probability:

What is the probability that a Yes/No election with 3030 voters results in a tie?

Before suggesting answers, let me make it clear that the main controversy has occurred when trying to answer this question in its barest form, without any additional information besides its simplest formulation above. To make it doubly clear, this is all the information that defines the problem:

1) There are 3030 voters that can vote Yes or No.
2) Yes and No votes are treated as Bernoulli trials.

We model a series of Bernoulli trials with a binomial distribution. It has two parameters, the number of events, and the probability of success for each event:

X ~ Bin(n, p)

Our question is answered by this piece-wise function

P(tie) =

P(X = n/2) if n is even

0 otherwise

All we need to do is plug in the parameters and we’re done. We’ve been given n = 3030 in our problem definition. But wait a minute, what about p? The problem definition states that votes are Bernoulli trials, but we know nothing about p!

In order to create intuition for the situation, let me pose two related questions.

What is the probability of getting 5 heads in 10 trials when tossing a fair coin?

What is the probability of getting 5 heads in 10 trials when tossing a coin where the only thing we know about the coin is that it can land heads or tails?

In the first question we have been given information about the coin we are tossing, which we input into the binomial model. In the second question we know nothing about the coin, and therefore nothing about the second parameter p.

This is precisely the case with our election problem. So how do we proceed? In both cases the answer is the same, we must construct a version of the binomial that allows us to represent this state of information. The beta-binomial probability distribution comes to the rescue. From wikipedia:

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random.

I hope something rang a bell when you saw the word “unknown” above, this is exactly our situation. What we do, therefore, is to construct a non-informative prior over p that represents our lack of information about said parameter. In the beta-binomial distribution this prior takes the form of a Beta distribution, and the usual choice as non-informative prior is Beta(1, 1), with alpha = beta = 1. You can see how this choice of prior favors no values of p:

MSP3179205a459f8dg1af060000219hieg2gbbdd77f

Remember it’s a probability density function

Having represented our state of knowledge about p as the choice of prior Beta(1, 1), and given that the parameter n is 3030, we now have all the ingredients to calculate things in a way that is consistent with our problem definition. We do this by using the probability mass function of the beta binomial:

We therefore want (since 3030 is even):

P(X = 1515)

= (3030 choose 1515) * Beta[1515 + 1, 1515 +1] / Beta[1, 1]

= 1/3031

Does that fraction seem funny? That value is precisely one divided by the total number of possible election results. You can see this by considering that results can go from [Yes 0 – 3030 No] all the way up to [Yes 3030 – 0 No]. And in fact, using our beta-binomial model with Beta(1, 1)  all these results are given the same probability: 1/3031.

This should come as no surprise, given that as we’ve said repeatedly, the problem definition is such that we know nothing about the election, we have no way to favor one result over the other. You can see this below, the probability for all results is 1/3031 = 0.00033.

MSP111h4205fe7g2d41h20000325i704i6g13h403

The p = 0.5 mistake

In spite of all of the above, most of the people that analyzed our problem got another result, not 1/3031, but instead 0.0145. This corresponds to calculating a binomial distribution with p = 0.5:

X ~ Binomial(3030, 0.5)

P(X=1515)

= (3030 choose 1515) * (0.5^1515) * (0.5^1515)

= 0.01449

How did they get to assuming p = 0.5? Well, it seems that those who went this route did not know about the beta-binomial distribution, and the beta prior that allows us to represent knowledge about p. Without these tools they made an unwarranted assumption: that the lack of information about p is equivalent to 100% certainty that p is 0.5. The source of that confusion is an insidious “coincidence”:

The probability of heads and tails for an unknown coin “happens” to be exactly the same as that for a coin which we know with 100% probability that it is fair.

Let me restate that

P(head for a single coin toss we know nothing about) = 0.5
P(head for a single coin toss we know 100% to be fair) = 0.5

Because the value is the same, it’s easy to jump to the conclusion that a series of coin tosses for a coin that we know nothing about is treated the same way as a series of coin tosses for which we know for sure that the coin is fair! Unfortunately the above coincidence does not reappear:

P(n successes for coin tosses we know nothing about)  P(n successes for coin tosses we know 100% to be fair)

To illustrate that setting p=0.5 (or any other point value) represents zero uncertainty about the value of p, let’s plot a few graphs for priors and probabilities of ties. This will show how our non-informative prior Beta(1, 1) progressively approximates the p=0.5 assumption as we reduce its uncertainty.

  • alpha = beta = 1 (non-informative)
MSP3179205a459f8dg1af060000219hieg2gbbdd77f

probability over p

MSP111h4205fe7g2d41h20000325i704i6g13h403

probability of yes votes

probability of tie = 1 / 3031 = 0.00033

  • alpha = beta = 10

With Beta(10, 10) the probability of tie increases from 1/3031 to 0.0012:

MSP19171egg4d3i2229cgcb00003h2cg0i369hi8g37

probability over p

MSP10531g554a6fc043h21c0000144a972697b87g94

probability of yes votes

probability of tie = 0.0012

  • alpha = beta = 100
probability over p

probability over p

MSP852227956eegbd4a60000003f290b49gdh19a2e

probability of yes votes

probability of tie = 0.0036

  • alpha = beta = 10000
probability over p

probability over p

probability of yes votes

probability of yes votes

probability of tie = 0.0135

  • alpha = beta = 5×10^5

probability of tie = 0.01447

A formal version of the above trend is:

As the variance of our prior tends to zero, the probability of a tie tends to 0.1449 (the value obtained with p = 0.5)

How can we interpret p?

Given the assumption that voter choices are  Bernoulli trials, what can be said about the significance of p? We can offer an interpretation, although this won’t change any of our results above.

Having said this, consider that p describes how likely it is that a voter selects Yes or No in an election. If we interpret that a voter chooses Yes or No depending on his/her preferences and the content of the question posed in the election, then p represents the relationship between the election content and the set of voter’s preferences.

Saying that we don’t know anything about the election is like saying that we know nothing about the voter’s preferences and question posed to them. If, for example, we knew that the question was about animal rights, and we also knew that the set of voters were animal activists, then we’d probably have a high value of p. Conversely if we asked gun supporters about gun control, p would be low. And if we asked a generally controversial question to the general population, we’d have p around 0.5.

Unless we have prior information that rules out or penalizes certain combinations of elections questions and preferences, we must use a non-informative prior for p, such as Beta(1,1).

What about using more information?

In most of this post I’ve been talking about an ideal case that is unrealistic. If we do have information about the election beyond its bare description we can incorporate it into our beta prior the same way we’ve done above.

This is precisely what makes bayesian models useful: prior information is made explicit and inferences are principled. I won’t go into the details of this as it would merit a post on its own, only note that using prior information must be done carefully to avoid inconsistencies like the ones described here.

The pairwise-bradleyterry model for pairwise voting

In a previous post we discussed pairwise voting and the pairwise-beta model as a way to obtain a global ranking over candidates using bayesian inference with the beta distribution. In that post we remarked in the final paragraph that the pairwise-beta model is not perfect:

In particular, it does not exploit information about the strength of the opposing item in a pairwise comparison.

In this post we will look at a better model which addresses this particular problem, albeit at a computational cost. To begin we present a pathological case which exhibits the problem when using the pairwise-beta.

Consider the following set of pairwise ballots, where A, B, C, D, E and F are options, and A > B indicates that A is preferred to B. There are 5 ballots:

A > B

B > C

C > D

D > E

F > E

Applying the pairwise-beta algorithm method to this set of ballots yields the following output (options A-F are referred to as numbers 0-5):

which is equivalent to the following ranking:

  1. A, F
  2. B, C, D
  3. E

A and F share the first position. B, C and D share the second position. E is last.

Hopefully the problem in this ranking is apparent: the strength of the opposing option in a pairwise comparison is not affecting the global ranking. This is why option F, which only beats the last option, is ranked at the same position as A, which “transitively” beats every other option. Similarly, options B, C and D are ranked at the same level, even though presumably option B should be stronger as it beats option C which beats option D.

In other words, beating a strong option should indicate more strength than beating a weak option. Similarly, being beaten by a strong option should indicate less weakness than being beaten by a weak option.

We can resort to the Bradley-Terry [1] model to address these shortcomings. The Bradley-Terry is a probabilistic model that can be used to predict the outcome of pairwise comparisons, as well as to obtain a global ranking from them. It has the following form:

and in logit form[2]:

logit2

The parameters (p’s and lambdas) can be fit using maximum likelihood estimation. One can consider these to represent the relative strength of options and therefore give a global ranking, although strictly speaking their interpretation is rooted in probabilities of outcomes of comparisons.

In order to apply this model we can use the BradleyTerry2 R package by Turner and Firth[2], which fits the model using tabular input data. Armed with this package all we need is some extra plumbing in our Agora Voting tallying component and we’re ready to go. Let’s run it against the same ballots we did above, we get:

which is equivalent to the following ranking:

  1. A
  2. B
  3. C
  4. D
  5. F
  6. E

Notice how this ranking does away with all the problems we mentioned with the pairwise-beta result. In particular, note how option F, which above was ranked joint first, is in this case ranked fifth. This is because it beat option E, which is last, and therefore not much strength can be inferred from that comparison.

Before concluding that the pairwise-beta model is terrible, remember that the results we got here correspond to a handpicked pathological set of ballots. In general it seems reasonable to expect results from both models to converge as more data accumulates and the strength of opponents is evened out. This hypothesis seems to match that stated in work by Salganik[3], where the pairwise-beta and a more robust model are compared saying:

In the cases considered in the paper, the two estimates of the score were very similar; there was a correlation of about 0.95 in both cases.

In summary, in this and the previous post we have described two models that can be used for pairwise elections, where candidates are asked to compare options in pairs. We have seen how one of the models works well and is easy to calculate, but can potentially give unrealistic rankings when data is sparse. We then considered a second more robust model which addresses this problem, but is computationally more expensive. Further work is required to determine exactly how computationally demanding our pairwise-bradleyterry implementation is.

 


[1] BRADLEY, R. A. and TERRY, M. E. (1952). Rank analysis of incomplete block designs. I. The method of paired comparisons.  – http://www.jstor.org/stable/2334029

[2] Bradley-Terry Models in R: The BradleyTerry2 Package – http://www.jstatsoft.org/v48/i09/paper

[3] Wiki surveys: Open and quantifiable social data collection -http://arxiv.org/pdf/1202.0500v2.pdf

Integer encoding of multiple-choice ballots (2)

In the last post we saw how simple arithmetic with the right choice of base can encode integer lists for multiple ballot choices in an elegant and compact way. A couple of points were mentioned in the notes section, one of them was

our scheme allows encoding repeated choices, which are usually not legal ballots. For example, one can encode the choice 1,1,1 but that would be an illegal vote under all the voting systems we mentioned as examples. The fact that legal ballots are a subset of all possible encodings means that the encoding is necessarily suboptimal with respect to those requirements. We will look at an alternative scheme in the next post.

The alternative scheme we show below attempts to optimize the encoding by using a variable base as digits are encoded. The main idea is that as chosen integers in the list are encoded, the remaining ones are constrained since they cannot be repeated. Thus the base can be reduced as less choices are possible. In practice the process is complicated by the need to keep track of what digits correspond to what, as gaps form in the remaining integers and the corresponding meaning of digits changes. Here is a python implementation

and a sample session using both encoders

Note the shorter value produced by the second encoder

encode: 14606467545964956303452810

encode2: 36697695360790800022

Despite the reduction, the second encoder is not optimal (the first encoder is optimal given repeatable choices); the range of output numbers is larger than that of legal ballots. It would be interesting to see how to obtain the most compact solution, a detailed analysis could compare these schemes systematically to get quantitive measures of space efficiency.


Integer encoding of multiple-choice ballots

Secure voting systems supporting privacy through encryption must encode ballot contents into integers before they can be encrypted[1]. This encoding step is mostly trivial. For example, imagine a yes-no-abstain ballot. One can simply apply the following mapping to yield integer plaintexts for corresponding ballots:

Yes => 1
No => 2
Abstain => 3

But things can get a bit more involved when dealing with multiple-selection ballots. These are ballots where the voter makes more than one choice. They can be either ranked ballots, where the voter specifies a preference relation over the selections, or unranked ballots where no such a preference exists. Examples of voting systems using the former are single transferable vote or instant runoff voting.  Systems like approval voting or plurality at large are examples of the second type, using unranked ballots.

Imagine we are using one of these systems to elect a candidate out of a field four: Alice, Bob, Charlie, and Donna. We first apply the trivial mapping:

Alice => 1
Bob => 2
Charlie => 3
Donna => 4

But how do we encode a complete ballot, for example, a ballot with (X corresponds to marked choices)

Alice X
Bob X
Charlie O
Donna O

Unlike the yes-no-abstain ballot above, the content of the ballot corresponds to a list of integers: 1 and 2. We could use the following mapping

encode: Alice, Bob => 1, 2 => 12

The ballot is encoded as the number 12, resulting from the concatenation of the string representations of each of the integers. But what if there are more candidates, and we need to encode something like:

encode: 11, 3 => 113

That won’t work, because 113 could represent either 11 and 3, or 1 and 13.

decode: 113 => ?

We can remedy this by adding some padding such that each choice is well separated:

encode: 11, 3 => “1103” => 1103

Then when decoding, we convert the integer to a string, split it every 2 characters, and finally obtain integers for each of the candidates:

decode: 1103 => “1103” => “11”, “03” => 11, 03

But there’s still a problem, what about this choice:

encode: 1, 13 => “0113” => 113

We run into trouble here, because the string “0113” corresponds to the integer 113; there is no mathematical difference between “0113” and “113”. To fix this, when decoding we can first check that the string length is a multiple of 2 (since we are using 2 chars per candidate integer), if it is not we prepend the required zeros. The encode-decode process would be

encode: 1, 13 => “0113” => 113
decode: 113 => “113” (prepend zero) => “0113”  => “01”, “13” => 1, 13

I hear you complain that all this concatenation, padding, and prepending looks a bit hackish, can we do better?

Let’s go back to our first example, when we simply wanted to do

encode: Alice, Bob => 1, 2 => 12

This looked very nice and simple. Can’t we do something like this in general, without any string hackery? The first step is to go back to the definition of decimal numbers.

Decimal numbers (cplusplus.com)

In these terms, the encoding 1, 2 => 12 corresponds to

(10^1) * 1 + (10^0) * 2 = 12

Here we have expressed the encoding of 1, 2 using arithmetic, no string operations involved. The ballot choices are interpreted as digits according to the mathematical definition of decimal numbers. (In fact, this is what goes on under the covers when you convert a string like “12” into the number 12.) This gives us a a purely arithmetical description of the simple mapping we started with. Things then got complicated when we considered the possiblity of more choices (candidates) in the ballot. Let’s apply our mapping to that problematic ballot:

encode: 11, 3 => (10^1) * 11 + (10^0) * 3 = 113

Our new procedure fails the same way: the simple scheme where each digit represents one choice cannot be accommodated by the decimal digit representation of choices, and the result 113 is ambiguous. But wait, who says we have to encode according to a decimal representation? What if we were to map choices to hexadecimal digits,:

encode: 11, 3 => (10^1) * B + (10^0) * 3 = B3

And we’ve restored simplicity and correctness to our scheme. B3 encodes the choice 11, 3 with one choice per digit and no ambiguity! If the B3 looks like cheating, just remember, B3 is a representation of a number that in decimal format turns out to be 179. The encode-decode process could just as well be written as

encode: 11, 3 => 179
decode: 179 => 11, 3

The bottom line is we can encode lists of integers into an integer provided we use the optimal base, which is equal to the number of possible choices in the ballot plus one.

Let’s revisit our original example, with Alice, Bob, Charlie and Donna. Since we have four candidates, our base is 4 + 1 = 5. The encoding is thus:

encode: Alice, Bob => 1, 2 => (10^1) * 1 + (10^0) * 2 = 12 (base 5) = 7 (decimal)

in short:

encode: 1, 2 => 7
decode: 7 => 1, 2

Note that not only is this method simpler with no string operations or padding, but the encoded values are smaller. Compare:

encode: 1, 2 => 12
encode: 11, 3 => 1103

with

encode: 1, 2 => 7
encode: 11, 3 => 179

Which should not come as a surprise, encoding with the specified base is the most compact[2] encoding possible (proof left as excercise for the reader). A larger base wastes space encoding ballot contents that are not possible, whereas a smaller base is insufficient to encode all possible ballots.

Finally, here is a python implementation of the encoder we have proposed

In the next post we will further discuss details as to the compactness of the encoding mentioned in [2].


[1] In the case of ElGamal encryption used in Agora Voting, the plaintext must be encoded into an element of the multiplicative subgroup G of order q of the ring Zp, where p and q are suitably chosen prime numbers. In order to do this, the plaintext must be first encoded into an integer, after which it is mapped to a member of G, and subsequently encrypted.

[2] A few caveats must be mentioned. First, we are using 1-based indices to represent choices, which means some values are unused. Second, our scheme allows encoding repeated choices, which are usually not legal ballots. For example, one can encode the choice 1,1,1 but that would be an illegal vote under all the voting systems we mentioned as examples. The fact that legal ballots are a subset of all possible encodings means that the encoding is necessarily suboptimal with respect to those requirements. We will look at an alternative scheme in the next post.