The set of votes cast in a liquid democracy can be used to define a vector space over voters. Voter similarity can thus be specified in terms of a distance function, or metric, over this space. The question is, how to represent voters as vectors?

The most obvious approach would be to assign one dimension to each election. The content of the vote cast in election *n* would be the *n*th component of the vector. So for example, if there are two elections, each with three options (1,2,3), and we had the following voters:

John chose 1 in the first election, 1 in the second election and 3 in the third election

Sally chose 2 in the first election, 1 in the second election and 1 in the third election

we would have these column vectors

John = (1, 1, 3)

Sally = (2, 1, 1)

We could them simply choose the euclidean distance as our metric in this vector space (wikpedia)

So we would have

d(John, Sally) = Sqrt( 1 + 0 + 2) = Sqrt(3)

But there’s a problem with this approach. If, for example, John had chosen option 2 in the third election, we would have

d(John, Sally) = Sqrt(1 + 0 + 1) = Sqrt(2)

So we would say that John and Sally would be more similar as voters in the second case. But this requires the interpretation that option 2 is “semantically” closer to option 3 than option 1. In other words, the encoding we proposed implies similarity between options that are numerically close together. But we do not want this; a priori, options in an election can be arbitrarily close or far apart, option numbering should impose no “semantic structure” on the options.

What we need is an encoding where the distance between Sally and John only depends on whether the same or different option was chosen, not on the numerical distance between the numbers assigned to each option. One way to do this[1] is to encode votes into *string sequences* that correspond to the cast vote. For example, choosing option 2 is encoded as *the string “2”*. Vectors are the concatenation of these individual vote. With this approach, the metric is the Hamming distance

In information theory, the **Hamming distance** between two strings of equal length is the number of positions at which the corresponding symbols are different. For a fixed length *n*, the Hamming distance is a metric on the vector space of the words of length n, as it obviously fulfills the conditions of non-negativity, identity of indiscernibles and symmetry, and it can be shown easily by complete induction that it satisfies the triangle inequality as well.

Let’s encode our voters again, in the first example

John = (“1”, “1”, “3”) = (“113”)

Sally = (“2”, “1”, “1”) = (“211”)

d(John, Sally) = 2

In the second example,

John = (“1”, “1”, “2”) = (“112”)

Sally = (“2”, “1”, “1”) = (“211”)

d(John, Sally) = 2

And, as desired, the distances in both examples are the same, 2.

So, what is the point of all this? Defining voter similarity can have many applications, for example

- Collaborative filtering for delegate recommendation (delegate discovery when the number of delegates is potentially very large)
- Activity and notification filtering in a social network voting context (similar to filtering based on proximity in a social graph)
- Anomaly detection (e.g. via k-nearest neighbours) for delegate recall or switching (i.e. alert users when their delegates may not be acting as expected)
- Visualization, using clusters to identify voting blocs

It should be noted, however, that all of this only applies to public voting, where vote informaton is available to compute similarites.

[1] Of course, other metrics can be defined. For example the Rajski distance is used here