Voter fraud and bayesian inference – part 2

We left off the discussion with

We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior.

Bayesian inference is a process that takes prior information, and adds evidence to obtain a posterior distribution. In our case this posterior distribution will be over the possible proportion of fake ballots in the set of all ballots. Let’s begin with the binomial case. What prior should we use? One answer is that, since we know nothing about the proportion of fake ballots we should be indifferent about each possibility. This translates into a uniform prior, where all proportions are equally likely. For example

P(proportion = fake ballots/ total ballots) = 1 / (total ballots + 1)

Since there are n + 1 possibilities for the number of fake ballots, we give each of them the same weight, which is 1 / (n + 1).

Beta + Binomial = Beta-Binomial

Before plugging this into bayes, a small technical detour.  Notice how the prior is itself a probability distribution, defined over the 0.0 – 1.0 interval. That is, the minimum proportion (0.0) is no fake ballots and maximum (1.0) is all fake ballots. It turns out there is a paramateric probability distribution one can use for this interval, it’s called the Beta distribution. The Beta distribution has two parameters, alpha and beta. The case of our neutral prior we defined above is equivalent to the Beta distribution with parameters (1, 1)

P(proportion ) = 1 / (n + 1) = Beta(1, 1)

We could express other knowledge with different choices of alpha and beta. But what’s the point of using the Beta, besides having a convenient way to specify priors? The point is that the Beta distribution is a conjugate prior of the binomial distribution. This means that the posterior distribution, once having taken into account available evidence, is also a Beta distribution. Meaning that the calculation of the posterior is much easier, as inference is just a matter of mapping the values of parameters of the Beta to some other values. Here is the posterior of the Beta distribution when it is used as the prior of the binomial (this is called the beta-binomial model).

Equations taken from [1]. The first line is just Bayes theorem, but the payoff is that the last line corresponds to a beta distribution, but with different parameters. In summary

with a beta prior, bayesian inference reduces to remapping the initial parameters alpha and beta, to alpha + k and beta + n – k, where k is the number of successes and n is the number of trials. Conjugate priors are an algebraic convenience that allow obtaining analytic expressions for posteriors easily. End of detour, please refer to [1] for further details.

Armed with our use of the beta-binomial obtaining the posterior given some audit results is simple. If we audited 10 ballots and 3 of them were fake our posterior would simply be

P(proportion = p | fake audit count = 3 out of 10)

= Beta(1 + 3, 1 + 10 – 3)

= Beta(4, 8)

here’s what Beta(4, 8) looks like

note how the peak of the distribution is at 0.3, it makes sense since in the sample 3 out 10 ballots where invalid. Evidence has transformed our initial uniform prior into the distribution seen above. This meets our original objective, a way to judge how many ballots are fake in the entire set of ballots based on limited audits. But it’s not the end of the story. What we would like also is to have an estimate as to whether or not the election result is correct. As we said previously, this estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.

The probablity that an election result was correct, given uncertainty about fake ballots, depends on two things. One is the proportion of ballots that are fake, this is what we already have a calculation for. The other thing is the actual election outcome; specifically a measure of how close the result was. The reason is simple, if the election was close, a small number of invalid ballots could cast doubts on its correctness. Conversely, if the result was a landslide, the presence of fake votes has no bearing on its correctness. For our purposes we will stick with a simple example in which the election decides between two options via simple plurality.

Call the difference between the winning and losing option d

d = winner votes – loser votes

In order for the election to be wrong, there must be a minimum of d fake votes. The existence of d fake votes does not imply that the result was wrong, but d fake votes are a necessary condition. Thus a probability that the number of fake votes is greater than or equal to d represents an upper bound on probability that the election was incorrect. Call this E (for error)

P(proportion of fake votes >= d / total votes) = E

(upper limit on the probability that the election was wrong)

We have P(proportion), it is the posterior we got above. How do we get P(proportion >= some constant)? Through the beta distribution’s cumulative distribution function, which is defined in general as

In order to reverse the inequality, we just need to subtract it from 1 (gives us the tail distribution). We finally have

Probability of incorrect result

= P(proportion >= d / total ballots)

= 1 – Cumulative Distribution Function of P(d / total ballots)

 One final correction. Because we have sampled a number of votes with known results, we must apply our calculations to the remaining ballots.

P(E) = 1 – CDF(d – sampled ballots / total ballots – sampled ballots)

Let’s try an example, an election between option A and option B with the following numbers.

Votes for A = 550

Votes for B = 450

Total ballots = 1000

Audited ballots = 100

Audited fake ballots = 4

which gives

Posterior = Beta(5, 97)

d = 100

Minimum fraction of fake votes required to change result = (100 – 4) / (1000 – 10) = 0.1066

Upper bound on probability of error

= 1 – CDF(Beta(5, 97), 0.1066)

= 0.01352

In conclusion, the probability of error due to fake ballots in this election is less than or equal to 1.35%.

You can find a javascript implementation for everything we’ve seen until now in this jsfiddle. Calculations for the binomial, beta, hypergeometric and cumulative distribution function are done with the jStat javascript library. Wait, you say, what about the hypergeometric? We’ll leave that for the next post, which should be pretty short.


[1] http://www.cs.cmu.edu/~10701/lecture/technote2_betabinomial.pdf

Voter fraud and bayesian inference

Wikipedia says

Personation  is a term used in law for the specific kind of voter fraud where an individual votes in an election, whilst pretending to be a different elector.

when someone practices personation multiple times to cast multiple votes we are talking about ballot stuffing. In this post we will consider an election in which authentication is not 100% secure, where personation is difficult but not impossible.  Furthermore we will assume there is some available, but costly method by which  ballots can be audited to determine whether or not they were cast via personation or were in fact valid.

What makes the problem non trivial is that ballot auditing is costly and cannot in principle be performed for the entirety of the ballots cast. Hence we would like to estimate, from a limited number of audited ballots, how severe ballot stuffing was for an election. This estimation can be used either as a guarantee that all went well or in the opposite case to detect a problem and even invalidate the results.

What we need is a mathematical model that given some information about the results of an auditing processes allows us to estimate the proportion of “fake” ballots in the set of all those cast. In other words, we are talking about statistical inference; in this post will use a bayesian approach. Let’s get to work.

Imagine we have a box with all the ballots for an election, and the auditing process consists in randomly taking one out and determining whether it is valid or not, recording the result, and then repeating a limited number of times. After we have recorded the results of all the audits, we would like to know how many of ballots in the entire box are fake. Two things come to mind. First, that the result of each audit is binary, we either get FAKE or VALID. Second, that if the proportion of fake ballots in the box is p, then probability that a randomly chosen ballot is fake is p; the probability that it is valid is 1 – p.

The auditing process as a whole yields a count of fake ballots and a count of valid ballots. If we have probablity p for the result of a single audit , can we assign a probablity to the count resulting from the complete audit process? Yes, the binomial distribution and its brother the hypergeometric distribution do just that[1]. Here they are

In our example, k above corresponds to the count of fake ballots. So these distributions give us a way to calculate the probability that a specific number of fake ballots is obtained in the audit assuming a certain proportion of fake ballots in the entire election. For example, let’s say we have 100 ballots total and we know that 10 of them are fake. What is the probability that if we audited 10 ballots we would obtain a fake count of 3?

P(X = 3) = 0.057395628 (binomial)

P(X = 3) = 0.0517937053324283 (hypergeometric)

Only 5%, it is unlikely we’d find 3 fake ballots with only 10 audits, given that there are only 10 out of 100 in total.

We have a way to calculate the probability of some outcome given some assumption about the proportion of fake ballots. But remember, what we want is exactly the opposite: given a certain result for an audit, we’d like to estimate the proportion of fake ballots in the entire set. It is this inversion that makes the problem a case of bayesian inference, and our friend Bayes theorem gives us the relationship between what we have and what we want.

in our case, it translates down to

P(Proportion of fake ballots | Audit Result) = P(Audit Result | Proportion of fake ballots) * P(Proportion of fake ballots) / P(Audit Result)

What we were calculating before is P(Audit Result | Proportion of fake ballots), which we can plug into the formula, together with the other terms, to get what we want: P(Proportion of fake ballots | Audit Result). The other terms are

P(Audit Result) =

The unconditional probability that a certain audit result occurs. It can be calculated by summing over all possible proportions, like this version of Bayes theorem shows:

As seen in the bottom term. Because the bottom term is common to all values of Ai, it can be interpreted as a normalizing constant that ensures that probabilities sum to 1.

P(Proportion of fake ballots) =

The prior probability that some proportion of fake ballots occurs. This ingredient is crucial for Bayesian inference and the Bayesian approach in general. It is an estimate of  the quantity we want to calculate that is prior to any of the evidence we obtain. It can be used to encode prior knowledge about what we are calculating. If we have no knowledge, we can try to encode that in a “neutral prior”. This last point is a very deep problem in Bayesian inference, as is the general problem of choosing priors. We won’t go into in detail here.

Recap. We want to calculate the proportion of fake ballots in an election based on the results of limited audits. We have seen how the binomial and hypergeometric distributions give probabilities for the results of an audit given an assumption about the proportion of fake ballots. Bayes theorem can be used to calculate the inverse probability that we are after, once we have specified a prior. See it in action in the next post.


[1] There is an important difference, the binomial distribution models sampling with replacement, whereas the hypergeometric models sampling without replacement. In the next post we will consider this difference and its significance for our problem.

The “Straw-Scotsman” pattern of ideological debate

Recently I had a short exchange on twitter on the subject of feminism. Reflecting on the nature of the disagreement, I realized that the structure of the arguments conformed to a pattern I had seen many times before, but never identified. It is a pattern that arises frequently in ideological disputes, I will mnemonically call it the “Straw-Scotsman” pattern.

Suppose Alice and Bob are debating about ideology X. Alice is a supporter of X, whereas Bob is a detractor.

Bob proceeds to criticize X:

Bob: I find ideology X unsatisfactory because of its properties a, b, c.

Alice retorts that Bob is mischaracterizing X:

Alice: Ideology X does not in fact have the properties a,b,c that you are wrongly assigning to it. You should inform yourself about what X really is before criticizing it.

Bob finds this to be a disingenuous answer:

Bob: You’re just dodging my criticisms by redefining X in a way that suits your argument. It seems to me that whatever criticism one could make of X you would simply reply that the real X is not like that.

In the language of fallacies, the pattern can be succintly described with

  • From Alice’s point of view, Bob is committing the straw man fallacy by attacking a position that does not in fact correspond to X.
  • From Bob’s point of view, Alice is committing the no true scotsman fallacy, by responding to any criticism of X by saying that the real X is not at all like that.

I offer no resolution here. From both points of view the opponent is engaging in fallacies, the argument is a stalemate and leads nowhere. The pattern typically also takes the following form when discussing the merits of ideologies in terms of historical outcomes.

Alice criticizes X using historical examples:

Alice: Ideology X is flawed, one only needs to look at what happened in the following examples a,b,c where it was applied and led to disastrous results.

Bob responds:

Bob: I disagree, examples a,b,c only show that the implementation of X was flawed. X was applied incorrectly or not at all, and it is this that led to bad results. However, if one applies X correctly, the results would be satisfactory.

A response which Alice finds unsatisfactory:

Alice: You could always dodge any critcism of a real world case of X by insisting that the implementation was wrong, rather than X is itself faulty.

In this manifestation the Straw-Scotsman pattern is best summed up as

  • When criticizing opposing ideologies people refer to their real world implementations, whereas when defending their own, they insist on its idealized form.

I realize now that I have seen this pattern occur many times when people debate, for example, communism and libertarianism.

Liquid democracy and spectral theory

In this post we will show some practical examples of how liquid democracy can be understood in terms of, and make use of results from spectral graph theory. For more background please see [Vigna2009]. Wikipedia says:

In mathematics, spectral graph theory is the study of properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated to the graph, such as its adjacency matrix or Laplacian matrix.

What does this have to do with liquid democracy? To answer this, let’s remember what defines liquid democracy is: system of transitive proxy voting. Proxy in that I can delegate my vote to some entity. Transitive because that entity can itself delegate the vote, and so on. Imagine a simple case with three voters, Alice, Bob, and Charlie. Alice delegates to Bob, and Bob delegates to Charlie. It is very natural to represent this as a graph, this is what we call the delegation graph

graph_decycled
Alice chooses Bob, Bob chooses Charlie

Assuming each voter starts with one vote, this would give us the following weights for each voter:

Alice = 1, Bob = 2, Charlie = 3

Because Alice and Bob have delegated their votes, Charlie will end up casting three votes, one for himself, one for Bob and one for Alice. What determines these weights is the structure of the graph; who is voting for who is encoded in the connections between vertices. In graph theory, the object that encodes this information is the adjacency matrix. Here’s the adjacency matrix for our example:

M

Where each row shows who each voter delegated to. Alice (A) delegated to Bob (B), hence the 1 in the second position of the first row. Similarly, Bob (B) delegated to Charlie, (C) as can be seen in the second row. Because Charlie did not delegate, the third row is all zeroes.

We can express the liquid tally above with these equations (1’s represent voter’s initial weights)

0*A + 0*B + 0*C + 1 = A

1*A + 0*B + 0*C + 1 = B

0*A + 1*B + 0*C  + 1 = C

Note how the 0’s and 1’s above correspond to the columns of the adjacency matrix. The above can be represented[1] in matrix form:

(A B C) * AdjacencyMatrix = (A B C)

This is an eigenvalue equation, whose eigenvector (the (A B C) row vector) corresponds to the result of the liquid tally. Note how the equation is recursive, which fits the recursive nature of transitive delegation. Vote weights are calculated in terms of vote weights themselves, the same way each delegate transmits votes that result from previous delegations.

When the adjacency matrix is used to evaluate the importance or influence of nodes in a graph in this way we are speaking of eigenvector centrality. Our example shows that calculating centrality is basically equivalent to tallying in liquid democracy. This is what makes the connection between spectral theory and liquid democracy.

Eigenvector centrality, Katz and Pagerank

Yes, that’s PageRank as in google, in case you thought this whole talk of centrality was a bit abstract, it’s what made google what it is today. Eigenvector centrality, Katz centrality, and PageRank are related methods to measure the importance of a node in a network. We won’t go into the differences between each measure, besides noting that both Katz and PageRank include an attenuation factor that decreases contributions from distant nodes, whereas eigenvector centrality does not.

In order to run some examples we will use the networkX library, which includes several functions for calculating centrality as well as many other useful features. If you want to play along with this code you will need to install the library as well as its dependencies, including matplotlib and numpy. If you’re on windows I suggest downloading the comprehensive WinPython package which includes everything you need here.

Let’s first create the graph corresponding to our example with Alice, Bob and Charlie. Here’s the code that does that

This generated the image shown earlier. Now to calculate the tally, we will run both Katz and PageRank.

which gives us

Both results match the tally we showed before. A couple of minor points above. First, the PageRank result was rescaled to make it match Katz. Second, the adjacency matrix for Katz was reversed as the networkx 1.8.1 Katz implementation is using a right eigenvector (this has been changed to left eigenvector in master).

More importantly, the alpha parameter is a damping factor. In the language of liquid democracy it modulates just how transitive delegation is by reducing contributions the further away the originate. For example, let’s change the above to alpha = 0.5:

Now Charlie receives 25% of Alice’s vote and 50% of Bob’s vote. So alpha quantifies the fraction of the vote that is effectively delegated. We can interpret then that a liquid democracy tally is a special case of Katz centrality and PageRank. In fact, liquid democracy is the limiting case of Katz and PageRank when alpha = 1.0, ie no damping (which is why you get viscous democracy in [Boldi2011]).

What about cycles?

One of the first things you have to deal with if you’ve implemented a liquid tallying algorithm is the possibility of cycles in the delegation graph, otherwise the procedure will blow up. Having detected a cycle at tally time the standard treatment is to consider votes that enter it as lost. In order to prevent that undesirable situation you can do cycle detection at vote time to warn the user that his/her vote may produce such a cycle.

What happens if we add a cycle to our example? Let’s try it

graph

and we get

The reason this happens has to do with the details of the algorithm that calculates eigenvectors; in particular the relationship between its convergence and the attenuation factor alpha[2]. The short story is this: using an attenuation factor of 1.0 on a graph with cycles may cause problems.

Just as liquid tally algorithms have to deal with cycles, so do we in order to make centrality work correctly. Fortunately there are fast algorithms to detect cycles in graphs. NetworkX offers an implementaton of an improved version of Tarjan’s strongly connected components algorithm, we will use it to define a function that removes cycles in a graph

Using this function we can obtain liquid tallies for any delegation graph correctly, using either Katz or PageRank. See the bottom of this post for the full python script demonstrating this.

Liquid democracy and degree (or branching factor)

Before we said that liquid democracy is the limiting case of Katz centrality and PageRank when alpha = 1.0. In the last section we saw another requirement besides that of alpha = 1.0: that the delegation graph must be acyclic, in other words a DAG. There is one more property that we can consider, degree.

A node’s degree is the number of (in our case, outward) connections with other nodes. In terms of delegation, it is the number of delegates that a voter chooses. Standard liquid democracy uses degree = 1, but such a requirement could in theory be relaxed. How does this fit in with Katz and PageRank? Lets construct a graph where voters may choose one or two delegates.

which gives

graph

resulting values

We see how Katz centrality does not yield a correct tally as it is not dividing outgoing weights for voters who split their delegation among two delegates, instead we get inflated weights. But the PageRank result does work, Bob’s two votes are split correctly, and the delegation proceeds normally from then on.

In summary

  • Liquid democracy is a special case of Katz centrality given
    • a damping factor alpha = 1.0
    • a directed acyclic graph of degree d = 1
  • Liquid democracy is a special case of PageRank given
    • a damping factor alpha = 1.0
    • a directed acyclic graph of degree d >= 1

That’s it for our quick tour of the relationship between liquid democracy and spectral theory. We have also seen how liquid democracy could be extended to include damping (as in [Boldi2011]), or to allow “multi delegation”.


Notes/References

[Vigna2009] Sebastiano Vigna – Spectral Ranking  http://arxiv.org/abs/0912.0238

[Page1998] The PageRank Citation Ranking: Bringing Order to the Web http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

[Boldi2011] Viscous Democracy for Social Networks http://chato.cl/papers/boldi_bonchi_castillo_vigna_2011_viscous_democracy_social_networks.pdf

[1] For simplicity, I have ignored the initial weights associated with each voter in the matrix equation. These initial weights are what makes liquid tallying equivalent to undamped Katz centrality rather than eigenvector centrality.

[2] For details as to alpha and PageRank See http://vigna.di.unimi.it/ftp/papers/PageRankFunctional.pdf section 5 Limit behaviour.

In the case of the networkX implementation of Katz centrality an alpha of 1.0 is guaranteed to converge as all eigenvalues of an acyclic graph are 0 (see http://www.emis.de/journals/JIS/VOL7/Sloane/sloane15.pdf and http://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.centrality.katz_centrality.html#networkx.algorithms.centrality.katz_centrality)

Python Script: Random liquid Tallying with Katz/PageRank

Causation and determinism

In this post I will try to disentangle the notions of determinism and causality, and suggest a different way to think of them. I came to think of these issues via the following informal assertion

The decay of a radioactive atom has no cause

I will not be discussing hidden variable theories nor Bell’s inequalities; I will assume outright that the phenomenon of radioactive decay is intrinsically random (as opposed to “apparent randomness” induced by ignorance), its quantum mechanical description is the most complete model possible; said model assigns probabilities to outcomes. With that out of the way, the usual argument that arrives at the above statement is

1) Radioactive decay is intrinsically random, indeterminate and cannot be predicted

2) There is no physical factor which determines whether a radioactive atom decays or not

3) Therefore, that a specific atom decays has no cause

Although the argument makes sense I am hesitant to accept 3) as is, and what it implies about how we think of causality.

Causality has been confusing minds for hundreds of years, it is a very difficult subject as evidenced by the volumes that have been written on it. So there’s not much point in trying to figure out what causality means exhaustively, via conceptual analysis in the tradition of analytic philosophy. Instead we will just quickly define causality using the mathematics of causal models, and see where that takes us for a specific scenario. In the context of these models, we will define causality according to two complementary questions:

A) what is meant by “the effect of a cause”

B) what is meant by “the cause of an effect”

Two types of causal models have been developed over the last thirty years, causal bayesian networks and structural causal models. These two formalisms are largely equivalent[4]; both make use of graph representations. Vertices in these graphs correspond to the variables under study whereas edges represent causal influences between the variables.  Guided by these graphs, one follows precise procedures to obtain mathematical expressions for causal queries over variables. These expressions are cast in the language of probability.

From [Pearl2000] page 15

In this post I will refer to structural causal models as described in Pearl’s Causality: Models, Reasoning, and Inference [Pearl2000]. To begin, we have the definition[2]

causal_model

In the above, D is a directed graph whose edges represent causal influences; these influences are quantitavely specified by functions (structural equations) on variables. Finally, probabilities are assigned to variables not constrained by functions, these are exogenous variables.

The effect of a cause

Given a structural causal model, question A) can be answered with the following result

causal_effect

alternatively

The difference E(Y | do(x’)) – E(Y | do(x”)) is sometimes taken as the definition of “causal effect” (Rosenbaum and Rubin 1983)

The causal effect of changing the variable x’ => x” on y is defined as the difference in expectation of the value that y will take. Note how the formalism includes explicit notation for interventions, do(x).

The cause of an effect

Question b) looks at it from a different point of view. Instead of asking what the effects of some cause are, we ask what the cause of some effect is; it’s a question of attribution. These questions naturally assume the form of counterfactuals (wikipedia):

Counterfactuals

A counterfactual conditional, is a conditional (or “if-then”) statement indicating what would be the case if its antecedent were true.

For example

If it were raining, then he would be inside.

Before you run off screaming “metaphysics!”, “non-falsifiability!” or other variants of hocus pocus, rest assured: counterfactuals have a clear empirical content. In fact, what grants counterfactuals their empirical content is the same assumption that allows confirmation of theories via experiments: that physical laws are invariant. Counterfactuals make predictions just the same way as experiments validate hypothesis. If I say

“if you had dropped the glass it would have accelerated downwards at g”

I am also saying that

“If you now drop the glass, it will accelerate downwards at g”

Given that I make the assumption that all relevant factors remain equal (ie, gravity has not suddenly disappeared).

The following result allows us to answer queries about counterfactuals[3]:

counterfactuals

Once we have expressions for counterfactuals, we can answer questions of type B), with the following results. Note that these results are expressed in terms of counterfactuals, which is why one needs theorem 7.1 as a prerequisite.

necessity

This completes the brief listing of key results for the purposes of the discussion.

Consequences

So what was the point of pasting in all these definitions without going into the details? The point is that given the formalisms of these models and their associated assumptions[5], we can think quantitatively about questions A) and B), without going into the nightmare of trying to figure out what causality “really means” from scratch. Our original criteria now have assumed a quantitative form:

A) The difference in expectation on some value Y when changing some variable X

B1)The probability that some variable X is a necessary requirement for the value of some observed variable Y

B2) The probability that some variable X is a sufficient requirement for the value of some observed variable Y

Thankfully our example of a radioactive atom is very simple compared to the applications causal models were designed for; for our purposes we do not need to work hard to identify the structure nor the probabilities  involved, these are given to us by the physics of nuclear decay.

Feynman diagram for Beta- decay (wikipedia)

Having said this, we construct a minimal model for eg. negative beta decay with the following two variables

r: The neutron-proton ratio, with values High, Normal, Low (using some arbitrary numerical threshold)

d: Whether β- decay occurs at some time t, with values True, False

Our questions, then, are

Q1) What is the causal effect of r=High on d?

Q2) What is the probability of necessity P(N) of r = High, relative to the observed effect d = True?

Q3) What is the probability of necessity P(S) of r = High, relative to the observed effect d = True?

In order to interpret the answers to the above questions we must first go into some more details about causality and the models we have used.

General causes, singular causes, and probabilities

Research into causality has distinguished two categories of causal claims:

General (or type-level) causal claims:

Drunk driving causes accidents.

Singular (or token-level) causal claims:

The light turning on was caused by me flipping the switch.

General claims describe overall patterns in events, singular claims describe specific events. This distinction brings us to another consideration. The language in which causal models yield expressions is that of probability. We have seen probabilities assigned to the value of some effect, as well as probabilities assigned to the statement that a cause is sufficient, or is necessary. But how do these probabilities arise?

Functional causal models are deterministic; the structural equations that describe causal mechanisms (graph arrows) yield unique values for variables as a function of their influences. On the other hand, the exogenous variables, those that are not specified within the model, but rather are inputs to it, have an associated uncertainty. Thus probabilities arise from our lack of knowledge about the exact values that these external conditions have. The epistemic uncertainty spreads from exogeneous variables throughout the rest of the model.

[Pearl2000] handles the general/singular dichotomy elegantly: there is no crisp border, rather there is a continuous spectrum as models range from general to specific, corresponding to how much uncertainty exists in the associated variables. A general causal claim is one where exogenous variables have wide probability distributions; as information is added these probabilities are tightened and the claim becomes singular. In the limit, there is no uncertainty, the model is deterministic.

We can go back to question 1) whose answer can be interpreted without much difficulty.

Q1) What is the causal effect of r=High (high nuclear ratio) on d (decay)?

If physics is correct, having a certain values for r will increase the expectaton of d being equal to True, relative to some other value for r. This becomes a general causal claim,

A1) High nuclear ratio causes Beta- decay

So, relative to our model, high nuclear ratio is a cause of Beta- decay. Note that we can say this despite the fact that decay is intrinsically indeterministic. Even though the probabilities are of a fundamentally different nature, the empirical content is indistinguishable from any other general claim with epistemic uncertainty. Hence, in this particular case determinism is not required to speak of causation.

The more controversial matter is attribution of cause for a singular indeterministic phenomenon, which is where we began.

3) A specific atom decay has no cause

This is addressed by questions 2) and 3).

Q2) What is the probability of necessity P(N) of r = High, relative to the observed effect d = True?

Q3) What is the probability of necessity P(S) of r = High, relative to the observed effect d = True?

Recall, functional causal models assign probabilities that arise from uncertainty in exogenous variables; this is what we see in definitions 9.2.1 and 9.2.2. The phrase “probability of sufficiency/necessity” conveys that sufficiency/necessity is a determinate property of the phenomenon, it’s just that we don’t have enough information to identify it. Therefore, in the singular limit these properties can be expressed as logical predicates

Sufficiency(C, E): Cause => Effect

Necessity(C, E): Effect => Cause

In the case of the decay of a specific atom at some time the causal claims become completely singular, definitions 9.2.1 and 9.2.2 reduce to evaluations of whether the above predicates hold. If we assume that atoms with low nuclear ratio do not undergo Beta- decay, our answers are:

A2) High nuclear ratio is a necessary cause of Beta- decay

A3) High nuclear ratio is not a sufficient cause of Beta- decay

Thus the truth of the statement that the decay of a radioactive atom has no cause depends on whether you are interested in sufficiency or necessity. In particular, that the atom would not have decayed were it not for its high nuclear ratio suggests this ratio was a cause of its decay.

But let’s make things more complicated, let’s say there is a small probability that atoms with low nuclear ratios show Beta- decay. We’d have to say that (remember, relative to our model) the decay of a specific atom at some time has no cause, because neither criterion of sufficiency or necessity is met.

The essence of causality, determinism?

We can continue to stretch the concept. Imagine that a specific nuclear ratio for a specific atom implied a 99.99% probability of decay at some time t, and also that said probability of decay for any other nuclear ratio were 0.001%. Would we still be comfortable saying that the decay of such an atom had no cause?

Singular indeterministic events are peculiar things. They behave according to probabilities, like those of general causation, but are fully specified, like instances of singular deterministic causation. Can we not just apply the methods and vocabulary of general causation to singular indeterministic events?

In fact, we can. We can modify functional causal models such that the underlying structural equations are stochastic, as mentioned in [Pearl2000] section 7.2.2. Another method found in [Steel2005] is to add un-physical exogenous variables that account for the outcomes of indeterministic events. Both of these can be swapped into regular functional models. This should yield equivalent definitions of 9.2.1 and 9.2.2, where probability of sufficiency and necessity are replaced with degrees, giving corresponding versions of A2) and A3).

In this approach, singular causation is not an all or nothing property, it is progressive. Just as general causal claims are expressed with epistemic probabilities, singular causal claims are expressed in terms of ontological probabilities. In this picture, saying that a particular radioactive decay had no cause would be wrong. Instead, perhaps we could say that a specific decay was “partially” or “mostly” caused by some property of that atom, rather than that there was no cause.

I believe this conception of causality is more informative. Throwing out causation just because probabilities are not 100% is excessive and misleading, it ignores regularities and discards information that has predictive content. The essence of causation, I believe, is not determinism, but counterfactual prediction, which banks on regularity, not certainty. It seems reasonable to extend the language we use for general causes onto singular ones, as their implications have the same empirical form. Both make probabilistic predictions, both can be tested.

No cause

What would it mean to say that some event has no cause, according to this interpretation? It would mean that an event is entirely unaffected by, and independent of, any of the universe’s state; no changes made anywhere would alter the probabilities we assign to its occurence. Such an event would be effectively “disconnected” or “transparent”.

Pollock – Lavender Mist Number 1

We could even imagine a completely causeless universe, where all events would be of this kind. It is not easy to see how such a strange place would look like. The most obvious possibility would be a chaotic universe, with no regularities. If we described such a universe as an n-dimensional (eg 3 + 1) collection of random variables, a causeless universe would exhibit zero interaction information, and zero intelligibility, as if every variable resulted of an independent coinflip. But it is not clear to me whether this scenario necessarily follows from a causeless  universe assumption.


References

[Pearl2000] http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/0521773628

[Pearl2009] http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf

[Steel2005] http://philoscience.unibe.ch/documents/causality/Steel2005.pdf

[2] http://bayes.cs.ucla.edu/BOOK-2K/ch2-2.pdf

[3] http://www.cs.ucla.edu/~kaoru/ch7-final

[4] http://www.mii.ucla.edu/causality/?p=571

[5] See eg causal markov condition, minimality, stability

[6] ftp://ftp.cs.ucla.edu/pub/stat_ser/r393.pdf

The essential ingredient of causation, as argued in Pearl (2009:361) is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.

[7] Laurea and her tolerance