Beauty and simplicity in physics

A theory appears beautiful or elegant [..] when it’s simple, [..] when it can be expressed very concisely. – Murray Gell-Mann

The question of whether one should use beauty as a heuristic in physics is a difficult problem with deep philosphical roots. It is hard to find appropriate assumptions to even frame the problem. Like in the classic philosophical problems it is related to, falling into infinite regress[5] is always right around the corner. In the limit, if we make no assumptions, nothing can be said.

I sketch an argument, in the form of a series of claims, as to how one should try to think of the problem. This is then applied in two ways. First to argue that the term “beauty” is not useful to think about the problem. Second, to evaluate one example of a heuristic that is normally considered to be an instance of beauty, naturalness.

1. Science is a process in which theories are assigned relative degrees of certainty.

1.1 We can model degrees of certainty with probabilities.

1.2 Theories must be logically consistent

1.3 Theories gain (or lose) certainty through empirical confirmation.

2. Scientific theories allow calculations that specify expected observations.

2.1 To yield calculations, these theories must be computable, in the Church-Turing sense.

2.2 These theories can be re-cast as algorithms.

3. The routine practice of science includes embedded assumptions. These assumptions are usually taken for granted and even invisible [4].

3.1 These assumptions affect the way probabilities are assigned to theories.

3.2 The way these assumptions affect probability assignments must be consistent.

We can accomodate the problem of beauty in physics in terms of point 3. I argue later that the word “beauty” is not the right word to describe the real problem at hand. This real problem can be best described as follows.

3.3 If competing theories yield (through calculation) empirical claims that are equivalent in our domain of observation, then they cannot be assigned different probabilities from observation alone.

3.3.1 The problem of assigning different probabilities to empirically equivalent theories (up to observation) is an instance of the philosophical problems of underdetermination, and its cousin, the problem of induction.

3.3.2 The problem is not vacuous as theories with as yet equivalent calculations may diverge for not yet made observations.

The problem of beauty in physics can now be described as the question of whether beauty should be used an additional criteria to answer 3.3.1. Before going forward, we identify the main classes of assumptions present and accepted in scientific practice.

4. Science includes a class of assumptions about nature’s uniformity, this is known as Uniformitarianism.

4.1 Uniformitarianism asserts, for example, that the laws of nature are invariant in space and time: we expect the laws of physics to be the same everywhere. We also expect the laws of physics to be the same in the past as in the future.

4.2 Uniformitarianism is an a priori assumption: it is not subject to confirmation as scientific theories are.

4.3 Following 3.1, uniformitarianism affects the way probabilities are assigned to theories in a way that is complementary to empirical confirmation (extra-empirical in [3]).

5. Science includes an assumption about nature’s simplicity, this is known as Occam’s Razor.

5.1 Occam’s razor asserts that simpler theories are, ceteris paribus, more likely to be true.

5.2 Occam’s razor is an a priori assumption: it is not subject to confirmation as scientific theories are.

5.3 Following 3.1, Occam’s razor affects the way probabilities are assigned to theories in a way that is complementary to empirical confirmation (extra-empirical in [3]).

It should be apparent that the claims in 4.3 and 5.3 have the same form as that required by 3.3: probability assigning mechanisms that, given equivalently supported theories, allow assigning probabilities differentially.

6. Uniformitarianism and Occam’s razor are the only a priori assumptions accepted in science. They can be reduced/unified into a bias in favour of simpler theories.

6.1 Uniformitarianism and Occam’s razor can be formalized as a bias that assigns relatively more probability to simpler theories. In  a bayesian setting, this bias takes the form of a prior over theories.

6.2 Given that theories take the form of algorithms, one needs a measure of simplicity for an algorithm. An example of such a formalism is found in Solomonoff induction, also in a bayesian setting.

6.3 Although Kolmogorov complexity is uncomputable, and the complexity has a dependence (up to a constant overhead) on choice of turing machine, solomonoff induction illustrates that the complexity of a theory is amenable to formal treatment[7]. This provides a model with which to think about the underlying philosophical issue in a principled way.

6.4 Other considerations such as “testability, falsifiability, naturalness, calculability, and diversity”[5] are not epistemological assumptions, but rather desirable criteria for the efficient practice of science.

Now we can say what the problem with “beauty” is: that it is ambiguous and/or inconsistent with scientific practice.

7. The word “beauty” is not a good way to think about the problem of selecting theories with equivalent empirical support. Whenever “beauty” does not refer to simplicity it is ambiguous or inconsistent.

7.1 If beauty does not refer to simplicity, there is no objective way to formalize it, and it is, in principle, not possible to construct a prior.

7.2 If beauty does not refer to simplicity, applying it is inconsistent: it it is not part of the accepted a priori assumptions used in science.

7.3 If it is a problem to use “beauty” when it does not mean simplicity, then one can drop beauty altogether and use simplicity instead.

Now I turn to a specific application of this approach naturalness as described by Hossenfelder [1]

Physicists use the word “natural” in general to mean that a theory’s dimensionless parameters (ie those without units) are not much larger or much smaller than 1 [..] The belief in naturalness is usually rationalized by claiming that numbers which are very large or very small are unlikely. They appear cherry-picked for no other reason than correctly describing an observation and require further explanation.

I argue that naturalness, as defined above, could be formulated as an instance of a simplicity bias:

8. If a theory is to be encoded as an algorithm, then the parameter values it uses to make calculations must be present as data in the theory.

8.1 Relative to the choice of Universal Turing Machine (UTM), the number of parameters and their values will contribute to the length of the algorithm.

8.2 Due to kolmogorov complexity invariance, the choice of UTM cannot arbitrarily reduce the length contributions of parameters. In particular, if the encoding of certain parameter values in one UTM is efficient, it cannot be for others. There is no free-lunch UTM, and certain number of parameters and values contribute to algorithm length in a non-arbitrary way (up to a constant overhead).

9. If we are using simplicity priors over theories due to 4) and 5), then it is reasonable to expect that these priors over theories have logical implications as to the number and values of theory parameters.

9.1 If 4) and 5) take the form of priors over theories, then they imply priors over parameter values.

9.2 If 9.1) is the case, then not only are simplicity favouring priors over parameter numbers and values compatible with scientific practice, they are, in fact, necessary. For these formulations of 4) and 5) it would be inconsistent not to employ a priori probability distributions over parameter numbers and values.

We can compare 9.1 to observations made by Hossenfelder[1][2] and Wells[3].

The major problem with finetuning arguments both in cosmology and in particle physics is the reference to probabilities without defining a probability distribution, or, if a distributionis defined, the failure to explain where the distribution itself comes from.

If naturalness has any connection to the extra-empirical judgments of theory plausibility, which is surely what naturalness discussion is after, we have no recourse but to introduce probability measures across parameters of the theory.

The main idea is that simplicity assumptions provide these probability distributions, albeit implicitly.

The claims made in 9) are not general, but rather point to the possibility of accomodating naturalness into the framework of scientific practice in a principled way. This is a philosophical argument, because I provide no concrete specifications here for 4) and 5), or naturalness. So one cannot say whether naturalness follows in general, one can only say that it is conceivable, as in the above model, that it does.

We can imagine a scenario where we could go from the philosophical to the technical, in a way reminiscent of E.T. Jaynes’ reasoning robot. The key to this idea is the need to automate the relevant probabilistic reasoning, which forces the formalization of the method into precise mathematical instantiations. This would mean

Step 1. Choose some turing machine where the robot runs on

Step 2. Restrict the class of theories such that priors are computable

Step 3. Formalize the principles 4) and 5) into precise probability priors over theories

Step 4. Formalize naturalness

Step 5. Extract the logical relationship between the formalizations of Step 3 and Step 4.

This would give a concrete answer as to whether, in that scenario, naturalness follows from previous principles, and in effect, whether naturalness is consistent with and necessary for the robots’ “scientific practice”.

Note that Step 3 is related to Wells’ comment [3]:

The critical discussion above leads to a credible skepticism to extra-empirical theory assessments (SEETA). We do not have a meta-theory that provides statistical distributions toparameters of theories, which is a key problem naturalness faces

Step 3 would, in effect, construct the meta-theory for the robot’s accepted scientific principles.

[1] Screams for Explanation: Finetuning and Naturalness in the Foundations of Physics

[2] How nature became unnatural

[3] Naturalness, Extra-Empirical Theory Assessments, and the Implications



[6] Beauty and Elegance in Physics

[7] Machine Super Intelligence

[8] Probability Theory: The Logic of Science


A toy model of information-theoretic security

In a previous post we discussed the foundations of cryptography in terms of different types of uncertainty stemming from limited information or computation. We saw an example using the caesar cipher that demonstrated how an adversary may have all the information necessary to reveal a secret if the relationship between the message space, key space and length of the ciphertext satisfies certain conditions:

Note how we were able to pick out the correct message because none of the other attempts gave meaningful results. This happens because the space of possible keys is so small that only one of them decrypts to a possible message. In technical terms, the key space and message space[2] are small enough compared to the length of the message that only one key will decrypt.

The situation can be coarsely be classified into three cases:

  1. H(K | C) = H(K) – Perfect secrecy
  2. H(K | C) < H(K) – Information-theoretic security
  3. H(K | C) = 0 – Computational security

Information-theoretic security refers to the epistemic uncertainty of the previous entry. We can use a simple model to illustrate how the conditions governing information-theoretic privacy work and when these conditions fail reducing to computational security.

Elements of our model

The XOR function

Our toy model needs to specify how messages are encrypted from some plaintext language to ciphertexts, using a key that determines the exact transformation and can be used by the intended recipient to recover the secret. We can use the simplest language possible, a two character alphabet whose messages are equivalent to binary sequences. Our encryption function will be the XOR function, which takes two binary sequences and produces a third one as its output. This choice fixes our key space to also be binary sequences. Here’s an example encryption:

1010011000 XOR 1100111010 = 110100010

The XOR function takes as input a message (1010011000), a key (1100111010) and produces a ciphertext 110100010. Of course, the message in this case means nothing to us, but there is no difference as to the process, we can simply imagine that the 1010011000 above is some meaningful content like “WELL DONE YOU HAVE FOUND THE SECRET”. The important thing to note is that, just like in english, there is a subset of the combinations in the plaintext space (binary sequences) that make meaningful messages whereas the rest do not. This brings us to the notion of language entropy, which precisely quantifies how big the meaningful subset of messages is relative to the entire plaintext space.


The higher the language entropy, the bigger the blue region will be compared to the entire plaintext space. In the case of a binary language the entropy ranges from 0 to 1, since this quantity is measured in bits per character. So far our toy model has these elements:

  • Plaintext space: ∈ {0, 1}n
  • Message space: M ⊂ P
  • Key space: K ⊂ {0, 1}n
  • Ciphertext space: C ∈ {0, 1}n
  • Encryption function: XOR: P x K → C
  • Language entropy: HL ∈ {0.0-1.0}

The security properties of our system depends on three parameters related to the above:

  • n: the number of characters in the plaintext
  • |K|: the size of the key space
  • HL: the language entropy
  • RL = 1 – HL: the language redundancy

The last parameter, redundancy, is just a rewriting of the language entropy. The equation that describes the security in terms of these parameters is:


This equation gives a lower bound for the expected number of spurious keys, represented by the term sn. A spurious key, for a given ciphertext, is a key that decrypts said ciphertext into a message that does not correspond to the message which was encrypted with the real key. In the example encryption at the beginning of the post we saw that when trying all the keys to decrypt the ciphertext only one of them yielded a meaningful plaintext: the ciphertext had no spurious keys. If, on the other hand, one of the keys, s, had decrypted the ciphertext into something like


then that key s would be a spurious key. An attacker trying all keys would get two possible messages when trying all possible keys backwards and would not know which of them was the real one. The secret would remain somewhat protected. The existence and number of  expected spurious keys determines which of the three coarse categories above a cryptosystem belongs to. Looking at the spurious key equation we can see the following trends

  • sn increases with the size of the key space, |K|
  • sn decreases with the size of the plaintext, n
  • sn decreases with language redundancy, RL

A visual representation

Encryption representation with n = 2, H = 0.8, M = 3, K = 1

The left part of the image is visual representation of our toy model for parameter values, the left axis is the plaintext space, the right axis is the ciphertext space. A point on the plot represents an encryption, in other words, a mapping from the plaintext space to the ciphertext space. We have used n = 2, giving a plaintext space of size 4. Of these four, three are meaningful messages (for a language entropy of 0.8). The 3 red dots on the plot therefore correspond to the three encryptions of these 3 meaningful messages. Because K=1, there is only one ciphertext per plaintext, or visually, only one point on any given horizontal line. Compare with:

n = 2, H = 0.8, M = 3, K = 3

Here we can see 9 points, corresponding to 3 messages x 3 keys. A horizontal line thus represents all the encryptions for a given message under different keys. What about the color difference? Here’s another set of parameters:

n = 6, H = 0.8, M = 28, K = 3

With a higher plaintext length the number of meaningful messages rises to 28, resulting in a total of 28 x 3 = 84 encryptions, which show up as red and blue points in this case. Can you spot the pattern that explains this?

It’s not easy to see, but the answer lies in understanding what vertical lines mean in the representation. Points lying on the same horizontal line are different encryptions for the same message. Points lying on the same vertical line are different messages for the same encryption. As we saw before, this is exactly the situation where it is not possible for an adversary to determine the secret by trying all keys in reverse, as there is no way to tell which of the resulting messages is the original.

Blue points are ciphertexts for which there is more than one key that decrypts to a meaningful message, or equivalently, blue points are ciphertexts with one or more spurious keys.

sn > 0 ⇒ blue point

sn = 0 ⇒ red point

Now we can go see the visual equivalent the properties of information-theoretic security we mentioned before.

  • sn increases with the size of the key space, |K|
Fixed n, H, increasing values of K

Visually: the proportion of blue dots increases.

  • sn decreases with the size of the plaintext, n
Fixed K, H, increasing values of n

Visually: the proportion of red dots increases.

  • sn decreases with language redundancy, RL
Fixed n, K, decreasing values of H (increasing values of R)

Visually: the proportion of red dots increases.

Visualizing the categories

Besides these trends, we also spoke about three broad categories into which cryptosystems fit:

  • H(K | C) = H(K) – Perfect secrecy
n = 8, H = 0.55, K = 256

Visually: the number of blue dots per column is equal to the number of horizontal lines. This means that the adversary gains no information from the ciphertext. Note also that 2^8 = 256, which is the value of K.

  • H(K | C) < H(K) – Information-theoretic security
n = 9, H = 0.55, K = 106

Visually: there are only blue dots. Every ciphertext is partially protected in the sense that the adversary does not have enough information to reveal the secret unequivocally.

  • H(K | C) = 0 – Computational security
n = 13, H = 0.53, K = 29

Visually: there are red dots, these ciphertexts have no information-theoretic protection and depend on computational security for their confidentiality.

Try it yourself

In this post we have described some of the key ideas presented in Claude Shannon’s seminal Communication Theory of Secrecy Systems published in 1949. We have also constructed a toy model that allowed us to visualize the security properties of cryptosystems and how they vary with its main parameters. You can try playing around with the model here. If you are a teacher and find it helpful please let us know!