Over the past few decades, we have witnessed a quiet revolution in the understanding of probability and, with it, the production of knowledge. One could call it the Bayesian revolution, after the 18th century statistician Thomas Bayes, who proposed that probability represents not an objective assessment of an event’s frequency but a subjective measurement of belief: an informed prediction about whether an event will occur.
While relatively simple mathematically, Bayesian statistics is computationally intensive, which, until recently, made it prohibitively difficult to operationalize. But with the development of modern computational capacity, Bayesian probability is ascendant, offering a way to turn the large volumes of data now being captured into predictions and “insights.” It provides the underlying logic for Google’s search engine, Five Thirty-Eight’s mode of political coverage, and algorithmic trading in financial markets. Most everything that we think of as artificial intelligence is also a product of applying statistics and probability to a growing field of problems — everything from predictive policing to online dating to surveillance assessment to targeted ads to voice detection and interpretation. Any time you see a moving needle assessing an election night outcome, use a spam filter, or receive a recommendation based on what others “like you” have done, you are experiencing the products of Bayesian probability in action.
While the Taylorist and Fordist revolutions created the conditions for the “automatic” production of goods, the Bayesian revolution is cementing the conditions for the automatic production of knowledge
Bayesian thinking has been hailed by a handful of technologists and scientists as a panacea for our broken times. Countless articles (including this one in Aeon and this peer-reviewed article in PLOS ONE about replication problems in psychology) have suggested that Bayesian statistics could, as though by magic, cure the replication crises that have plagued various scientific disciplines, addressing the corrupting influences of incentives. The Bayesian approach has even inspired a self-help movement of sorts devoted to promoting “rational thinking,” published under the moniker Less Wrong. Bayesian thinking has become idealized as a norm to aspire to and ultimately to be controlled by.
Whereas previous ways of conceiving statistics imagined creating a hypothesis and designing an experiment to evaluate it, Bayesian statistics starts with a belief about the world and updates that belief as new data becomes available. This makes it more suited for an era where computers can work on a constant stream of real-time data. As Michael Lynch, the founder of the data analytics company Autonomy, explained in a article in Wired in 2000, “Bayes gave us a key to a secret garden … With the new, superpowerful computers, we can explore that country.” But that country is turning out to be a place where seeking the most profitable course of action supplants efforts to evaluate other kinds of truth.
To appreciate how revolutionary Bayesian probability is, it is helpful to understand the kind of statistics it is supplanting: frequentism, the dominant mode of statistical analysis and scientific inference in the 20th century. This approach defines probability as the long-run frequency of a system, with statisticians designing experiments to gather evidence to prove or disprove a proposed claim about what that long-run frequency is. (Though it should be noted there are many debates internal to frequentism about exactly what is proved or disproved in an experiment.) This initial claim does not change during the analysis but is instead determined as disproven or not. For example, a statistician might claim that a flipped coin will land heads 50 percent of the time and hold this claim static while they gather evidence (i.e. flip the coin). Their prediction addresses not a particular coin flip but rather what is expected to happen over a series of flips, based on some theory.
The upside of this approach to probability is its apparent objectivity: Probability merely represents the expected frequency of a physical system — or something that can be imagined as analogous to one. What one believes about the next coin flip does not matter; the long-run frequency of heads will be the basis for evaluating whether the hypothesis about its probability is accurate (i.e. whether the flips we observe correspond with how we believe the system operates). In this way, scientific analysis proceeds by setting a hypothesis at the beginning, and, only after all of the data is gathered, evaluating whether that hypothesis objectively corresponds with the data.
Any time you see a moving needle assessing an election-night outcome or use a spam filter you are experiencing the products of Bayesian probability in action
While this approach seems sensible, it is also limiting. It prevents statisticians from directly assigning a probability to a single event, which has no long-run frequency. Actual events do not have a probability: They happen or they do not. It rains tomorrow or it doesn’t; a car crashes or it doesn’t; a scientific hypothesis is true or not. A coin lands on head or tails; it does not half-land on heads. So for a frequentist, a claim like “there is a 70 percent chance of rain tomorrow” requires imagining a series of similar days — what statisticians refer to as a “reference class” — that could function like a series of coin flips. Moreover, this means under a traditional frequentist interpretation, a hypothesis cannot be given a probability since it is not a frequency (e.g. either a coin is fair, or it isn’t).
Frequentism thus required extensive human involvement in defining problems and designing experiments. Despite frequentism’s often-claimed objectivity, the selection of the reference class is subjective. For a coin flip, for instance, the reference class could be every flip with a given coin, or it could be only those made under specific atmospheric conditions or those performed by a specific flipper. Different reference classes can have substantial effects on the long-run frequency of an event and hence the accuracy of hypotheses about its probability. In the 1970s, two Bayesian statisticians, D.V. Lindley and L.D. Phillips demonstrated how under a frequentist interpretation of probability, variations in what a researcher thinks about an experiment can substantially change the results even with the same data.
A Bayesian approach, however, provides a means to do away with the reference class. It further proposes that one can dispense with the established process of formulating a hypothesis, testing it, and then evaluating it. Instead researchers make a generalized model of some problem and then update the model as new evidence is gathered. As opposed to the supposedly “objective” understanding of probability as a stable long-run frequency, Bayesian interpretations understand probability to be an ever-updating subjective measure of what researchers believe about whether an event will occur or not. This allows probability to be established for single events, and any event happening or not can be its own hypothesis.
Researchers might initially assume a coin will land heads 50 percent of the time, but if results defy that prediction, they will adjust their expectations accordingly to correspond with the conditions on the ground — like an election night prediction needle — without having to invent a whole new theory. In fact, Bayesians can treat every possible bias for the coin as a possible hypothesis and calculate the probability that each is true. When frequentists say a coin flip should be 50/50, they are saying something constant about a whole series of flips. If that proves false, they come up with a new hypothesis to test. But a Bayesian would say, I believe it is 50/50 now, and I will adjust my beliefs based on the results as they come in. For example after a few flips 51/49 might appear the most probable bias for the coin.
To understand how this works with a more practical example, imagine a drug test for cyclists that can detect doping 98 percent of the time, with a 3 percent false-positive rate. We may initially assume that if a rider tests positive, there is only a 3 percent chance that they are not doping. But Bayes’s theorem can be used to calculate the probability a cyclist is doping based on how pervasive we believe doping is in the field. Bayesian inference starts with a gut feeling (5 percent of bikers are doping) and, as more evidence is obtained (the results of actual drug tests), the probability becomes increasingly objective. If we believe that only 5 percent of the field is doping, it turns out that a positive test corresponds with just over a 60 percent chance that a cyclist is doping. This may seem counterintuitive, but it is simply because, based on our assumption about how prevalent doping is generally, there are so many more non-doping cyclists who could produce a false positive than doping cyclists who could produce a true positive. As we conduct more tests, we increase the accuracy of our guess about how much of the field is doping, as the output of subsequent rounds of calculation reflect the data more and more and our initial assumptions less and less.
Though people start from different assumptions, beliefs (such as how common doping is) should converge with additional evidence, becoming less subjective and more objective. In this way, Bayesian thinking provides a guide for how to think — or more specifically, how to turn our subjective beliefs into increasingly objective statements about the world. Using Bayes’s theorem to update our beliefs based on evidence, we arrive ever closer to the same conclusions that others with the same evidence arrive at, slowly moving away from our starting assumptions.
For many Bayesians, the question is not whether it is permissible to say a claim is true; the point is whether it is profitable to act as if it were true
Bayesian statistics adopt a behavioral as opposed to epistemological understanding of statistics, building on the work of early frequentist statisticians like Jerzey Neyman and Egon Pearson. Prior to this behavioral shift — and in the way many introductory statistics classes are still taught — the point of evaluating data was to determine whether one can say some hypothesis is true, e.g. whether a daily low dose of aspirin really reduces the risk of a heart attack, yes or no. As Leonard Savage, an early American Bayesian, noted, “the problems of statistics were almost always thought of as problems of deciding what to say rather than what to do.”
But for many Bayesians, the question is not whether it is permissible to say a claim is true; the point is whether it is profitable to act as if it were true. With a Bayesian approach, it is possible to weigh the certitude of one’s predictions against the costs of being wrong. It allows computation to capture subjective belief and translate it directly into value, into an economic calculation. For example, if one knows the costs of shutting down a production line to recalibrate it as well as the cost of selling defective products, you can calculate how many products to inspect in order to minimize needless inspections and lost production time while keeping the rate of defective products within an acceptable range. In this way, one worries less about knowing the truth — in this case assuring the quality of every product — and more about how to behave as if one knows it.
By understanding statistics as measuring how profitable something will be rather than how accurate a statement about it is, the Bayesian revolution provides a methodology for data analysis that can be applied directly to maximize the return on investment (whether for ads, mortgages, or political campaigns) that can be squeezed from any data at hand. We can see this same underlying logic at work in a whole host of other technologies, from online advertising to determining who will repay credit card debt. The Bayesian revolution ultimately ties knowledge to its ability to be worth something. Statistics — and with it machine learning — turns out to be, in a sense, a subdiscipline of economics.
The Bayesian revolution taught the world how to use any available data as evidence to produce predictions instead of theories. It has likewise convinced a generation of engineers and data scientists that computers can calculate belief just as well as humans can establish and test their own beliefs. In her book about Bayesian statistics, for instance, Sharon McGrayne quotes Peter Norvig, Google’s research director, who told her that “there must have been dozens of times when a project started with naïve Bayes” — one of the simplest and most straightforward implementations of the Bayes classifier, which probabilistically classifies things into categories — “just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.”
In its most dystopian light, this revolution has allowed algorithms to treat subjective knowledge as though it were objective, calculable, and ultimately predictable
Bayesianism in machine learning allows computers to find patterns in large data sets by creating a nearly infinite set of hypotheses that can be given probabilities directly, and can be constantly updated as new data is processed. In this way, each possible category for a given thing can be given a probability, selecting the one with the highest probability as the one to use (at least until the probabilities update when new data is gathered). Thus no complex theory is needed to set up a hypothesis or an experiment; instead a whole field of hypotheses can be automatically generated and evaluated. With cheap computing and massive data stores, there is little cost to evaluating these myriad hypotheses; the most probable can be extracted from a data set and acted upon immediately to produce profit, by selling ads, moving money to new investments, or introducing two users to each other, and so on.
It’s this approach to data that has led to such claims as Chris Anderson’s 2008 declaration in Wired that we are approaching the end of theory: “Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns.” The subjective theory of statistics has allowed some to dream of the complete evacuation of human involvement from science, rather than asserting the importance of human understanding to that process. Taken to its extreme, the outcome of this thinking is that humans should think and evaluate their own beliefs more like computers rather than the other way around.
Bayesian thinking and its valorization of a science of doing rather than knowing has allowed a whole host of human activities to be predicted rather than theorized. Read in its most dystopian light, this revolution has allowed algorithms to treat subjective knowledge as though it were objective, calculable, and ultimately predictable. This belief in the ability to predict probabilistically has allowed data scientists to try to control who sees what news, which friends people make, what dates they go on, the credit they are given, the jobs their applications are considered for. In the cases of predictive policing, antiterrorism operations, and prison sentencing, algorithms overseen by probability dictate how and to whom state violence is meted out — often reproducing racial, ethnic, and gender biases, as numerous studies (including this recent work by Pro Publica on predicting recidivism rates and Safiya Noble’s book about racial bias in search algorithm results) have shown.
We may look back at this revolution in statistical methodology as being equally important as the Taylorist and Fordist revolutions in production at the turn of the 20th century. While those earlier revolutions created the conditions for the “automatic” production of material goods, the Bayesian revolution is cementing the conditions for the automatic production of knowledge. Those 20th century revolutions greatly increased the power and scope of the capitalist class as they desubjectified labor, making workers more like machines. Likewise, the Bayesian automation of knowledge production has now concentrated wealth and information, threatening to circumscribe both the world we see and the way the world sees us.
While it is perhaps tempting to be nostalgic for frequentism’s commitment to knowing over doing, it is increasingly unlikely we could ever go back. Today frequentism appears broken and untenable. (One has only to look at the number of articles — including this recent statement from the American Statistical Association and this editorial in Nature — declaring the dangers of p-values and significance testing, two of the workhorses of frequentism.) The task ahead is to make modes of knowing that are more just, that do not simply serve to reproduce social injustices or concentrate wealth and knowledge in the hands of the few. Without an understanding of the stakes and implications of the Bayesian revolution, this would not only be improbable but impossible.