What Kind of Data Revolution is This?

October 21, 2014

(Above: 17th-century Mathematicians Blaise Pascal and Pierre de Fermet)

The word “revolution” has become commonplace among big data proponents for characterizing changes brought about by new analytics tools and the associated technologies. Some have argued that big data analytics herald the “end of theory” or of uncertainty about the future or of traditional sample-based testing methods. While almost invariably overblown, such proclamations are nevertheless right that big data is revolutionary. But what kind of revolution is it?

In addition to transforming the nature and processes of communication and economic production, big data analytics provide new kinds of tools for the natural and social sciences. The sciences have undergone few transformations in methodology of this type since the 19th century, when probability theory became, for the first time, not just a branch of mathematics but also a method for practicing the sciences.

Until about 200 years ago, probability and statistics were relatively disparate fields. The calculus of probabilities was a branch of mathematics, dating back to 17th century mathematicians Blaise Pascal and Pierre de Fermat; it was the purview of mathematicians. Statistics, meanwhile, consisted largely in the cataloguing of large data sets, such as population and demographic data. Statisticians, as Louis Menand describes in his The Metaphysical Club, were mostly number crunchers employed by and who provided data for the state – hence the term “statistics.” Then two things happened.

First, the innovative work of mathematicians such as Carl Friedrich Gauss and Pierre-Simon Laplace as well as newly developed probabilistic methods were honed and applied to the natural sciences. With this application of probability theory to scientific inquiry, a whole new framework for thinking about science and scientific knowledge emerged.

The reason was that, as Ian Hacking argues in his book The Emergence of Probability, probabilistic mathematical tools allowed uncertainty to be accommodated into the exact sciences in a new way. Uncertainty itself could be quantified. Accordingly, the aim of science was no longer scientia, the Latin term meaning “knowledge,” signifying unchanging and perfect knowledge. Science now aimed at knowledge that was neither “perfect” nor “unchangeable,” but rather probabilistically precise.

For example, the method of least squares helped to solve practical challenges presented by uncertain astronomical observations: various and divergent observations of a single star could be seen to converge around a mean taken to be that star’s probabilistic location. Uncertainty was no longer an unfortunate, though unavoidable, byproduct of observation; it became an integral, quantifiable, part of scientific practice.

Second, the non-exact sciences became relatively exact for the first time. It did not take long for scientists to begin applying the new probabilistic methods to the messier domains of research into what would later be called the social and human sciences. The large data sets collected, parsed, and catalogued by statisticians since the ancient Egyptians could now be subjected to the same new probabilistic tools.

The result was that social events and trends, hitherto thought to be random, were shown to obey strict probabilistic laws. As with the application of probability to the exact sciences, this transformed, for better or for worse, the way we people thought about the objects of scientific investigation—in this case, people, populations, and demographics.

Similarly, today we have a whole new body of data: the gargantuan data sets pertaining to people, things, and events recorded by smart sensors and social media, all of which were until recently too large or expensive to store. But more importantly, these data sets can now being subjected to new mathematical tools. Big data analytics provide a whole new method for deciphering, communicating, and making predictions about these ever-growing data.

Moreover, in a fortuitous parallel with the 19th century revolution in statistics, the impact of these new analytics tools in the sciences is being felt especially in the study of social phenomena—thanks to the data about social and consumer behavior made available through social media, smart devices, and Internet searching—and in astronomy, where enormous, and fast-growing, sets of astronomical data are now stored, parsed, and deciphered. But as with the maturation and application of probability to natural and social sciences, big data doesn’t represent a scientific revolution so much as a revolution in the tools that science has at its disposal.

We have yet to see just how much these new data analytics tools will augment or transform classical analytics (that is a question to be answered by computer scientists). Moreover, we have yet to see just how much new big data analytics tools will transform the way we think about knowledge, uncertainty, and observation. But there is no doubt that as big data analytics enriches the scientist’s toolkit the framework for practicing all of the sciences will change. And as in the 19th century, this change will affect—is already affecting—not only the sciences themselves, but also the way think about and act in the political, business, and social domains of life.

The emergence of probability coincided with—and even helped to bring about—a whole new way of thinking about knowledge. The tool changed not only the practice but ultimately the nature of the thing being practiced. We may one day look back at the early 21st century as the emergence of a new and broader scientific framework—an improvement upon, if not a break from, the statistical methods and assumptions that dominated the way people thought about natural and social phenomenon for two centuries.