Big Data and Science: Part 1
In early 2011, Science devoted a special issue to the impact big data is having on science. A series of articles described the role that the increased ease and affordability of collecting new streams of data are likely to have on individual disciples ranging from astronomy to genetics to ecology.
1. Verifying scientific results will increasingly require access to the large amounts of data derived from an experiment as well as the software used to analyze it
Traditional science relied on the reproducibility of experiments to ensure the validity of reported results. Researchers published their methods and results so that others could duplicate them. This is usually impossible with experiments involving large amounts of data. Moreover, any ultimate conclusions will often depend upon subjective judgments made about the collection, synchronization, processing, and interpretation of large amounts of data.
As a result, efforts to ensure either the validity or applicability of reported results will require access to the data itself, as well as a detailed description of how it was collected and processed, and to any software used to interpret it. Unless this information is made public, it will be very difficult to spot any errors in methods or reasoning. In economics, debate over the economic research of economist Ken Rogoff and Carmen Reinhart on the one hand and Thomas Piketty on the other was only possible because they shared access to both their data and their methods.
2. Big Data is especially complicated in multidisciplinary research.
Much of the most interesting science is increasingly done at the intersection of different disciplines. In areas like neuroscience, social sciences, and ecology, data often comes from experts working in different fields. Yet each discipline usually uses its own terminology and data reporting methods. Integrating data from different databases across disciplines will be an increasingly difficult challenge and making it easily usable by scientists from any of them will often require new tools. For example, the Neuroscience Information Framework has developed a new lexicon and ontology to deal with the terminology of different approaches.
3. Integrating data from multiple sources requires a lot of work.
Data from multiple sensors and satellites must be calibrated to ensure that it can be compared across instruments. The problem is increased when dozens or hundreds of data sources are involved. Successful integration usually requires a detailed knowledge of how the original data were collected, how the various sensors operate (or do not), who owns the data, and what, if any, restrictions govern its use. The problem is further complicated with data has been collected at different times, by different people or when it resides in different databases.
4. Proper curation of data is increasingly important.
In the past, data was usually collected by individual researchers and then kept by them. It was often thrown away when researchers moved, retired or died. Even when it was not, rapid technological innovation meant that the format in which data were stored, or the computer language in which code was written quickly became obsolete. If a number of researchers worked on the same project, the data might have been split up among them haphazardly. Andrew Curry describes how one researcher spent years reassembling data from an experiment over twenty years ago in order to apply new theoretical insights to the results.
5. Technology has shifted the constraint on scientific discovery from data collection to data storage, processing, and analysis.
Until recently the dominant constraint on the advance of scientific knowledge was often the lack of data. Most of the time and cost of new scientific projects was usually spent on generating new data. This is still the case in many areas such as the Federal Drug Administration’s traditional requirements for field testing of new drugs. Increasingly, however, the arrival of vast amount of new data is overwhelming the rest of the scientific process. In order to be useful, data must also be transmitted, stored, and analyzed. Bottlenecks at each of these stages are increasingly likely to slow constrain the use of big data.
Cameras in the space satellites of the Autonomous Real-time Ground Ubiquitous Surveillance Imaging System generate hundreds of times more data than can ever be transmitted to the analysts on earth. Two solutions are to either compress the data or to move the processing to the satellite so that only the results are transmitted.
In either case, some data is lost to researchers. Even when scientists have access to all of the data, storing it is a problem. As of 2010, data generation was growing 58% each year while storage grew only 40% each year. As a consequence, finding the capacity and money to store data as it is being generated is increasingly a problem. Finally, the data must be analyzed and written up. This also requires large amounts of processing power and researcher time, at least the latter of which is not growing exponentially.