Big Data and Science: Part 2

June 11, 2014

This is the second of two blog posts describing the common themes that emerged from a compilation of articles on big data and science that appeared in the February 11, 2011 issue of Science magazine. This post completes the list of themes and offers some concluding thoughts.

[Read: Big Data and Science, Part I]

6. The primary users of data will increasingly be machines, not humans.

Petabytes of data cannot be searched by hand. As a result, several layers of software will be needed to help separate the wheat from the chaff. Researchers will need software that can spot the needle in the haystack, such as the unique signature of an atomic particle or the number of cancer cells in a sample. They also need help spotting common traits within large amounts of data, such as genetic markers that influence individual reactions to drugs. Third, they need help keeping up with the constantly growing volume of research results in a particular field. Scientific results, including publications, must increasingly be in machine readable format and work will have to continue both on improving data software and increasing artificial intelligence.

7. The rise of metadata and artificial intelligence may lead to a better understanding and improvement of the scientific process itself.

Science occurs in a specific cultural setting filled with personal relationships, biases, and patterns of practice. Thomas Kuhn analyzed how the dominant paradigms within a scientific discipline can shift over time. The ability to collect and store all of the data surrounding scientific efforts, combined with the development of artificial intelligence software capable of analyzing it and spotting common themes in the literature should increase our understanding of the scientific process. For example, it should be possible to spot how the interpersonal relationships between coauthors changes over time and how groups of coauthors merge, split, and evolve. It should also be possible to analyze text and references to trace the influence of specific scientific ideas through the literature and see whether certain some areas of research have been relatively neglected.

8. The biggest obstacles to shared data may be behavioral.

Having exerted great time and effort into gathering data, many researchers are reluctant to share it with others, some of whom may be trying to solve the same scientific question. Alternatively, if the data is no longer immediately useful, individuals have little incentive to maintain it on the off chance that someone else may attach value to it later. Many authors mentioned the need to change both the informal expectations surrounding data and the formal reward structure for scientists. Increasingly, scientific journals are requiring authors to make the full data sets behind their research publically available. Genomics is often cited as a leading example of what data practices should look like. The increased practice of formally citing data collections, thereby giving credit to those who originally gathered and maintain it also matters. Some funding sources, such as the National Science Foundation, increasingly require information about data maintenance as part of grant applications.

9. Proper data handling requires new institutional responses. 

There is a growing demand for people with the skills to properly store, maintain, and access growing volumes of data. In many cases maintaining the provenance of the data is just as important as maintaining the data itself. There have been attempts to either set up central repositories where data from individual experiments can be stored and accessed and to create intermediary institutions for coordinating the use of data that is stored in different repositories. Just as a librarian performs a unique service separate from that of authors or researchers, the need for data curators supplements the need for more scientists.

10. We need better policies regarding data ownership, privacy, use, and security. 

Although, many of the most difficult problems arise in connection with health care where data about individuals is inextricably linked to issues of their personal privacy, health, and financial security, they affect all sciences. Already there are increased demands to ensure that data generated with the help of government funding is more widely shared. But sharing will not automatically solve the question of who is responsible for ensuring data accuracy or maintaining accompanying software and metadata or who is responsible when there is a data breach. In health care and other fields a delicate balance needs to be struck between the privacy interests of individual patients and researchers and the social interest (including that of patients collectively) in drawing as much information as possible out of large pools of individual data.

Big data will continue to have a huge effect on both the process and output of science in almost every field. Scientists and policy makers can increase the positive impact of this trend by developing proactive responses to each of these ten points.

 See also