Scientific Big Data Access Soon to Get Bigger

bigdata_scienceScientists receiving federal funds will soon have to include plans for the public access of much of their raw data. This new requirement was spelled out earlier this year in a memo from the executive branch’s Office of Science and Technology Policy. Once there is an official statement, scientists applying for federal grants will likely start seeing requirements for a data management plan in the coming months. Many of the details of how and where this data will be stored are still unclear, and the time-frame is still uncertain.

The enormous amount and huge variety of different kinds of data across all scientific disciplines poses a significant challenge to anyone trying to put together a single centralized data warehouse of research data. It is possible that individual agencies, or even publishers, might be the stewards of the raw data files.

Generally speaking the data that would have to be included are the individual data points used in preparation of a published paper. Data points that have been expunged from the final analysis will likely have to be included, the idea being that scientists can evaluate why those points were eliminated.

Victoria Stodden, a statistician at Columbia University and an expert on open data, expressed concern at the prospect that computer code might not be included in the requirements – having the code is as important as having the raw data. Without code, experimental results would not be replicate-able.

This upcoming new era of openness in scientific data triggers special significance to me since in a previous life I was an independent researcher in astrophysics. For several years, I was utilizing my machine learning skills to detect gravitational wave signatures in data collected by the Laser Interferometer Gravitational Wave Observatory (LIGO) managed by Caltech and MIT. But after a trip to one of the detectors in Hanford, Washington and speaking to one of the experimental physicists in residence, I was told that the raw data was unavailable to the public. I brought up the point that my tax dollars funded the research so why couldn’t I access the data? The only reason given was that if the data were made publicly available there would be too many “false positives” that the project scientists would have to check out. Whether true or not, I’m pleased to see this new spirit of openness and what it could mean to scientific discovery.




