Academic Data Curation: Who checks? Who Pays? How Much?
The research publishing system works. We get new drugs and new breakthrough discoveries every year. The goal of FAIR research data is to optimise this.
In the last decade, we’ve seen repeated reports of highlighting a lack of reproducibility and replicability in published academic research results. This has led to institutional, publisher and most significantly funder mandates for research data being made openly available at the point of publication of the paper.
Over 10 years, we have seen the ground swell of these policies and mandates. This has led to large amounts of data files with associated metadata being made available, in many data repositories around the world. The birth of the concept of FAIR (Findable, Accessible, Interoperable, Reusable) data in 2014(1) has helped unite global initiatives with a broad, common goal. We even have been hearing from researchers themselves that this is a change they want(2). Doing your research becomes so much easier when you can build on top of the raw research that has gone before and not just the summarised findings in the form of a conclusions section of a peer reviewed article.
So, what are the next steps in ensuring all of this information can be turned into knowledge, to be read by fellow researchers, or to train AI models?
Several recent, high profile publications have highlighted some of these problems(3). As data publishing becomes the norm, questions are surfacing about who should be responsible for checking these datasets and their associated metadata, and to what level should they be examining the research. This is in addition to basic technological needs, such as open APIs and complying with web accessibility standards.
For me there are several tiers, which can be mapped back to the FAIR data principles. This list is not intended to be exhaustive, but indicative of the complexity and effort required at each stage
Data with no checks that can be useful
- Academic files and metadata are available on the internet in repositories that follow best practice norms.
Top level check to make data Findable and Accessible
- The metadata is sufficient to be discoverable through a google search
- Policy compliant checks – the files have no PII, are under the correct license
Interoperable and Reusable
- Files are in an open, preservation-optimised format
- Subject specific metadata schemas are applied in compliance with community best practice
- Forensic data checks for editing, augmenting
- Re-running of the results to ensure replicability
One important discussion topic is understanding that post publication metadata curation can help improve datasets over time, either by humans or machines – with the caveat that researchers may be intentionally obfuscating the research, or providing so little descriptive metadata that the dataset will always be useless.
As we move from through checks 1-7, the human curation, technical expertise and time taken increases. As such, so does the cost. Scalability of costs needs to be thought about(4). We cannot rely on volunteers to take all research to levels 2 and 3.
The research data community needs to come up with a plan to move to fully FAIR data by 2030 (points 4+ above), with a full understanding of how each of the steps above is carried out and by whom.
The research publishing system works. We get new drugs and new breakthrough discoveries every year. The goal of FAIR research data is to optimise this, to make use of machine learning, AI and all human knowledge to get these breakthrough discoveries, treatments for pandemics and improve our understanding of climate change faster.
References
1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
2. Science, Digital; Fane, Briony; Ayris, Paul; Hahnel, Mark; Hrynaszkiewicz, Iain; Baynes, Grace; et al. (2019): The State of Open Data Report 2019. figshare. Report. https://doi.org/10.6084/m9.figshare.9980783.v2
3. Nature 578, 199-200 (2020) https://doi.org/10.1038/d41586-020-00287-y
4. Nature 578, 491 (2020) https://doi.org/10.1038/d41586-020-00505-7