The Data Citation Corpus - tracking NIH funded open academic data
By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.
In 2023, the Wellcome Trust awarded funds to build an open Data Citation Corpus to dramatically transform the data citation landscape. Through this award, DataCite has partnered with Chan Zuckerberg Initiative, EMBL-EBI, and other organizations that identify and assert data citations.
We at Digital Science have been looking at the Data Citation Corpus, to dig deeper into data citation counts.
The first release is based on a seed file that includes data citations from the following sources:
- Data citations from DataCite and Crossref DOI metadata, via Event Data.
- Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC.
So we are basically looking at papers that have a link to a DataCite DOI or accession number.
By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. The Data Citation Corpus only gave us about 70% of the paper links that were resolvable DOIs. This should improve over time.
This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.
The data behind the graphs is on Figshare here: https://doi.org/10.6084/m9.figshare.26649703.v1
If you would like to play around with the data yourself, you can request it here. One limitation of the Data Citation Corpus is that it only has data to 2022. At Digital Science, we have up to present data available through Dimensions.ai, so will continue to look for ways to track compliance and monitor the growth and reuse of open academic data going forward.