The Data Citation Corpus - tracking NIH funded open academic data

By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.

In 2023, the Wellcome Trust awarded funds to build an open Data Citation Corpus to dramatically transform the data citation landscape. Through this award, DataCite has partnered with Chan Zuckerberg Initiative, EMBL-EBI, and other organizations that identify and assert data citations.

Data Citation Corpus – Make Data Count

We at Digital Science have been looking at the Data Citation Corpus, to dig deeper into data citation counts.

The first release is based on a seed file that includes data citations from the following sources:

  • Data citations from DataCite and Crossref DOI metadata, via Event Data.
  • Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC

So we are basically looking at papers that have a link to a DataCite DOI or accession number.

By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. The Data Citation Corpus only gave us about 70% of the paper links that were resolvable DOIs. This should improve over time.

This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.

The data behind the graphs is on Figshare here: https://doi.org/10.6084/m9.figshare.26649703.v1

Number of NIH Funded papers with a link to a dataset - based on Data Citation Corpus
We at Digital Science have been looking at the Data Citation Corpus, to dig deeper into data citation counts.The first release is based on a seed file that includes data citations from the following sources:Data citations from DataCite and Crossref DOI metadata, via Event Data.Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC.So we are basically looking at papers that have a link to a DataCite DOI or accession number.By combining this dataset with Dimensions.ai data in Google Big Query, we we’re able to add more dimensions to the dataset (pardon the pun), such as funder or institution. The Data Citation Corpus only gave us about 70% of the paper links that were resolvable DOIs. This should improve over time.This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.

If you would like to play around with the data yourself, you can request it here. One limitation of the Data Citation Corpus is that it only has data to 2022. At Digital Science, we have up to present data available through Dimensions.ai, so will continue to look for ways to track compliance and monitor the growth and reuse of open academic data going forward.

Engage with the Data Citation Corpus!
The Make Data Count initiative works with researchers, repositories, librarians, publishers, institutional representatives, funders, policymakers and infrastructure providers to promote open data metrics and the responsible evaluation of research data usage. We want to hear feedback from the community to inform the development of the Data Citation Corpus. If you would like to provide feedback on the corpus or discuss a collaboration as a pilot partner, please complete the form below and we will follow up with you. For any questions about the corpus or Make Data Count, you can also contact Iratxe Puebla, Director of Make Data Count.

Subscribe to OpenResearch.wtf

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe