Methodology for Comparing Citation Database Coverage of Dataset Usage

Terminology

Published

March 16, 2025

Terminology

Citation databases form the foundation of modern research tracking and analysis. Digital repositories, like the test cases featured in this report, systematically catalog scholarly publications and their references to each other (De Bellis, 2009). Citation databases differ in their approaches to curating and maintaining this information. Some focus exclusively on peer-reviewed journal articles with strict inclusion criteria, while others index a broader range of research outputs including preprints, technical reports, and conference proceedings (Martín-Martín et al., 2021; Mongeon & Paul-Hus, 2016). These curation approaches affect how comprehensively each database captures research impact (Visser et al., 2021).

Understanding how these databases work requires familiarity with bibliometrics - the statistical analysis of published works and their impact (Broadus, 1987). Bibliometric analysis examines patterns in publication, citation networks, and research influence (Hood & Wilson, 2001). The field emerged from early citation indices, which mapped relationships between papers through their references (Garfield, 1955).

For tracking USDA dataset usage, these concepts directly apply. Accurate tracking of dataset usage in scientific literature serves multiple purposes. For federal agencies like the USDA, it helps monitor the return on public data investments, find gaps in dataset use, plan future data collection, and support evidence-based policy decisions. This tracking requires reliable citation data from citation databases. Unlike standard citations, researchers often reference datasets within the text of their publications rather than citing them formally. This makes tracking dataset usage more complex.

To solve this tracking challenge, methods have been developed that scan publication text for dataset mentions (Lane et al., 2022). The scope and accuracy of our dataset tracking depends on what publications we can access and analyze. Because different databases curate content in different ways, it creates variation in what dataset mentions they capture and their frequency. Variations in content across sources affect our ability to accurately track dataset impact and adoption. The DemocratizingData.ai platform, for example, uses bibliometric data to monitor these dataset usage patterns, helping USDA understand how its data supports research. By comparing how different citation databases track this information, we can better understand their strengths and limitations for monitoring research impact.

1 References

Broadus, R. N. (1987). Toward a definition of “bibliometrics.” Scientometrics, 12, 373–379. https://doi.org/10.1007/bf02016680

De Bellis, N. (2009). Bibliometrics and citation analysis: From the science citation index to cybermetrics. Scarecrow Press. https://doi.org/10.1002/asi.21181

Garfield, E. (1955). Citation indexes for science: A new dimension in documentation through association of ideas. Science, 122(3159), 108–111. https://doi.org/10.1126/science.122.3159.108

Hood, W. W., & Wilson, C. S. (2001). The literature of bibliometrics, scientometrics, and informetrics. Scientometrics, 52, 291–314. https://doi.org/10.1023/A:1017919924342

Lane, J., Gimeno, E., Levitskaya, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.8a3f2336

Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). Google scholar, microsoft academic, scopus, dimensions, web of science, and OpenCitations’ COCI: A multidisciplinary comparison of coverage via citations. Scientometrics, 126(1), 871–906. https://doi.org/10.1007/s11192-020-03690-4

Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of web of science and scopus: A comparative analysis. Scientometrics, 106, 213–228. https://doi.org/10.1007/s11192-015-1765-5

Visser, M., Van Eck, N. J., & Waltman, L. (2021). Large-scale comparison of bibliographic data sources: Scopus, web of science, dimensions, crossref, and microsoft academic. Quantitative Science Studies, 2(1), 20–41. https://doi.org/10.1162/qss_a_00112