Methodology for Comparing Citation Database Coverage of Dataset Usage
Terminology
Terminology
Citation databases form the foundation of modern research tracking and analysis. Digital repositories, like the test cases featured in this report, systematically catalog scholarly publications and their references to each other (De Bellis, 2009). Citation databases differ in their approaches to curating and maintaining this information. Some focus exclusively on peer-reviewed journal articles with strict inclusion criteria, while others index a broader range of research outputs including preprints, technical reports, and conference proceedings (Martín-Martín et al., 2021; Mongeon & Paul-Hus, 2016). These curation approaches affect how comprehensively each database captures research impact (Visser et al., 2021).
Understanding how these databases work requires familiarity with bibliometrics - the statistical analysis of published works and their impact (Broadus, 1987). Bibliometric analysis examines patterns in publication, citation networks, and research influence (Hood & Wilson, 2001). The field emerged from early citation indices, which mapped relationships between papers through their references (Garfield, 1955).
For tracking USDA dataset usage, these concepts directly apply. Accurate tracking of dataset usage in scientific literature serves multiple purposes. For federal agencies like the USDA, it helps monitor the return on public data investments, find gaps in dataset use, plan future data collection, and support evidence-based policy decisions. This tracking requires reliable citation data from citation databases. Unlike standard citations, researchers often reference datasets within the text of their publications rather than citing them formally. This makes tracking dataset usage more complex.
To solve this tracking challenge, methods have been developed that scan publication text for dataset mentions (Lane et al., 2022). The scope and accuracy of our dataset tracking depends on what publications we can access and analyze. Because different databases curate content in different ways, it creates variation in what dataset mentions they capture and their frequency. Variations in content across sources affect our ability to accurately track dataset impact and adoption. The DemocratizingData.ai platform, for example, uses bibliometric data to monitor these dataset usage patterns, helping USDA understand how its data supports research. By comparing how different citation databases track this information, we can better understand their strengths and limitations for monitoring research impact.