Methodology for Comparing Citation Database Coverage of Dataset Usage

Step 02: Extract Dataset Mentions

Published

March 16, 2025

Extract Dataset Mentions to Build a Publication Dataset

The goal of this step is to build a dataset of publications that reference the dataset name aliases for the USDA data assets across Scopus, OpenAlex, and Dimensions.

To generate this dataset, the process requires:

Dataset name aliases (from Step 1)
Search routines tailored to each citation database to extract relevant publications

Search routines, described below, guide this step, as dataset mentions are often inconsistent across publications—appearing in titles, abstracts, full text, or reference lists. Scopus uses a structured seed corpus to refine searches, while OpenAlex and Dimensions rely on direct queries across their full publication records. The outputs of this step are three publication-level datasets, one for each citation database, which will be further analyzed in subsequent steps.

Search Routines

The search routines for Scopus were designed to systematically identify mentions of USDA data assets across a vast collection of academic publications. Multiple approaches were used to maximize dataset identification. These included (1) full-text searches, which leveraged Scopus’s licensed access to retrieve dataset mentions directly from publication text, (2) reference searches, which scanned citation lists for dataset appearances, and (3) machine learning models, which applied text-matching algorithms to improve accuracy.

Process of Running Search Routines

The process of running the search routines is to identify a candidate match with the list of data assets. The candidate match is effectively the “publication ID – dataset ID” combination and is referred to as a dyad. For any given publication, there may be multiple dyads.

Identifying references to datasets within scientific publications is inherently difficult for a number of reasons including:

No defined format for dataset references: Datasets are often not cited formally and rather are referred to using unpredictable textual context and formats.
Name disambiguation: Datasets can be referred to by their full name, acronym, and many other valid ways. For instance, the dataset “Rural-Urban Continuum Codes” may also be referred to as “Rural Urban Continuum Codes” or “RUCC” or by using a URL reference,
Conflicts with other terms and phrases: Contextual cues need to be used to ensure, for example, that a data asset such as “Feed Outlook” is indeed the relevant USDA reference.
Simple spelling and other invalid references: Ideally, search algorithms need to allow for “fuzzy” matching to catch slightly misspelled or mis-named datasets.

To address this challenge, the project employed the top three models from the 2021 Kaggle competition sponsored by the Coleridge Initiative.

1. Full Text Search Corpus

Scopus is a large, curated abstract and citation database of scientific literature including scientific journals, books, and conference proceedings. Around 11,000 new records are added each day from over 7,000 publishers worldwide. Elsevier is also licensed to access the full text of publications from many, although not all, of these publishers for internal analysis (the full text is not licensed for public use). Where the appropriate licenses do not exist, the records are excluded from the search. To provide some context in this respect, for calendar year 2022, Elsevier estimates that full text records exist for 91% of records published and captured in that year. With license restrictions also considered, the estimate is that it is possible to undertake full text searches on approximately 82% of the total records for that year.

The USDA full text search corpus was created using Scopus, with a publication year range of 2017 to 2023 inclusive and using the Topics, Journals and Top Authors.

The full text records associated with the USDA search corpus is shown in Table 1:

Table 1: Full Text Records Associated with USDA Search Corpus

	Number of Records
2017-2023 Articles from Topics	726,423
2017-2023 Articles from Journals	1,537,851
2017-2023 Articles from Top Authors	21,938
De-duplicated Articles from Above	2,089,728
Deduplicated articles where we have full text	1,630,958
Deduplicated articles where we have full text and are licensed to search	1,450,086

2. References Search Corpus

A search through the references list of Scopus records is also undertaken as a separate and distinct step from the full text search. The search corpus here is broader than for full text, as there are no license conditions restricting the search. In addition, because references contain highly structured data, it is feasible to search through all of Scopus, as the computational limitations of full-text search do not apply.

Because of this, all Scopus records within the publication date range are searched. For the USDA search period of 2017 to 2023, this amounted to 25,110,182 records.

The reference search employs an exact text string matching routine across the references of the identified Scopus records.

Because of the issues associated with generic terms, the same flags as applied in the Machine Learning step were also applied here.

Table 2: Number of Records from Scopus References Search Routine

Process Step Outputs	Number of Records
Number of unique Scopus publications identified in reference search	25,588
Number of those publications that were unique to the reference search (i.e. not found by Kaggle models).	22,818
Number of target data assets matched with the above publications	34,526

3. Machine Learning (Kaggle) Routines (Full Text Search)

The top three models from the 2021 Kaggle competition sponsored by the Coleridge Initiative differ in their approaches, strengths, and weaknesses, and the strategy was to use all three to generate results, aggregating and filtering the results to achieve a synergy that would outperform any of the models individually. The same Kaggle models that were used in support of the Year 1 USDA project were employed on the data assets available to this project.

The models are applied to the full text search corpus and generate a series of outputs identifying potential dataset matches. For two of the Kaggle models, the focus is on identifying general data assets so many matches will be generated that are not relevant to USDA and its target data assets. Thus, a further fuzzy text matching routine is applied to the Kaggle output to produce a subset of candidate matches (dyads) that are linked to the target data assets.

As well as producing metadata for the publications and associated dyads, the process records the Kaggle record that produced the dyad and the scores associated with the matching routines. In addition, for all returned records where publisher licensing allows, a snippet is produced. The snippet is a fragment of text that shows both the referenced dataset and the contextual text that surrounds it, to provide human validators sufficient context to enable them to determine the validity of that candidate reference. The machine learning phase of the project therefore aims to locate all mentions of the target data assets within the search corpus of full text publications and to provide the candidate matches along with their snippets of text to a database that can facilitate subsequent validation by subject matter experts.

With a focus on data assets rather than datasets and with some of the name aliases comprising short acronyms and/or very generic terms, there is a risk that high levels of false positives would be generated. For example, one of the search terms was “Crop Progress Report”. There are likely to many other countries beyond the US that all issue reports on Crop Progress. Hence, as well as searching for the terms, a set of flags/filters were also included thus ensuring dyads could be identified which also had the flagged terms. Typically, the filters chosen were linked either to focusing on the agency or to focusing on the research produced in the US. Specifically, for the full text search in the USDA project, the following terms were employed as filters:

NASS, USDA, US Department of Agriculture, United States Department of Agriculture, National Agricultural Statistics Service, Economic Research Service

In total, the use of flags was identified as being appropriate for 112 of the data assets.

The Kaggle routines were run in early December 2023 with the process completing on 14 December.

A summary of some of the key results from the Full Text search is provided in Table 3:

Table 3: Full Text Search Generated by Kaggle Routine

Process Step Outputs	Number of Records
Number of unique Scopus publications identified by the three Kaggle algorithms	635,831
Number of unique publications identified after Fuzzy text matching to target data assets	4,104
Number of target data assets matched in the above publications	4,392¹
Number of snippets generated	14,377²

Post Processing Adjustments – RUCC and QuickStat Increment

Note that the RUCC and Quickstat increment was applied after the Kaggle routines were initially run. The process for running that increment involved two steps:

A new search of the Scopus reference search corpus using the RUCC and Quickstat aliases.
A fuzzy text search of the Kaggle output that had been generated using the RUCC and Quickstat aliases.

Describe Rafael’s methodology for searching for dataset names in OpenAlex articles and additional steps Cal did to pull data from the OpenAlex API

To collect publications mentioning the NASS Census of Agriculture from the OpenAlex Catalog, I conducted a string search using a predefined set of dataset aliases: “Census of Agriculture,” “USDA Census,” “NASS Census,” “Agricultural Census,” and “AG Census.” To minimize false positives, I applied several filters: the publications had to be in English, published between 2017 and 2024, and include at least one author affiliated with an American institution. Additionally, to ensure that the publications were indeed referring to the correct dataset, I required that they also contain specific flag terms within the full text body, such as “USDA,” “US Department of Agriculture,” “United States Department of Agriculture,” “NASS,” or “National Agricultural Statistics Service.”

This method closely mirrors the approach used in the USDA Briefing Book sent by Julia (Appendix 1: Data Search), where a similar string search was applied to the Scopus catalog. In the Scopus analysis, the string search was performed primarily on the references text body rather than the full text and was executed only within a seed corpus. In contrast, our search in OpenAlex was conducted across the entire OpenAlex database. Notably, the references string search in Scopus identified over 80% of the findings, as documented in the briefing book, highlighting the effectiveness of this approach.

Refer to this Appendix for additional details on file construction.

Coming Soon

Footnotes

Explanatory Note 1: A publication may contain references to more than one target data assets. It may also contain multiple references to the same target data asset. As an example, a publication may contain the following references to target assets (Data Asset A = 3 references, Data Asset B = 2 references, Data asset C = 4 reference then in this field three target data assets, the value included would be “3”.↩︎
Explanatory Note 2: For the same publication as in Explanatory Note 1, the value here would be 9 provided the license for the publication allowed for snippet generation.↩︎