Methodology for Comparing Citation Database Coverage of Dataset Usage

Step 02: Extract Dataset Mentions

Published

March 16, 2025

Extract Dataset Mentions to Build a Publication Dataset

The goal of this step is to build a dataset of publications that reference the dataset name aliases for the USDA data assets across Scopus, OpenAlex, and Dimensions.

To generate this dataset, the process requires:

  • Dataset name aliases (from Step 1)
  • Search routines tailored to each citation database to extract relevant publications

Search routines, described below, guide this step, as dataset mentions are often inconsistent across publications—appearing in titles, abstracts, full text, or reference lists. Scopus uses a structured seed corpus to refine searches, while OpenAlex and Dimensions rely on direct queries across their full publication records. The outputs of this step are three publication-level datasets, one for each citation database, which will be further analyzed in subsequent steps.

Search Routines

The search routines for Scopus were designed to systematically identify mentions of USDA data assets across a vast collection of academic publications. Multiple approaches were used to maximize dataset identification. These included (1) full-text searches, which leveraged Scopus’s licensed access to retrieve dataset mentions directly from publication text, (2) reference searches, which scanned citation lists for dataset appearances, and (3) machine learning models, which applied text-matching algorithms to improve accuracy.

Process of Running Search Routines

The process of running the search routines is to identify a candidate match with the list of data assets. The candidate match is effectively the “publication ID – dataset ID” combination and is referred to as a dyad. For any given publication, there may be multiple dyads.

Identifying references to datasets within scientific publications is inherently difficult for a number of reasons including:

  1. No defined format for dataset references: Datasets are often not cited formally and rather are referred to using unpredictable textual context and formats.
  2. Name disambiguation: Datasets can be referred to by their full name, acronym, and many other valid ways. For instance, the dataset “Rural-Urban Continuum Codes” may also be referred to as “Rural Urban Continuum Codes” or “RUCC” or by using a URL reference,
  3. Conflicts with other terms and phrases: Contextual cues need to be used to ensure, for example, that a data asset such as “Feed Outlook” is indeed the relevant USDA reference.
  4. Simple spelling and other invalid references: Ideally, search algorithms need to allow for “fuzzy” matching to catch slightly misspelled or mis-named datasets.

To address this challenge, the project employed the top three models from the 2021 Kaggle competition sponsored by the Coleridge Initiative.

1. Full Text Search Corpus

Scopus is a large, curated abstract and citation database of scientific literature including scientific journals, books, and conference proceedings. Around 11,000 new records are added each day from over 7,000 publishers worldwide. Elsevier is also licensed to access the full text of publications from many, although not all, of these publishers for internal analysis (the full text is not licensed for public use). Where the appropriate licenses do not exist, the records are excluded from the search. To provide some context in this respect, for calendar year 2022, Elsevier estimates that full text records exist for 91% of records published and captured in that year. With license restrictions also considered, the estimate is that it is possible to undertake full text searches on approximately 82% of the total records for that year.

The USDA full text search corpus was created using Scopus, with a publication year range of 2017 to 2023 inclusive and using the Topics, Journals and Top Authors.

The full text records associated with the USDA search corpus is shown in Table 1:

Table 1: Full Text Records Associated with USDA Search Corpus
Number of Records
2017-2023 Articles from Topics 726,423
2017-2023 Articles from Journals 1,537,851
2017-2023 Articles from Top Authors 21,938
De-duplicated Articles from Above 2,089,728
Deduplicated articles where we have full text 1,630,958
Deduplicated articles where we have full text and are licensed to search 1,450,086

2. References Search Corpus

A search through the references list of Scopus records is also undertaken as a separate and distinct step from the full text search. The search corpus here is broader than for full text, as there are no license conditions restricting the search. In addition, because references contain highly structured data, it is feasible to search through all of Scopus, as the computational limitations of full-text search do not apply.

Because of this, all Scopus records within the publication date range are searched. For the USDA search period of 2017 to 2023, this amounted to 25,110,182 records.

The reference search employs an exact text string matching routine across the references of the identified Scopus records.

Because of the issues associated with generic terms, the same flags as applied in the Machine Learning step were also applied here.

Table 2: Number of Records from Scopus References Search Routine
Process Step Outputs Number of Records
Number of unique Scopus publications identified in reference search 25,588
Number of those publications that were unique to the reference search (i.e. not found by Kaggle models). 22,818
Number of target data assets matched with the above publications 34,526

Describe Rafael’s methodology for searching for dataset names in OpenAlex articles and additional steps Cal did to pull data from the OpenAlex API

To collect publications mentioning the NASS Census of Agriculture from the OpenAlex Catalog, I conducted a string search using a predefined set of dataset aliases: “Census of Agriculture,” “USDA Census,” “NASS Census,” “Agricultural Census,” and “AG Census.” To minimize false positives, I applied several filters: the publications had to be in English, published between 2017 and 2024, and include at least one author affiliated with an American institution. Additionally, to ensure that the publications were indeed referring to the correct dataset, I required that they also contain specific flag terms within the full text body, such as “USDA,” “US Department of Agriculture,” “United States Department of Agriculture,” “NASS,” or “National Agricultural Statistics Service.”

This method closely mirrors the approach used in the USDA Briefing Book sent by Julia (Appendix 1: Data Search), where a similar string search was applied to the Scopus catalog. In the Scopus analysis, the string search was performed primarily on the references text body rather than the full text and was executed only within a seed corpus. In contrast, our search in OpenAlex was conducted across the entire OpenAlex database. Notably, the references string search in Scopus identified over 80% of the findings, as documented in the briefing book, highlighting the effectiveness of this approach.

Refer to this Appendix for additional details on file construction.

Coming Soon

Footnotes

  1. Explanatory Note 1: A publication may contain references to more than one target data assets. It may also contain multiple references to the same target data asset. As an example, a publication may contain the following references to target assets (Data Asset A = 3 references, Data Asset B = 2 references, Data asset C = 4 reference then in this field three target data assets, the value included would be “3”.↩︎

  2. Explanatory Note 2: For the same publication as in Explanatory Note 1, the value here would be 9 provided the license for the publication allowed for snippet generation.↩︎