Methodology for Comparing Citation Database Coverage of Dataset Usage

Step 02: Extract Dataset Mentions

Published

June 27, 2025

Extract Dataset Mentions to Build a Publication Dataset

The goal of this step is to build a dataset of publications that reference the dataset name aliases for the USDA data assets across Scopus, OpenAlex, and Dimensions.

To generate this dataset, the process requires:

Dataset name aliases (from Step 1)
Search routines tailored to each citation database to extract relevant publications

Search routines, described below, guide this step, as dataset mentions are often inconsistent across publications—appearing in titles, abstracts, full text, or reference lists. Scopus uses a structured seed corpus to refine searches; OpenAlex uses both a full text search and defines a seed corpus; Dimensions relies only on direct queries across their full publication records. The outputs of this step are three publication-level datasets, one for each citation database, which are further analyzed in subsequent steps.

Search Routines

The search routines for Scopus were designed to systematically identify mentions of USDA data assets across a vast collection of academic publications. Multiple approaches were used to maximize dataset identification. These included (1) full-text searches, which leveraged Scopus’s licensed access to retrieve dataset mentions directly from publication text, (2) reference searches, which scanned citation lists for dataset appearances, and (3) machine learning models, which applied text-matching algorithms to improve accuracy.

Process of Running Search Routines

The process of running the search routines is to identify a candidate match with the list of data assets. The candidate match is effectively the “publication ID – dataset ID” combination and is referred to as a dyad. For any given publication, there may be multiple dyads.

Identifying references to datasets within scientific publications is inherently difficult for a number of reasons including:

No defined format for dataset references: Datasets are often not cited formally and rather are referred to using unpredictable textual context and formats.
Name disambiguation: Datasets can be referred to by their full name, acronym, and many other valid ways. For instance, the dataset “Rural-Urban Continuum Codes” may also be referred to as “Rural Urban Continuum Codes” or “RUCC” or by using a URL reference,
Conflicts with other terms and phrases: Contextual cues need to be used to ensure, for example, that a data asset such as “Feed Outlook” is indeed the relevant USDA reference.
Simple spelling and other invalid references: Ideally, search algorithms need to allow for “fuzzy” matching to catch slightly misspelled or mis-named datasets.

To address this challenge, the project employed the top three models from the 2021 Kaggle competition sponsored by the Coleridge Initiative.

1. Full Text Search Corpus

Scopus is a large, curated abstract and citation database of scientific literature including scientific journals, books, and conference proceedings. Around 11,000 new records are added each day from over 7,000 publishers worldwide. Elsevier is also licensed to access the full text of publications from many, although not all, of these publishers for internal analysis (the full text is not licensed for public use). Where the appropriate licenses do not exist, the records are excluded from the search. To provide some context in this respect, for calendar year 2022, Elsevier estimates that full text records exist for 91% of records published and captured in that year. With license restrictions also considered, the estimate is that it is possible to undertake full text searches on approximately 82% of the total records for that year.

The USDA full text search corpus was created using Scopus, with a publication year range of 2017 to 2023 inclusive and using the Topics, Journals and Top Authors.

The full text records associated with the USDA search corpus is shown in Table 1:

Table 1: Full Text Records Associated with USDA Search Corpus

	Number of Records
2017-2023 Articles from Topics	726,423
2017-2023 Articles from Journals	1,537,851
2017-2023 Articles from Top Authors	21,938
De-duplicated Articles from Above	2,089,728
Deduplicated articles where we have full text	1,630,958
Deduplicated articles where we have full text and are licensed to search	1,450,086

2. References Search Corpus

A search through the references list of Scopus records is also undertaken as a separate and distinct step from the full text search. The search corpus here is broader than for full text, as there are no license conditions restricting the search. In addition, because references contain highly structured data, it is feasible to search through all of Scopus, as the computational limitations of full-text search do not apply.

Because of this, all Scopus records within the publication date range are searched. For the USDA search period of 2017 to 2023, this amounted to 25,110,182 records.

The reference search employs an exact text string matching routine across the references of the identified Scopus records.

Because of the issues associated with generic terms, the same flags as applied in the Machine Learning step were also applied here.

Table 2: Number of Records from Scopus References Search Routine

Process Step Outputs	Number of Records
Number of unique Scopus publications identified in reference search	25,588
Number of those publications that were unique to the reference search (i.e. not found by Kaggle models).	22,818
Number of target data assets matched with the above publications	34,526

3. Machine Learning (Kaggle) Routines (Full Text Search)

The top three models from the 2021 Kaggle competition sponsored by the Coleridge Initiative differ in their approaches, strengths, and weaknesses, and the strategy was to use all three to generate results, aggregating and filtering the results to achieve a synergy that would outperform any of the models individually. The same Kaggle models that were used in support of the Year 1 USDA project were employed on the data assets available to this project.

The models are applied to the full text search corpus and generate a series of outputs identifying potential dataset matches. For two of the Kaggle models, the focus is on identifying general data assets so many matches will be generated that are not relevant to USDA and its target data assets. Thus, a further fuzzy text matching routine is applied to the Kaggle output to produce a subset of candidate matches (dyads) that are linked to the target data assets.

As well as producing metadata for the publications and associated dyads, the process records the Kaggle record that produced the dyad and the scores associated with the matching routines. In addition, for all returned records where publisher licensing allows, a snippet is produced. The snippet is a fragment of text that shows both the referenced dataset and the contextual text that surrounds it, to provide human validators sufficient context to enable them to determine the validity of that candidate reference. The machine learning phase of the project therefore aims to locate all mentions of the target data assets within the search corpus of full text publications and to provide the candidate matches along with their snippets of text to a database that can facilitate subsequent validation by subject matter experts.

With a focus on data assets rather than datasets and with some of the name aliases comprising short acronyms and/or very generic terms, there is a risk that high levels of false positives would be generated. For example, one of the search terms was “Crop Progress Report”. There are likely to many other countries beyond the US that all issue reports on Crop Progress. Hence, as well as searching for the terms, a set of flags/filters were also included thus ensuring dyads could be identified which also had the flagged terms. Typically, the filters chosen were linked either to focusing on the agency or to focusing on the research produced in the US. Specifically, for the full text search in the USDA project, the following terms were employed as filters:

NASS, USDA, US Department of Agriculture, United States Department of Agriculture, National Agricultural Statistics Service, Economic Research Service

In total, the use of flags was identified as being appropriate for 112 of the data assets.

The Kaggle routines were run in early December 2023 with the process completing on 14 December.

A summary of some of the key results from the Full Text search is provided in Table 3:

Table 3: Full Text Search Generated by Kaggle Routine

Process Step Outputs	Number of Records
Number of unique Scopus publications identified by the three Kaggle algorithms	635,831
Number of unique publications identified after Fuzzy text matching to target data assets	4,104
Number of target data assets matched in the above publications	4,392¹
Number of snippets generated	14,377²

Post Processing Adjustments – RUCC and QuickStat Increment

Note that the RUCC and Quickstat increment was applied after the Kaggle routines were initially run. The process for running that increment involved two steps:

A new search of the Scopus reference search corpus using the RUCC and Quickstat aliases.
A fuzzy text search of the Kaggle output that had been generated using the RUCC and Quickstat aliases.

0.1 Search Routines

The second citation database used is OpenAlex, an open catalog of scholarly publications that provides public acess to metadata and, when available, full-text content for open-access publications via its API. Unlike Scopus, which provides controls access to licensed content, OpenAlex indexes only open-access publications or those for which open metadata has been made available by publishers.

Two methods were used to identify USDA dataset mentions in OpenAlex: a full-text search (described below) and a seed corpus approach (described in the following section). Both methods focused on peer-reviewed journal articles published between 2017 and 2023 and restricted the dataset to final published versions, excluding preprints and earlier drafts to avoid duplication across versions.

The full-text search relied on querying OpenAlex’s full-text search index using combinations of dataset aliases (e.g., alternate names, acronyms) and institutional flag terms (e.g., “USDA,” “NASS”). The combination of dataset alias and flag terms ensured that retrieved publications made an explicit connection to the correct data source. A “true” dataset mention was recorded only when at least one alias and one flag term appeared in the same publication, increasing the precision of captured dataset mentions.³

Queries were implemented using the pyalex Python package⁴, which manages API requests and enforces OpenAlex’s usage rate limits. The search used the search and filter endpoints, targeting English-language, open-access articles or reviews published after 2017. Results were returned in JSON format based on the OpenAlex Work object schema, including fields for publication metadata, authorship, journal, concepts, citations, and open access status. Each record included metadata fields such as:

display_name (publication title)
authorships (authors and affiliations)
host_venue.display_name (journal)
doi (digital object identifier)
concepts (topics)
cited_by_count (citation counts)
type (publication type, e.g., “article”)
publication_year (year article was publish)
language (language, English only)
is_oa (open access)

The code used to implement this querying and filtering process is publicly available here.

0.1.1 Limitations of Full-Text Search Method

Although the OpenAlex API provides access to full-text search, limitations in content ingestion affect result completeness. OpenAlex receives publication text through two primary ingestion methods: PDF extraction and n-grams delivery.

In the PDF ingestion method, OpenAlex extracts text directly from the article PDF. However, the references section is not included in the searchable text. References are processed separately to create citation pointers between scholarly works, meaning that mentions of datasets appearing only in bibliographies are not discoverable through full-text search.

In the n-grams ingestion method, OpenAlex does not receive the full article text. Instead, it receives a set of extracted word sequences (n-grams) from the publisher or author. These n-grams represent fragments of text—typically short sequences of one, two, or three words—which are not guaranteed to preserve full continuous phrases. As a result, complete dataset names may be broken apart or omitted, reducing the likelihood that search queries match the intended aliases.

These ingestion and indexing limitations affect the completeness of results when relying solely on OpenAlex full-text search. Mentions of USDA datasets that appear either exclusively in references or are fragmented within n-grams may be missed. To address these limitations, an alternative search method was developed based on constructing a filtered seed corpus of publications for local full-text analysis.

Refer to this Appendix for additional details on file construction.

0.2 Search Routines

To identify publications mentioning USDA datasets, we used the Dimensions.ai API, following the same general methodology applied in Scopus and OpenAlex. We reused the same dataset aliases, institutional flag terms, and overall search criteria to ensure consistency across sources. The search covered scholarly publications from 2017 to 2023 and was restricted to works authored by at least one researcher affiliated with a U.S.-based institution.

Dimensions queries are written using a structured Domain Specific Language (DSL). We constructed boolean queries that combined multiple dataset aliases (e.g., “NASS Census of Agriculture”, “USDA Census”, “Agricultural Census”) with institutional identifiers (e.g., “USDA”, “NASS”, “U.S. Department of Agriculture”). As with Scopus and OpenAlex, both a dataset alias and an institutional flag term were required to appear in each result. These terms were grouped using OR within each category and then combined with an AND across categories. For example:

(“NASS Census of Agriculture” OR “Census of Agriculture” OR “USDA Census of Agriculture” OR “Agricultural Census” OR “USDA Census” OR “AG Census”) AND (USDA OR “US Department of Agriculture” OR “United States Department of Agriculture” OR NASS OR “National Agricultural Statistics Service”)

We implemented this process using the dimcli Python library, which provides a streamlined interface to the Dimensions.ai API and automates result pagination. A significant advantage of this approach is the capability of the Dimensions.ai platform to manage complex searches directly, resulting in precise results and reduced computational overhead. By executing these queries directly through the API, we avoided the technical complexity associated with downloading and locally processing large amounts of textual content. Moreover, the Dimensions.ai API results can be automatically structured into an analysis-ready DataFrame format. This simplified data structure greatly facilitated our subsequent validation, data integration, and analytical workflows.

To maintain methodological consistency with Scopus and OpenAlex, the following filters were applied to the search:

English-language publications
Works published between 2017-2023
Document types: articles, chapters, proceedings, monographs, and preprints
Author affiliations: Publications were filtered to include only those authored by researchers affiliated with at least one U.S.-based institution.

For comparability with the Scopus and OpenAlex samples, only publications classified as “articles” were retained for final analysis. This restriction reduces duplication across versions (e.g., preprints, proceedings) and reflects our focus on peer-reviewed scholarly output.

For each article, we retrieved metadata including title, authors, DOI, journal, abstract, publication date, citation counts, subject classifications, and links. These fields supported topic-level analysis, author and institution mapping, and validation of dataset mentions.

Using Dimensions.ai provided two main technical advantages. First, because the platform supports full-text query execution natively, we avoided the need to download or parse external files. Second, the API responses were easily converted into analysis-ready DataFrames, which simplified downstream validation and integration with other sources.

Footnotes

Explanatory Note 1: A publication may contain references to more than one target data assets. It may also contain multiple references to the same target data asset. As an example, a publication may contain the following references to target assets (Data Asset A = 3 references, Data Asset B = 2 references, Data asset C = 4 reference then in this field three target data assets, the value included would be “3”.↩︎
Explanatory Note 2: For the same publication as in Explanatory Note 1, the value here would be 9 provided the license for the publication allowed for snippet generation.↩︎
This procedure increased the likelihood of capturing genuine dataset references rather than incidental matches to individual words. Initial drafts of the query incorrectly included terms like “NASS” and “USDA” in the alias list. This was corrected to ensure that aliases strictly referred to dataset names, and flag terms referred to organizations.↩︎
Pyalex is an open-source library designed to facilitate interaction with the OpenAlex API; see https://help.openalex.org/hc/en-us/articles/27086501974551-Projects-Using-OpenAlex for more information. The package manages request formatting and automates compliance with OpenAlex’s “polite pool” rate limits, which restrict the number of requests per minute and impose backoff delays. Pyalex introduced automatic pauses between requests, with a default retry_backoff_factor of 100 milliseconds, to ensure stable and continuous retrieval. This setup enabled systematic querying while adhering to OpenAlex’s usage policies.↩︎