Methodology for Comparing Citation Database Coverage of Dataset Usage

Step 03: Clean Publication Data

Published

March 16, 2025

Pre-Process and Clean Publication Datasets

The goal of this step is to standardize and resolve inconsistencies in publication records by disambiguating journal names, author affiliations, and institutions across the three databases.

To compare publication coverage across citation databases, we first identify all journals that contain publications using each dataset in Scopus, OpenAlex, and Dimensions. Refer to “Define Data Assets” for the list of the datasets for which we evaluate coverage.

We subset the publication dataset from Step 2 by filtering for dataset mentions. For example, if a publication references Ag Census, it is included in the Ag Census sub-dataset; otherwise, it is excluded. This process identifies dataset-specific publication patterns in Scopus, OpenAlex, and Dimensions.

Our approach follows a hierarchical approach to understand how USDA data assets appear in these citation databases.

Journal Level – Identifies journals publishing research using USDA datasets. A journal is included if at least one of its publications references the dataset, but this does not indicate overall dataset prevalence within that journal.
Publication Level – Examines individual publications within these journals to assess how often and in what context USDA datasets appear.
Author Level – Tracks authors of these publications, analyzing institutional affiliations and research networks to understand dataset reach.
Institution Level – Maps dataset usage across institutions to identify geographic and organizational research patterns.

This structured approach standardizes dataset mention analysis across databases, allowing for direct comparisons of coverage and research impact.

Case Study: Census of Agriculture

To illustrate the data cleaning and disambiguation process, we use the Census of Agriculture as a case study to systematically compare coverage, overlap, and differences between the three citation databases. The Census of Agriculture (also referred to as “Ag Census”) is widely used in agricultural and economic research, making it an ideal dataset for assessing database differences.

Pre-Processing Steps by Citation Database

Scopus

Journal Coverage

To analyze journal coverage in Scopus, we generate a dataset containing all unique journals that include at least one publication referencing Ag Census data. This dataset is built from an initial publication-level dataset, which captures individual research articles mentioning Ag Census.

We construct the publication-level dataset for only Ag Census mentions using the following metadata from the publication-level data:

Publication identifier (DOI)
Journal name
Publisher
ISSN (International Standard Serial Number, a unique journal identifier)
Dataset alias (alternate names used to reference Ag Census)
Dyad (dataset mention pair)

This data structure follows the format outlined in the data schema (Figure XX).

Crosswalk of Dataset Identifiers between Scopus and OpenAlex

Scopus assigns multiple identifiers to the same dataset depending on how it is reported, rather than a single, standardized identifier. Therefore, the authors create a crosswalk between Scopus and OpenAlex so that each dataset can have one common identifier.

Link to crosswalk file

After assembling the publication-level dataset, the final step in preparing the Scopus journal-level dataset is to aggregate publications at the journal level based on their ISSN.

Author Disambiguation

COMING SOON

Institution Disambiguation

COMING SOON

Results from Data Cleaning

This section presents overall statistics for Scopus and OA. Each subsection will have results reported for each dataset

Step 4 produces two publication-level datasets: one of all academic papers released through Scopus that use Ag Census data and a similar one for OpenAlex.

There are 4712 unique publications reported in Scopus and 1266 in OpenAlex. These data are collapsed into a journal-level dataset based on the International Standard Serial Number (ISSN) that is unique to each academic journal.