Methodology for Comparing Citation Database Coverage of Dataset Usage
Step 01: Define Data Assets
Define Scope of Data Assets to Be Searched
The data assets featured here consist of those collected by the USDA, primarily from the Economic Research Service (ERS) and the National Agricultural Statistics Service (NASS). These data assets are widely used in agricultural economics and food systems research.
National Agricultural Statistics Service (NASS) Data Assets
In late July 2023, an initial set of 21 reports were received containing a report name and a URL link to database curated by Cornell University. These reports are part of the USDA’s efforts to track data usage across various research applications. However, the names of these reports were highly generic, making it difficult to precisely identify them in citation databases. Examples include reports titled “Agricultural Prices” and “Farm Labor,” which lack specificity when compared to more structured dataset identifiers.
Data Processing and Standardization
To improve identification and searchability, the input was analyzed and transformed into a structured list that included:
- International Standard Serial Numbers (ISSNs): Each of the 21 reports was assigned an ISSN, where available, to provide a standardized identifier.
- Alias Creation: Generic report names were appended with the term report to better distinguish them from other similarly named publications in research literature.
- Expanded Search Terms: Additional variations of dataset names were included to account for different citation styles and possible ways authors reference these reports.
The final dataset classification involved:
- Main Data Asset (Parent Record): The original 21 reports, each representing a distinct dataset.
- Aliases: ISSNs and URLs served as aliases to improve retrieval accuracy.
- Search Term Expansion: Combining report names with different citation formats led to a total of 64 search terms (21 parent records + 43 aliases).
This standardization process improved the efficiency of identifying NASS datasets across publications indexed by citation databases such as Scopus, OpenAlex, and Dimensions.
Economic Research Service (ERS) Data Assets
The process of identifying ERS data assets occurred in two phases: (1) an initial dataset compilation, and (2) a refinement process incorporating feedback from a team of agricultural economists at Colorado State University (CSU). This process was meant to yield a list of data assets was both comprehensive and relevant to the research community tracking USDA dataset usage.
Phase 1: Initial Compilation of ERS Data Assets
In October 2023, an initial list of 2,103 ERS records was compiled. These records included dataset names and, in some cases, associated aliases. The list was then reviewed by Professor Julia Lane, who identified and removed 144 records that were not suitable for machine learning-based dataset tracking.
Reasons for Exclusion
- Records were too generic – Terms such as “Milk, Cotton, and CSV Format of National Data” were too broad to be meaningfully identified in citation databases.
- Records were too specific – Entries such as “Table 15—Agricultural Chemical Input” and “Southeast: 1982-91, 1992-97” were references within broader reports rather than standalone data assets.
After these exclusions, the remaining 1,959 records represented the initial list of ERS data assets.
Phase 2: Refinement with CSU Team
A team of agricultural economists at CSU were consulted to refine the list so that it accurately captured key USDA datasets that may have been overlooked in the initial process. This involved:
- Reviewing dataset usage in prior USDA research – Identifying which datasets were frequently cited.
- Cross-checking with known data users – Ensuring that key datasets used by agricultural economists were included.
- Expanding alias definitions – Recognizing dataset acronyms and alternative naming conventions.
As part of this process, an additional set of assets was incorporated, including datasets that had been previously identified in the Year 1 USDA project. Notably, datasets such as the Census of Agriculture and the Agricultural Resource Management Survey (ARMS) were added, along with key acronyms like FoodAPS. This phase contributed:
- 12 new parent records
- 8 additional alias records
- Total: 20 new search terms
Final Data Asset Identification
Unlike NASS data assets, which had ISSNs and DOIs, ERS datasets were primarily linked through URLs. The final structured dataset included:
- 1,959 parent records (main ERS datasets)
- 1,959 alias records (URLs serving as dataset identifiers)
- 20 additional records from the CSU consultation
- Total: 3,918 search terms
Through this two-phase process, the list of ERS data assets evolved from an initial broad set of records into a refined, structured collection of datasets that could be effectively tracked across citation databases.
Final List of USDA Data Assets
The data assets represent those most frequently used in agricultural economics research, spanning topics from farm management to food security. The final set of data assets, their producing agencies, and descriptions are presented in Table 1.
Dataset Name | Produced By | Description |
---|---|---|
Census of Agriculture | NASS | Conducted every five years, it provides comprehensive data on U.S. farms, ranches, and producers. |
Agricultural Resource Management Survey (ARMS) | ERS | A USDA survey on farm financials, production practices, and resource use. |
Food Acquisition and Purchase Survey (FoodAPS) | ERS | A nationally representative survey tracking U.S. household food purchases and acquisitions. |
Current Population Survey Food Security Supplement (CPS-FSS) | ERS | An annual supplement to the Current Population Survey (CPS) measuring U.S. household food security. |
Food Access Research Atlas (FARA) | ERS | A USDA tool mapping food access based on store locations and socioeconomic data. |
Rural-Urban Continuum Code (RUCC) | ERS | A classification system distinguishing U.S. counties by rural and urban characteristics. |
Household Food Security Survey Module | ERS | A USDA survey module used to assess food insecurity levels in households. |
Local Food Marketing Practices Survey | NASS | A USDA survey on U.S. farms’ local food sales, direct-to-consumer marketing, and supply chains. |
Farm to School Census | FNS | A USDA survey tracking school food procurement and local farm partnerships. |
Quarterly Food at Home Price Database (QFAHPD) | ERS | A database of U.S. retail food prices by product, region, and time. |
Tenure Ownership and Transition of Agricultural Land (TOTAL) | NASS | A survey collecting data on farmland ownership, leasing, and transfer. |
Transition of Agricultural Land Survey | NASS | A component of TOTAL that examines farmland ownership changes and succession plans. |
Information Resources, Inc. (IRI) InfoScan | Circana (formerly IRI) | A commercial scanner dataset tracking retail food and consumer goods purchases. |
To provide a comprehensive reference for dataset tracking, this Appendix includes a detailed list of data assets and their corresponding aliases, collectively referred to as dyads. Each dyad represents a dataset-name and alias pair used in citation database searches, allowing for more precise identification of dataset mentions in research publications. These aliases include acronyms, alternate spellings, dataset variations, and associated URLs, ensuring broad coverage across different citation practices. The dyad list serves as the foundation for dataset extraction and disambiguation across Scopus, OpenAlex, and Dimensions.