OpenAlex Data Dictionary

Published

March 30, 2025

Overview of OpenAlex Reference Files

This section describes the process of construction CSV files extracted from a SQL Server database. These files contain interconnected data about publications and datasets, specifically focusing on how datasets are mentioned within publications. The main goal is to enable you to analyze the relationships between publications and datasets, particularly those identified using specific search models.

Below is a detailed explanation of primary tables and how they relate to one another. For a complete list of tables, please refer to the data schema.

File Organization in GitHub Repository

PLACEHOLDER

Category File Path & Link
Processed IPEDS Dataset compare_scopus_openalex/resources/IPEDS/IPEDS.csv
Raw IPEDS Data compare_scopus_openalex/resources/raw_data_IPEDS/
Data Processing Code compare_scopus_openalex/resources/documentation/IPEDSdata.rmd
Data Documentation compare_scopus_openalex/resources/documentation/IPEDS_Data.md

Data Dictionary

Download OpenAlex Source Files

You can download the source files from this link.

dataset.csv

  • Description: Lists all datasets identified in OpenAlex. This contains details of all USDA datasets.

publication_dataset_links.csv

  • Description: Connects publications with one or more datasets in the OpenAlex data. Indicates which publications are associated with which datasets.

publication.csv

  • Description: This file contains information about publications, which are the central entities in this dataset.
  • Key Columns:
    • id: Unique identifier for each publication.
    • title: Title of the publication.
    • doi: Digital Object Identifier of the publication.
    • year and month: Publication date.
    • Other metadata such as citation_count, pub_type, etc.

dataset_alias.csv

  • Description: Contains all the aliases (alternative names) of datasets. This helps in identifying datasets that might be referred to by different names in publications.
  • Key Columns:
    • alias_id: Unique identifier for each alias.
    • parent_alias_id: Identifies the main alias for a dataset. If parent_alias_id equals alias_id, it is the primary alias.
    • alias: The alias name of the dataset.

Note: The search in OpenAlex was performed using the same aliases and flag terms applied in the Scopus data, without any optimization.

How to Use:

  • To find the main alias of a dataset, look for rows where alias_id equals parent_alias_id.
  • To find all aliases of a dataset, filter by parent_alias_id corresponding to the main alias.

dyad.csv

  • Description: Represents the mentions of dataset aliases found within publications. Acts as a linking table between publication.csv and dataset_alias.csv.
  • Key Columns:
    • id: Unique identifier for each dyad (mention).
    • publication_id: References the id in publication.csv.
    • alias_id: References the alias_id in dataset_alias.csv.
    • mention_candidate: The actual text mentioning the dataset in the publication.

model.csv

  • Description: Lists the different models or methods used to identify dataset mentions within publications.
  • Key Columns:
    • id: Unique identifier for each model.
    • name: Name of the model (e.g., string_matching, refmatch).

Relevant Models:

  • Model ID 1: string_matching
  • Model ID 5: refmatch

These are the models we are focusing on to compare with data extracted from OpenAlex, as no Kaggle model has been applied there.

dyad_model.csv

  • Description: Connects dyads with the models that identified them. Allows filtering dyads based on the models used.
  • Key Columns:
    • dyad_id: References the id in dyad.csv.
    • model_id: References the id in model.csv.

How to Use:

  • To filter dyads (and thus publications) identified by specific models, perform an inner join with dyad.csv on dyad_id and filter by model_id.
How the Files are Related

The files are structured to represent entities (publications, journals, institutions, authors) and their relationships. The main publication data is in publications_main.csv, and the details about journals, institutions, and authors are in their respective files.

The link files (publication_journal_links.csv, publication_institution_links.csv, publication_author_links.csv) represent the many-to-many relationships between publications and these entities.

To find all publications by a specific author, you can use authors.csv to find the author’s author_openalex_id and then use publication_author_links.csv to find the associated publications.

To analyze the distribution of publications across journals, you can join publications_main.csv, publication_journal_links.csv, and journals.csv on publication_openalex_id and journal_openalex_id.

Sample Data

id title doi year month
321613 New estimates for CRNA vacancies 2009 4
321614 Crossing county lines: The impact of crash location and driver’s… 10.1016/j.aap.2006… 2006 7
id alias_id parent_alias_id alias
1676 87 89 Census of Agriculture
1673 12 282 ARMS Farm Financial and Crop Production Practices
1671 88 89 USDA Census of Agriculture
id publication_id alias_id mention_candidate
2569 1211491 87 census of agriculture
2573 1199598 88 usda census of agriculture
id name
1 string_matching
5 refmatch
id dyad_id model_id score
4928 2569 1 2.0
4929 2569 4 1.0
4930 2569 2 1.0

How to Extract Publications for a Specific Dataset

To find all publications associated with a particular dataset, such as the NASS Census of Agriculture, follow these steps:

  1. Identify the Main Alias:

    • Find the alias_id where alias_id equals parent_alias_id for the dataset.
    • For NASS Census of Agriculture, the main alias has alias_id = 89.
  2. Get All Aliases:

    • In dataset_alias.csv, filter rows where parent_alias_id equals 89.
    • This gives you all aliases associated with the NASS Census of Agriculture dataset.
  3. Link Aliases to Publications:

    • In dyad.csv, filter rows where alias_id matches any of the alias_ids obtained in step 2.
    • This will give you publication_ids of publications mentioning any alias of the dataset.
  4. Retrieve Publication Details:

    • Using the publication_ids from step 3, retrieve the corresponding records from publication.csv.

Filtering Publications by Specific Models

Since we’re interested in mentions identified by the string_matching and refmatch models (models with id 1 and 5), follow these steps:

  1. Filter Dyads by Model:

    • In dyad_model.csv, filter rows where model_id is 1 or 5.
    • This gives you dyad_ids linked to these models.
  2. Get Relevant Dyads:

    • Perform an inner join with dyad.csv on dyad_id.
    • This filters dyads to only those identified by the specified models.
  3. Proceed as Before:

    • Continue with the steps in the previous section, but using the filtered dyads from step 2.