OpenAlex Data Dictionary
Overview of OpenAlex Reference Files
This section describes the process of construction CSV files extracted from a SQL Server database. These files contain interconnected data about publications and datasets, specifically focusing on how datasets are mentioned within publications. The main goal is to enable you to analyze the relationships between publications and datasets, particularly those identified using specific search models.
Below is a detailed explanation of primary tables and how they relate to one another. For a complete list of tables, please refer to the data schema.
File Organization in GitHub Repository
PLACEHOLDER
Category | File Path & Link |
---|---|
Processed IPEDS Dataset | compare_scopus_openalex/resources/IPEDS/IPEDS.csv |
Raw IPEDS Data | compare_scopus_openalex/resources/raw_data_IPEDS/ |
Data Processing Code | compare_scopus_openalex/resources/documentation/IPEDSdata.rmd |
Data Documentation | compare_scopus_openalex/resources/documentation/IPEDS_Data.md |
Data Dictionary
You can download the source files from this link.
dataset.csv
- Description: Lists all datasets identified in OpenAlex. This contains details of all USDA datasets.
publication_dataset_links.csv
- Description: Connects publications with one or more datasets in the OpenAlex data. Indicates which publications are associated with which datasets.
publication.csv
- Description: This file contains information about publications, which are the central entities in this dataset.
- Key Columns:
id
: Unique identifier for each publication.title
: Title of the publication.doi
: Digital Object Identifier of the publication.year
andmonth
: Publication date.- Other metadata such as
citation_count
,pub_type
, etc.
dataset_alias.csv
- Description: Contains all the aliases (alternative names) of datasets. This helps in identifying datasets that might be referred to by different names in publications.
- Key Columns:
alias_id
: Unique identifier for each alias.parent_alias_id
: Identifies the main alias for a dataset. Ifparent_alias_id
equalsalias_id
, it is the primary alias.alias
: The alias name of the dataset.
Note: The search in OpenAlex was performed using the same aliases and flag terms applied in the Scopus data, without any optimization.
How to Use:
- To find the main alias of a dataset, look for rows where
alias_id
equalsparent_alias_id
. - To find all aliases of a dataset, filter by
parent_alias_id
corresponding to the main alias.
dyad.csv
- Description: Represents the mentions of dataset aliases found within publications. Acts as a linking table between
publication.csv
anddataset_alias.csv
. - Key Columns:
id
: Unique identifier for each dyad (mention).publication_id
: References theid
inpublication.csv
.alias_id
: References thealias_id
indataset_alias.csv
.mention_candidate
: The actual text mentioning the dataset in the publication.
model.csv
- Description: Lists the different models or methods used to identify dataset mentions within publications.
- Key Columns:
id
: Unique identifier for each model.name
: Name of the model (e.g.,string_matching
,refmatch
).
Relevant Models:
- Model ID 1:
string_matching
- Model ID 5:
refmatch
These are the models we are focusing on to compare with data extracted from OpenAlex, as no Kaggle model has been applied there.
dyad_model.csv
- Description: Connects dyads with the models that identified them. Allows filtering dyads based on the models used.
- Key Columns:
dyad_id
: References theid
indyad.csv
.model_id
: References theid
inmodel.csv
.
How to Use:
- To filter dyads (and thus publications) identified by specific models, perform an inner join with
dyad.csv
ondyad_id
and filter bymodel_id
.
The files are structured to represent entities (publications, journals, institutions, authors) and their relationships. The main publication data is in publications_main.csv
, and the details about journals, institutions, and authors are in their respective files.
The link files (publication_journal_links.csv
, publication_institution_links.csv
, publication_author_links.csv
) represent the many-to-many relationships between publications and these entities.
To find all publications by a specific author, you can use authors.csv
to find the author’s author_openalex_id
and then use publication_author_links.csv
to find the associated publications.
To analyze the distribution of publications across journals, you can join publications_main.csv
, publication_journal_links.csv
, and journals.csv
on publication_openalex_id
and journal_openalex_id
.
Sample Data
id | title | doi | year | month |
---|---|---|---|---|
321613 | New estimates for CRNA vacancies | 2009 | 4 | |
321614 | Crossing county lines: The impact of crash location and driver’s… | 10.1016/j.aap.2006… | 2006 | 7 |
id | alias_id | parent_alias_id | alias |
---|---|---|---|
1676 | 87 | 89 | Census of Agriculture |
1673 | 12 | 282 | ARMS Farm Financial and Crop Production Practices |
1671 | 88 | 89 | USDA Census of Agriculture |
id | publication_id | alias_id | mention_candidate |
---|---|---|---|
2569 | 1211491 | 87 | census of agriculture |
2573 | 1199598 | 88 | usda census of agriculture |
id | name |
---|---|
1 | string_matching |
5 | refmatch |
id | dyad_id | model_id | score |
---|---|---|---|
4928 | 2569 | 1 | 2.0 |
4929 | 2569 | 4 | 1.0 |
4930 | 2569 | 2 | 1.0 |
How to Extract Publications for a Specific Dataset
To find all publications associated with a particular dataset, such as the NASS Census of Agriculture, follow these steps:
Identify the Main Alias:
- Find the
alias_id
wherealias_id
equalsparent_alias_id
for the dataset. - For NASS Census of Agriculture, the main alias has
alias_id
= 89.
- Find the
Get All Aliases:
- In
dataset_alias.csv
, filter rows whereparent_alias_id
equals 89. - This gives you all aliases associated with the NASS Census of Agriculture dataset.
- In
Link Aliases to Publications:
- In
dyad.csv
, filter rows wherealias_id
matches any of thealias_id
s obtained in step 2. - This will give you
publication_id
s of publications mentioning any alias of the dataset.
- In
Retrieve Publication Details:
- Using the
publication_id
s from step 3, retrieve the corresponding records frompublication.csv
.
- Using the
Filtering Publications by Specific Models
Since we’re interested in mentions identified by the string_matching
and refmatch
models (models with id
1 and 5), follow these steps:
Filter Dyads by Model:
- In
dyad_model.csv
, filter rows wheremodel_id
is 1 or 5. - This gives you
dyad_id
s linked to these models.
- In
Get Relevant Dyads:
- Perform an inner join with
dyad.csv
ondyad_id
. - This filters dyads to only those identified by the specified models.
- Perform an inner join with
Proceed as Before:
- Continue with the steps in the previous section, but using the filtered dyads from step 2.