Scopus Data Dictionary
Overview of Scopus Reference Files
This section describes the process of construction CSV files extracted from a SQL Server database. These files contain interconnected data about publications and datasets, specifically focusing on how datasets are mentioned within publications. The main goal is to enable you to analyze the relationships between publications and datasets, particularly those identified using specific search models.
Below is a detailed explanation of primary tables and how they relate to one another. For a complete list of tables, please refer to the data schema.
File Organization in GitHub Repository
PLACEHOLDER
Category | File Path & Link |
---|---|
Processed IPEDS Dataset | compare_scopus_openalex/resources/IPEDS/IPEDS.csv |
Raw IPEDS Data | compare_scopus_openalex/resources/raw_data_IPEDS/ |
Data Processing Code | compare_scopus_openalex/resources/documentation/IPEDSdata.rmd |
Data Documentation | compare_scopus_openalex/resources/documentation/IPEDS_Data.md |
Data Dictionary
You can download the source files from this link.
publication.csv
(Primary table)
- Description: Central table with metadata about publications.
- Columns:
id
: Unique publication identifiertitle
: Publication titledoi
: Digital Object Identifieryear
,month
: Date of publicationcitation_count
,pub_type
: Additional metadata
dataset_alias.csv
- Description: Lists dataset aliases (alternate names).
- Columns:
alias_id
: Unique alias identifierparent_alias_id
: Primary alias identifier (ifalias_id = parent_alias_id
, it’s primary)alias
: Name of the alias
- How to use:
- Identify primary aliases where
alias_id = parent_alias_id
- Find all aliases by filtering on
parent_alias_id
- Identify primary aliases where
dyad.csv
- Description: Links publications (
publication.csv
) and dataset aliases (dataset_alias.csv
). - Columns:
id
: Unique identifier for each mention (dyad)publication_id
: Foreign key topublication.csv
alias_id
: Foreign key todataset_alias.csv
mention_candidate
: Mention text from publication
model.csv
- Description: Lists methods/models for identifying dataset mentions.
- Columns:
id
: Unique model identifiername
: Name of the identification model
dyad_model.csv
- Description: Connects dyads and models used to identify them.
- Columns:
dyad_id
: Foreign key todyad.csv
model_id
: Foreign key tomodel.csv
score
: Confidence or relevance score
- How to use:
- Filter mentions by joining with
dyad.csv
ondyad_id
and filtering bymodel_id
- Filter mentions by joining with
Sample Data
id | title | doi | year | month |
---|---|---|---|---|
321613 | New estimates for CRNA vacancies | 2009 | 4 | |
321614 | Crossing county lines: The impact of crash location and driver’s… | 10.1016/j.aap.2006… | 2006 | 7 |
id | alias_id | parent_alias_id | alias |
---|---|---|---|
1676 | 87 | 89 | Census of Agriculture |
1673 | 12 | 282 | ARMS Farm Financial and Crop Production Practices |
id | publication_id | alias_id | mention_candidate |
---|---|---|---|
2569 | 1211491 | 87 | census of agriculture |
2573 | 1199598 | 88 | usda census of agriculture |
id | name |
---|---|
1 | string_matching |
5 | refmatch |
id | dyad_id | model_id | score |
---|---|---|---|
4928 | 2569 | 1 | 2.0 |
4929 | 2569 | 4 | 1.0 |
4930 | 2569 | 2 | 1.0 |
How to Extract Publications for a Specific Dataset
To find all publications associated with a particular dataset, such as the NASS Census of Agriculture, follow these steps:
Identify the Main Alias:
- Find the
alias_id
wherealias_id
equalsparent_alias_id
for the dataset. - For NASS Census of Agriculture, the main alias has
alias_id
= 89.
- Find the
Get All Aliases:
- In
dataset_alias.csv
, filter rows whereparent_alias_id
equals 89. - This gives you all aliases associated with the NASS Census of Agriculture dataset.
- In
Link Aliases to Publications:
- In
dyad.csv
, filter rows wherealias_id
matches any of thealias_id
s obtained in step 2. - This will give you
publication_id
s of publications mentioning any alias of the dataset.
- In
Retrieve Publication Details:
- Using the
publication_id
s from step 3, retrieve the corresponding records frompublication.csv
.
- Using the
Filtering Publications by Specific Models
Since we’re interested in mentions identified by the string_matching
and refmatch
models (models with id
1 and 5), follow these steps:
Filter Dyads by Model:
- In
dyad_model.csv
, filter rows wheremodel_id
is 1 or 5. - This gives you
dyad_id
s linked to these models.
- In
Get Relevant Dyads:
- Perform an inner join with
dyad.csv
ondyad_id
. - This filters dyads to only those identified by the specified models.
- Perform an inner join with
Proceed as Before:
- Continue with the steps in the previous section, but using the filtered dyads from step 2.