Scopus Data Dictionary

Published

March 30, 2025

Overview of Scopus Reference Files

This section describes the process of construction CSV files extracted from a SQL Server database. These files contain interconnected data about publications and datasets, specifically focusing on how datasets are mentioned within publications. The main goal is to enable you to analyze the relationships between publications and datasets, particularly those identified using specific search models.

Below is a detailed explanation of primary tables and how they relate to one another. For a complete list of tables, please refer to the data schema.

File Organization in GitHub Repository

PLACEHOLDER

Category	File Path & Link
Processed IPEDS Dataset	`compare_scopus_openalex/resources/IPEDS/IPEDS.csv`
Raw IPEDS Data	`compare_scopus_openalex/resources/raw_data_IPEDS/`
Data Processing Code	`compare_scopus_openalex/resources/documentation/IPEDSdata.rmd`
Data Documentation	`compare_scopus_openalex/resources/documentation/IPEDS_Data.md`

Data Dictionary

Download Scopus Source Files

You can download the source files from this link.

`publication.csv` (Primary table)

Description: Central table with metadata about publications.
Columns:
- id: Unique publication identifier
- title: Publication title
- doi: Digital Object Identifier
- year, month: Date of publication
- citation_count, pub_type: Additional metadata

`dataset_alias.csv`

Description: Lists dataset aliases (alternate names).
Columns:
- alias_id: Unique alias identifier
- parent_alias_id: Primary alias identifier (if alias_id = parent_alias_id, it’s primary)
- alias: Name of the alias
How to use:
- Identify primary aliases where alias_id = parent_alias_id
- Find all aliases by filtering on parent_alias_id

`dyad.csv`

Description: Links publications (publication.csv) and dataset aliases (dataset_alias.csv).
Columns:
- id: Unique identifier for each mention (dyad)
- publication_id: Foreign key to publication.csv
- alias_id: Foreign key to dataset_alias.csv
- mention_candidate: Mention text from publication

`model.csv`

Description: Lists methods/models for identifying dataset mentions.
Columns:
- id: Unique model identifier
- name: Name of the identification model

`dyad_model.csv`

Description: Connects dyads and models used to identify them.
Columns:
- dyad_id: Foreign key to dyad.csv
- model_id: Foreign key to model.csv
- score: Confidence or relevance score
How to use:
- Filter mentions by joining with dyad.csv on dyad_id and filtering by model_id

Sample Data

id	title	doi	year	month
321613	New estimates for CRNA vacancies		2009	4
321614	Crossing county lines: The impact of crash location and driver’s…	10.1016/j.aap.2006…	2006	7

id	alias_id	parent_alias_id	alias
1676	87	89	Census of Agriculture
1673	12	282	ARMS Farm Financial and Crop Production Practices

id	publication_id	alias_id	mention_candidate
2569	1211491	87	census of agriculture
2573	1199598	88	usda census of agriculture

id	name
1	string_matching
5	refmatch

id	dyad_id	model_id	score
4928	2569	1	2.0
4929	2569	4	1.0
4930	2569	2	1.0

How to Extract Publications for a Specific Dataset

To find all publications associated with a particular dataset, such as the NASS Census of Agriculture, follow these steps:

Identify the Main Alias:
- Find the alias_id where alias_id equals parent_alias_id for the dataset.
- For NASS Census of Agriculture, the main alias has alias_id = 89.
Get All Aliases:
- In dataset_alias.csv, filter rows where parent_alias_id equals 89.
- This gives you all aliases associated with the NASS Census of Agriculture dataset.
Link Aliases to Publications:
- In dyad.csv, filter rows where alias_id matches any of the alias_ids obtained in step 2.
- This will give you publication_ids of publications mentioning any alias of the dataset.
Retrieve Publication Details:
- Using the publication_ids from step 3, retrieve the corresponding records from publication.csv.

Filtering Publications by Specific Models

Since we’re interested in mentions identified by the string_matching and refmatch models (models with id 1 and 5), follow these steps:

Filter Dyads by Model:
- In dyad_model.csv, filter rows where model_id is 1 or 5.
- This gives you dyad_ids linked to these models.
Get Relevant Dyads:
- Perform an inner join with dyad.csv on dyad_id.
- This filters dyads to only those identified by the specified models.
Proceed as Before:
- Continue with the steps in the previous section, but using the filtered dyads from step 2.

Overview of Scopus Reference Files

File Organization in GitHub Repository

Data Dictionary

publication.csv (Primary table)

dataset_alias.csv

dyad.csv

model.csv

dyad_model.csv

Sample Data

How to Extract Publications for a Specific Dataset

Filtering Publications by Specific Models

`publication.csv` (Primary table)

`dataset_alias.csv`

`dyad.csv`

`model.csv`

`dyad_model.csv`