Scopus Data Dictionary

Published

March 30, 2025

Overview of Scopus Reference Files

This section describes the process of construction CSV files extracted from a SQL Server database. These files contain interconnected data about publications and datasets, specifically focusing on how datasets are mentioned within publications. The main goal is to enable you to analyze the relationships between publications and datasets, particularly those identified using specific search models.

Below is a detailed explanation of primary tables and how they relate to one another. For a complete list of tables, please refer to the data schema.

File Organization in GitHub Repository

PLACEHOLDER

Category File Path & Link
Processed IPEDS Dataset compare_scopus_openalex/resources/IPEDS/IPEDS.csv
Raw IPEDS Data compare_scopus_openalex/resources/raw_data_IPEDS/
Data Processing Code compare_scopus_openalex/resources/documentation/IPEDSdata.rmd
Data Documentation compare_scopus_openalex/resources/documentation/IPEDS_Data.md

Data Dictionary

Download Scopus Source Files

You can download the source files from this link.

publication.csv (Primary table)

  • Description: Central table with metadata about publications.
  • Columns:
    • id: Unique publication identifier
    • title: Publication title
    • doi: Digital Object Identifier
    • year, month: Date of publication
    • citation_count, pub_type: Additional metadata

dataset_alias.csv

  • Description: Lists dataset aliases (alternate names).
  • Columns:
    • alias_id: Unique alias identifier
    • parent_alias_id: Primary alias identifier (if alias_id = parent_alias_id, it’s primary)
    • alias: Name of the alias
  • How to use:
    • Identify primary aliases where alias_id = parent_alias_id
    • Find all aliases by filtering on parent_alias_id

dyad.csv

  • Description: Links publications (publication.csv) and dataset aliases (dataset_alias.csv).
  • Columns:
    • id: Unique identifier for each mention (dyad)
    • publication_id: Foreign key to publication.csv
    • alias_id: Foreign key to dataset_alias.csv
    • mention_candidate: Mention text from publication

model.csv

  • Description: Lists methods/models for identifying dataset mentions.
  • Columns:
    • id: Unique model identifier
    • name: Name of the identification model

dyad_model.csv

  • Description: Connects dyads and models used to identify them.
  • Columns:
    • dyad_id: Foreign key to dyad.csv
    • model_id: Foreign key to model.csv
    • score: Confidence or relevance score
  • How to use:
    • Filter mentions by joining with dyad.csv on dyad_id and filtering by model_id

Sample Data

id title doi year month
321613 New estimates for CRNA vacancies 2009 4
321614 Crossing county lines: The impact of crash location and driver’s… 10.1016/j.aap.2006… 2006 7
id alias_id parent_alias_id alias
1676 87 89 Census of Agriculture
1673 12 282 ARMS Farm Financial and Crop Production Practices
id publication_id alias_id mention_candidate
2569 1211491 87 census of agriculture
2573 1199598 88 usda census of agriculture
id name
1 string_matching
5 refmatch
id dyad_id model_id score
4928 2569 1 2.0
4929 2569 4 1.0
4930 2569 2 1.0

How to Extract Publications for a Specific Dataset

To find all publications associated with a particular dataset, such as the NASS Census of Agriculture, follow these steps:

  1. Identify the Main Alias:

    • Find the alias_id where alias_id equals parent_alias_id for the dataset.
    • For NASS Census of Agriculture, the main alias has alias_id = 89.
  2. Get All Aliases:

    • In dataset_alias.csv, filter rows where parent_alias_id equals 89.
    • This gives you all aliases associated with the NASS Census of Agriculture dataset.
  3. Link Aliases to Publications:

    • In dyad.csv, filter rows where alias_id matches any of the alias_ids obtained in step 2.
    • This will give you publication_ids of publications mentioning any alias of the dataset.
  4. Retrieve Publication Details:

    • Using the publication_ids from step 3, retrieve the corresponding records from publication.csv.

Filtering Publications by Specific Models

Since we’re interested in mentions identified by the string_matching and refmatch models (models with id 1 and 5), follow these steps:

  1. Filter Dyads by Model:

    • In dyad_model.csv, filter rows where model_id is 1 or 5.
    • This gives you dyad_ids linked to these models.
  2. Get Relevant Dyads:

    • Perform an inner join with dyad.csv on dyad_id.
    • This filters dyads to only those identified by the specified models.
  3. Proceed as Before:

    • Continue with the steps in the previous section, but using the filtered dyads from step 2.