Methodology for Comparing Citation Database Coverage of Dataset Usage

This repository contains a set of R scripts that process, harmonize, and compare publication-level datasets from Scopus, OpenAlex (seed corpus and full-text), and Dimensions. The workflow supports analysis of dataset usage across platforms, topic-level comparisons, and source intersections.


Program Order

  1. process_openalex.R

  2. process_scopus_seed_corpus.R

  3. process_dimensions.R

  4. compare_publications.R

  5. construct_treemaps.R

  6. construct_sankey_plots.R

  7. aggregate_topics.R

  8. construct_bubble_charts.R

  9. construct_word_clouds.R

  10. clean_author_names.R

  11. clean_institutional_affil.R

  12. construct_maps.R


Workflow Overview

The main R script is 00_master_script.R. It sequentially sources individual scripts to process input files, generate harmonized datasets, and produce comparative visualizations and tables.


Script Descriptions (Execution Order)

Part 1: Cleaning and Deduplicating Data

Step 1: Clean OpenAlex Publications

Script name: 01_process_openalex.R

Processes OpenAlex full-text and seed corpus data and deduplicates records by dataset.

Note: Prior to running this script, you must run the flatten_openalex_all_datasets.ipynb Jupyter Notebook. This notebook processes the raw JSON files from the OpenAlex Seed Search Corpus and flattens them into structured CSV tables. Refer to the OpenAlex data schema for a full overview of available fields.

Outputs retained:

  • master_openalex

Step 2: Clean Scopus Publications

Script name: 02_process_scopus_seed_corpus.R

Processes matched Scopus publications using the same seed DOIs. Structures the output to match OpenAlex and supports dataset-level comparisons.

Additional outputs:

  • master_scopus, scopus_pubs
  • scopus_doi, scopus_topic, scopus_author
  • dyad_model_dataset

Step 3: Clean Dimensions Publications

Script name: 03_process_dimensions.R

Processes Dimensions publication metadata, focusing on concept tags, authorship, and dataset mentions.

Additional outputs:

  • master_dimensions, dim_publications
  • dim_authors, dim_affiliations, dim_concepts
  • dataset_reference_table

Part 2: Comparing Publications

Step 1: Build a comparable sample

Script name: 04_compare_publications.R

Filters for articles published between 2017–2023. Merges all four datasets and compares coverage and intersections.

Outputs:

  • Unified publication-level dataset
  • Source flags: scopus_yes, oa_yes, dim_yes
  • Harmonized dataset indicators

Step 2: Construct treemap visualizations

Script name: 05_construct_treemaps.R

Visualizes the overlap in publication coverage using treemaps across 15 mutually exclusive source combinations.

Output:

  • Treemaps (one per dataset) saved as PNG files

Part 3: Comparing Journals

Step 1: Construct Sankey plots

Script name: 06_construct_sankey_plots.R

Generates Sankey plots to show source pathways for dataset mentions across citation platforms.

Output:

  • PNG diagrams representing flow across sources

Part 4: Comparing Topics

Step 1: Harmonize topics across sources

Script name: 07_aggregate_topics.R

Harmonizes and aggregates topic metadata across all four sources. Produces a unified dataset used for topic-level summaries.

Outputs:

  • dataset##_topics for each dataset
  • master_topics_df, food_security_flag_terms
  • Count tables for topics by source and overlap group

Step 2: Construct bubble charts

Script name: 08_construct_bubble_charts.R

Creates bubble charts of topic coverage by dataset and source. Topics are aggregated using harmonized labels.

Output:

  • PNG files grouped by dataset
  • Aggregated topic flags

Step 3: Construct word clouds

Script name: 09_construct_word_clouds.R

Generates word clouds for top topics flagged as relevant to food security or other themes of interest.

Output:

  • Word cloud PNGs by dataset and topic flag

Part 5: Comparing Authors and Institutions

Step 1: Clean and standardize author names

Script name: 10_clean_author_names.R

Standardizes and deduplicates author names for consistent identification across citation sources.


Step 2: Clean institutional affiliations

Script name: 11_clean_institutional_affil.R

Standardizes and assigns ROR identifiers to institutional affiliations, where available.


Step 3: Construct maps

Script name: 12_construct_maps.R

Maps institutional affiliations by dataset and source, using geocoded or ROR-derived location data.


Notes on Replicability

  • All scripts assume that required paths are defined at the top of 00_master_script.R.
  • A keep_list strategy is used in each script to manage memory and retain only needed objects.
  • Output directories for each figure type are defined explicitly before sourcing visualization scripts.

For access to the input data, please contact the authors.
To cite this workflow, please use:
Chenarides, L., Bryan, C., & Ladislau, R. (2025). Methodology for comparing citation database coverage of dataset usage.
Available at: https://laurenchenarides.github.io/data_usage_report/report.html

LS0tDQp0aXRsZTogIlJFQURNRSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMgTWV0aG9kb2xvZ3kgZm9yIENvbXBhcmluZyBDaXRhdGlvbiBEYXRhYmFzZSBDb3ZlcmFnZSBvZiBEYXRhc2V0IFVzYWdlDQoNClRoaXMgcmVwb3NpdG9yeSBjb250YWlucyBhIHNldCBvZiBSIHNjcmlwdHMgdGhhdCBwcm9jZXNzLCBoYXJtb25pemUsIGFuZCBjb21wYXJlIHB1YmxpY2F0aW9uLWxldmVsIGRhdGFzZXRzIGZyb20gU2NvcHVzLCBPcGVuQWxleCAoc2VlZCBjb3JwdXMgYW5kIGZ1bGwtdGV4dCksIGFuZCBEaW1lbnNpb25zLiBUaGUgd29ya2Zsb3cgc3VwcG9ydHMgYW5hbHlzaXMgb2YgZGF0YXNldCB1c2FnZSBhY3Jvc3MgcGxhdGZvcm1zLCB0b3BpYy1sZXZlbCBjb21wYXJpc29ucywgYW5kIHNvdXJjZSBpbnRlcnNlY3Rpb25zLg0KDQotLS0NCg0KIyMgUHJvZ3JhbSBPcmRlcg0KDQoxLiBgcHJvY2Vzc19vcGVuYWxleC5SYCAgDQoyLiBgcHJvY2Vzc19zY29wdXNfc2VlZF9jb3JwdXMuUmAgIA0KMy4gYHByb2Nlc3NfZGltZW5zaW9ucy5SYCAgDQoNCjQuIGBjb21wYXJlX3B1YmxpY2F0aW9ucy5SYCAgDQo1LiBgY29uc3RydWN0X3RyZWVtYXBzLlJgICANCjYuIGBjb25zdHJ1Y3Rfc2Fua2V5X3Bsb3RzLlJgDQoNCjcuIGBhZ2dyZWdhdGVfdG9waWNzLlJgICANCjguIGBjb25zdHJ1Y3RfYnViYmxlX2NoYXJ0cy5SYCAgDQo5LiBgY29uc3RydWN0X3dvcmRfY2xvdWRzLlJgDQoNCjEwLiBgY2xlYW5fYXV0aG9yX25hbWVzLlJgICANCjExLiBgY2xlYW5faW5zdGl0dXRpb25hbF9hZmZpbC5SYCAgDQoxMi4gYGNvbnN0cnVjdF9tYXBzLlJgDQoNCi0tLQ0KDQojIyBXb3JrZmxvdyBPdmVydmlldw0KDQpUaGUgbWFpbiBSIHNjcmlwdCBpcyBgMDBfbWFzdGVyX3NjcmlwdC5SYC4gSXQgc2VxdWVudGlhbGx5IHNvdXJjZXMgaW5kaXZpZHVhbCBzY3JpcHRzIHRvIHByb2Nlc3MgaW5wdXQgZmlsZXMsIGdlbmVyYXRlIGhhcm1vbml6ZWQgZGF0YXNldHMsIGFuZCBwcm9kdWNlIGNvbXBhcmF0aXZlIHZpc3VhbGl6YXRpb25zIGFuZCB0YWJsZXMuDQoNCi0tLQ0KDQojIyBTY3JpcHQgRGVzY3JpcHRpb25zIChFeGVjdXRpb24gT3JkZXIpDQoNCiMjIyBQYXJ0IDE6IENsZWFuaW5nIGFuZCBEZWR1cGxpY2F0aW5nIERhdGENCg0KIyMjIyBTdGVwIDE6IENsZWFuIE9wZW5BbGV4IFB1YmxpY2F0aW9ucyANCg0KKipTY3JpcHQgbmFtZToqKiBgMDFfcHJvY2Vzc19vcGVuYWxleC5SYA0KDQpQcm9jZXNzZXMgT3BlbkFsZXggZnVsbC10ZXh0IGFuZCBzZWVkIGNvcnB1cyBkYXRhIGFuZCBkZWR1cGxpY2F0ZXMgcmVjb3JkcyBieSBkYXRhc2V0Lg0KDQoqKk5vdGU6KiogUHJpb3IgdG8gcnVubmluZyB0aGlzIHNjcmlwdCwgeW91IG11c3QgcnVuIHRoZSBgZmxhdHRlbl9vcGVuYWxleF9hbGxfZGF0YXNldHMuaXB5bmJgIEp1cHl0ZXIgTm90ZWJvb2suIFRoaXMgbm90ZWJvb2sgcHJvY2Vzc2VzIHRoZSByYXcgSlNPTiBmaWxlcyBmcm9tIHRoZSBPcGVuQWxleCBTZWVkIFNlYXJjaCBDb3JwdXMgYW5kIGZsYXR0ZW5zIHRoZW0gaW50byBzdHJ1Y3R1cmVkIENTViB0YWJsZXMuIFJlZmVyIHRvIHRoZSBPcGVuQWxleCBkYXRhIHNjaGVtYSBmb3IgYSBmdWxsIG92ZXJ2aWV3IG9mIGF2YWlsYWJsZSBmaWVsZHMuDQoNCioqT3V0cHV0cyByZXRhaW5lZDoqKg0KDQotIGBtYXN0ZXJfb3BlbmFsZXhgDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMjogQ2xlYW4gU2NvcHVzIFB1YmxpY2F0aW9ucw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwMl9wcm9jZXNzX3Njb3B1c19zZWVkX2NvcnB1cy5SYA0KDQpQcm9jZXNzZXMgbWF0Y2hlZCBTY29wdXMgcHVibGljYXRpb25zIHVzaW5nIHRoZSBzYW1lIHNlZWQgRE9Jcy4gU3RydWN0dXJlcyB0aGUgb3V0cHV0IHRvIG1hdGNoIE9wZW5BbGV4IGFuZCBzdXBwb3J0cyBkYXRhc2V0LWxldmVsIGNvbXBhcmlzb25zLg0KDQoqKkFkZGl0aW9uYWwgb3V0cHV0czoqKg0KDQotIGBtYXN0ZXJfc2NvcHVzYCwgYHNjb3B1c19wdWJzYCAgDQotIGBzY29wdXNfZG9pYCwgYHNjb3B1c190b3BpY2AsIGBzY29wdXNfYXV0aG9yYCAgDQotIGBkeWFkX21vZGVsX2RhdGFzZXRgDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMzogQ2xlYW4gRGltZW5zaW9ucyBQdWJsaWNhdGlvbnMNCg0KKipTY3JpcHQgbmFtZToqKiBgMDNfcHJvY2Vzc19kaW1lbnNpb25zLlJgDQoNClByb2Nlc3NlcyBEaW1lbnNpb25zIHB1YmxpY2F0aW9uIG1ldGFkYXRhLCBmb2N1c2luZyBvbiBjb25jZXB0IHRhZ3MsIGF1dGhvcnNoaXAsIGFuZCBkYXRhc2V0IG1lbnRpb25zLg0KDQoqKkFkZGl0aW9uYWwgb3V0cHV0czoqKg0KDQotIGBtYXN0ZXJfZGltZW5zaW9uc2AsIGBkaW1fcHVibGljYXRpb25zYCAgDQotIGBkaW1fYXV0aG9yc2AsIGBkaW1fYWZmaWxpYXRpb25zYCwgYGRpbV9jb25jZXB0c2AgIA0KLSBgZGF0YXNldF9yZWZlcmVuY2VfdGFibGVgDQoNCi0tLQ0KDQojIyMgUGFydCAyOiBDb21wYXJpbmcgUHVibGljYXRpb25zDQoNCiMjIyMgU3RlcCAxOiBCdWlsZCBhIGNvbXBhcmFibGUgc2FtcGxlICANCg0KKipTY3JpcHQgbmFtZToqKiBgMDRfY29tcGFyZV9wdWJsaWNhdGlvbnMuUmANCg0KRmlsdGVycyBmb3IgYXJ0aWNsZXMgcHVibGlzaGVkIGJldHdlZW4gMjAxN+KAkzIwMjMuIE1lcmdlcyBhbGwgZm91ciBkYXRhc2V0cyBhbmQgY29tcGFyZXMgY292ZXJhZ2UgYW5kIGludGVyc2VjdGlvbnMuDQoNCioqT3V0cHV0czoqKg0KDQotIFVuaWZpZWQgcHVibGljYXRpb24tbGV2ZWwgZGF0YXNldCAgDQotIFNvdXJjZSBmbGFnczogYHNjb3B1c195ZXNgLCBgb2FfeWVzYCwgYGRpbV95ZXNgICANCi0gSGFybW9uaXplZCBkYXRhc2V0IGluZGljYXRvcnMNCg0KLS0tDQoNCiMjIyMgU3RlcCAyOiBDb25zdHJ1Y3QgdHJlZW1hcCB2aXN1YWxpemF0aW9ucw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwNV9jb25zdHJ1Y3RfdHJlZW1hcHMuUmANCg0KVmlzdWFsaXplcyB0aGUgb3ZlcmxhcCBpbiBwdWJsaWNhdGlvbiBjb3ZlcmFnZSB1c2luZyB0cmVlbWFwcyBhY3Jvc3MgMTUgbXV0dWFsbHkgZXhjbHVzaXZlIHNvdXJjZSBjb21iaW5hdGlvbnMuDQoNCioqT3V0cHV0OioqDQoNCi0gVHJlZW1hcHMgKG9uZSBwZXIgZGF0YXNldCkgc2F2ZWQgYXMgUE5HIGZpbGVzDQoNCi0tLQ0KDQojIyMgUGFydCAzOiBDb21wYXJpbmcgSm91cm5hbHMNCg0KIyMjIyBTdGVwIDE6IENvbnN0cnVjdCBTYW5rZXkgcGxvdHMgDQoNCioqU2NyaXB0IG5hbWU6KiogYDA2X2NvbnN0cnVjdF9zYW5rZXlfcGxvdHMuUmANCg0KR2VuZXJhdGVzIFNhbmtleSBwbG90cyB0byBzaG93IHNvdXJjZSBwYXRod2F5cyBmb3IgZGF0YXNldCBtZW50aW9ucyBhY3Jvc3MgY2l0YXRpb24gcGxhdGZvcm1zLg0KDQoqKk91dHB1dDoqKg0KDQotIFBORyBkaWFncmFtcyByZXByZXNlbnRpbmcgZmxvdyBhY3Jvc3Mgc291cmNlcw0KDQotLS0NCg0KIyMjIFBhcnQgNDogQ29tcGFyaW5nIFRvcGljcw0KDQojIyMjIFN0ZXAgMTogSGFybW9uaXplIHRvcGljcyBhY3Jvc3Mgc291cmNlcw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwN19hZ2dyZWdhdGVfdG9waWNzLlJgDQoNCkhhcm1vbml6ZXMgYW5kIGFnZ3JlZ2F0ZXMgdG9waWMgbWV0YWRhdGEgYWNyb3NzIGFsbCBmb3VyIHNvdXJjZXMuIFByb2R1Y2VzIGEgdW5pZmllZCBkYXRhc2V0IHVzZWQgZm9yIHRvcGljLWxldmVsIHN1bW1hcmllcy4NCg0KKipPdXRwdXRzOioqDQoNCi0gYGRhdGFzZXQjI190b3BpY3NgIGZvciBlYWNoIGRhdGFzZXQgIA0KLSBgbWFzdGVyX3RvcGljc19kZmAsIGBmb29kX3NlY3VyaXR5X2ZsYWdfdGVybXNgICANCi0gQ291bnQgdGFibGVzIGZvciB0b3BpY3MgYnkgc291cmNlIGFuZCBvdmVybGFwIGdyb3VwDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMjogQ29uc3RydWN0IGJ1YmJsZSBjaGFydHMNCg0KKipTY3JpcHQgbmFtZToqKiBgMDhfY29uc3RydWN0X2J1YmJsZV9jaGFydHMuUmANCg0KQ3JlYXRlcyBidWJibGUgY2hhcnRzIG9mIHRvcGljIGNvdmVyYWdlIGJ5IGRhdGFzZXQgYW5kIHNvdXJjZS4gVG9waWNzIGFyZSBhZ2dyZWdhdGVkIHVzaW5nIGhhcm1vbml6ZWQgbGFiZWxzLg0KDQoqKk91dHB1dDoqKg0KDQotIFBORyBmaWxlcyBncm91cGVkIGJ5IGRhdGFzZXQgIA0KLSBBZ2dyZWdhdGVkIHRvcGljIGZsYWdzDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMzogQ29uc3RydWN0IHdvcmQgY2xvdWRzDQoNCioqU2NyaXB0IG5hbWU6KiogYDA5X2NvbnN0cnVjdF93b3JkX2Nsb3Vkcy5SYA0KDQpHZW5lcmF0ZXMgd29yZCBjbG91ZHMgZm9yIHRvcCB0b3BpY3MgZmxhZ2dlZCBhcyByZWxldmFudCB0byBmb29kIHNlY3VyaXR5IG9yIG90aGVyIHRoZW1lcyBvZiBpbnRlcmVzdC4NCg0KKipPdXRwdXQ6KioNCg0KLSBXb3JkIGNsb3VkIFBOR3MgYnkgZGF0YXNldCBhbmQgdG9waWMgZmxhZw0KDQotLS0NCg0KIyMjIFBhcnQgNTogQ29tcGFyaW5nIEF1dGhvcnMgYW5kIEluc3RpdHV0aW9ucw0KDQojIyMjIFN0ZXAgMTogQ2xlYW4gYW5kIHN0YW5kYXJkaXplIGF1dGhvciBuYW1lcw0KDQoqKlNjcmlwdCBuYW1lOioqIGAxMF9jbGVhbl9hdXRob3JfbmFtZXMuUmANCg0KU3RhbmRhcmRpemVzIGFuZCBkZWR1cGxpY2F0ZXMgYXV0aG9yIG5hbWVzIGZvciBjb25zaXN0ZW50IGlkZW50aWZpY2F0aW9uIGFjcm9zcyBjaXRhdGlvbiBzb3VyY2VzLg0KDQotLS0NCg0KIyMjIyBTdGVwIDI6IENsZWFuIGluc3RpdHV0aW9uYWwgYWZmaWxpYXRpb25zDQoNCioqU2NyaXB0IG5hbWU6KiogYDExX2NsZWFuX2luc3RpdHV0aW9uYWxfYWZmaWwuUmANCg0KU3RhbmRhcmRpemVzIGFuZCBhc3NpZ25zIFJPUiBpZGVudGlmaWVycyB0byBpbnN0aXR1dGlvbmFsIGFmZmlsaWF0aW9ucywgd2hlcmUgYXZhaWxhYmxlLg0KDQotLS0NCg0KIyMjIyBTdGVwIDM6IENvbnN0cnVjdCBtYXBzDQoNCioqU2NyaXB0IG5hbWU6KiogYDEyX2NvbnN0cnVjdF9tYXBzLlJgDQoNCk1hcHMgaW5zdGl0dXRpb25hbCBhZmZpbGlhdGlvbnMgYnkgZGF0YXNldCBhbmQgc291cmNlLCB1c2luZyBnZW9jb2RlZCBvciBST1ItZGVyaXZlZCBsb2NhdGlvbiBkYXRhLg0KDQotLS0NCg0KIyMgTm90ZXMgb24gUmVwbGljYWJpbGl0eQ0KDQotIEFsbCBzY3JpcHRzIGFzc3VtZSB0aGF0IHJlcXVpcmVkIHBhdGhzIGFyZSBkZWZpbmVkIGF0IHRoZSB0b3Agb2YgYDAwX21hc3Rlcl9zY3JpcHQuUmAuDQotIEEgYGtlZXBfbGlzdGAgc3RyYXRlZ3kgaXMgdXNlZCBpbiBlYWNoIHNjcmlwdCB0byBtYW5hZ2UgbWVtb3J5IGFuZCByZXRhaW4gb25seSBuZWVkZWQgb2JqZWN0cy4NCi0gT3V0cHV0IGRpcmVjdG9yaWVzIGZvciBlYWNoIGZpZ3VyZSB0eXBlIGFyZSBkZWZpbmVkIGV4cGxpY2l0bHkgYmVmb3JlIHNvdXJjaW5nIHZpc3VhbGl6YXRpb24gc2NyaXB0cy4NCg0KLS0tDQoNCkZvciBhY2Nlc3MgdG8gdGhlIGlucHV0IGRhdGEsIHBsZWFzZSBjb250YWN0IHRoZSBhdXRob3JzLiAgDQpUbyBjaXRlIHRoaXMgd29ya2Zsb3csIHBsZWFzZSB1c2U6ICANCioqQ2hlbmFyaWRlcywgTC4sIEJyeWFuLCBDLiwgJiBMYWRpc2xhdSwgUi4gKDIwMjUpLiBNZXRob2RvbG9neSBmb3IgY29tcGFyaW5nIGNpdGF0aW9uIGRhdGFiYXNlIGNvdmVyYWdlIG9mIGRhdGFzZXQgdXNhZ2UuKiogIA0KQXZhaWxhYmxlIGF0OiBbaHR0cHM6Ly9sYXVyZW5jaGVuYXJpZGVzLmdpdGh1Yi5pby9kYXRhX3VzYWdlX3JlcG9ydC9yZXBvcnQuaHRtbF0oaHR0cHM6Ly9sYXVyZW5jaGVuYXJpZGVzLmdpdGh1Yi5pby9kYXRhX3VzYWdlX3JlcG9ydC9yZXBvcnQuaHRtbCkNCg==