Methodology for Comparing Citation Database Coverage of Dataset
Usage
This repository contains a set of R scripts that process, harmonize,
and compare publication-level datasets from Scopus, OpenAlex (seed
corpus and full-text), and Dimensions. The workflow supports analysis of
dataset usage across platforms, topic-level comparisons, and source
intersections.
Program Order
process_openalex.R
process_scopus_seed_corpus.R
process_dimensions.R
compare_publications.R
construct_treemaps.R
construct_sankey_plots.R
aggregate_topics.R
construct_bubble_charts.R
construct_word_clouds.R
clean_author_names.R
clean_institutional_affil.R
construct_maps.R
Workflow Overview
The main R script is 00_master_script.R
. It sequentially
sources individual scripts to process input files, generate harmonized
datasets, and produce comparative visualizations and tables.
Script Descriptions (Execution Order)
Part 1: Cleaning and Deduplicating Data
Step 1: Clean OpenAlex Publications
Script name: 01_process_openalex.R
Processes OpenAlex full-text and seed corpus data and deduplicates
records by dataset.
Note: Prior to running this script, you must run the
flatten_openalex_all_datasets.ipynb
Jupyter Notebook. This
notebook processes the raw JSON files from the OpenAlex Seed Search
Corpus and flattens them into structured CSV tables. Refer to the
OpenAlex data schema for a full overview of available fields.
Outputs retained:
Step 2: Clean Scopus Publications
Script name:
02_process_scopus_seed_corpus.R
Processes matched Scopus publications using the same seed DOIs.
Structures the output to match OpenAlex and supports dataset-level
comparisons.
Additional outputs:
master_scopus
, scopus_pubs
scopus_doi
, scopus_topic
,
scopus_author
dyad_model_dataset
Step 3: Clean Dimensions Publications
Script name:
03_process_dimensions.R
Processes Dimensions publication metadata, focusing on concept tags,
authorship, and dataset mentions.
Additional outputs:
master_dimensions
, dim_publications
dim_authors
, dim_affiliations
,
dim_concepts
dataset_reference_table
Part 2: Comparing Publications
Step 1: Build a comparable sample
Script name:
04_compare_publications.R
Filters for articles published between 2017–2023. Merges all four
datasets and compares coverage and intersections.
Outputs:
- Unified publication-level dataset
- Source flags:
scopus_yes
, oa_yes
,
dim_yes
- Harmonized dataset indicators
Step 2: Construct treemap visualizations
Script name:
05_construct_treemaps.R
Visualizes the overlap in publication coverage using treemaps across
15 mutually exclusive source combinations.
Output:
- Treemaps (one per dataset) saved as PNG files
Part 3: Comparing Journals
Step 1: Construct Sankey plots
Script name:
06_construct_sankey_plots.R
Generates Sankey plots to show source pathways for dataset mentions
across citation platforms.
Output:
- PNG diagrams representing flow across sources
Part 4: Comparing Topics
Step 1: Harmonize topics across sources
Script name: 07_aggregate_topics.R
Harmonizes and aggregates topic metadata across all four sources.
Produces a unified dataset used for topic-level summaries.
Outputs:
dataset##_topics
for each dataset
master_topics_df
,
food_security_flag_terms
- Count tables for topics by source and overlap group
Step 2: Construct bubble charts
Script name:
08_construct_bubble_charts.R
Creates bubble charts of topic coverage by dataset and source. Topics
are aggregated using harmonized labels.
Output:
- PNG files grouped by dataset
- Aggregated topic flags
Step 3: Construct word clouds
Script name:
09_construct_word_clouds.R
Generates word clouds for top topics flagged as relevant to food
security or other themes of interest.
Output:
- Word cloud PNGs by dataset and topic flag
Part 5: Comparing Authors and Institutions
Step 1: Clean and standardize author names
Script name:
10_clean_author_names.R
Standardizes and deduplicates author names for consistent
identification across citation sources.
Step 2: Clean institutional affiliations
Script name:
11_clean_institutional_affil.R
Standardizes and assigns ROR identifiers to institutional
affiliations, where available.
Step 3: Construct maps
Script name: 12_construct_maps.R
Maps institutional affiliations by dataset and source, using geocoded
or ROR-derived location data.
Notes on Replicability
- All scripts assume that required paths are defined at the top of
00_master_script.R
.
- A
keep_list
strategy is used in each script to manage
memory and retain only needed objects.
- Output directories for each figure type are defined explicitly
before sourcing visualization scripts.
For access to the input data, please contact the authors.
To cite this workflow, please use:
Chenarides, L., Bryan, C., & Ladislau, R. (2025).
Methodology for comparing citation database coverage of dataset
usage.
Available at: https://laurenchenarides.github.io/data_usage_report/report.html
LS0tDQp0aXRsZTogIlJFQURNRSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMgTWV0aG9kb2xvZ3kgZm9yIENvbXBhcmluZyBDaXRhdGlvbiBEYXRhYmFzZSBDb3ZlcmFnZSBvZiBEYXRhc2V0IFVzYWdlDQoNClRoaXMgcmVwb3NpdG9yeSBjb250YWlucyBhIHNldCBvZiBSIHNjcmlwdHMgdGhhdCBwcm9jZXNzLCBoYXJtb25pemUsIGFuZCBjb21wYXJlIHB1YmxpY2F0aW9uLWxldmVsIGRhdGFzZXRzIGZyb20gU2NvcHVzLCBPcGVuQWxleCAoc2VlZCBjb3JwdXMgYW5kIGZ1bGwtdGV4dCksIGFuZCBEaW1lbnNpb25zLiBUaGUgd29ya2Zsb3cgc3VwcG9ydHMgYW5hbHlzaXMgb2YgZGF0YXNldCB1c2FnZSBhY3Jvc3MgcGxhdGZvcm1zLCB0b3BpYy1sZXZlbCBjb21wYXJpc29ucywgYW5kIHNvdXJjZSBpbnRlcnNlY3Rpb25zLg0KDQotLS0NCg0KIyMgUHJvZ3JhbSBPcmRlcg0KDQoxLiBgcHJvY2Vzc19vcGVuYWxleC5SYCAgDQoyLiBgcHJvY2Vzc19zY29wdXNfc2VlZF9jb3JwdXMuUmAgIA0KMy4gYHByb2Nlc3NfZGltZW5zaW9ucy5SYCAgDQoNCjQuIGBjb21wYXJlX3B1YmxpY2F0aW9ucy5SYCAgDQo1LiBgY29uc3RydWN0X3RyZWVtYXBzLlJgICANCjYuIGBjb25zdHJ1Y3Rfc2Fua2V5X3Bsb3RzLlJgDQoNCjcuIGBhZ2dyZWdhdGVfdG9waWNzLlJgICANCjguIGBjb25zdHJ1Y3RfYnViYmxlX2NoYXJ0cy5SYCAgDQo5LiBgY29uc3RydWN0X3dvcmRfY2xvdWRzLlJgDQoNCjEwLiBgY2xlYW5fYXV0aG9yX25hbWVzLlJgICANCjExLiBgY2xlYW5faW5zdGl0dXRpb25hbF9hZmZpbC5SYCAgDQoxMi4gYGNvbnN0cnVjdF9tYXBzLlJgDQoNCi0tLQ0KDQojIyBXb3JrZmxvdyBPdmVydmlldw0KDQpUaGUgbWFpbiBSIHNjcmlwdCBpcyBgMDBfbWFzdGVyX3NjcmlwdC5SYC4gSXQgc2VxdWVudGlhbGx5IHNvdXJjZXMgaW5kaXZpZHVhbCBzY3JpcHRzIHRvIHByb2Nlc3MgaW5wdXQgZmlsZXMsIGdlbmVyYXRlIGhhcm1vbml6ZWQgZGF0YXNldHMsIGFuZCBwcm9kdWNlIGNvbXBhcmF0aXZlIHZpc3VhbGl6YXRpb25zIGFuZCB0YWJsZXMuDQoNCi0tLQ0KDQojIyBTY3JpcHQgRGVzY3JpcHRpb25zIChFeGVjdXRpb24gT3JkZXIpDQoNCiMjIyBQYXJ0IDE6IENsZWFuaW5nIGFuZCBEZWR1cGxpY2F0aW5nIERhdGENCg0KIyMjIyBTdGVwIDE6IENsZWFuIE9wZW5BbGV4IFB1YmxpY2F0aW9ucyANCg0KKipTY3JpcHQgbmFtZToqKiBgMDFfcHJvY2Vzc19vcGVuYWxleC5SYA0KDQpQcm9jZXNzZXMgT3BlbkFsZXggZnVsbC10ZXh0IGFuZCBzZWVkIGNvcnB1cyBkYXRhIGFuZCBkZWR1cGxpY2F0ZXMgcmVjb3JkcyBieSBkYXRhc2V0Lg0KDQoqKk5vdGU6KiogUHJpb3IgdG8gcnVubmluZyB0aGlzIHNjcmlwdCwgeW91IG11c3QgcnVuIHRoZSBgZmxhdHRlbl9vcGVuYWxleF9hbGxfZGF0YXNldHMuaXB5bmJgIEp1cHl0ZXIgTm90ZWJvb2suIFRoaXMgbm90ZWJvb2sgcHJvY2Vzc2VzIHRoZSByYXcgSlNPTiBmaWxlcyBmcm9tIHRoZSBPcGVuQWxleCBTZWVkIFNlYXJjaCBDb3JwdXMgYW5kIGZsYXR0ZW5zIHRoZW0gaW50byBzdHJ1Y3R1cmVkIENTViB0YWJsZXMuIFJlZmVyIHRvIHRoZSBPcGVuQWxleCBkYXRhIHNjaGVtYSBmb3IgYSBmdWxsIG92ZXJ2aWV3IG9mIGF2YWlsYWJsZSBmaWVsZHMuDQoNCioqT3V0cHV0cyByZXRhaW5lZDoqKg0KDQotIGBtYXN0ZXJfb3BlbmFsZXhgDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMjogQ2xlYW4gU2NvcHVzIFB1YmxpY2F0aW9ucw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwMl9wcm9jZXNzX3Njb3B1c19zZWVkX2NvcnB1cy5SYA0KDQpQcm9jZXNzZXMgbWF0Y2hlZCBTY29wdXMgcHVibGljYXRpb25zIHVzaW5nIHRoZSBzYW1lIHNlZWQgRE9Jcy4gU3RydWN0dXJlcyB0aGUgb3V0cHV0IHRvIG1hdGNoIE9wZW5BbGV4IGFuZCBzdXBwb3J0cyBkYXRhc2V0LWxldmVsIGNvbXBhcmlzb25zLg0KDQoqKkFkZGl0aW9uYWwgb3V0cHV0czoqKg0KDQotIGBtYXN0ZXJfc2NvcHVzYCwgYHNjb3B1c19wdWJzYCAgDQotIGBzY29wdXNfZG9pYCwgYHNjb3B1c190b3BpY2AsIGBzY29wdXNfYXV0aG9yYCAgDQotIGBkeWFkX21vZGVsX2RhdGFzZXRgDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMzogQ2xlYW4gRGltZW5zaW9ucyBQdWJsaWNhdGlvbnMNCg0KKipTY3JpcHQgbmFtZToqKiBgMDNfcHJvY2Vzc19kaW1lbnNpb25zLlJgDQoNClByb2Nlc3NlcyBEaW1lbnNpb25zIHB1YmxpY2F0aW9uIG1ldGFkYXRhLCBmb2N1c2luZyBvbiBjb25jZXB0IHRhZ3MsIGF1dGhvcnNoaXAsIGFuZCBkYXRhc2V0IG1lbnRpb25zLg0KDQoqKkFkZGl0aW9uYWwgb3V0cHV0czoqKg0KDQotIGBtYXN0ZXJfZGltZW5zaW9uc2AsIGBkaW1fcHVibGljYXRpb25zYCAgDQotIGBkaW1fYXV0aG9yc2AsIGBkaW1fYWZmaWxpYXRpb25zYCwgYGRpbV9jb25jZXB0c2AgIA0KLSBgZGF0YXNldF9yZWZlcmVuY2VfdGFibGVgDQoNCi0tLQ0KDQojIyMgUGFydCAyOiBDb21wYXJpbmcgUHVibGljYXRpb25zDQoNCiMjIyMgU3RlcCAxOiBCdWlsZCBhIGNvbXBhcmFibGUgc2FtcGxlICANCg0KKipTY3JpcHQgbmFtZToqKiBgMDRfY29tcGFyZV9wdWJsaWNhdGlvbnMuUmANCg0KRmlsdGVycyBmb3IgYXJ0aWNsZXMgcHVibGlzaGVkIGJldHdlZW4gMjAxN+KAkzIwMjMuIE1lcmdlcyBhbGwgZm91ciBkYXRhc2V0cyBhbmQgY29tcGFyZXMgY292ZXJhZ2UgYW5kIGludGVyc2VjdGlvbnMuDQoNCioqT3V0cHV0czoqKg0KDQotIFVuaWZpZWQgcHVibGljYXRpb24tbGV2ZWwgZGF0YXNldCAgDQotIFNvdXJjZSBmbGFnczogYHNjb3B1c195ZXNgLCBgb2FfeWVzYCwgYGRpbV95ZXNgICANCi0gSGFybW9uaXplZCBkYXRhc2V0IGluZGljYXRvcnMNCg0KLS0tDQoNCiMjIyMgU3RlcCAyOiBDb25zdHJ1Y3QgdHJlZW1hcCB2aXN1YWxpemF0aW9ucw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwNV9jb25zdHJ1Y3RfdHJlZW1hcHMuUmANCg0KVmlzdWFsaXplcyB0aGUgb3ZlcmxhcCBpbiBwdWJsaWNhdGlvbiBjb3ZlcmFnZSB1c2luZyB0cmVlbWFwcyBhY3Jvc3MgMTUgbXV0dWFsbHkgZXhjbHVzaXZlIHNvdXJjZSBjb21iaW5hdGlvbnMuDQoNCioqT3V0cHV0OioqDQoNCi0gVHJlZW1hcHMgKG9uZSBwZXIgZGF0YXNldCkgc2F2ZWQgYXMgUE5HIGZpbGVzDQoNCi0tLQ0KDQojIyMgUGFydCAzOiBDb21wYXJpbmcgSm91cm5hbHMNCg0KIyMjIyBTdGVwIDE6IENvbnN0cnVjdCBTYW5rZXkgcGxvdHMgDQoNCioqU2NyaXB0IG5hbWU6KiogYDA2X2NvbnN0cnVjdF9zYW5rZXlfcGxvdHMuUmANCg0KR2VuZXJhdGVzIFNhbmtleSBwbG90cyB0byBzaG93IHNvdXJjZSBwYXRod2F5cyBmb3IgZGF0YXNldCBtZW50aW9ucyBhY3Jvc3MgY2l0YXRpb24gcGxhdGZvcm1zLg0KDQoqKk91dHB1dDoqKg0KDQotIFBORyBkaWFncmFtcyByZXByZXNlbnRpbmcgZmxvdyBhY3Jvc3Mgc291cmNlcw0KDQotLS0NCg0KIyMjIFBhcnQgNDogQ29tcGFyaW5nIFRvcGljcw0KDQojIyMjIFN0ZXAgMTogSGFybW9uaXplIHRvcGljcyBhY3Jvc3Mgc291cmNlcw0KDQoqKlNjcmlwdCBuYW1lOioqIGAwN19hZ2dyZWdhdGVfdG9waWNzLlJgDQoNCkhhcm1vbml6ZXMgYW5kIGFnZ3JlZ2F0ZXMgdG9waWMgbWV0YWRhdGEgYWNyb3NzIGFsbCBmb3VyIHNvdXJjZXMuIFByb2R1Y2VzIGEgdW5pZmllZCBkYXRhc2V0IHVzZWQgZm9yIHRvcGljLWxldmVsIHN1bW1hcmllcy4NCg0KKipPdXRwdXRzOioqDQoNCi0gYGRhdGFzZXQjI190b3BpY3NgIGZvciBlYWNoIGRhdGFzZXQgIA0KLSBgbWFzdGVyX3RvcGljc19kZmAsIGBmb29kX3NlY3VyaXR5X2ZsYWdfdGVybXNgICANCi0gQ291bnQgdGFibGVzIGZvciB0b3BpY3MgYnkgc291cmNlIGFuZCBvdmVybGFwIGdyb3VwDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMjogQ29uc3RydWN0IGJ1YmJsZSBjaGFydHMNCg0KKipTY3JpcHQgbmFtZToqKiBgMDhfY29uc3RydWN0X2J1YmJsZV9jaGFydHMuUmANCg0KQ3JlYXRlcyBidWJibGUgY2hhcnRzIG9mIHRvcGljIGNvdmVyYWdlIGJ5IGRhdGFzZXQgYW5kIHNvdXJjZS4gVG9waWNzIGFyZSBhZ2dyZWdhdGVkIHVzaW5nIGhhcm1vbml6ZWQgbGFiZWxzLg0KDQoqKk91dHB1dDoqKg0KDQotIFBORyBmaWxlcyBncm91cGVkIGJ5IGRhdGFzZXQgIA0KLSBBZ2dyZWdhdGVkIHRvcGljIGZsYWdzDQoNCi0tLQ0KDQojIyMjIFN0ZXAgMzogQ29uc3RydWN0IHdvcmQgY2xvdWRzDQoNCioqU2NyaXB0IG5hbWU6KiogYDA5X2NvbnN0cnVjdF93b3JkX2Nsb3Vkcy5SYA0KDQpHZW5lcmF0ZXMgd29yZCBjbG91ZHMgZm9yIHRvcCB0b3BpY3MgZmxhZ2dlZCBhcyByZWxldmFudCB0byBmb29kIHNlY3VyaXR5IG9yIG90aGVyIHRoZW1lcyBvZiBpbnRlcmVzdC4NCg0KKipPdXRwdXQ6KioNCg0KLSBXb3JkIGNsb3VkIFBOR3MgYnkgZGF0YXNldCBhbmQgdG9waWMgZmxhZw0KDQotLS0NCg0KIyMjIFBhcnQgNTogQ29tcGFyaW5nIEF1dGhvcnMgYW5kIEluc3RpdHV0aW9ucw0KDQojIyMjIFN0ZXAgMTogQ2xlYW4gYW5kIHN0YW5kYXJkaXplIGF1dGhvciBuYW1lcw0KDQoqKlNjcmlwdCBuYW1lOioqIGAxMF9jbGVhbl9hdXRob3JfbmFtZXMuUmANCg0KU3RhbmRhcmRpemVzIGFuZCBkZWR1cGxpY2F0ZXMgYXV0aG9yIG5hbWVzIGZvciBjb25zaXN0ZW50IGlkZW50aWZpY2F0aW9uIGFjcm9zcyBjaXRhdGlvbiBzb3VyY2VzLg0KDQotLS0NCg0KIyMjIyBTdGVwIDI6IENsZWFuIGluc3RpdHV0aW9uYWwgYWZmaWxpYXRpb25zDQoNCioqU2NyaXB0IG5hbWU6KiogYDExX2NsZWFuX2luc3RpdHV0aW9uYWxfYWZmaWwuUmANCg0KU3RhbmRhcmRpemVzIGFuZCBhc3NpZ25zIFJPUiBpZGVudGlmaWVycyB0byBpbnN0aXR1dGlvbmFsIGFmZmlsaWF0aW9ucywgd2hlcmUgYXZhaWxhYmxlLg0KDQotLS0NCg0KIyMjIyBTdGVwIDM6IENvbnN0cnVjdCBtYXBzDQoNCioqU2NyaXB0IG5hbWU6KiogYDEyX2NvbnN0cnVjdF9tYXBzLlJgDQoNCk1hcHMgaW5zdGl0dXRpb25hbCBhZmZpbGlhdGlvbnMgYnkgZGF0YXNldCBhbmQgc291cmNlLCB1c2luZyBnZW9jb2RlZCBvciBST1ItZGVyaXZlZCBsb2NhdGlvbiBkYXRhLg0KDQotLS0NCg0KIyMgTm90ZXMgb24gUmVwbGljYWJpbGl0eQ0KDQotIEFsbCBzY3JpcHRzIGFzc3VtZSB0aGF0IHJlcXVpcmVkIHBhdGhzIGFyZSBkZWZpbmVkIGF0IHRoZSB0b3Agb2YgYDAwX21hc3Rlcl9zY3JpcHQuUmAuDQotIEEgYGtlZXBfbGlzdGAgc3RyYXRlZ3kgaXMgdXNlZCBpbiBlYWNoIHNjcmlwdCB0byBtYW5hZ2UgbWVtb3J5IGFuZCByZXRhaW4gb25seSBuZWVkZWQgb2JqZWN0cy4NCi0gT3V0cHV0IGRpcmVjdG9yaWVzIGZvciBlYWNoIGZpZ3VyZSB0eXBlIGFyZSBkZWZpbmVkIGV4cGxpY2l0bHkgYmVmb3JlIHNvdXJjaW5nIHZpc3VhbGl6YXRpb24gc2NyaXB0cy4NCg0KLS0tDQoNCkZvciBhY2Nlc3MgdG8gdGhlIGlucHV0IGRhdGEsIHBsZWFzZSBjb250YWN0IHRoZSBhdXRob3JzLiAgDQpUbyBjaXRlIHRoaXMgd29ya2Zsb3csIHBsZWFzZSB1c2U6ICANCioqQ2hlbmFyaWRlcywgTC4sIEJyeWFuLCBDLiwgJiBMYWRpc2xhdSwgUi4gKDIwMjUpLiBNZXRob2RvbG9neSBmb3IgY29tcGFyaW5nIGNpdGF0aW9uIGRhdGFiYXNlIGNvdmVyYWdlIG9mIGRhdGFzZXQgdXNhZ2UuKiogIA0KQXZhaWxhYmxlIGF0OiBbaHR0cHM6Ly9sYXVyZW5jaGVuYXJpZGVzLmdpdGh1Yi5pby9kYXRhX3VzYWdlX3JlcG9ydC9yZXBvcnQuaHRtbF0oaHR0cHM6Ly9sYXVyZW5jaGVuYXJpZGVzLmdpdGh1Yi5pby9kYXRhX3VzYWdlX3JlcG9ydC9yZXBvcnQuaHRtbCkNCg==