Skip to content

Python code that can be used to gather and organize metadata from research data repositories about the research datasets published by affilitates of a particular institution.

License

Notifications You must be signed in to change notification settings

utlibraries/research-data-discovery

Repository files navigation

License: MIT

Scripted process for retrieving metadata on institutional-affiliated research dataset publications

Metadata

  • Version: 3.5.2
  • Released: 2025/06/09 (end of sprint 4)
  • Author(s): Bryan Gee (UT Libraries, University of Texas at Austin; bryan.gee@austin.utexas.edu; ORCID: 0000-0003-4517-3290)
  • Contributor(s): None
  • License: MIT
  • README last updated: 2025/06/09 (end of sprint 4)

Table of Contents

  1. Purpose
  2. Organization & file list
  3. Overview
  4. Important caveats
  5. Re-use
  6. Planned development

Purpose

This repository contains Python code that is designed to gather and organize metadata from a number of individual research data repository/platform APIs in order to analyze and summarize research dataset publications that are affiliated with at least one researcher from a particular institution. This code is being developed in the specific context of retrieving data for the University of Texas at Austin and is intended to be eventually be used in tandem with separate but related work that searches for UT-Austin-affiliated GitHub repositories in order to build a more comprehensive understanding of how researchers on campus are sharing their research outputs. However, this code has been constructed to be stand-alone and can be adapted for use at other institutions.

Organization & file list

  1. dataset-records-retrieval.py: This is the primary Python script for conducting large-scale records retrieval through the DataCite API. It also includes functionality for using a set of different APIs to try and identify deposits on Figshare that lack affiliation metadata but that can be connected to an article with at least one author from a focal institution.
  2. config-template.json: This is the config file that contains most parameters and that stores API keys. This file should be populated with personal information as necessary, with affiliation permutations modified if applying this to a different institution, and renamed as config.json in order for scripts to work. For some fields, the UT Austin specific information is left in as a model of what format should be entered.
  3. journal-list.json: This file contains the official journal names and ISSNs to be queried as part of one of the possible Figshare workflows (construction of a hypothetical SI DOI and testing its existence). This file contains all PLOS titles as an example but could be expanded to any other journal that does uses the model of appending '.s00x' to the article DOI for mediated Figshare deposits.
  4. data-dictionary.csv: This file describes the columns that are contained in each output and accessory-output file. Note that exporting certain files is coded out in the present script, and columns for those files are not defined here since those files are not considered essential (e.g., checkpoint files).
  5. accessory-scripts/dataset-records-retrieval-visualization.py: This file contains the code used to generate visuals for the 2025 RDAP Summit and the CNI 2025 Spring Meeting.
  6. accessory-scripts/plos-osi-search.py: This file contains the code used to retrieve the latest version of the PLOS Open Science Indicators (OSI) Dataset, identifies articles that list data as having been shared in part or in whole through Supplemental Information (mediated Figshare deposit for all PLOS titles), retrieves a list of PLOS articles with at least one author from a focal institution, and searches for matches to identify PLOS articles co-authored by a university researcher where 'data' were deposited on Figshare through the mediated process. It does the same for affiliated articles that link to NCBI deposits (note that this could be reuse rather than novel generation).
  7. accessory-scripts/datacite-ror-query.py: This file contains a trimmed version of main workflow and uses the ROR identifier (specified in the config.json file) to search for affiliated datasets instead of a single or multiple affiliation name strings. The only purpose for this script is to quantify the degree to which a ROR-based query will result in an incomplete retrieval due to lack of widespread adoption of ROR / re-curation of deposits published prior to ROR integration.
  8. accessory-scripts/datacite-figshare-partner-query.py: This file is an adaptation of one of the secondary Figshare workflows that identifies journal-mediated, DataCite-minted deposits without affiliation metadata and attempts to connect them to articles that were (co)authored by a researcher at a focal institution. In the workflow in the main codebase, a publisher (e.g., 'Taylor & Francis') and the resourceTypeGeneral of 'dataset' are specified, with the script set to loop through a list of publishers. In this accessory script, only a single publisher is queried, but the query is broadened to capture any resource type (e.g., 'audiovisual', 'image'); note that this has to be time-capped for the past few years to keep the scale of the retrieval manageable, which could be on the order of hundreds of thousands of records for just one publisher. The script runs the same cross-matching of DataCite-minted deposits against a list of affiliated articles retrieved through OpenAlex. The purpose of this workflow is to explore whether some objects not labeled as 'dataset' might contain data and whether some objects labeled as 'dataset' might not be data. The use of file formats to attempt to predict the 'data' nature of an object is still very preliminary. It is intended to lay the foundation for a process in the main workflow to more rigorously assess the contents of objects labeled as 'dataset.'
  9. accessory-scripts/datacite-figshare-partner-query_metadata-only.py: This file is used only to retrieve the automated metadata summaries that are included in DataCite API responses (e.g., number of DOIs published per year). This can potentially be useful for rapid summaries when gathering the data would otherwise be an intensive API query process due to the number of records, but it should be used carefully, especially for repositories that may mint DOIs at a very granular scale (e.g., Dataverse installations, Figshare, Zenodo).
  10. accessory-scripts/crossref-query.py: This file conducts a general institution-based query to the Crossref REST API. It is separated from the primary workflow based on the results for UT Austin (hundreds of thousands of results, most of which have nothing to do with UT Austin), which indicate that this does not need to be run as frequently as a DataCite query and could instead have a recently generated output file pulled in to concatenate with the primary workflow's output. This script is configured to export a CSV file with the same fields as the primary workflow output.
  11. accessory-data/20250310-mediated-figshare-metadata-summary.csv: This file contains a manually compiled summary of select metadata for Figshare deposits mediated through publisher partners(filter on 'Publishers'); it is intended to provide insight into possible filter parameters that may permit their programmatic retrieval. This is a static file created on 2025/03/10, and partners/metadata may change in the future (e.g., SciELO journals was listed the last time I examined this in October 2024). Briefly, I accessed each publisher's Figshare collection through the web interface and selected 10 random deposits, with preference given to recent deposits. A few listed publishers are not recorded in the CSV file: JACC and SAGE redirect to the publishers' homepage, not a Figshare collection; Human Genome Variation is a database; and IEEE Standards, Medical Affairs Professional Society, Optica Open, and Physiome appeared to contain out-of-scope topic (e.g., only preprints in Optica Open). I recorded which indexer (DataCite vs. Crossref) was used to mint the DOI; what the listed publisher name is (listed_publisher); the client-id and provider-id if minted through DataCite, as these represent queryable fields that indicate a Figshare connection; up to 10 DOIs that were examined; whether the DOIs contain the string 'figshare' (doi_figshare); and how the DOIs were constructed (doi_construction). There are only a few DOI construction models: a default Figshare DOI that is randomly created and assigned with the prefix '10.6084/m9.figshare'; appending .t00x or .s00x to the end of the article DOI, where 'x' is a sequential integer; or a randomly created DOI using the associated journal/publisher's DOI prefix.
  12. accessory-scripts/preprint-dryad-date-comparison.py: This file is narrowly focused, being designed only to look at discrepancies in timestamps in Dryad datasets as part of a case study in the forthcoming preprint. It retrieves all records for Dryad from both the DataCite and Dryad APIs and all timestamps that are available in each.
  13. accessory-scripts/preprint-reads-dataset-reanalysis.py: This file is narrowly focused, being designed only to reanalyze select parts of the RADS dataset (Johnston et al., 2024; PLOS ONE). It retrieves additional metadata for deposits linked to certain repositories and then either (1) retrieves information on linked articles or (2) applies the same deduplication steps of the primary workflow to the RADS data; these are not mutually exclusive but are not directly related functions.

Several scripts will create additional subdirectories as part of their workflow (e.g., 'outputs'); these subdirectories are not provided here.

Overview

Primary workflow

The core workflow contained in dataset-records-retrieval.py makes use of four REST APIs in order to conduct a large-scale initial sweep for university-affiliated datasets based on a set of permutations for the institutional name: DataCite; Dataverse; Dryad; and Zenodo. Note that the Dataverse code is configured specifically for the Texas Data Repository (TDR)'s instance. Other APIs have been incorporated into secondary workflows or accessory scripts: Crossref; Figshare; and OpenAlex. Finally, a few APIs have been explored for dataset retrieval but are not currently incorporated here: Mendeley Data; and Open Science Framework (OSF).

The code is designed to maximize the potential retrieval scope of a given API query, specifically as it relates to the fields in which an affiliation may be found (which is not always the 'affiliation' field) and the various permutations for UT Austin specifically (e.g., 'University of Texas at Austin' versus 'University of Texas, Austin'). Even though the three repositories that are integrated into the primary workflow (Dataverse, Dryad, Zenodo) all mint their DOIs through DataCite and should thus be discoverable collectively through the DataCite API, the individual repository APIs were queried as both a cross-validation process and an exploration of whether there might be some important variability in metadata cross-walks; an initial inability to perfectly cross-validate all three repositories records in the early stages of this code's development facilitated refinement of the workflow and identified edge-case scenarios. An additional benefit of exploring repository-specific APIs is the potential to identify additional metadata that are not cross-walked to DataCite (possibly because they are not supported in the present schema), such as certain controlled vocabularies.

The primary script consists of four major components:

  1. API query construction and calls;
  2. Filtering of the JSON response and conversion to a pandas dataframe;
  3. Cross-validation checks of the responses from individual repositories' API against their equivalent output as retrieved from the DataCite API; and
  4. Concatenation and de-duplication, with the 'original' (specific repository API) source preferred when a dataset was returned by both the repository API and the DataCite API.

The cross-validation step is optional and can be enabled/disabled with a single Boolean variable; if disabled, DataCite will be the exclusive source of retrieved information. De-duplication is necessary regardless of whether cross-validation is implemented or not, primarily due to variable granularity of DOI assignment between repositories. The current process also handles 'double-minting' of DOIs for one deposit, a practice found in some repositories (e.g., Zenodo), and includes a toggle to de-duplicate Dataverse deposits that have the same list of authors and affiliations, the same publication date, and the same rights/licensing. This accounts for the tendency of some users to oversplit materials for one manuscript into multiple DOI-backed datasets, all nested under a non-DOI-backed dataverse (see ticket). This is a highly conservative approach that should avoid accidental removal of linked deposits that were separated intentionally (e.g., data in one deposit, software in another; different authors for different datasets for one study).

Secondary workflows

Based on the results of early testing with the primary workflow, I am now developing targeted secondary workflows that attempt to fill known gaps (e.g., paucity of Figshare deposits). Version 2.0.0 introduced several secondary workflows aimed at finding Figshare deposits, which largely lack affiliation metadata but which can be discovered through a number of different means, especially when they have been automatically created through a partner journal's manuscript submission process/portal. The dataset-records-retrieval.py script contains three secondary workflows that can be toggled on or off with Boolean variables.

The first workflow (figshareArticleLink) takes Figshare deposits that were recovered with affiliation metadata and 'figshare' listed as the publisher and searches for article metadata in order to identify patterns (e.g., for UT Austin, all such deposits are linked to articles in Springer Nature journals). This is mostly an intellectual curiosity kind of workflow since it doesn't increase the number of identified datasets; the second and third workflows represent two different ways to identify additional affiliated Figshare deposits that lack affiliation metadata.

The second workflow (figshareWorkflow1) takes advantage of the fact that for many partner journals, mediated Figshare deposits are listed with the publisher in the 'publisher' metadata field, rather than Figshare. This workflow thus retrieves all datasets with a publisher listing like 'Taylor & Francis' through DataCite, retrieves university-affiliated articles published by that same publisher from OpenAlex (which uses ROR to standardize affiliation metadata), and looks for matches. It should be noted that there is widespread variation in how mediated Figshare deposits are classified in the DataCite resource type schema, so not all objects labeled as 'dataset' are perhaps datasets, and not all objects containing 'data proper' will be labeled as 'dataset' (other options include 'text', 'component', and 'collection').

The third workflow (figshareWorkflow2) takes advantage of a different configuration in certain journals in which mediated Figshare deposits are minted through Crossref with a DOI that appends '.s00x' (or sometimes '.t00x') to the end of the associated article DOI where 'x' is a sequential number. This workflow retrieves all university-affiliated articles from a publisher that is known to do this (e.g., PLOS) with journal-list.json, constructs a hypothetical mediated Figshare DOI by adding '.s001' to the article DOI, and then tests whether that link exists. This is a more time-intensive process and will only identify that there is some sort of Figshare deposit - this may not be classified as a 'dataset' in the metadata or in a conceptual sense.

There is also a highly specific script, figshare-plos-osi-search.py, which retrieves the PLOS Open Science Indicators (OSI) dataset through the Figshare API. This dataset encompasses all PLOS articles (through a certain timeframe) and has identified locations of data sharing. Because PLOS uses the mediated Figshare process, any article listed as having shared data as Supplemental Information has created a mediated Figshare deposit. This workflow (which runs in about 30 seconds), will identify all such articles, retrieve a list of university-affiliated PLOS articles through OpenAlex, and find matches. This script was mostly used as an initial test of concept for developing the above workarounds for Figshare deposits that can be applied across more publishers, but it can be useful for quickly getting an estimate of what proportion of articles have generated a Figshare deposit. Refer to the PLOS OSI methodology documentation for more details on how their dataset was generated; the same caveats of whether a deposit is a 'dataset proper' remain, although they attempted to infer whether SI was data or not.

Version 3.0.0 introduces a fourth Figshare workflow, which is called a 'validator' at present; what this process does is take Figshare deposits that use the standard Figshare DOI construction and loop them through the Figshare API (the numerical string at the end of the DOI is the 'ID' that is accepted by the API), returning information on the files within each deposit. Some basic rules are then implemented to flag deposits that contain software (e.g., R scripts), deposits that only comprise software, and deposits that only contain formats that are less likely to be data (e.g., MS Word). The current logic is rather basic, in part because there are not many unique file formats recognized in UT-affiliated deposits, but more robust criteria could be established (e.g., using the filenames). It's likely that for another institution, a reuser would encounter other file formats, as well as formats where the mimetype assignment is incorrect (e.g., one accounted for here is the labeling of CSV files as 'text/plain' rather than 'text/csv'.)

Version 3.0.0 also introduces a new secondary workflow to retrieve NCBI Bioprojects. NCBI does not use digital PIDs, instead issuing collection/accession/project IDs that while persistent, do not have a persistent-resolving URL like an ARK or handle, let alone a DOI. There is also no API specifically designed for metadata records retrieval for institutions. This workflow makes use of a Selenium library developed for Python. Selenium is normally used for developers to test functionality, but the ability to program a series of interactions with a webpage (e.g., click this dropdown, then select this radio button) allows for the automation of retrieving an XML of query results and integration into the larger codebase. This workflow requires a separate installation of a browser-specific WebDriver proxy. This code was developed for Mozilla Firefox and thus uses GeckoDriver, but other browsers have their own proxies (e.g., ChromeDriver). You may need to do some basic path mapping for the installed proxy (e.g., for GeckoDriver). If you use a browser other than Firefox, some modifications will be required for the code - an AI chatbot can probably convert the relevant chunks for you with minimal prompting since a chatbot facilitated the development of this workflow - but you don't need to have Firefox configured in any particular way or to use it on any regular basis, so as long as it's installed on your computer, you may not need or want to rework the script. The code for other browsers may be added in the future with a toggle to indicate which codeblock to use. This workflow is not any faster than manually querying and downloading this XML - you will see your computer open a browser window, go to the link, and click some buttons like you would manually - but it allows for more complete automation of the process.

Version 3.3.0 introduces an external accessory workflow to make a general query to the Crossref REST API. For UT Austin, this retrieved an unwieldly number of results (>600,000), most of which were not actually related to UT Austin and even fewer of which were actually datasets (i.e. some UT Austin-affiliated objects were mislabeled as 'dataset' in the Crossref schema). This query is separated out since it is a high-cost, low-reward process that will not be run as routinely as the DataCite query.

The DataCite Citation Corpus was explored but is not presently incorporated because it returned very few results for UT Austin (fewer than 150).

Outputs

Main workflow

Four files will always be saved regardless of whether cross-validation or Figshare workflows are used:

  1. date_datacite-initial-output.csv: This file is exported immediately after subsetting the API response for select fields and has had no additional processing (e.g., deduplication) performed; as a result, it can be quite large if your query retrieves a large number of file-level DOIs or other forms of overgranularized datasets.

  2. date_datacite-output-for-affiliation-source.csv: This file is exported relatively early in the process with fields retrieved from DataCite and is used exclusively to summarize which field an institutional abbreviation was detected in (affiliation_source) and what permutation of the institutional name was detected (affiliation_permutation). This categorization is hierarchical; the script first looks in the creator.affiliation.name field, and if it finds a focal affiliation, will be recorded as 'creator.affiliation.' If no affiliation is found, it will then check contributor.affiliation.name field, and if it it finds a focal affiliation, will be recorded as 'contributor.affiliation.' For remaining entries, it will check creator.name and contributor.name. Whenever a match is found, the entry is removed from further consideration, so a dataset with the affiliation listed in both the creator.affiliation.name and contributor.affiliation.name fields will be listed here as 'creator.affiliation.' There are also additional fields recorded here for metadata assessment that are similarly not carried through to the final output.

  3. date_datacite-output-for-metadata-assessment.csv: A nearly identical file to date_datacite-output-for-metadata-assessment.csv; currently, the information unique to this file is a column that converts mimeTypes to standardized, 'friendly' file formats, columns for whether a dataset contains any code format and whether it contains only code formats, and columns where the calculated number of descriptive and non-descriptive words in the dataset title are recorded.

  4. date_full-concatenated-dataframe.csv: This file is the final output of the main workflow. It has applied all filtering and de-duplication steps, trimmed the listed variables to a select few for readability, and has added some categorical variables (e.g., whether a repository is part of GREI).

If you run the cross-validation process, the script will return a list of datasets from a specific repository that were retrieved from DataCite but not from that repository's API and vice versa (e.g., date_DataCite-into-Dryad_joint-unmatched-dataframe.csv). It will also combine all of the affiliated datasets that were only retrieved from the repository-specific API into one file (date_datacite-additional-cross-validation.csv).

If you run figshareArticleLink, it will output a file with a few fields from Crossref in a CSV (date_figshare-datasets-with-article-info.csv), including the publisher (e.g,. 'Springer Nature'), the specific journal, the DOI of the article, a list of the article authors, the title of the article, and the publication date of the article.

If you run figshareWorkflow1, it will output three different files. The first is the result of merging together the list of datasets from DataCite with the list of articles from OpenAlex, date_figshare-discovery-all.csv. This file contains a mix of fields from DataCite (e.g., dataset DOI) and OpenAlex (e.g., DOI and title of the article). This is only used in order to explore whether a 'dataset' metadata characterization is appropriate based on attributes such as file type(s) in the fourth Figshare workflow. The second is the result of a deduplication process that seeks to mitigate the overgranularization of some manuscripts' SI (one DOI for each file), date_figshare-discovery-deduplicated.csv; this is formatted for vertical concatenation with date_full-concatenated-dataframe.csv and is the one that is worked with further downstream.The third file is date_full-concatenated-dataframe-plus-figshare.csv, which appends the newly discovered Figshare records that do not record affiliation metadata but are linked to university-affiliated articles to the date_full-concatenated-dataframe.csv file. The source is listed as 'Datacite+' to account for the nuanced workflow.

If you run figshareWorkflow2, it will output one of two files depending on whether you use OpenAlex or Crossref; the file will look like this: date_indexer-articles-with-hypothetical-deposits.csv. This will return a list of all articles from the queried publisher(s) with a column ('Valid') indicating whether or not a hypothetical DOI was found for an article. There is no assurance that this object is a 'dataset' (in metadata classification or general classification), and there may be other objects of similar or different form that are related to the same article (e.g., .s001 is a supplemental figure and .s002 is a character matrix). For this reason, these records are not appended to the date_full-concatenated-dataframe.csv file as they will require more steps to assess.

If you run the Figshare validator, it will output one file, date_figshare-discovery-all-metadata_combined, which merges the basic dataframe returned from DataCite for these deposits with the filtered dataframe returned from the Figshare API. This is not reconcatenated with the larger dataset of all affiliated datasets.

If you run the NCBI workflow, it will always output the XML download from NCBI (bioproject_results.xml) and is set to overwrite any previous version. The converted dataframe is edited to be as similar as possible to the main dataframe to ensure alignment (e.g., NCBI identifiers are listed in the DOI column). The filtered dataframe that has the same columns as the main dataframe is exported as date_NCBI-select-output-aligned.csv. If you run the second Figshare workflow or pull in a previously generated output from that used the second Figshare workflow, the NCBI dataframe will be concatenated and output a date_full-concatenated-dataframe-plus-figshare-ncbi.csv file. If you don't run the second Figshare workflow, the NCBI dataframe will be concatenated with only the main output and return a date_full-concatenated-dataframe-plus-ncbi.csv file.

Accessory scripts

If you run the accessory-scripts/plos-osi-search.py script, it will return two files. The first is date_PLOS-articles-with-data-in-SI.csv and the second is date_PLOS-articles-with-data-in-NCBI.csv. Both represent the subset of the PLOS OSI dataset that could be linked to affiliated articles of the focal institution and for which data location was indicated to be either in SI or NCBI.

If you run the accessory-scripts/datacite-ror-query.py script, it will return four files. The first is date_datacite-ror-retrieval.csv, which returns all of the ROR-affiliated deposits in DataCite. The second is a filtered version of this file that handles de-duplication in the same fashion as the main workflow: date_datacite-ror-retrieval-filtered.csv. The third and fourth files are the same, but use a single-permutation-affiliation query parameter instead of the ROR identifier.

If you run the accessory-scripts/datacite-figshare-partner-query.py script, it will return four files. The first is date_publisher_figshare-discovery-all.csv, which is the result of the initial DataCite query being cross-referenced with the OpenAlex output and thus returns a list of all Figshare deposits that can be linked to an affiliated university. This file includes all deposits when there are multiple files associated with one article that were each split into a separate deposit (e.g., Supplemental Table 1 gets a DOI, Supplemental Table 2 gets a different DOI) and is the one that is looped through the Figshare API. If you want a de-duplicated version that retains only one deposit per article (i.e. how many unique articles have mediated Figshare deposits), that is generated as date_publisher_figshare-discovery-deduplicated.csv. The third file is the first output from the Figshare API and contains a row for each file from the linked deposits with a few metadata fields: date_publisher_figshare-discovery-all_metadata.csv. The final file is date_publisher_figshare-discovery-all_metadata_combined.csv. This merges the original DataCite/OpenAlex dataframe with the extra metadata retrieved from the Figshare API, with several columns added to identify whether a deposit contains any software, only contains software, is labeled 'dataset' but contains only formats with a low probability of being data, and is not labeled 'dataset' but could be a dataset.

If you run the accessory-scripts/crossref-query.py script, it will return two files. date_crossref-institution-objects.csv contains all of the DOIs that are positively identified as being linked to the focal institution. date_crossref-institution-true-datasets.csv is a subset of that file and removes entries for any 'repository' that is not a data repository (e.g., Authorea Preprints).

Important caveats

Object classification

The primary workflow and certain secondary workflows only collect items that are labeled as a 'Dataset' in the DataCite metadata schema (for some repositories, this is the only allowable object type). The same is true of the accessory Crossref script; Crossref explicitly recommends using 'dataset' for non-datasets if another label is not available or more appropriate. It is a given that not all of these meet the criteria for 'data' proper, in part or in whole, and may include other materials like appendices or software; the present workflow does not attempt to make inferences on the precise nature of content (although this is planned). Conversely, some deposits that do constitute 'data' proper are labeled as another object type (e.g., 'Component,' 'Text'), and these are not presently detected. Retrieving objects through the DataCite API requires downstream processing, as some objects that labeled as 'datasets' are either individual files within a DOI-backed deposit (common to Dataverse installations) or are versions of the same deposit (Zenodo, which mints a parent DOI and then a separate DOI for each version). The primary script omits individual files that are part of a larger project and restricts the Zenodo deposits to a single record per 'lineage' of deposits.

Distinctiveness of deposits

There are additional considerations to keep in mind related to how research organize materials within a single project. In some instances, distinct deposits with separate DOIs may in fact be part of the same project (e.g., associated with a single manuscript), and some calculations might wish to further consolidate these to attempt to capture the number of 'unique projects' with at least one dataset. For example, Dataverse has the relatively unique 'dataverse' object, a non-DOI-backed structure in which other dataverses and DOI-backed datasets can be nested. For this reason, some researchers will separate the materials for a single manuscript along some logical delineation (e.g., by data format; data vs. software) into multiple DOI-backed deposits that are housed within a single dataverse (example in the Texas Data Repository), whereas if those materials had been deposited in a different repository without an equivalent higher-level structure, they might have been deposited together in one PID-backed deposit.

Consolidation along these lines can be done by deduplicating with a stricter combination of attributes on the assumption that related deposits likely have nearly identical metadata (e.g., publication date, author list); it may also be possible to use relations to other objects, if provided (this is more likely to be exclusively recorded in a repository-specific API). The theoretical concept of consolidation that is given above for Dataverse could be accomplished with the Dataverse API since the dataverse in which a dataset is housed can be retrieved, but this would not be possible through the DataCite API since information about dataverses does not cross-walk (likely because dataverses do not receive DOIs and an equivalent structure is otherwise rare in other repositories). The present version of the primary workflow attempst to consolidate Dataverse deposits using a combination of variables.

The Zenodo process of minting two DOIs for an initial release of a deposit was previously noted. Other repositories also do the same (e.g., Figshare, ICPSR, Mendeley Data) and need to be de-duplicated in the same fashion. Whether all versions of a single dataset should be counted as separate datasets may vary between institutions, but this workflow usually treats a 'lineage' of many versions as a single dataset. A final consideration with Figshare deposits (with or without affiliation metadata) is that there is variation in whether a journal-mediated process will create one deposit for all files associated with one manuscript or one deposit for each file associated with that manuscript. The latter is considered to be overly granular since those objects probably would not be deposited as separate deposits (e.g., two supplemental tables) in a human-controlled process. The workflow also accounts for this and uses the relatedIdentifier field to consolidate entries.

Use of 'contact' rather than 'creator' for Dataverse API retrieval

It goes without saying that how affiliation metadata are cross-walked and exposed impacts the scale of what can and cannot be retrieved with this workflow. Specifically for Dataverse, the Search API, which casts a wide net to retrieve many records (efficient, not impacted by rate limiting), does not return affiliation metadata for the 'creator' (author) field, only for the 'contact.' It is possible to get the 'creator' affiliation metadata through the Native API, but this requires passing a list of DOIs of interest to the Native API, which then makes a request for each DOI (less efficient, can be impacted by rate limiting). The Dataverse component of the cross-validation process thus returns the listed 'contact' affiliation; it is inferred that in the overwhelming majority of cases, the point of contact(s) is probably also listed as an author or, at the very least, is from the same institution (i.e. that 'contact' affiliation is a good proxy for 'creator' affiliation).

Re-use

This script can be freely re-used, re-distributed, and modified in line with the associated MIT license. If a re-user is only seeking to replicate a UT-Austin-specific output or to retrieve an equivalent output for a different institution, the script will require very little modification - essentially only the defining of affiliation parameters will be necessary. For other Dataverse-based platforms that have significantly altered the metadata framework, it is possible that additional edits to the API call and subsetting of the response will be necessary if you run cross-validation. If additional fields or processing of the output are desired, the script will require more substantive modification and knowledge of the specific structure of a given API response.

Disclaimer

This workflow is, and likely will always be, perpetually under development. Because of the marked heterogeneity in how datasets are shared (e.g., lack of persistent identifiers; use of identifiers other than DOIs; variation in affiliation metadata), it is practically assured that not all datasets will be captured by this workflow or any other and that substantial gaps may exist for certain platforms/avenues for data sharing. Reusers should be cognizant of these limitations in determining how data gained from this workflow may inform decision-making. The creator(s) and contributor(s) of this repository and any entities to which they are affiliated are not responsible for any decisions, policies, or other actions that are made on the basis of obtained data.

Config file

API keys and numerical API query parameters (e.g., records per page, page limit) are defined in a config.json file. The file included in this repository called config-template.json should be populated with API keys (see below) and renamed.

Third-party API access

Users will need to create accounts for Dataverse and Zenodo in order to obtain personalized API keys, add those to the config-template.json file, and rename it as config.json. Note that if you wish to query multiple Dataverse installations (e.g., a non-Harvard institutional dataverse and Harvard Dataverse), you will need to create an account and get a separate API key for each installation. Crossref, DataCite, Dryad, Figshare, and OpenAlex do not require API keys for standard access.

Constructing API query parameters

If users need to modify the existing API queries, they should refer to the previously linked API documentation for specific APIs. For targeting a different institution (or set of institutions), users will need to identify a list of possible permutations of the institutional name; the use of of ROR identifiers in either the DataCite API or most repositories' specific API will fail to retrieve most related deposits because most repositories have not implemented ROR into their platforms given its relatively recent added support in the DataCite schema (Dryad is a notable example as an early adopter of ROR). It may also not be feasible for platforms to retroactively add ROR identifiers for all previously published deposits in an efficient programmatic fashion without potentially introducing errors. The optional cross-validation step can facilitate identification of some permutations if querying an API that does not require an exact string match for retrieval based on affiliation. Another approach is to compile a list of known affiliated deposits within and across different repositories and then to examine their metadata in the DataCite API; testing this on some of my own datasets led to the discovery of a lack of recording of affiliation in Figshare metadata, for example. A third approach would be to survey affiliated scholarly articles, books, and preprints (e.g., through the Crossref API, which does not require an exact string match for affiliation).

If the Dataverse cross-validation step is enabled for a different installation than the Texas Data Repository, the DOI prefix should be modified accordingly. The 'subtree' parameter can probably be removed as well since most Dataverse installations are not multi-institutional (each Texas-based partner has a separate subtree within TDR).

Test environment

A Boolean variable called test, located immediately after the importing of packages, can be used to create a 'test environment.' If this setting is set to TRUE, the script is set to only retrieve 5 pages from the DataCite API (currently a full run requires more than 77 pages with page size of 1,000 records for UT Austin). Currently, the number of records retrieved from the three other APIs utilized here (Dataverse, Dryad, Zenodo) is significantly smaller, so different page limits for a test run are not defined for these (but could be added).

Cross-validation

Similar to the test environment, a Boolean value called crossValidate (located immediately after test) can be used to toggle the cross-validation component on and off (TRUE will retrieve records from other APIs and cross-validate against DataCite). A future version of the script will allow for toggling of the use of the Dataverse API in the cross-validation process.

  • For UT Austin users: there should be no reason why you need to run the cross-validation process since I have used it to refine the workflow into the present state (e.g., to account for different permutations of the institutional name).
  • For non-UT Austin users: if you are at another institution and want to adapt this workflow, running the cross-validation is recommended to identify alternative/unexpected permutations of the name, especially if you are at an institution that similarly is part of a broader system or that has its name listed in a wide variety of ways. However, you may be able to intuit these on your own or have the fortune of being at an institution with few options (e.g., Stanford University).

Rate limiting

In the present configuration, any rate limiting is unlikely to affect the workflows or require modification because of how queries are not targeting specific DOIs (i.e. a few requests return many records). However, potential/planned expansion may necessitate the use of targeted single-object retrieval, and users should be aware that many public APIs impose some kind of rate limiting (e.g., Dryad; OpenAlex; Zenodo). Note that Zenodo also restricts the total number of records that can be retrieved with one query to 10,000. Dataverse installations may or may not have rate limits; for users attempting to retrieve data from the Texas Data Repository, there are currently no rate limits, although a to-be-determined limit is planned in the near future.

Predicted runtime

Exact runtime will vary on local internet speed and external server traffic. Typically, a run of the primary workflow without cross-validation (only retrieving from DataCite) or any of the secondary Figshare workflows should complete in under 20 minutes for UT Austin or an institution of predicted similar research output. If cross-validation is employed, a run should complete in under 25 minutes for UT Austin or an institution of predicted similar research output. The test environment without cross-validation should complete in about 1 minute; if cross-validation is employed, it should complete in about 8-9 minutes.

Adding one of the Figshare workflows can significantly increase the runtime, especially if many publishers are queried; significant variation between institutions in publishing volume with a certain publisher is expected. Typical runtime of the main workflow, the first Figshare workflow (DataCite + OpenAlex), and the NCBI step is around 40-55 minutes at present. The NCBI workflow on its own should complete in about 30-40 seconds.

If a script finishes significantly faster than you expect or have experienced in previous runs, check the number of returned records; sometimes server instability or high traffic leads to incomplete retrieval.

Planned development

This workflow is intended to be continually developed by members of the Research Data Services team at the UT Libraries in order to continually refine the process and expand the capture potential. Product development ideas/plans are listed as 'Issues'. The projected timelines are listed in a linked Project.

The use of OAI-PMH protocols and large "data dumps" (like the 200 GB Crossref public data file) are under consideration for future incorporation, but a secondary objective of this workflow is to employ code and data sources that are both accessible and computationally tractable for a wide range of potential users who may not have access to above-average storage or computing capacities or even much exposure to code.

Questions / comments

If you have any questions or comments that don't warrant creating an issue, feel free to email me - I'm happy to help with any trouble-shooting for re-users. As a note, I plan on writing up at least a preprint describing the conceptual basis for the workflow and all of its nuances, so hopefully that will help to provide more robust documentation beyond this README and the in-script annotations.

Contributions

Version 3.2.0 added a CONTRIBUTING.md file that has been temporarily removed after internal discussion about university policies on distributing open source software. The gist is that there are still some policy details to be sorted out (not just for this project) with our technology transfer office, so we're unable to accept external pull requests at the moment (this does not affect others' ability to fork and modify the repository). We hope this will get clarity in the near future so that it can be opened up. In the interim, if you have ideas of how to improve something (excluding aesthetics), feel free to reach out by email or make an issue.

Version notes (3.6.0)

The current version scheme follows a MAJOR.MINOR.PATCH format, with a 'major' change involving added functionality or significant revisions to the workflow; a 'minor' change involving addition of accessory files or minor revisions to the workflow (e.g., refactoring); and a 'patch' is a bug fix. We plan to make formal releases synced with a DOI-backed deposit and will reset the version at that point.

Version 3.6.0 provides the real bug fix related to the primary workflow being refactored for the cross-validation process. It also does some other bug fixing and refinement around deduplication and cleaning of Figshare deposits. Additionally, it adds an alternative way to retrieve NCBI deposits through the biopython module and updates the config.json file to account for even more permutations of UT Austin that were identified only in NCBI (e.g., all lowercase affiliation). The crossref-query accessory script was minorly updated for a dataframe export; the datacite-ror-query.py accessory script was overhauled to include 'quick retrieval' code for both a ROR-based DataCite query and a single-affiliation-permutation DataCite query (e.g., 'The University of Texas at Austin'). The dataset-records-retrieval-visualization.py accessory script has been significantly expanded to include new visualizations for the forthcoming preprint. The dryad-date-comparison.py file was renamed to append 'preprint' to the front; the script was otherwise not altered. Another script related to the preprint, accessory-scripts/preprint-reads-dataset-reanalysis.py, was also added; this script downloads and reanalyzes certain parts of the RADS dataset.

Version 3.5.1 is a bug fix related to the primary workflow being refactored for the cross-validation process.

Version 3.5.0 is a 'moderate' release. It mainly advances some of the metadata assessment steps, both expanding 'as-is' metadata from a DataCite field and calculating some metadata (e.g., number of descriptive words in dataset title; whether code is present [for files with file formats recorded]) based on other existing fields. Additionally, there is some refactoring in the cross-validation process to better streamline the incorporation and processing of datasets that were only found in a repository API and not the DataCite API. Rather than coerce the variably structured fields of interest from each repository's API output into the same format as the DataCite fields, the code now identifies all previously undetected DOIs and then retrieves the same metadata fields as the general capture through the DataCite API. A new accessory script designed exclusively for the preprint (comparison of all timestamps for Dryad datasets in the DataCite and Dryad APIs) is also added.

Version 3.4.0 is a very minor release. It does a small amount of cleaning in the main workflow, edits the datacite-ror-query.py script to use the same method for extracting publication year from a different field than publicationYear, and adds a data dictionary to explain the columns in the various output files.

Version 3.3.0 is the first release with the new sprint/project management scheme and thus contains a substantial amount of work made over the preceding two weeks. A number of notable updates have been made to the primary workflow script. Firstly, it begins staging for metadata assessment (e.g., file size, file formats, rights/licensing); these variables are not included in the final output, but they are returned in the initial API response. Some related functions or code blocks for cleaning this information, as well as a growing map of mimetypes to common-language formats in the config.json file, are included. Secondly, it adds functionality to handle partial over-granularization of Dataverse datasets where a researcher creates several deposits for the same manuscript; this is enabled/disabled via a toggle. Thirdly, it adds functionality to categorize whether a focal institution is listed for the first author only, for the last author only, for both authors, for neither author, or there is only a single author; the idea here is to be able to quickly filter on datasets that are thought to have a higher probability of having been managed by a researcher at the focal institution (rather than casual authorship). Fourthly, functionality is added to create a dynamic variable that can be updated with different resource types for the API call, as well as as parameter in the config file that allows for dynamic insertion of a focal institution's name into filenames (not applicable for all files so far). Fifth, the data visualization script is updated to use 'if' statements to avoid run-time errors if one of the potential data files does not exist and cannot be loaded in. Finally, it adds a toggle to concatenate the data generated by a new accessory script that queries the Crossref API, crossref-query.py. It adds one other accessory script: datacite-figshare-partner-query_metadata-only.py is used to retrieve the metadata summaries that are included in any DataCite API response; this provides a quick means to obtain certain summaries (e.g., per year DOI publication volumes for a given repository), although there are many caveats to this coarse approach.

Version 3.2.0 implements some minor accessory functionality in the main workflow (dynamic variable for specifying DataCite resource type(s) to query; expands process to infer role UT researchers played in dataset based on author order; process for handling oversplitting of Dataverse datasets [not file-level DOIs; see ticket]). A new accessory script to retrieve metadata summaries from DataCite queries (vastly more time-efficient for "quick-look" information) is also added. Single vs. double quotation marks were also standardized in the main script and will be standardized in other scripts as they are modified.

Version 3.1.2 fixes a bug in the data visualization script that accidentally removed the path/filename for one plot. It also begins staging increased retrieval of DataCite fields in the primary workflow.

Version 3.1.1. temporarily reworks the retrieval of the publication year from DataCite, switching to a different variable (registered) after discovery of an issue with how Dryad has cross-walked metadata for many recently published datasets for the previously utilized publicationYear. It also fixes the code to create several columns for the Figshare and NCBI workarounds to align with the remaining output. The accessory visualization script has also been updated with code for a line graph comparing annual publishing volumes of UT-linked deposits in select repositories and a toggle for different export formats (TIFF vs. PNG). This script also incorporates the first implementation of a process to move older versions of files to a nested subdirectory. The quick ROR-based DataCite query script has been updated to direct the outputs to a different folder.

Version 3.0.1 fixes the code that identifies which permutation of an institutional name was detected by the general DataCite query to prioritize an exact match over a partial match. It also fixes a minor bug in the accessory visualization script where an off-orange color was specified for one plot's title.

Version 3.0.0 adds functionality for retrieving file-level metadata for affiliated Figshare deposits and using a basic set of rules to assess whether they are properly characterized as 'datasets' and functionality for programmatic retrieval of affiliated NCBI Bioprojects. Extra functionality is added to the config.json file for the second Figshare workflow (a map of publisher names as listed in DataCite and their OpenAlex code; test env page limits for OpenAlex API) and more permutations of 'UT Austin' were added based on NCBI data (although it is not expected that most of these will be used outside of NCBI, like some with typos). figshare-plos-osi-search.py was renamed to plos-osi-search.py after being expanded to do a quick-check on NCBI-linked articles as well. Two accessory scripts were added: datacite-ror-query.py and datacite-figshare-partner-query.py.

Version 2.2.0 refactors the primary workflow and makes minor edits to the README.

Version 2.1.0 adds the summary CSV of publisher-mediated Figshare deposits and makes minor edits to the README, including adding a disclaimer about completeness of records.

Version 2.0.0 adds functionality to the primary dataset-records-retrieval.py script to enable workarounds to the lack of affiliation metadata for most Figshare deposits and was slightly refactored in some areas for consolidation. A sample input file for one of the possible Figshare workarounds (journal-list.json) has also been added. This version also adds the two scripts in accessory-scripts. Ancillary files (config-template.json, .gitignore, and this README have been updated).

About

Python code that can be used to gather and organize metadata from research data repositories about the research datasets published by affilitates of a particular institution.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •  

Languages