Cluster Analysis of Open Research Data: A Case for Replication Metadata

Trisovic, Ana

doi:10.2218/ijdc.v17i1.833

Journal Article

FZJ-2023-04211

Cluster Analysis of Open Research Data: A Case for Replication Metadata

Trisovic, A.

2023
Digital Curation Centre Bath

International Journal of Digital Curation 17(1), 13 - (2023) [10.2218/ijdc.v17i1.833]

This record in other databases:

Please use a persistent id in citations: doi:10.2218/ijdc.v17i1.833 doi:10.34734/FZJ-2023-04211

Abstract: Research data are often released upon journal publication to enable result verificationand reproducibility. For that reason, research dissemination infrastructures typicallysupport diverse datasets coming from numerous disciplines, from tabular data andprogram code to audio-visual files. Metadata, or data about data, is critical to makingresearch outputs adequately documented and FAIR. Aiming to contribute to thediscussions on the development of metadata for research outputs, I conduct anexploratory analysis to determine how research datasets cluster based on whatresearchers organically deposit together. The content of over 40,000 datasets from theHarvard Dataverse research data repository is used as a sample for the cluster analysis. Ifind that the majority of the clusters are formed by single-type datasets, while in the restof the sample no meaningful clusters can be identified. For the result interpretation, Iuse the metadata standard employed by DataCite, a leading organization fordocumenting a scholarly record, and map existing resource types to my results. About65% of the sample can be described with a single-type metadata (such as Dataset,Software or Report), while the rest would require aggregate metadata types. ThoughDataCite supports an aggregate type such as a Collection, I argue that a significantnumber of datasets, in particular those containing both data and code files (about 20%of the sample), would be more accurately described as a Replication resource metadatatype. Such resource type would be particularly useful in facilitating researchreproducibility.

Classification: