% IMPORTANT: The following is UTF-8 encoded. This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.
@INPROCEEDINGS{Khler:892675,
author = {Köhler, Cristiano and Ulianych, Danylo and Gerkin, Richard
C. and Davison, Andrew P. and Grün, Sonja and Denker,
Michael},
title = {{F}acilitating the sharing of data analysis results through
in-depth provenance capture},
reportid = {FZJ-2021-02261},
year = {2021},
abstract = {INTRODUCTION/MOTIVATION Workflows for the analysis of
electrophysiology activity data are typically composed of
multiple steps. In the simplest case, these comprise several
scripts executed in sequence, with several dependencies on
data and parameters sets. However, workflows can become
increasingly complex during the course of an analysis
project: researchers can investigate alternative analysis
paths or adjust the workflow components according to new
hypotheses or additional experimental data. Considering this
complexity and iterative nature, robust tools forming the
basis of the workflow are necessary [1] to fully document
the workflow and improve the reproducibility of the results.
Provenance is the capture and characterization of data
manipulations and parameters throughout the workflow [2].
This requires complete and self-explanatory descriptions of
the generated data and a method to minimize the need for
manually tracking the workflow execution, while maximizing
the information content of the provenance trail. While
frameworks to structure the input data and associated
metadata exist, a similar representation for the outputs of
the analysis part of the workflow is missing. Moreover,
workflow management systems capture limited provenance
information, as they do not provide details about the
functions used inside each analysis script. Finally, the
workflow output lacks information pertaining to its
generation. Therefore, to satisfy the requirements of a
practically useful provenance trail, existing tools must be
improved to implement a data model that captures analysis
outputs and their detailed provenance and, ultimately,
represents the analysis and its results in accordance with
the FAIR principles [3]. METHODS We focus on two open-source
tools for the analysis of electrophysiology data developed
in EBRAINS. The Neo $(RRID:SCR_000634)$ framework provides
an object model to standardize neural activity data acquired
from different sources [4]. Elephant $(RRID:SCR_003833)$ is
a Python toolbox for analyses of electrophysiology data [5].
We implemented two synergistic prototype solutions that
extend the functionality of these tools with respect to (i)
the systematic standardization of analysis results and (ii)
the automatical capture of provenance information during the
execution of a Python analysis script. Both solutions are
under development and being incorporated as new
functionality into the Elephant package. The first solution
represents the output of Elephant functions in a data model
inspired by Neo. Objects for specific analysis results
(e.g., a time histogram) are inherited from a base Python
class that supports storage of provenance information such
as timestamps and unique identifiers. The second solution is
a provenance tracker implemented as a function decorator. It
identifies the objects that are input to and output from the
function, creating unique hashes. It also captures
timestamps, statement code lines, and additional function
parameters. Extended dependencies between objects (such as
indexing and attributes) are mapped using the analysis of
the abstract syntax tree (AST) obtained from the code.
RESULTS AND DISCUSSION The solutions presented here capture
provenance during the analysis of electrophysiology data
with minimal user intervention. The data objects support a
hierarchical standardization of the output of Elephant
functions (e.g., a time histogram is a specific type of
histogram) while encapsulating all the information about the
generation of an analysis output. Therefore, these objects
can be easily re-used or shared. This will eliminate the
need to manually annotate the output of the analysis with
corresponding parameters. The new objects also seamlessly
extend the functionality of the Neo classes currently used
as output of Elephant functions, and can be integrated into
the existing code bases with minimal disruption.
Additionally, we describe how to capture provenance
information throughout the Python analysis script using
decorators. These track the Elephant and user-defined
functions used in the script while mapping the inputs to the
outputs. We demonstrate how the captured information can be
used to build a graph showing the steps followed in the
script, and that can be stored as metadata. The analysis
results obtained with or without the use of the two
solutions are compared, highlighting the potential benefits
for reproducibility and data re-use. The provenance tracker
and the standard data objects capture and manage distinct
aspects of the provenance information. In the end, both
solutions are complementary. On one hand, the decorator is
focused on building the provenance trail and the
relationships between the different steps of the analysis
within the script. On the other hand, the standard objects
focus on the representation of the data, standardizing
information that is similar among the outputs of different
functions together with the storage of the relevant
provenance information as metadata. Ultimately, those two
developments aim to increase data interoperability and
reusability in accordance with the FAIR principles.
REFERENCES: [1] Denker, M. and Grün, S. (2016). Designing
Workflows for the Reproducible Analysis of
Electrophysiological Data. In Brain-Inspired Computing,
Amunts, K. et al., eds. (Cham: Springer International
Publishing), pp. 58-72. [2] Ragan, E.D. et al. (2016).
Characterizing Provenance in Visualization and Data
Analysis: An Organizational Framework of Provenance Types
and Purposes. IEEE Transactions on Visualization and
Computer Graphics. 22(1):31–40. [3] Wilkinson, M.D. et al.
(2016). The FAIR Guiding Principles for scientific data
management and stewardship. Scientific Data 3, 160018. [4]
Garcia, S. et al. (2014) Neo: an object model for handling
electrophysiology data in multiple formats. Frontiers in
Neuroinformatics 8:10. [5] http://python-elephant.org},
month = {Feb},
date = {2021-02-01},
organization = {5th HBP Student Conference on
Interdisciplinary Brain Research,
online (online), 1 Feb 2021 - 4 Feb
2021},
subtyp = {Other},
cin = {INM-6 / INM-10 / IAS-6},
cid = {I:(DE-Juel1)INM-6-20090406 / I:(DE-Juel1)INM-10-20170113 /
I:(DE-Juel1)IAS-6-20130828},
pnm = {5235 - Digitization of Neuroscience and User-Community
Building (POF4-523) / 5231 - Neuroscientific Foundations
(POF4-523) / 571 - Connectivity and Activity (POF3-571) /
574 - Theory, modelling and simulation (POF3-574) / HDS LEE
- Helmholtz School for Data Science in Life, Earth and
Energy (HDS LEE) (HDS-LEE-20190612) / HBP SGA2 - Human Brain
Project Specific Grant Agreement 2 (785907) / HBP SGA3 -
Human Brain Project Specific Grant Agreement 3 (945539) /
HAF - Helmholtz Analytics Framework (ZT-I-0003)},
pid = {G:(DE-HGF)POF4-5235 / G:(DE-HGF)POF4-5231 /
G:(DE-HGF)POF3-571 / G:(DE-HGF)POF3-574 /
G:(DE-Juel1)HDS-LEE-20190612 / G:(EU-Grant)785907 /
G:(EU-Grant)945539 / G:(DE-HGF)ZT-I-0003},
typ = {PUB:(DE-HGF)24},
url = {https://juser.fz-juelich.de/record/892675},
}