Facilitating the sharing of data analysis results through in-depth provenance capture

Köhler, Cristiano; Denker, Michael; Ulianych, Danylo; Grün, Sonja; Gerkin, Richard C.; Davison, Andrew P.
% IMPORTANT: The following is UTF-8 encoded.  This means that in the presence
% of non-ASCII characters, it will not work with BibTeX 0.99 or older.
% Instead, you should use an up-to-date BibTeX implementation like “bibtex8” or
% “biber”.

@INPROCEEDINGS{Khler:892675,
      author       = {Köhler, Cristiano and Ulianych, Danylo and Gerkin, Richard
                      C. and Davison, Andrew P. and Grün, Sonja and Denker,
                      Michael},
      title        = {{F}acilitating the sharing of data analysis results through
                      in-depth provenance capture},
      reportid     = {FZJ-2021-02261},
      year         = {2021},
      abstract     = {INTRODUCTION/MOTIVATION Workflows for the analysis of
                      electrophysiology activity data are typically composed of
                      multiple steps. In the simplest case, these comprise several
                      scripts executed in sequence, with several dependencies on
                      data and parameters sets. However, workflows can become
                      increasingly complex during the course of an analysis
                      project: researchers can investigate alternative analysis
                      paths or adjust the workflow components according to new
                      hypotheses or additional experimental data. Considering this
                      complexity and iterative nature, robust tools forming the
                      basis of the workflow are necessary [1] to fully document
                      the workflow and improve the reproducibility of the results.
                      Provenance is the capture and characterization of data
                      manipulations and parameters throughout the workflow [2].
                      This requires complete and self-explanatory descriptions of
                      the generated data and a method to minimize the need for
                      manually tracking the workflow execution, while maximizing
                      the information content of the provenance trail. While
                      frameworks to structure the input data and associated
                      metadata exist, a similar representation for the outputs of
                      the analysis part of the workflow is missing. Moreover,
                      workflow management systems capture limited provenance
                      information, as they do not provide details about the
                      functions used inside each analysis script. Finally, the
                      workflow output lacks information pertaining to its
                      generation. Therefore, to satisfy the requirements of a
                      practically useful provenance trail, existing tools must be
                      improved to implement a data model that captures analysis
                      outputs and their detailed provenance and, ultimately,
                      represents the analysis and its results in accordance with
                      the FAIR principles [3]. METHODS We focus on two open-source
                      tools for the analysis of electrophysiology data developed
                      in EBRAINS. The Neo $(RRID:SCR_000634)$ framework provides
                      an object model to standardize neural activity data acquired
                      from different sources [4]. Elephant $(RRID:SCR_003833)$ is
                      a Python toolbox for analyses of electrophysiology data [5].
                      We implemented two synergistic prototype solutions that
                      extend the functionality of these tools with respect to (i)
                      the systematic standardization of analysis results and (ii)
                      the automatical capture of provenance information during the
                      execution of a Python analysis script. Both solutions are
                      under development and being incorporated as new
                      functionality into the Elephant package. The first solution
                      represents the output of Elephant functions in a data model
                      inspired by Neo. Objects for specific analysis results
                      (e.g., a time histogram) are inherited from a base Python
                      class that supports storage of provenance information such
                      as timestamps and unique identifiers. The second solution is
                      a provenance tracker implemented as a function decorator. It
                      identifies the objects that are input to and output from the
                      function, creating unique hashes. It also captures
                      timestamps, statement code lines, and additional function
                      parameters. Extended dependencies between objects (such as
                      indexing and attributes) are mapped using the analysis of
                      the abstract syntax tree (AST) obtained from the code.
                      RESULTS AND DISCUSSION The solutions presented here capture
                      provenance during the analysis of electrophysiology data
                      with minimal user intervention. The data objects support a
                      hierarchical standardization of the output of Elephant
                      functions (e.g., a time histogram is a specific type of
                      histogram) while encapsulating all the information about the
                      generation of an analysis output. Therefore, these objects
                      can be easily re-used or shared. This will eliminate the
                      need to manually annotate the output of the analysis with
                      corresponding parameters. The new objects also seamlessly
                      extend the functionality of the Neo classes currently used
                      as output of Elephant functions, and can be integrated into
                      the existing code bases with minimal disruption.
                      Additionally, we describe how to capture provenance
                      information throughout the Python analysis script using
                      decorators. These track the Elephant and user-defined
                      functions used in the script while mapping the inputs to the
                      outputs. We demonstrate how the captured information can be
                      used to build a graph showing the steps followed in the
                      script, and that can be stored as metadata. The analysis
                      results obtained with or without the use of the two
                      solutions are compared, highlighting the potential benefits
                      for reproducibility and data re-use. The provenance tracker
                      and the standard data objects capture and manage distinct
                      aspects of the provenance information. In the end, both
                      solutions are complementary. On one hand, the decorator is
                      focused on building the provenance trail and the
                      relationships between the different steps of the analysis
                      within the script. On the other hand, the standard objects
                      focus on the representation of the data, standardizing
                      information that is similar among the outputs of different
                      functions together with the storage of the relevant
                      provenance information as metadata. Ultimately, those two
                      developments aim to increase data interoperability and
                      reusability in accordance with the FAIR principles.
                      REFERENCES: [1] Denker, M. and Grün, S. (2016). Designing
                      Workflows for the Reproducible Analysis of
                      Electrophysiological Data. In Brain-Inspired Computing,
                      Amunts, K. et al., eds. (Cham: Springer International
                      Publishing), pp. 58-72. [2] Ragan, E.D. et al. (2016).
                      Characterizing Provenance in Visualization and Data
                      Analysis: An Organizational Framework of Provenance Types
                      and Purposes. IEEE Transactions on Visualization and
                      Computer Graphics. 22(1):31–40. [3] Wilkinson, M.D. et al.
                      (2016). The FAIR Guiding Principles for scientific data
                      management and stewardship. Scientific Data 3, 160018. [4]
                      Garcia, S. et al. (2014) Neo: an object model for handling
                      electrophysiology data in multiple formats. Frontiers in
                      Neuroinformatics 8:10. [5] http://python-elephant.org},
      month         = {Feb},
      date          = {2021-02-01},
      organization  = {5th HBP Student Conference on
                       Interdisciplinary Brain Research,
                       online (online), 1 Feb 2021 - 4 Feb
                       2021},
      subtyp        = {Other},
      cin          = {INM-6 / INM-10 / IAS-6},
      cid          = {I:(DE-Juel1)INM-6-20090406 / I:(DE-Juel1)INM-10-20170113 /
                      I:(DE-Juel1)IAS-6-20130828},
      pnm          = {5235 - Digitization of Neuroscience and User-Community
                      Building (POF4-523) / 5231 - Neuroscientific Foundations
                      (POF4-523) / 571 - Connectivity and Activity (POF3-571) /
                      574 - Theory, modelling and simulation (POF3-574) / HDS LEE
                      - Helmholtz School for Data Science in Life, Earth and
                      Energy (HDS LEE) (HDS-LEE-20190612) / HBP SGA2 - Human Brain
                      Project Specific Grant Agreement 2 (785907) / HBP SGA3 -
                      Human Brain Project Specific Grant Agreement 3 (945539) /
                      HAF - Helmholtz Analytics Framework (ZT-I-0003)},
      pid          = {G:(DE-HGF)POF4-5235 / G:(DE-HGF)POF4-5231 /
                      G:(DE-HGF)POF3-571 / G:(DE-HGF)POF3-574 /
                      G:(DE-Juel1)HDS-LEE-20190612 / G:(EU-Grant)785907 /
                      G:(EU-Grant)945539 / G:(DE-HGF)ZT-I-0003},
      typ          = {PUB:(DE-HGF)24},
      url          = {https://juser.fz-juelich.de/record/892675},
}
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe