Conference Presentation (After Call) FZJ-2024-06473

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Scalable Data Management for High-Resolution Microscopy of the Human Brain: Challenges and Future Directions

 ;  ;

2024

INM Retreat 2024, JülichJülich, Germany, 19 Nov 2024 - 20 Nov 20242024-11-192024-11-20

Abstract: In order to investigate the complex structural and functional organization of the human brain, data must be integrated across multiple modalities and resolutions. This requires the implementation of scalable workflows for data extraction, AI-driven analysis, and visualization. A key challenge in this process is the storage and organization of large image datasets in suitable repositories. Due to the prohibitive cost of data duplication at this scale, storage systems must adhere to community standards, enable provenance tracking, and meet the performance demands of high-throughput data ingestion, highly parallel processing on HPC systems, and random access for interactive visualization. In this context we address the case of building a repository of high-resolution microscopy scans for whole human brain sections in the order of multiple petabytes.Digitizing a human brain using whole-slide imaging of cell body stained tissue sections requires capturing about 7,000-8,000 histological sections at 20 micrometer thickness using high-throughput scanners. When aiming for an isotropic resolution of 1 micrometer, each histological section generates 29 TIFF images (“z-stack”), representing different focus levels. These images are automatically transferred to a gateway server for initial organization and automated quality control (QC). After QC, the z-stack is moved to a parallel file system (GPFS) on a supercomputer, generating approximately 2 petabytes of image data across 200,000 files for a single brain. These data are then accessed by various applications and pipelines, each with distinct requirements. HPC applications, such as deep learning-based cell segmentation and brain mapping, rely on fast random access and parallel I/O to efficiently stream image patches to GPUs. In contrast, remote visualization and annotation require access via an HTTP service, along with higher-capacity storage for serving diverse data concurrently. A multi-tier HPC storage system addresses these needs: the high-performance storage tier offers low latency and high bandwidth for analysis, while the capacity-optimized extended storage tier meets visualization requirements. Controlled data staging across these tiers is crucial and is managed using DataLad, which enables well-defined staging, comprehensive tracking, and version control of image datasets across distributed storage systems. Each brain section is organized as a distinct DataLad dataset to minimize the number of files per repository.However, the current data management approach presents two major challenges. First, the TIFF format lacks support for parallel I/O, leading to data duplication when converting to HDF5 for HPC workflows. Second, the existing data organization is not aligned with community standards, hindering collaboration. Therefore, a major objective is standardization of both file formats and folder structures. However, adopting standards such as the Brain Imaging Data Structure (BIDS) poses significant challenges due to the large number of files created by multiple folders and sidecar files, as well as the small-file structure of OME-ZARR, which is incompatible with GPFS file systems that require inode restrictions.To address these challenges, optimizing the size of DataLad datasets and exploring ways to reduce inode usage are essential. Questions remain about whether file formats like ZARR v3 or HDF5, which minimize inode consumption, should be integrated into the BIDS standard. Community discussions may provide solutions to these issues.


Contributing Institute(s):
  1. Strukturelle und funktionelle Organisation des Gehirns (INM-1)
Research Program(s):
  1. 5251 - Multilevel Brain Organization and Variability (POF4-525) (POF4-525)
  2. 5254 - Neuroscientific Data Analytics and AI (POF4-525) (POF4-525)
  3. HIBALL - Helmholtz International BigBrain Analytics and Learning Laboratory (HIBALL) (InterLabs-0015) (InterLabs-0015)
  4. EBRAINS 2.0 - EBRAINS 2.0: A Research Infrastructure to Advance Neuroscience and Brain Health (101147319) (101147319)
  5. DFG project G:(GEPRIS)501864659 - NFDI4BIOIMAGE - Nationale Forschungsdateninfrastruktur für Mikroskopie und Bildanalyse (501864659) (501864659)

Appears in the scientific report 2024
Click to display QR Code for this record

The record appears in these collections:
Document types > Presentations > Conference Presentations
Institute Collections > INM > INM-1
Workflow collections > Public records
Publications database

 Record created 2024-11-26, last modified 2024-12-13



Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)