| Hauptseite > Publikationsdatenbank > HPC system and job monitoring with LLview |
| Talk (non-conference) (Other) | FZJ-2023-03546 |
;
2022
This record in other databases:
Please use a persistent id in citations: doi:10.34734/FZJ-2023-03546
Abstract: LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenges of monitoring such huge systems increase significantly. To address that, LLview is under continuous development in order to work for a wide range of hardware systems and software interfaces with negligible overhead and at the same time providing fast, reliable access to job reports, system-wide monitoring data, and real-time system information. That information is provided to system users, project advisors, support teams and system administrators, helping the managing of jobs, identification of performance issues at many levels and also helping the system administrators to find failures and system malfunctions. This webinar gives an overview of the different LLview components and their interaction with each other and the system. Moreover, particular attention is drawn to the system monitoring views and the job reporting features, as they allow to trace the entire life cycle of a job and can help identify problems and bottlenecks at a very early stage.
|
The record appears in these collections: |