001043552 001__ 1043552
001043552 005__ 20250724210254.0
001043552 0247_ $$2datacite_doi$$a10.34734/FZJ-2025-02926
001043552 037__ $$aFZJ-2025-02926
001043552 1001_ $$0P:(DE-Juel1)138707$$aBreuer, Thomas$$b0$$eCorresponding author$$ufzj
001043552 1112_ $$aISC High Performance 2025$$cHamburg$$d2025-06-10 - 2025-06-14$$gISC25$$wGermany
001043552 245__ $$aThe Art of Process Pinning: Turning Chaos into Core Harmony
001043552 260__ $$c2025
001043552 3367_ $$033$$2EndNote$$aConference Paper
001043552 3367_ $$2BibTeX$$aINPROCEEDINGS
001043552 3367_ $$2DRIVER$$aconferenceObject
001043552 3367_ $$2ORCID$$aCONFERENCE_POSTER
001043552 3367_ $$2DataCite$$aOutput Types/Conference Poster
001043552 3367_ $$0PUB:(DE-HGF)24$$2PUB:(DE-HGF)$$aPoster$$bposter$$mposter$$s1753340973_1149$$xAfter Call
001043552 500__ $$aThis poster was awarded second prize in the Best Research Poster category.
001043552 520__ $$aHigh-Performance Computing (HPC) centres face growing challenges as user numbers and application diversity increase, requiring systems to manage a wide range of workflows. While users prioritise scientific output over specific configurations, administrators strive to maintain fully utilised systems with optimised jobs, avoiding resource waste. However, no single default environment can address the diverse needs of users and applications due to the complex landscape of unique use cases. Process pinning - the binding of tasks and threads to specific CPU cores - is a vital yet often overlooked optimisation that significantly improves job performance. This technique benefits both CPU-intensive and GPU-enabled jobs. Proper pinning prevents process migration, ensures efficient memory access, and enables faster communication, improving system performance by simply adjusting workload manager parameters (e.g., Slurm) without altering code. Metrics from various applications and benchmarks show that suboptimal pinning can drastically reduce performance, with production scenarios likely impacted even more. Achieving optimal process pinning is challenging due to three interrelated factors: - System side: Application layers and libraries (e.g., MPI, OpenMP, Slurm) interact with hardware architectures, affecting task and thread placement. Updates to these components can disrupt the expected pinning behaviour. - User side: Users must consider system architecture and configuration options, such as how to split processes and threads or distribute them across cores. Even with the same core usage pattern, distribution can vary based on workload options (e.g., Slurm `cpu-bind` and `distribution` values). Portability across systems is not guaranteed, often leading to suboptimal performance. - Operator side: Administrators and support staff must monitor systems to ensure effective resource utilisation and address issues proactively. Identifying problematic jobs is difficult due to the variety of characteristics, with inefficiencies often hidden in core usage patterns. We developed tools and processes based on investigations across diverse HPC systems to address these challenges. These solutions enhance overall system throughput by identifying binding errors, guiding users in optimisation, and monitoring core usage. Our solutions include: - A workflow that validates pinning distributions by running automated test jobs, periodically or manually, via the GitLab-CI framework. Results are compared to expected outputs, with summaries generated and full comparison displayed on the provider-targeted part of the JuPin pinning tool (https://go.fzj.de/pinning). Tests help HPC providers address issues pre-production, update documentation, and notify users of changes. - A user-targeted interactive visualisation functionality of JuPin enables users to test pinning options, visualise task distributions, and generate Slurm-compatible commands. Though focused on Slurm, it can be adapted for other workload managers. - LLview (https://go.fzj.de/llview), an open-source monitoring and operational data analytics tool, has been extended to monitor core usage patterns, providing statistics and aggregated computing times. This helps identify inefficiencies and intervene proactively. JuPin and LLview collectively improve node utilisation, reduce waste, and simplify achieving optimal pinning. These advancements translate to delivering more results in less time. We published JuPin as open-source software on GitHub in May 2025 (https://github.com/FZJ-JSC/jupin). In conclusion, resolving pinning challenges is critical for optimising HPC systems. Our tools establish a foundation for scaling operations, including preparations for the JUPITER exascale supercomputer.
001043552 536__ $$0G:(DE-HGF)POF4-5112$$a5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
001043552 536__ $$0G:(DE-HGF)POF4-5121$$a5121 - Supercomputing & Big Data Facilities (POF4-512)$$cPOF4-512$$fPOF IV$$x1
001043552 536__ $$0G:(DE-Juel-1)DB001492$$aBMBF 01 1H1 6013, NRW 325 – 8.03 – 133340 - SiVeGCS (DB001492)$$cDB001492$$x2
001043552 536__ $$0G:(DE-Juel-1)ATMLAO$$aATMLAO - ATML Application Optimization and User Service Tools (ATMLAO)$$cATMLAO$$x3
001043552 7001_ $$0P:(DE-Juel1)162225$$aGuimaraes, Filipe$$b1$$ufzj
001043552 7001_ $$0P:(DE-Juel1)184480$$aHimmels, Carina$$b2$$ufzj
001043552 7001_ $$0P:(DE-Juel1)132108$$aFrings, Wolfgang$$b3$$ufzj
001043552 7001_ $$0P:(DE-Juel1)137040$$aPaschoulas, Chrysovalantis$$b4$$ufzj
001043552 7001_ $$0P:(DE-Juel1)168541$$aGöbbert, Jens Henrik$$b5$$ufzj
001043552 8564_ $$u//juser.fz-juelich.de/record/1043552/files/ISC25_JuPin_ResearchPoster.pdf
001043552 8564_ $$uhttps://juser.fz-juelich.de/record/1043552/files/ISC25_JuPin_ResearchPoster.pdf$$yOpenAccess
001043552 909CO $$ooai:juser.fz-juelich.de:1043552$$popenaire$$popen_access$$pVDB$$pdriver
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)138707$$aForschungszentrum Jülich$$b0$$kFZJ
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)162225$$aForschungszentrum Jülich$$b1$$kFZJ
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)184480$$aForschungszentrum Jülich$$b2$$kFZJ
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)132108$$aForschungszentrum Jülich$$b3$$kFZJ
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)137040$$aForschungszentrum Jülich$$b4$$kFZJ
001043552 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)168541$$aForschungszentrum Jülich$$b5$$kFZJ
001043552 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5112$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
001043552 9131_ $$0G:(DE-HGF)POF4-512$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5121$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vSupercomputing & Big Data Infrastructures$$x1
001043552 9141_ $$y2025
001043552 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001043552 920__ $$lyes
001043552 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
001043552 980__ $$aposter
001043552 980__ $$aVDB
001043552 980__ $$aUNRESTRICTED
001043552 980__ $$aI:(DE-Juel1)JSC-20090406
001043552 9801_ $$aFullTexts