001     1038510
005     20250203124521.0
020 _ _ |a 978-3-031-69765-4 (print)
020 _ _ |a 978-3-031-69766-1 (electronic)
024 7 _ |a 10.1007/978-3-031-69766-1_4
|2 doi
024 7 _ |a 0302-9743
|2 ISSN
024 7 _ |a 1611-3349
|2 ISSN
024 7 _ |a WOS:001308370400004
|2 WOS
037 _ _ |a FZJ-2025-01495
100 1 _ |a Nassyr, Stepan
|0 P:(DE-Juel1)172888
|b 0
|u fzj
111 2 _ |a 30th European Conference on Parallel and Distributed Processing
|g Euro-Par 2024
|c Madrid
|d 2024-08-26 - 2024-08-30
|w Spain
245 _ _ |a Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels
260 _ _ |a Cham
|c 2024
|b Springer Nature Switzerland
295 1 0 |a Euro-Par 2024: Parallel Processing
300 _ _ |a 47 - 61
336 7 _ |a CONFERENCE_PAPER
|2 ORCID
336 7 _ |a Conference Paper
|0 33
|2 EndNote
336 7 _ |a INPROCEEDINGS
|2 BibTeX
336 7 _ |a conferenceObject
|2 DRIVER
336 7 _ |a Output Types/Conference Paper
|2 DataCite
336 7 _ |a Contribution to a conference proceedings
|b contrib
|m contrib
|0 PUB:(DE-HGF)8
|s 1738307726_672
|2 PUB:(DE-HGF)
336 7 _ |a Contribution to a book
|0 PUB:(DE-HGF)7
|2 PUB:(DE-HGF)
|m contb
490 0 _ |a Lecture Notes in Computer Science
|v 14802
520 _ _ |a Dense matrix-matrix operations are relevant for a broad range of numerical applications, e.g. for implementing deep neural networks. Past research has led to a good understanding of how these operations can be mapped in a generic manner on typical processor architectures with multiple cache levels such that near-optimal performance can be reached. However, while commonly used micro-architectures are typically suitable for such operations, their architectural parameters need to be suitably tuned. The performance of highly optimised implementations of these operations relies on micro-kernels that are often handwritten. Given the increased variety of instruction set architectures and SIMD instruction extensions, this becomes challenging. In this paper, wepresent and implement a methodology for an exhaustive exploration of a processor core micro-architecture design space based on gem5 simulations. Furthermore, we present a tool for generating efficiently vectorised code leveraging Arm’s SVE and RISC-V’s RVV instructions. It enables automatisation of the generation of micro-kernels and, therefore, the generation of a large range of such kernels. The results provide insights both, to micro-architecture architects as well as micro-kernel developers. The assembler generator is open-sourced and the simulation data is availableas supplementary material.
536 _ _ |a 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)
|0 G:(DE-HGF)POF4-5112
|c POF4-511
|f POF IV
|x 0
536 _ _ |a 5122 - Future Computing & Big Data Systems (POF4-512)
|0 G:(DE-HGF)POF4-5122
|c POF4-512
|f POF IV
|x 1
536 _ _ |a PhD no Grant - Doktorand ohne besondere Förderung (PHD-NO-GRANT-20170405)
|0 G:(DE-Juel1)PHD-NO-GRANT-20170405
|c PHD-NO-GRANT-20170405
|x 2
536 _ _ |a EPI SGA2 (16ME0507K)
|0 G:(BMBF)16ME0507K
|c 16ME0507K
|x 3
588 _ _ |a Dataset connected to CrossRef Book Series, Journals: juser.fz-juelich.de
700 1 _ |a Pleiter, Dirk
|0 P:(DE-Juel1)144441
|b 1
|e Corresponding author
770 _ _ |z 978-3-031-69765-4=978-3-031-69766-1
773 _ _ |a 10.1007/978-3-031-69766-1_4
856 4 _ |u https://juser.fz-juelich.de/record/1038510/files/978-3-031-69766-1.pdf
|y Restricted
856 4 _ |u https://juser.fz-juelich.de/record/1038510/files/Preprint.pdf
|y Restricted
909 C O |o oai:juser.fz-juelich.de:1038510
|p VDB
910 1 _ |a Forschungszentrum Jülich
|0 I:(DE-588b)5008462-8
|k FZJ
|b 0
|6 P:(DE-Juel1)172888
913 1 _ |a DE-HGF
|b Key Technologies
|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action
|1 G:(DE-HGF)POF4-510
|0 G:(DE-HGF)POF4-511
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-500
|4 G:(DE-HGF)POF
|v Enabling Computational- & Data-Intensive Science and Engineering
|9 G:(DE-HGF)POF4-5112
|x 0
913 1 _ |a DE-HGF
|b Key Technologies
|l Engineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action
|1 G:(DE-HGF)POF4-510
|0 G:(DE-HGF)POF4-512
|3 G:(DE-HGF)POF4
|2 G:(DE-HGF)POF4-500
|4 G:(DE-HGF)POF
|v Supercomputing & Big Data Infrastructures
|9 G:(DE-HGF)POF4-5122
|x 1
914 1 _ |y 2024
915 _ _ |a Nationallizenz
|0 StatID:(DE-HGF)0420
|2 StatID
|d 2024-12-28
|w ger
915 _ _ |a DBCoverage
|0 StatID:(DE-HGF)0200
|2 StatID
|b SCOPUS
|d 2024-12-28
920 _ _ |l yes
920 1 _ |0 I:(DE-Juel1)JSC-20090406
|k JSC
|l Jülich Supercomputing Center
|x 0
980 _ _ |a contrib
980 _ _ |a VDB
980 _ _ |a contb
980 _ _ |a I:(DE-Juel1)JSC-20090406
980 _ _ |a UNRESTRICTED


LibraryCollectionCLSMajorCLSMinorLanguageAuthor
Marc 21