Master Thesis FZJ-2026-00950

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Unravelling DNA Base Composition Behind Budding Yeast DNA Replication Origins Using Large Language Models

 ;  ;

2025

66 pp. () [10.34734/FZJ-2026-00950] = Masterarbeit, Saarland University, 2025

This record in other databases:  

Please use a persistent id in citations: doi:

Abstract: In living cells, DNA replication begins at multiple genomic sites called replicationorigins. Identifying these origins and their underlying base sequencecomposition is crucial for understanding the replication process. Existing machinelearning methods for origin prediction often require labor-intensive featureengineering or lack interpretability.In this study, we employ genome-based pre-trained LLMs to predict replicationorigins in budding yeast. By leveraging pre-training, LLMs automaticallycapture complex genomic patterns, eliminating the need for extensive featureengineering. The attention mechanism further enables the recognition of importantsequence dependencies and patterns.We fine-tuned the pre-trained DNABERT and DNABERT-2 models for ourdownstream task. To reveal the DNA base composition behind replication origins,we emphasize data engineering and explainability, rather than solely usingmodels for prediction. Therefore, we evaluate model performance acrossdatasets of varying complexity using a structured data engineering strategy, toensure robustness.We developed a comprehensive pipeline for identifying sequence motifs usingattention maps and bioinformatics post-processing, making DNABERT moreinterpretable. We also discussed explainability of the attention maps extractedby DNABERT-2, as well as its learning mechanism using various approaches.


Note: Masterarbeit, Saarland University, 2025

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2025
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Document types > Theses > Master Theses
Workflow collections > Public records
Institute Collections > JSC
Publications database
Open Access

 Record created 2026-01-23, last modified 2026-02-20


OpenAccess:
Download fulltext PDF
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)