| Home > Publications database > Unravelling DNA Base Composition Behind Budding Yeast DNA Replication Origins Using Large Language Models |
| Master Thesis | FZJ-2026-00950 |
; ;
2025
This record in other databases:
Please use a persistent id in citations: doi:10.34734/FZJ-2026-00950
Abstract: In living cells, DNA replication begins at multiple genomic sites called replicationorigins. Identifying these origins and their underlying base sequencecomposition is crucial for understanding the replication process. Existing machinelearning methods for origin prediction often require labor-intensive featureengineering or lack interpretability.In this study, we employ genome-based pre-trained LLMs to predict replicationorigins in budding yeast. By leveraging pre-training, LLMs automaticallycapture complex genomic patterns, eliminating the need for extensive featureengineering. The attention mechanism further enables the recognition of importantsequence dependencies and patterns.We fine-tuned the pre-trained DNABERT and DNABERT-2 models for ourdownstream task. To reveal the DNA base composition behind replication origins,we emphasize data engineering and explainability, rather than solely usingmodels for prediction. Therefore, we evaluate model performance acrossdatasets of varying complexity using a structured data engineering strategy, toensure robustness.We developed a comprehensive pipeline for identifying sequence motifs usingattention maps and bioinformatics post-processing, making DNABERT moreinterpretable. We also discussed explainability of the attention maps extractedby DNABERT-2, as well as its learning mechanism using various approaches.
|
The record appears in these collections: |