Bachelor Thesis FZJ-2025-02306

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Exploring Linguistic Proximity in C4 Multilingual Data through Efficient Embedding Model Analysis and Visualization on HPC



2025

77 p. () [10.34734/FZJ-2025-02306] = Bachelorarbeit, Rheinisch-Westfälische Technische Hochschule Aachen, 2025

This record in other databases:

Please use a persistent id in citations: doi:

Abstract: This thesis investigates the proximity of different languages and different language families by analysing how multilingual text data are represented in a shared latent space, focusing on the Colossal Clean Crawled Corpus (C4) with a multilingual extension (mC4). The main focus is to determine whether embeddings of different languages group together based on their linguistic families, topical content, or both. This is achieved through a high-performance computing (HPC) system to embed 6.1TB of textual data from 24 diverse languages. The BAAI bge-m3 embedding model served to create embeddings of dimension 1,024, which were stored in a vector database using ChromaDB to facilitate scalable analysis and querying.Subsequent dimensionality reduction with t-distributed Stochastic Neighbor Embedding (t-SNE) allowed for the visualization of language clusters in two-dimensional space for a simpler and better understanding. Results reveal that similar thematic or topical content often drives the embedding model to generate vectors that lie close together, even from different languages. However, certain clusters reflect linguistic closeness—especially among languages from the same family—indicating that the model also recognizes linguistic features. Overall, the thesis uses multilingual embeddings to check the existence of any relation between the semantic representation of texts as vectors (embeddings) and the linguistic structure of the origin languages, demonstrating how HPC resources, combined with advanced embedding models, can efficiently handle large datasets and offer deeper insights into language proximity and topic similarity analysis.


Note: Bachelorarbeit, Rheinisch-Westfälische Technische Hochschule Aachen, 2025

Contributing Institute(s):
  1. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511) (POF4-511)

Appears in the scientific report 2025
Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Dokumenttypen > Hochschulschriften > Bachelorarbeiten
Workflowsammlungen > Öffentliche Einträge
Institutssammlungen > JSC
Publikationsdatenbank
Open Access

 Datensatz erzeugt am 2025-04-22, letzte Änderung am 2025-04-23


OpenAccess:
Volltext herunterladen PDF
Dieses Dokument bewerten:

Rate this document:
1
2
3
 
(Bisher nicht rezensiert)