Home > Publications database > Exploring Linguistic Proximity in C4 Multilingual Data through Efficient Embedding Model Analysis and Visualization on HPC |
Bachelor Thesis | FZJ-2025-02306 |
2025
This record in other databases:
Please use a persistent id in citations: doi:10.34734/FZJ-2025-02306
Abstract: This thesis investigates the proximity of different languages and different language families by analysing how multilingual text data are represented in a shared latent space, focusing on the Colossal Clean Crawled Corpus (C4) with a multilingual extension (mC4). The main focus is to determine whether embeddings of different languages group together based on their linguistic families, topical content, or both. This is achieved through a high-performance computing (HPC) system to embed 6.1TB of textual data from 24 diverse languages. The BAAI bge-m3 embedding model served to create embeddings of dimension 1,024, which were stored in a vector database using ChromaDB to facilitate scalable analysis and querying.Subsequent dimensionality reduction with t-distributed Stochastic Neighbor Embedding (t-SNE) allowed for the visualization of language clusters in two-dimensional space for a simpler and better understanding. Results reveal that similar thematic or topical content often drives the embedding model to generate vectors that lie close together, even from different languages. However, certain clusters reflect linguistic closeness—especially among languages from the same family—indicating that the model also recognizes linguistic features. Overall, the thesis uses multilingual embeddings to check the existence of any relation between the semantic representation of texts as vectors (embeddings) and the linguistic structure of the origin languages, demonstrating how HPC resources, combined with advanced embedding models, can efficiently handle large datasets and offer deeper insights into language proximity and topic similarity analysis.
![]() |
The record appears in these collections: |