Exploring Linguistic Proximity in C4 Multilingual Data through Eﬀicient Embedding Model Analysis and Visualization on HPC

Rahmdel, Sahand
001041549 001__ 1041549
001041549 005__ 20250423202217.0
001041549 0247_ $$2datacite_doi$$a10.34734/FZJ-2025-02306
001041549 037__ $$aFZJ-2025-02306
001041549 1001_ $$0P:(DE-Juel1)207054$$aRahmdel, Sahand$$b0$$eCorresponding author$$ufzj
001041549 245__ $$aExploring Linguistic Proximity in C4 Multilingual Data through Eﬀicient Embedding Model Analysis and Visualization on HPC$$f - 2025-03-25
001041549 260__ $$c2025
001041549 300__ $$a77 p.
001041549 3367_ $$2DRIVER$$abachelorThesis
001041549 3367_ $$02$$2EndNote$$aThesis
001041549 3367_ $$2DataCite$$aOutput Types/Supervised Student Publication
001041549 3367_ $$0PUB:(DE-HGF)2$$2PUB:(DE-HGF)$$aBachelor Thesis$$bbachelor$$mbachelor$$s1745388543_23740
001041549 3367_ $$2BibTeX$$aMASTERSTHESIS
001041549 3367_ $$2ORCID$$aSUPERVISED_STUDENT_PUBLICATION
001041549 502__ $$aBachelorarbeit, Rheinisch-Westfälische Technische Hochschule Aachen, 2025$$bBachelorarbeit$$cRheinisch-Westfälische Technische Hochschule Aachen$$d2025
001041549 520__ $$aThis thesis investigates the proximity of different languages and different language families by analysing how multilingual text data are represented in a shared latent space, focusing on the Colossal Clean Crawled Corpus (C4) with a multilingual extension (mC4). The main focus is to determine whether embeddings of different languages group together based on their linguistic families, topical content, or both. This is achieved through a high-performance computing (HPC) system to embed 6.1TB of textual data from 24 diverse languages. The BAAI bge-m3 embedding model served to create embeddings of dimension 1,024, which were stored in a vector database using ChromaDB to facilitate scalable analysis and querying.Subsequent dimensionality reduction with t-distributed Stochastic Neighbor Embedding (t-SNE) allowed for the visualization of language clusters in two-dimensional space for a simpler and better understanding. Results reveal that similar thematic or topical content often drives the embedding model to generate vectors that lie close together, even from different languages. However, certain clusters reflect linguistic closeness—especially among languages from the same family—indicating that the model also recognizes linguistic features. Overall, the thesis uses multilingual embeddings to check the existence of any relation between the semantic representation of texts as vectors (embeddings) and the linguistic structure of the origin languages, demonstrating how HPC resources, combined with advanced embedding models, can eﬀiciently handle large datasets and offer deeper insights into language proximity and topic similarity analysis.
001041549 536__ $$0G:(DE-HGF)POF4-5112$$a5112 - Cross-Domain Algorithms, Tools, Methods Labs (ATMLs) and Research Groups (POF4-511)$$cPOF4-511$$fPOF IV$$x0
001041549 8564_ $$uhttps://juser.fz-juelich.de/record/1041549/files/Bachelorarbeit_Rahmdel_424069.pdf$$yOpenAccess
001041549 909CO $$ooai:juser.fz-juelich.de:1041549$$popenaire$$popen_access$$pVDB$$pdriver$$pdnbdelivery
001041549 9101_ $$0I:(DE-588b)5008462-8$$6P:(DE-Juel1)207054$$aForschungszentrum Jülich$$b0$$kFZJ
001041549 9131_ $$0G:(DE-HGF)POF4-511$$1G:(DE-HGF)POF4-510$$2G:(DE-HGF)POF4-500$$3G:(DE-HGF)POF4$$4G:(DE-HGF)POF$$9G:(DE-HGF)POF4-5112$$aDE-HGF$$bKey Technologies$$lEngineering Digital Futures – Supercomputing, Data Management and Information Security for Knowledge and Action$$vEnabling Computational- & Data-Intensive Science and Engineering$$x0
001041549 9141_ $$y2025
001041549 915__ $$0StatID:(DE-HGF)0510$$2StatID$$aOpenAccess
001041549 920__ $$lyes
001041549 9201_ $$0I:(DE-Juel1)JSC-20090406$$kJSC$$lJülich Supercomputing Center$$x0
001041549 9801_ $$aFullTexts
001041549 980__ $$abachelor
001041549 980__ $$aVDB
001041549 980__ $$aUNRESTRICTED
001041549 980__ $$aI:(DE-Juel1)JSC-20090406
Gast :: Anmelden JuSER
		Suchen		Absenden		Personalisieren Ihre Benachrichtigungen Ihre Körbe Ihre Suchanfragen		Hilfe