TurkEmbed: Turkish embedding model on natural language inference & sentence text similarity tasks

dc.authorid0000-0002-7877-7528
dc.authorid0000-0002-9502-7817
dc.authorid0000-0001-9033-8934
dc.authorid0000-0003-2008-243X
dc.contributor.authorEzerceli, Özayen_US
dc.contributor.authorGümüşçekiçci, Gizemen_US
dc.contributor.authorErkoç, Tuğbaen_US
dc.contributor.authorÖzenç, Berkeen_US
dc.date.accessioned2026-05-06T07:02:07Z
dc.date.available2026-05-06T07:02:07Z
dc.date.issued2025-11-10
dc.departmentIşık Üniversitesi, Mühendislik ve Doğa Bilimleri Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.departmentIşık University, Faculty of Engineering and Natural Sciences, Department of Computer Engineeringen_US
dc.description.abstractThis paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resourceconstrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-bTR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.en_US
dc.identifier.citationzerceli, Ö., Gümüşçekiçci, G., Erkoç, T. & Özenç, B. (2025). TurkEmbed: Turkish embedding model on natural language inference & sentence text similarity tasks. Arxiv, 1-9. doi: https://doi.org/10.48550/arXiv.2511.08376en_US
dc.identifier.endpage9
dc.identifier.startpage1
dc.identifier.urihttps://hdl.handle.net/11729/7379
dc.identifier.urihttps://doi.org/10.48550/arXiv.2511.08376
dc.identifier.wosPPRN:161696255
dc.identifier.wosqualityN/A
dc.indekslendigikaynakWeb of Scienceen_US
dc.indekslendigikaynakPreprint Citation Indexen_US
dc.institutionauthorGümüşçekiçci, Gizemen_US
dc.institutionauthorErkoç, Tuğbaen_US
dc.institutionauthorÖzenç, Berkeen_US
dc.institutionauthorid0000-0002-9502-7817
dc.institutionauthorid0000-0001-9033-8934
dc.institutionauthorid0000-0003-2008-243X
dc.language.isoenen_US
dc.publisherCornell Univen_US
dc.relation.ispartofArxiven_US
dc.relation.publicationcategoryÖn Baskı – Uluslararası – Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectSemantic text similarityen_US
dc.subjectMatryoshka representationen_US
dc.subjectEmbedding modelen_US
dc.subjectNatural language inferenceen_US
dc.subjectDownstream tasken_US
dc.titleTurkEmbed: Turkish embedding model on natural language inference & sentence text similarity tasksen_US
dc.typePreprinten_US
dspace.entity.typePublicationen_US

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
TurkEmbed_Turkish_Embedding_Model_on_NLI_STS_Tasks.pdf
Boyut:
1.17 MB
Biçim:
Adobe Portable Document Format
Lisans paketi
Listeleniyor 1 - 1 / 1
Küçük Resim Yok
İsim:
license.txt
Boyut:
1.17 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: