TurkEmbed: Turkish embedding model on natural language inference & sentence text similarity tasks

Yükleniyor...
Küçük Resim

Tarih

2025-11-10

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Cornell Univ

Erişim Hakkı

info:eu-repo/semantics/openAccess

Araştırma projeleri

Organizasyon Birimleri

Dergi sayısı

Özet

This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resourceconstrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-bTR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.

Açıklama

Anahtar Kelimeler

Semantic text similarity, Matryoshka representation, Embedding model, Natural language inference, Downstream task

Kaynak

Arxiv

WoS Q Değeri

N/A

Scopus Q Değeri

Cilt

Sayı

Künye

zerceli, Ö., Gümüşçekiçci, G., Erkoç, T. & Özenç, B. (2025). TurkEmbed: Turkish embedding model on natural language inference & sentence text similarity tasks. Arxiv, 1-9. doi: https://doi.org/10.48550/arXiv.2511.08376