Effective semi-supervised learning strategies for automatic sentence segmentation

Dalva, Doğan; Güz, Ümit; Gürkan, Hakan

dc.contributor.author	Dalva, Doğan	en_US
dc.contributor.author	Güz, Ümit	en_US
dc.contributor.author	Gürkan, Hakan	en_US
dc.date.accessioned	2018-12-13T01:04:03Z
dc.date.available	2018-12-13T01:04:03Z
dc.date.issued	2018-04-01
dc.identifier.citation	Dalva, D., Güz, Ü. & Gürkan, H. (2018). Effective semi-supervised learning strategies for automatic sentence segmentation. Pattern Recognition Letters, 105(SI), 76-86. doi:10.1016/j.patrec.2017.10.010	en_US
dc.identifier.issn	0167-8655
dc.identifier.issn	1872-7344
dc.identifier.uri	https://hdl.handle.net/11729/1416
dc.identifier.uri	http://dx.doi.org/10.1016/j.patrec.2017.10.010
dc.description.abstract	The primary objective of sentence segmentation process is to determine the sentence boundaries of a stream of words output by the automatic speech recognizers. Statistical methods developed for sentence segmentation requires a significant amount of labeled data which is time-consuming, labor intensive and expensive. In this work, we propose new multi-view semi-supervised learning strategies for sentence boundary classification problem using lexical, prosodic, and morphological information. The aim is to find effective semi-supervised machine learning strategies when only small sets of sentence boundary labeled data are available. We primarily investigate two semi-supervised learning approaches, called self-training and co-training. Different example selection strategies were also used for co-training, namely, agreement, disagreement and self-combined. Furthermore, we propose three-view and committee-based algorithms incorporating with agreement, disagreement and self-combined strategies using three disjoint feature sets. We present comparative results of different learning strategies on the sentence segmentation task. The experimental results show that the sentence segmentation performance can be highly improved using multi-view learning strategies that we proposed since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average baseline F-measure of 67.66% to 75.15% and 64.84% to 66.32% when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.	en_US
dc.description.sponsorship	This material is based upon work supported by the Scientific and Technological Research Council of Turkey (TUBITAK) (Project Number: 107E182 and Project Number: 111E228), Isik University Scientific Research Projects Fund (Project Number: 09A301 and Project Number: 14A201), TUBITAK BIDEB and J. William Fulbright Post-Doctoral Research Fellowship, USA fundings at SRI-International, Speech Technology and Research (STAR) Lab., Menlo Park, CA, USA and International Computer Science Institute (ICSI) Speech Group, University of California at Berkeley, CA, USA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies. The authors thank Gokhan Tur, Dilek Hakkani- Tur, Benoit Favre, Sebastien Cuendet, Murat Saraclar, Siddika Parlak, Erinc Dikici, Izel D. Revidi, Cenk Demiroglu and Fatih Ozaydin and Bogazici University Signal and Image Processing (BUSIM) Group for many helpful discussions. The authors also thank the anonymous reviewers for their useful comments on an earlier version of this paper.	en_US
dc.language.iso	eng	en_US
dc.publisher	Elsevier Science BV	en_US
dc.relation.isversionof	10.1016/j.patrec.2017.10.010
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Machine learning	en_US
dc.subject	Multi-view semi-supervised learning	en_US
dc.subject	Co-training	en_US
dc.subject	Sentence segmentation	en_US
dc.subject	Boosting	en_US
dc.subject	Speech	en_US
dc.subject	Recognition	en_US
dc.subject	Multiview	en_US
dc.subject	Speech recognition	en_US
dc.subject	Sentence boundary	en_US
dc.subject	Adaptive boosting	en_US
dc.subject	Artificial intelligence	en_US
dc.subject	Classification (of information)	en_US
dc.subject	Learning algorithms	en_US
dc.subject	Learning systems	en_US
dc.subject	Speech processing	en_US
dc.subject	Automatic speech recognizers	en_US
dc.subject	Morphological information	en_US
dc.subject	Multi-view learning	en_US
dc.subject	Semi- supervised learning	en_US
dc.subject	Sentence boundaries	en_US
dc.subject	Supervised learning	en_US
dc.title	Effective semi-supervised learning strategies for automatic sentence segmentation	en_US
dc.type	article	en_US
dc.description.version	Publisher's Version	en_US
dc.relation.journal	Pattern Recognition Letters	en_US
dc.contributor.department	Işık Üniversitesi, Mühendislik Fakültesi, Elektrik-Elektronik Mühendisliği Bölümü	en_US
dc.contributor.department	Işık University, Faculty of Engineering, Department of Electrical-Electronics Engineering	en_US
dc.contributor.authorID	0000-0002-4597-0954
dc.contributor.authorID	0000-0002-7008-4778
dc.identifier.volume	105
dc.identifier.issue	SI
dc.identifier.startpage	76
dc.identifier.endpage	86
dc.peerreviewed	Yes	en_US
dc.publicationstatus	Published	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.contributor.institutionauthor	Dalva, Doğan	en_US
dc.contributor.institutionauthor	Güz, Ümit	en_US
dc.contributor.institutionauthor	Gürkan, Hakan	en_US
dc.relation.index	WOS	en_US
dc.relation.index	Scopus	en_US
dc.relation.index	Science Citation Index Expanded (SCI-EXPANDED)	en_US
dc.description.quality	Q2
dc.description.wosid	WOS:000428363000010

Bu öğenin dosyaları:

Ad:: 1416.pdf
Boyut:: 1.098Mb
Biçim:: PDF
Açıklama:: Publisher's Version

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

MF - Makale Koleksiyonu | Elektrik-Elektronik Mühendisliği Bölümü / Department of Electrical-Electronics Engineering [181]
Elektrik-Elektronik Mühendisliği Bölümüne ait makale koleksiyonunu içerir.
Scopus İndeksli Makale Koleksiyonu [1009]
WOS İndeksli Makale Koleksiyonu [1025]

Basit öğe kaydını göster

Effective semi-supervised learning strategies for automatic sentence segmentation

Bu öğenin dosyaları:

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

İlgili Öğeler

Extension of conventional co-training learning strategies to three-view and committee-based learning strategies for effective automatic sentence segmentation ﻿

Aynı oteli temsil eden farklı kayıtlar için akıllı eşleştirme ﻿

An incremental model selection algorithm based on cross-validation for finding the architecture of a Hidden Markov model on hand gesture data sets ﻿

Extension of conventional co-training learning strategies to three-view and committee-based learning strategies for effective automatic sentence segmentation

Aynı oteli temsil eden farklı kayıtlar için akıllı eşleştirme

An incremental model selection algorithm based on cross-validation for finding the architecture of a Hidden Markov model on hand gesture data sets