19 sonuçlar
Arama Sonuçları
Listeleniyor 1 - 10 / 19
Yayın Calculating the VC-dimension of decision trees(IEEE, 2009) Aslan, Özlem; Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemWe propose an exhaustive search algorithm that calculates the VC-dimension of univariate decision trees with binary features. The VC-dimension of the univariate decision tree with binary features depends on (i) the VC-dimension values of the left and right subtrees, (ii) the number of inputs, and (iii) the number of nodes in the tree. From a training set of example trees whose VC-dimensions are calculated by exhaustive search, we fit a general regressor to estimate the VC-dimension of any binary tree. These VC-dimension estimates are then used to get VC-generalization bounds for complexity control using SRM in decision trees, i.e., pruning. Our simulation results shows that SRM-pruning using the estimated VC-dimensions finds trees that are as accurate as those pruned using cross-validation.Yayın Univariate margin tree(Springer, 2010) Yıldız, Olcay TanerIn many pattern recognition applications, first decision trees are used due to their simplicity and easily interpretable nature. In this paper, we propose a new decision tree learning algorithm called univariate margin tree, where for each continuous attribute, the best split is found using convex optimization. Our simulation results on 47 datasets show that the novel margin tree classifier performs at least as good as C4.5 and LDT with a similar time complexity. For two class datasets it generates smaller trees than C4.5 and LDT without sacrificing from accuracy, and generates significantly more accurate trees than C4.5 and LDT for multiclass datasets with one-vs-rest methodology.Yayın Maintenance policy analysis of the regenerative air heater system using factored POMDPs(Elsevier Ltd, 2022-03) Kıvanç, İpek; Özgür Ünlüakın, Demet; Bilgiç, TanerMaintenance optimization of multi-component systems is a difficult problem. Partially Observable Markov Decision Processes (POMDPs) are powerful tools for such problems under uncertainty in stochastic environments. In this study, the main POMDP solution approaches and solvers are surveyed. Then, based on experimental models with different complexities in the size of the system space, selected POMDP solvers using different representation patterns for modeling and different procedures for updating the value function while solving are compared. Furthermore, to show that factored representations are advantageous in modeling and solving the maintenance problem of multi-component systems where there exist also stochastic dependencies among the components, the maintenance problem of the one-line regenerative air heater system available in thermal power plants is modeled and solved with factored POMDPs. In-depth sensitivity analyses are performed on the obtained policy. The results show that factored POMDPs enable compact modeling, efficient policy generation and practical policy analysis for the tackled problem. Furthermore, they also motivate the use of factored POMDPs in the generation and analysis of maintenance policies for similar multi-component systems.Yayın İlişkisel veri tabanlarında mükerrer kayıtların makine öğrenmesiyle tespiti(Institute of Electrical and Electronics Engineers Inc., 2018-07-05) Bayrak, Ahmet Tuğrul; Yılmaz, Aykut İnan; Yılmaz, Kemal Burak; Düzağaç, Remzi; Yıldız, Olcay TanerVeri miktarının artışına paralel olarak, ilişkisel veri tabanlarında mükerrer kayıtlar da artmaktadır. Artan bu kayıtlar kullanıldıkları rapor veya analizlerde tutarsızlığa sebep olabilmektedir. Bu sorunu en aza indirgemek için yaptığımız çalışmada, kayıtların birbirlerine olan benzerlikleri ve alan uzmanlık bilgisiyle belirlenen ağırlıklar, öznitelik olarak kullanılarak makine öğrenmesi algoritmaları ile mükerrer kayıtların bulunması hedeflenmiştir. Yapılan işlem sonucunda 9301467 satır veride 28412 mükerrer çift tespit edilmiştir. Bulunan bu mükerrer kayıtlar veri kaynağından temizlenerek verinin daha tutarlı hale gelmesi sağlanmaktadır.Yayın Effective semi-supervised learning strategies for automatic sentence segmentation(Elsevier Science BV, 2018-04-01) Dalva, Doğan; Güz, Ümit; Gürkan, HakanThe primary objective of sentence segmentation process is to determine the sentence boundaries of a stream of words output by the automatic speech recognizers. Statistical methods developed for sentence segmentation requires a significant amount of labeled data which is time-consuming, labor intensive and expensive. In this work, we propose new multi-view semi-supervised learning strategies for sentence boundary classification problem using lexical, prosodic, and morphological information. The aim is to find effective semi-supervised machine learning strategies when only small sets of sentence boundary labeled data are available. We primarily investigate two semi-supervised learning approaches, called self-training and co-training. Different example selection strategies were also used for co-training, namely, agreement, disagreement and self-combined. Furthermore, we propose three-view and committee-based algorithms incorporating with agreement, disagreement and self-combined strategies using three disjoint feature sets. We present comparative results of different learning strategies on the sentence segmentation task. The experimental results show that the sentence segmentation performance can be highly improved using multi-view learning strategies that we proposed since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average baseline F-measure of 67.66% to 75.15% and 64.84% to 66.32% when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.Yayın Extension of conventional co-training learning strategies to three-view and committee-based learning strategies for effective automatic sentence segmentation(IEEE, 2018) Dalva, Doğan; Güz, Ümit; Gürkan, HakanThe objective of this work is to develop effective multi-view semi-supervised machine learning strategies for sentence boundary classification problem when only small sets of sentence boundary labeled data are available. We propose three-view and committee-based learning strategies incorporating with co-training algorithms with agreement, disagreement, and self-combined learning strategies using prosodic, lexical and morphological information. We compare experimental results of proposed three-view and committee-based learning strategies to other semi-supervised learning strategies in the literature namely, self-training and co-training with agreement, disagreement, and self-combined strategies. The experiment results show that sentence segmentation performance can be highly improved using multi-view learning strategies that we propose since data sets can be represented by three redundantly sufficient and disjoint feature sets. We show that the proposed strategies substantially improve the average performance when only a small set of manually labeled data is available for Turkish and English spoken languages, respectively.Yayın ISIKUN at the FinCausal 2020: Linguistically informed machine-learning approach for causality identification in financial documents(Association for Computational Linguistics (ACL), 2020) Özenir, Hüseyin Gökberk; Karadeniz, İlknurThis paper presents our participation to the FinCausal-2020 Shared Task whose ultimate aim is to extract cause-effect relations from a given financial text. Our participation includes two systems for the two sub-tasks of the FinCausal-2020 Shared Task. The first sub-task (Task-1) consists of the binary classification of the given sentences as causal meaningful (1) or causal meaningless (0). Our approach for the Task-1 includes applying linear support vector machines after transforming the input sentences into vector representations using term frequency-inverse document frequency scheme with 3-grams. The second sub-task (Task-2) consists of the identification of the cause-effect relations in the sentences, which are detected as causal meaningful. Our approach for the Task-2 is a CRF-based model which uses linguistically informed features. For the Task-1, the obtained results show that there is a small difference between the proposed approach based on linear support vector machines (F-score 94%), which requires less time compared to the BERT-based baseline (F-score 95%). For the Task-2, although a minor modifications such as the learning algorithm type and the feature representations are made in the conditional random fields based baseline (F-score 52%), we have obtained better results (F-score 60%). The source codes for the both tasks are available online (https://github.com/ozenirgokberk/FinCausal2020.git/).Yayın Cost-conscious comparison of supervised learning algorithms over multiple data sets(Elsevier Sci Ltd, 2012-04) Ulaş, Aydın; Yıldız, Olcay Taner; Alpaydın, Ahmet İbrahim EthemIn the literature, there exist statistical tests to compare supervised learning algorithms on multiple data sets in terms of accuracy but they do not always generate an ordering. We propose Multi(2)Test, a generalization of our previous work, for ordering multiple learning algorithms on multiple data sets from "best" to "worst" where our goodness measure is composed of a prior cost term additional to generalization error. Our simulations show that Multi2Test generates orderings using pairwise tests on error and different types of cost using time and space complexity of the learning algorithms.Yayın Mapping classifiers and datasets(Pergamon-Elsevier Science Ltd, 2011-04) Yıldız, Olcay TanerGiven the posterior probability estimates of 14 classifiers on 38 datasets, we plot two-dimensional maps of classifiers and datasets using principal component analysis (PCA) and Isomap. The similarity between classifiers indicate correlation (or diversity) between them and can be used in deciding whether to include both in an ensemble. Similarly, datasets which are too similar need not both be used in a general comparison experiment. The results show that (i) most of the datasets (approximately two third) we used are similar to each other, (ii) multilayer perceptrons and k-nearest neighbor variants are more similar to each other than support vector machine and decision tree variants. (iii) the number of classes and the sample size has an effect on similarity.Yayın Feature extraction from discrete attributes(IEEE, 2010) Yıldız, Olcay TanerIn many pattern recognition applications, first decision trees are used due to their simplicity and easily interpretable nature. In this paper, we extract new features by combining k discrete attributes, where for each subset of size k of the attributes, we generate all orderings of values of those attributes exhaustively. We then apply the usual univariate decision tree classifier using these orderings as the new attributes. Our simulation results on 16 datasets from UCI repository [2] show that the novel decision tree classifier performs better than the proper in terms of error rate and tree complexity. The same idea can also be applied to other univariate rule learning algorithms such as C4.5Rules [7] and Ripper [3].












