Software defect prediction using Bayesian networks and kernel methods

Okutan, Ahmet

dc.contributor.advisor	Yıldız, Olcay Taner	en_US
dc.contributor.author	Okutan, Ahmet	en_US
dc.contributor.other	Işık Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Doktora Programı	en_US
dc.date.accessioned	2016-05-31T12:07:36Z
dc.date.available	2016-05-31T12:07:36Z
dc.date.issued	2012-07-01
dc.identifier.citation	Okutan, A. (2002). Software defect prediction using Bayesian networks and kernel methods. İstanbul: Işık Üniversitesi Fen Bilimleri Enstitüsü.	en_US
dc.identifier.uri	https://hdl.handle.net/11729/891
dc.description	Text in English; Abstract: English and Turkish	en_US
dc.description	Includes bibliographical references (leaves 115-127)	en_US
dc.description	xix, 128 leaves	en_US
dc.description.abstract	There are lots of different software metrics discovered and used for defect prediction in the literature. Instead of dealing with so many metrics, it would be practical and easy if we could determine the set of metrics that are most important and focus on them more to predict defectiveness. We use Bayesian modelling to determine the influential relationships among software metrics and defect proneness. In addition to the metrics used in Promise data repository, We define two more metrics, i.e. NOD for the number of developers and LOCQ for the source code quality. We wxtract these metrics by inspecting the source code repositories of the selected Promise data repository data sets. At the end of our modeling, We learn both the marginal defect proneness probability of the whole software system and the set of most effective metrics. Our experiments on nine open source Promise data repository data sets show that respense for class (RFC), lines of code (LOC), and lack of coding quality (LOCQ) are the most efective metrics whereas coupling between objets (CBO), weighted method per class (WMC), and lack of cohesion of methods (LCOM) are less efective metris on defect proneness. Furthermore, number of children (NOC) and depth of inheritance tree (DIT) have very limited effect and are unstustworthy. On tthe other hand, based on the experiments on Poi, Tomcat, and Xalan data sets, We observe that there is a positive correlation between the number of developers (NOD) and the level of defectiveness.However, futher investigation involving a greater number of projects, is need to confirm our findings. Furthermore, we propose a novel technique for defect prediction that uses plagiarism detection tools. Although the defect prediction problem haz been researched for a long time, the results achieved are not so bright. We use kernel programming to model the relationship between source code similarity and defectiveness. Each value in the kernel matrix shows how much parallelism exit between the corresponding files ib the kernel matrix shows how much parallelism exist between the corresponding files in the software system chosen. Our experiments on 10 real world datasets indicate that support vector machines (SVM) with a precalculated kernel matrix performs better than the SVM with the usual linear and RBF kernels and generates comparable results with the famous defect prediction methods like linear logistic regression and J48 in terms of the area under the curve (AUC).Furthermore, we observed that when the amount of similarity among the files of a software system is high, then the AUC found by the SVM with precomputed kernel can be used to predict the number of defects in the files or classes of a software system, because we observe a relationship between source code similarity and the number of defects. Based on the results of our analysis, the developers can focus on more defective modules rather than on less or non defective ones during testing activities. The experiments on 10 Promise datasets indicate that while predicting the number of defects, SVM with a precomputed kernel performs as good as the SVM with the usual linear and RBF kernels, in terms of the root mean square error (RMSE). The method proposed is also comparable with other regression methods like linear regression and IBK. The results of these experiments suggest that source code similarity is a good means of predicting both defectiveness and the number of defects in software modules.	en_US
dc.description.abstract	Literatürde kullanılan çok çeşitli yazılım ölçütleri mevcuttur. Çok sat-yıda ölçütle hata tahmini yapmak yerine, en önemli ölçüt kümesini belirleyip bu kümedeki ölçütleri hata tahmininde kullanmak daha pratik ve kolay olacaktır. Bu tezde yazılım ölçütleri ile hataya yarkınlık arasındaki etkileşimi ortaya çıkarmak için Bayesian modelleme yöntemi kullanılmıştır. Promise veri deposundaki yazılım ölçütlerine ek olarak, yazılım geliştiricisi sayısı (NOD) ve kaynak kodu kalitesi (LOCQ) adlı 2 yeni ölçüt tanımlanmıştır. Bu ölçütleri çıkarmak için Promise veri depesundaki veri kümelerinin açık kaynak kodları kullanılmıştır. Yapılan modelleme sonucunda, hem sınanan sistemin hatalı olm aihtimali, hem de en etkili ölçüt künesi bulunmaktadır. 9 Promise veri kümesi üzerindeki deneyler, RFC, LOC ve LOCQ ölçütlerinin en etkili ölçütler olduğunu, CBO, WMC ve LCOM ölçütlerinin ise daha az etkili olduğunu ortaya koymuştur. Ayrıca, NOC ve DIT ölçütlerinin sınırlı bir etkiye sahip olduğu ve güvenilir olmadığı gözlemlenmiştir. Öte yandan, Poi, Tomcat Xalan veri kümeleri üzerinde yapılan deneyler sonucunda, yazılım geliştici sayısı (NOD) ile hata seviyesi arasında doğru orantı olduğu sonucuna varılmıştır. Bununla birlikte, tespitlerimizi doğrulamak için daha fazla veri kümesi üzerinde deney yapmaya ihtiyaç vardır. Ayrıca bu tezde, hata tahmini için intihal tespit araçlarını kullanan yeni bir yöntem önerilmiştir. Hata tahmini için intihal tespit araçlarını kullanan yeni bir yöntem önerilmiştir. Hata tahmin problemi ve uzun zamandan beri araştırılmaktadır, fakat ortaya çıkan sonuçlar çok parlak değildir. Farklı bir bakış açısı getirmek üzere, kaynak kod benzerliği ve hataya yatkınlık arasındaki ilişkiyi modelleyen çekirdek metodu yöntemi kullanılmıştır. Bu yöntemde, üretilen çekirdek matrisindeki her bir değer, matrisin satır ve sütunda bulubab kaynak kodu dosyaları arasındaki parelelliği göstermektedir. 10 veri kümesi üzerindeki deneyler, önceden hesaplanmış çekirdek matrisi kullanan SVM yönteminin, doğrusal veya RBF çekirdek kullanan SVM yöntemlerine göre daha başarılı olduğunu ayrıca mevcut hata tahmin yöntemleri doğrusal lojistik regresyon ve J48 ile benzer sonuçlar ürettiğini göstermiştir. Ayrıca, bir yazılım sistemi içerisinde bulubab dosyalar arasındaki kod benzerliğinin daha fazla olduğunu durumlarda, ROC eğrisi altındaki alan (AUC) ölçütünün de daha yüksek olduğu görülmüştür. Ayrıca, önceden hesaplanmış çekirdek matris kullanan SVM yönteminin, hata sayısı ile kaynak kodu benzerliği arasında gözlemlenen ilişkiden ötürü, bir yazılım sistemindeki hata sayısının tahmin edilmesinde de kullanılabileceği gösterilmiştir. Yapılan analiz sonucunda, yazılım geliştiriciler hatasız veya daha az hatalı modüllere odaklanmak yerine, daha fazla hata içeren modüllere odaklanabilirler. 10 Promise veri kümesi üzerinde yapılan deneyler, hata sayısını tahmin ederken, önceden hesaplanan çekirdek matris kullanan SVM yönetiminin ortalama karesel hata (RMSE) açısından doğrusal ve RBF çekirdek kullanan SVM yöntemi kadar başarılı olduğunu göstermiştir. Uygulana yöntem, doğrusal regreyon ve IBK gibi diğer regresyon yöntemleri ile benzer sonuçlar üreetmiştir. Yapılan deneylerin sonuçları, kaynak kodu benzerliğinin hataya yatkınlık ve hata sayısının tahmin etmede iyi bir araç olduğunu ortaya koymuştur.	en_US
dc.description.tableofcontents	Software Metrics	en_US
dc.description.tableofcontents	Static Code Metrics	en_US
dc.description.tableofcontents	McCabe Metrics	en_US
dc.description.tableofcontents	Line of Code Metrics	en_US
dc.description.tableofcontents	Halstead Metrics	en_US
dc.description.tableofcontents	Object Oriented Metrics	en_US
dc.description.tableofcontents	Developer Metrics	en_US
dc.description.tableofcontents	Process Metrics	en_US
dc.description.tableofcontents	Defect Prediction	en_US
dc.description.tableofcontents	Defect Prediction Data	en_US
dc.description.tableofcontents	Performance Measure	en_US
dc.description.tableofcontents	An Overview of the Defect Prediction Studies	en_US
dc.description.tableofcontents	Defect Prediction Using Statistical Methods	en_US
dc.description.tableofcontents	Defect Prediction Using Machine Learning Methods	en_US
dc.description.tableofcontents	Previous Work on Defect Prediction	en_US
dc.description.tableofcontents	Critics About Studies	en_US
dc.description.tableofcontents	Benchmarking Studies	en_US
dc.description.tableofcontents	Bayesian Networks	en_US
dc.description.tableofcontents	Background on Bayesian Networks	en_US
dc.description.tableofcontents	K2 Algorithm	en_US
dc.description.tableofcontents	Previous Work on Bayesian Networks	en_US
dc.description.tableofcontents	Kernel Machines	en_US
dc.description.tableofcontents	Background on Kernel Machines	en_US
dc.description.tableofcontents	Support Vector Machines	en_US
dc.description.tableofcontents	Support Vector Machines for Regression	en_US
dc.description.tableofcontents	Kernel Functions	en_US
dc.description.tableofcontents	String Kernels	en_US
dc.description.tableofcontents	Previous Work on Kernel Machines	en_US
dc.description.tableofcontents	Plagiarism Tools	en_US
dc.description.tableofcontents	Similarity Detection	en_US
dc.description.tableofcontents	Kernel Methods for Defect Prediction	en_US
dc.description.tableofcontents	Proposed Method	en_US
dc.description.tableofcontents	Bayesian networks	en_US
dc.description.tableofcontents	Bayesian network of Metrics and Defect Proneness	en_US
dc.description.tableofcontents	Ordering Metrics for Bayesian Network Construction	en_US
dc.description.tableofcontents	Kernel Methods to Predict Defectiveness	en_US
dc.description.tableofcontents	Selecting Plagiarism Tools and Tuning Their Input Parameters	en_US
dc.description.tableofcontents	Data Set Selection	en_US
dc.description.tableofcontents	Kernel Matrix Generation	en_US
dc.description.tableofcontents	Kernel Methods to Predict the Number of Defects	en_US
dc.description.tableofcontents	Experiments and Results	en_US
dc.description.tableofcontents	Experiment I: Determine Influential Relationships Among Metrics and Defectiveness Using Bayesian Networks	en_US
dc.description.tableofcontents	Experiment Design	en_US
dc.description.tableofcontents	Experiment II: Determine The Role Of Coding Quality And Number Of Developers On Defectiveness Using Bayesian Networks	en_US
dc.description.tableofcontents	Conclusion Instability Test	en_US
dc.description.tableofcontents	Effectiveness of Metric Pairs	en_US
dc.description.tableofcontents	Feature Selection Tests	en_US
dc.description.tableofcontents	Effectiveness of the Number of Developers (NOD)	en_US
dc.description.tableofcontents	Experiment III: Defect Proneness Prediction Using Kernel Methods	en_US
dc.description.tableofcontents	Experiment IV: Prediction of the Number of Defects with Kernel Methods	en_US
dc.description.tableofcontents	Threats to Validity	en_US
dc.description.tableofcontents	Summary of Results	en_US
dc.description.tableofcontents	Bayesian Networks	en_US
dc.description.tableofcontents	Kernel Methods to Predict Defectiveness	en_US
dc.description.tableofcontents	Kernel Methods to Predict the Number of Defects	en_US
dc.description.tableofcontents	Contributions	en_US
dc.description.tableofcontents	Bayesian Networks	en_US
dc.description.tableofcontents	Kernel Methods to Predict Defectiveness	en_US
dc.description.tableofcontents	Kernel Methods to Predict the Number of Defects	en_US
dc.description.tableofcontents	Future Work	en_US
dc.language.iso	eng	en_US
dc.publisher	Işık Üniversitesi	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject.lcc	QA76.76.Q35 O38 2012
dc.subject.lcsh	Neural networks (Computer science)	en_US
dc.subject.lcsh	Computer software -- Quality control	en_US
dc.subject.lcsh	Artificial intelligence	en_US
dc.subject.lcsh	Bayesian statistical decision theory	en_US
dc.title	Software defect prediction using Bayesian networks and kernel methods	en_US
dc.title.alternative	Bayesian ağları ve çekirdek yöntemleri ile yazılım hata tahmini	en_US
dc.type	doctoralThesis	en_US
dc.contributor.department	Işık Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Doktora Programı	en_US
dc.contributor.authorID	0000-0001-6664-515X
dc.relation.publicationcategory	Tez	en_US
dc.contributor.institutionauthor	Okutan, Ahmet	en_US

Bu öğenin dosyaları:

Ad:: Ahmet_Okutan.pdf
Boyut:: 1.268Mb
Biçim:: PDF
Açıklama:: DoctoralThesis

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

FBE - Tez Koleksiyonu | Bilgisayar Mühendisliği / Computer Engineering [4]
Bilgisayar Mühendisliği Doktora programına ait tez koleksiyonunu içerir.

Basit öğe kaydını göster