Posts

Showing posts from July, 2010

Naive bayes classifier for discrete predictors

The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques. We introduce in Tanagra ( version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier. In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good perform

Interactive decision tree learning with Spad

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet. Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial to

Supervised learning from imbalanced dataset

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate. For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a