Posts

Showing posts from February, 2014

Introduction to Decision Trees

Here are the lecture notes I use for my course “Introduction to Decision Trees”. The basic concepts of the decision tree algorithm are described. The underlying method is rather similar to the CHAID approach. Keywords : machine learning, supervised methods, decision tree learning, classification tree Slides : Introduction to Decision Trees References : T. Mitchell, " Decision Tree Learning ", in "Machine Learning", McGraw Hill, 1997; Chapter 3, pp. 52-80. L. Rokach, O. Maimon, " Decision Trees ", in  "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 9, pp. 165-192.

Introduction to Supervised Learning

Here are the lecture notes I use for my course “Introduction to Supervised Learning”. The presentation is very simplified. But, all the important elements are described: the goal of the supervised learning process, the Bayes rule, the evaluation of the models using the confusion matrix. Keywords : machine learning, supervised methods, model, classifier, target attribute, class attribute, input attributes, descriptors, bayes rule, confusion matrix, error rate, sensitivity, precision, specificity Slides : Introduction to Supervised Learning References : O. Maimon, L. Rokach, " Introduction to Supervised Methods ", in  "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 8, pp. 149-164. T. Hastie, R. Tibshirani, J. Friedman, " The elements of Statistical Learning ", Springer, 2009.

Cluster analysis for mixed data

The aim of clustering is to gather together the instances of a dataset in a set of groups. The instances in the same cluster are similar according a similarity (or dissimilarity) measure. The instances in distinct groups are different. The influence of the used measure, which is often a distance measure, is essential in this process. They are well known when we work on attributes with the same type. The Euclidian distance is often used when we deal with numeric variables; the chi-square distance is more appropriate when we deal with categorical variables. The problem is a lot of more complicated when we deal with a set of mixed data i.e. with both numeric and categorical values. It is admittedly possible to define a measure which handles simultaneously the two kinds of variables, but we have trouble with the weighting problem. We must define a weighting system which balances the influence of the attributes, indeed the results must not depend of the kind of the variables. This is not ea