Posts

Showing posts from August, 2014

Association rule learning (slides)

Association rule learning is a popular approach to extract rules from large databases. Initially intended to transactional data, especially for the market basket analysis, the method can be applied to any binary or binarized data. In these slides, we show the outline of the approach. We present a basic algorithm to generate association rules from data. We highlight the influence of the settings (minimum support and minimum confidence) for the reduction of the search space, and thus for the reduction  of the amount of calculations. Keywords : association rule, association rules, itemset, frequent itemset, eclat algorithm, support, confidence, lift Components (Tanagra) : A PRIORI, A PRIORI MR, A PRIORI PT, FREQUENT ITEMSETS, SPV ASSOC RULE, SPV ASSOC TREE Slides : Association rule learning References : Wikipedia, " Association Rule Learning ". M. Zaki, S. Parthasaraty, M. Ogihara, W. Li, “ New Algorithms for Fast Discovery of Association Rules ”, in Proc. of KDD’97, p. 283-296,

ROC curve (slides)

The ROC curve is a graphical tool for the evaluation and comparison of binary classifiers. It provides more complete evaluation than the confusion matrix and the error rate.  It is valid even if we deal with a non-representative test set i.e. the observed class frequencies are not an estimate of the prior class probabilities. It is especially useful when we deal with class imbalance, and when the misclassification costs matrix is not well established. In these slides, we show: the ideas underlying the ROC curve; the construction of the curve from a dataset; the calculation of the AUC (area under curve), a synthetic indicator derived from the ROC curve; and the use of the ROC curve for model comparison. Keywords : receiver operating characteristic, roc curve, auc, area under curve, binary classifier, evaluation, model comparison, class probability estimate, score Components (Tanagra) : SCORING, ROC CURVE Slides : ROC curve References : Wikipedia, " Receiver Operating Characteristic

Customer targeting (slides)

Customer targeting is one component of the direct marketing. The aim is to identify the customers which are the most interested in a new product. We are in the data mining context because we create a classifier from a learning sample. But we do not want to classify the instances. We want to measure the probability of the individuals to buy the product i.e. their score, their propensity to purchase. In this context, we use a specific tool - the gain chart (or the cumulative lift curve) - to assess the efficiency of the analysis. In these slides, we detail the overall process. We emphasize the reading of the gain chart, especially the transposition of the reading of the chart from a labeled sample to the customer database (for which we do not know the values of the target attribute). Keywords : customer targeting, direct marketing, scoring, score, propensity to purchase Components (Tanagra) : SCORING, LIFT CURVE Slides : Customer targeting References : Microsoft, “ Lift chart (Analysis S

Descriptive discriminant analysis (slides)

The descriptive discriminant analysis (DDA) or canonical discriminant analysis is a statistical approach which performs a multivariate characterization of differences between groups. It is related to other factorial approaches such as principal component analysis or canonical correlation analysis. In these slides, we show the main issues of the approach, and the reading of the results. We show also how the discriminant analysis is related to the predictive discriminant analysis (linear discriminant analysis) which, yet, relies on restrictive statistical assumptions. Keywords : discriminant analysis, descriptive discriminant analysis, canonical discriminant analysis, predictive discriminant analysis, correlation ratio, R, lda package MASS, sas, proc candisc Components (Tanagra) : CANONICAL DISCRIMANT ANALYSIS Slides : DDA Dataset : wine_quality.xls References : SAS, " CANDISC procedure ".