Posts

Showing posts with the label Decision tree

Gradient boosting with R and Python

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References). The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers. The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with eac...

Cost-Sensitive Learning (slides)

This course material presents approaches for the consideration of misclassification costs in supervised learning. The baseline method is the one for which we do not take into account the costs. Two issues are studied : the metric used for the evaluation of the classifier when a misclassification cost matrix is provided i.e. the expected cost of misclassification (ECM); some approaches which enable to guide the machine learning algorithm towards the minimization of the ECM. Keywords : cost matrix, misclassification, expected cost of misclassification, bagging, metacost, multicost Slides : Cost Sensitive Learning References : Tanagra Tutorial, " Cost-senstive learning - Comparison of tools ", March 2009. Tanagra Tutorial, " Cost-sensitive decision tree ", November 2008.

Hyper-threading and solid-state drive

After more than 6 years of good and faithful service, I decided to change my computer. It must be said that the former (Intel Core 2 Quad Q9400 2.66 Ghz - 4 cores - running Windows 7 - 64 bit) began to make disturbing sounds. I am obliged to put music to cover the rumbling of the beast and be able to work quietly. The choice of the new computer was another matter. I spent the age of the race to the power which is necessarily fruitless anyway, given the rapid evolution of PCs. Nevertheless, I was sensitive to two aspects that I could not evaluate previously: The hyper-threading  technology is effective in programming multithreaded algorithms of data mining? The use of temporary files to relieve the memory occupation takes advantage of SSD  disk technology? The new PC runs under Windows 8.1 (I wrote the French version of this tutorial one year ago). The processor is a Core I7 4770S (3.1 Ghz). It has 4 physical cores but 8 logical cores with the hyper-threading technology. The sy...

Random Forest - Boosting with R and Python

This tutorial follows the slideshow devoted to the " Bagging, Random Forest and Boosting ". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc. As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software ( Tanagra and Knime ) providing turnkey solutions, very simple to implement, more accessible for peopl...

Bagging, Random Forest, Boosting (slides)

This course material presents ensemble methods: bagging, random forest and boosting. These approaches are based on the same guiding idea : a set of base classifiers learned from the an unique learning algorithm are fitted to different versions of the dataset. For bagging and random forest, the models are fitted independently of bootstrap samples. Random Forest incorporates an additional mechanism in order to “decorrelate” the models which are necessarily decision trees. Boosting works in a sequential fashion. A model at the step (t) is fitted to a weighted version of the sample in order to correct the error of the model learned at the preceding step (t-1). Keywords : bagging, boosting, random forest, decision tree, rpart package, adabag package, randomforest package, R software Slides : Bagging - Random Forest - Boosting References : Breiman L., "Bagging Predictors", Machine Learning, 26, p. 123-140, 1996. Breiman L., "Random Forests", Machine Learning, 45, p. 5-32,...

Clustering tree (slides)

The clustering tree algorithm is both a clustering approach and a multi-objective supervised learning method. In the cluster analysis framework, the aim is to group objects in clusters, where the objects in the same cluster are similar in a certain sense. The clustering tree algorithm enables to perform this kind of task. We obtain a decision tree as a clustering structure. Thus, the deployment of the classification rule in the information system is really easy. But we can also consider the clustering tree as an extension of the classification/regression tree because we can distinguish two set of variables: the explained (active) variables which are used to determine the similarities between the objects; the predictive (illustrative) variables which allows to describe the groups. In this slides, we show the main features of this approach. Keywords : cluster analysis, clustering, clustering tree, groups characterization Slides : Clustering tree References : M. Chavent (1998), « A monoth...

Decision tree learning algorithms

Here are the slides I use for my course about the existing decision tree learning algorithms. Only the most popular ones are described: C4.5, CART and CHAID (a variant). The differences between these approaches are highlighted according: the splitting measure; the merging strategy during the splitting process; the approach for determining the right sized tree. Keywords : machine learning, supervised methods, decision tree learning, classification tree, chaid, cart, c4.5 Slides : C4.5, CART and CHAID References : L. Breiman, J. Friedman, R. Olshen and C. Stone, “Classification and Regression Trees”, Wadsworth Int. Group, 1984. G. Kass, “An exploratory technique for Investigating Large Quantities of Categorical Data”, Applied Statistics, 29(2), 1980, pp. 119-127. R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufman, 1993.

Introduction to Decision Trees

Here are the lecture notes I use for my course “Introduction to Decision Trees”. The basic concepts of the decision tree algorithm are described. The underlying method is rather similar to the CHAID approach. Keywords : machine learning, supervised methods, decision tree learning, classification tree Slides : Introduction to Decision Trees References : T. Mitchell, " Decision Tree Learning ", in "Machine Learning", McGraw Hill, 1997; Chapter 3, pp. 52-80. L. Rokach, O. Maimon, " Decision Trees ", in  "The Data Mining and Knowledge Discovery Handbook", Springer, 2005; Chapter 9, pp. 165-192.

Revolution R Community 5.0

The R software is a fascinating project. It becomes a reference tool for the data mining process. With the R package system, we can extend its features potentially at the infinite. Almost all existing statistical / data mining techniques are available in R. But if there are many packages, there are very few projects which intend to improve the R core itself. The source code is freely available. In theory anyone can modify a part or even the whole software. Revolution Analytics proposes an improved version of R. It provides Revolution R Enterprise, it seems (according to their website) that: it improves dramatically the fastness of some calculations; it can handle very large database; it provides a visual development environment with a debugger. Unfortunately, this is a commercial tool. I could not check these features . Fortunately, a community version is available. Of course, I have downloaded the tool to study its behavior. Revolution R Community is a slightly improved version of the...

Tanagra - Version 1.4.45

New features for the principal component analysis (PCA). PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced. PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than t...

Using PDI-CE for model deployment (PMML)

Model deployment is a crucial task of the data mining process. In the supervised learning, it can be the applying of the predictive model on new unlabeled cases. We have already described this task for various tools (e.g. Tanagra, Sipina, Spad, R). They have as common feature the use of the same tool for the model construction and the model deployment. In this tutorial, we describe a process where we do not use the same tool for the model construction and the model deployment. This is only possible if (1) the model is described in a standard format, (2) the tool which used for the deployment can handle both the database with unlabeled instances and the model. Here, we use the PMML standard description for the sharing of the model, and the PDI-CE ( Pentaho Data Integration Community Edition ) for the applying of the model on the unseen cases. We create a decision tree with various tools such as SIPINA, KNIME or RAPIDMINER; we export the model in the PMML format; then, we use PDI-CE for ...

Sipina - Version 3.8

The tools (SIPINA RESEARCH, REGRESS and ASSOCIATION RULE SOFTWARE) included in the SIPINA distribution have been updated with some improvements. SIPINA.XLA . The add-in for Excel can work now with either for the 32 or 64-bit versions of EXCEL. Importation of text data files . Processing time has been improved. This improvement reduces also the transferring time when we use the SIPINA.XLA add-in for Excel (which uses a temporary file in the text file format). Association rule software . The GUI has been simplified; the display of the rules is made more readable. Because they are internally based on the FastMM memory management, these tools can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows . The processing capabilities are improved. Keywords : sipina, decision tree induction, association rule, multiple linear regression Sipina website : Sipina Download : Setup file References : Tanagra - SIPINA add-in for Excel Tanagra - Tanagra add-in for Excel 2007 and 2010 Delp...

Dealing with very large dataset (continuation)

 Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables . We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate. In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my perso...

Decision tree and large dataset (continuation)

One of the exciting aspects of computing is that things are changing very quickly. The machines are ever more efficient, the operating systems are improved, the software also. Since writing an old tutorial about the induction of decision tree on a large dataset , I have a new computer and I use a 64 bit OS (Windows 7). Some of the tools studied propose a 64 bit version (Knime, RapidMiner, R). I wonder how behave the various tools in this new context. To do that, I renewed the same experiment. We note that a more efficient computer allows to improve the computation time (about 20%). The specific gain for a 64 bit version is relatively low, but it is real (about 10%). And some tools are clearly improved their programming of the decision tree induction (Knime, RapidMiner). On the other hand, we observe that the memory occupation remains stable for the most of the tools in the new context. Keywords : c4.5, decision tree, large dataset, wave dataset, knime2.4.2, orange 2.0b, r 2.13.2, rapi...

New GUI for RapidMiner 5.0

RapidMiner is a very popular data mining tool. It is (one of) the most used by the data miners according to the annual Kdnuggets polls (2011, 2010, 2009, 2008, 2007). There are two versions. We describe here the Community Edition which freely downloadable from the editor's website. The new RapidMiner 5.0 has a new graphical user interface which is very similar to that of Knime. The organization of the workspace is the same. The sequence of data processing (using operators) is described with a diagram called "process" into the RapidMiner documentation. In fact, this version 5.0 joined the presentation adopted by the vast majority of data mining software. Some features are shared with many tools, among others: the connection to the R software; the meta-nodes which implements a loop or a standard succession of operations; the description of the methods underlying operators which is continuously in the right part of the main window. RapidMiner 5.0 having evolved substantiall...

Data Mining with R - The Rattle Package

R ( http://www.r-project.org/ ) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - 2011 ). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations. But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners. I n this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming langua...

Creating reports with Tanagra

The ability to create automatically reports from the results of an analysis is a valuable functionality for Data Mining. But this is rather an asset to the professional tools. The programming of this kind of functionality is not really promoted in the academic domain. I do not think that I can publish a paper in a journal where I describe the ability of Tanagra to create attractive reports. This is the reason for which the output of the academic tools, such as R or Weka, is mainly in a formatted text shape. Tanagra, which is an academic tool, provides also text outputs. The programming remains simple if we see at a glance the source code. But, in order to make the presentation more attractive, it uses the HTML to format the results. I take advantage of this special feature to generate reports without making a particular programming effort. Tanagra is one of the few academic tools to be able to produce reports that can easily be displayed in office automation software. For instances, th...

Multithreading for decision tree induction

Nowadays, much of modern personal computers (PC) have multicore processors. The computer operates as if it had multiple processors. Software and data mining algorithms must be modified in order to benefit of this new feature. Currently, few free tools exploit this opportunity because it is impossible to define a generic approach that would be valid regardless of the learning method used. We must modify each existing learning algorithm. For a given technique, decomposing an algorithm into elementary tasks that can execute in parallel is a research field in itself. In a second step, we must adopt a programming technology which is easy to implement. In this tutorial, I propose a technology based on threads for the induction of decision trees. It is well suited in our context for various reasons. (1) It is easy to program with the modern programming languages. (2) Threads can share information; they can also modify common objects. Efficient synchronization tools enable to avoid data corrup...

Connecting Sipina and Excel using OLE

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA . It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution. Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina usi...

Interactive decision tree learning with Spad

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet. Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial to...