Posts

Connecting Sipina and Excel using OLE

The connection between a data mining tool and Excel (and more generally spreadsheet) is a very important issue. We had addressed many times this topic in our tutorials. With hindsight, I think the solution based on add-ins for Excel is the best one, both for SIPINA and for TANAGRA . It is simple, reliable and highly efficient. It does not require developing specific versions. The connection with Excel is a simple additional functionality of the standard distribution. Prior to reaching this solution, we had explored different trails. In this tutorial, we present the XL-SIPINA software based on Microsoft's OLE technology. At the opposite of the add-in solution, this version of SIPINA chooses to embed Excel into the Data Mining tool. The system works rather well. Nevertheless, it has finally been dropped for two reasons: (1) we were forced to compile special versions that work only if Excel is installed on the user's machine; (2) the transferring time between Excel and Sipina usi...

Sipina add-in for Excel

The data importation is a bottleneck for Data Mining Tools. The majority of users are working with a spreadsheet tool such as Excel, mainly in the coupling with specialized software for data mining (see KDnuggets polls ). Therefore, a recurring issue for users is "how to send my data from Excel to SIPINA?" It is possible to import different types of formats into SIPINA. About Excel workbooks, one particular device has been implemented. An add-in is automatically copied to the computer during the installation process. It must be integrated into Excel. The add-in incorporates a new menu into Excel. After selecting the data range, the user only has to activate it, this leads to the following: (1) SIPINA starts automatically, (2) the data are transferred via the clipboard and (3) SIPINA considers the first row of the range of cells corresponds to the names of variables, (4) columns with numerical values of the variables are quantitative (5) columns with alphanumeric values are ca...

Tanagra add-in for Office 2007 and Office 2010

The "tanagra.xla" add-in for Excel contributes to the wide diffusion of Tanagra. The principle is simple. It is to embed a Tanagra menu in Excel. Thus the user can run statistical calculations without having to leave the spreadsheet. It seems simplistic. But this feature facilitates immensely the work of data miner. Indeed, the spreadsheet is one of the most used tools for preparing dataset (see KDNuggets Polls: Tools / Languages for Data Cleaning - 2008). By embedding the data mining tool in the spreadsheet environment, it avoids to the practitioner the tedious and repetitive manipulations: importing the dataset, exporting the dataset, checking the compatibilities between data file formats, etc. The installation and the use of the "tanagra.xla" add-in under the previous versions of Office are described elsewhere (Office 1997 to Office 2003). This description is obsolete for the latest version of Office because the organization of the menus is modified for these v...

Naive bayes classifier for discrete predictors

The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Yet, despite this, it appears robust and efficient. Its performance is comparable to other supervised learning techniques. We introduce in Tanagra ( version 1.4.36 and later) a new presentation of the results of the learning process. The classifier is easier to understand, and its deployment is also made easier. In the first part of this tutorial, we present some theoretical aspects of the naive bayes classifier. Then, we implement the approach on a dataset with Tanagra. We compare the obtained results (the parameters of the model) to those obtained with other linear approaches such as the logistic regression, the linear discriminant analysis and the linear SVM. We note that the results are highly consistent. This largely explains the good perform...

Interactive decision tree learning with Spad

In this tutorial, we will be interested in SPAD. This is a French software specialized in exploratory data analysis which evolved much these last years. We would perform a sequence of analysis from a dataset collected into 3 worksheets of a Excel data file: (1) we create a classification tree from the learning sample into the first worksheet, we try to analyze deeply some nodes of the tree to highlight the characteristics of covered instances, we try also to modify interactively (manually) the properties of some splitting operation; (2) we apply the classifier on unseen cases of the second worksheet; (3) we compare the prediction of the model with the actual values of the target attribute contained into the third worksheet. Of course, we can perform this process using free tools such as SIPINA (the interactive construction of the tree) or R (the programming of the sequence of operations, in particular the applying of the model on unlabeled dataset). But with Spad or other commercial to...

Supervised learning from imbalanced dataset

In real problems, the classes are not equally represented in dataset. The instances corresponding to positive class, the one that we want to detect often, are few. For instance, in a fraud detection problem, there are a very few cases of fraud comparing to the large number of honest connections; in a medical problem, the ill persons are fortunately rare; etc. In these situations, using the standard learning process and assessing the classifier with the confusion matrix and the misclassification rate are not appropriate. We observe that the default classifier consisting to assign all the instances to the majority class is the one which minimizes the error rate. For the dataset that we analyze in this tutorial, 1.77% of all the examples belong to the positive class. If we assign all the instances to the negative class - this is the default classifier - the misclassification rate is 1.77%. It is difficult to find a classifier which is able to do better. Even if we know that we have not a ...

Handling large dataset in R - The "filehash" package

The processing of very large datasets is a crucial problem in data mining. To handle them, we must avoid to load the whole dataset into memory. The idea is quite simple: (1) we write all or a part of the dataset on the disk in a binary file format to allow a direct access; (2) the machine learning algorithms must be modified to efficiently access the values stored on the disk. Thus, the characteristics of the computer are no longer a bottleneck for the handling of a large dataset. In this tutorial, we describe the great "filehash" package for R. It allows to copy (to dump) any kind of R objects into a file. We can handle these objects without loading them into main memory. This is especially useful for the data frame object. Indeed, we can perform a statistical analysis with the usual functions directly from a database on the disk. The processing capacities are vastly improved and, in the same time, we will note that the increase in computation time remains moderate. To evalu...