Posts

Showing posts from February, 2012

Checking missing values in Tanagra

Up to the 1.4.41 version, Tanagra does not handle missing values because it seems interesting to force the students, which are the main users of Tanagra, to think about and to propose the most appropriate solution in relation with the characteristics of their dataset and the goal of their analysis. Thus, Tanagra simply truncates the file to import from the first obstacle. This treatment often disconcerts the users, especially since no error message was sent.  They wondered why, whereas the conditions look right, the data were not properly loaded. From Tanagra 1.4.42 version, the importation of the text file format (tab separator), of the XLS file format (Excel 97-2003), and the data transfer using the add-in for Excel (up to Excel 2010 ) and LibreOffice 3.5/OpenOffice 3.3, have been modified. Tanagra reads all rows of the base. But it skips the incomplete rows and / or with inconsistencies (e.g. a column contains numeric value whereas this is a discrete attribute). And above all, an

Logistic regression on large dataset

The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations. I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice. Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7),

Tanagra - Version 1.4.42

The Tanagra.xla add-in for Excel can work now for both the 32 and 64-bit versions of EXCEL . With the FastMM memory manager, Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows . The processing capabilities, especially about the handling of large datasets, are improved. The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer . Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report. Donwload page : setup