Dealing with very large dataset (continuation)

 Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.

In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.

Keywords: very large dataset, decision tree, sampling, sipina, knime, rapidminer
Components: ID3
Tutorial: en_Tanagra_Tree_Very_Large_Dataset.pdf
Dataset: twice-kdd-cup-discretized-descriptors.zip
References:
Tanagra, "Dealing with very large dataset in Sipina".
Tanagra, "Decision tree and large dataset (continuation)".
Tanagra, "Decision tree and large dataset".
Tanagra, "Local sampling for decision tree learning".

Comments

Popular posts from this blog

Around Here Thirty & Thirty-One 07/21-08/03

connection.

Data Mining with R - The Rattle Package