Posts

Showing posts from December, 2015

R online with R-Fiddle

R-Fiddle is a programming environment for R available online. It allows us to encode and to run a program written in R. Although R is free and there are also good free programming environments for R (e.g. R-Studio desktop, Tinn-R), this type of tool has several interests. It is suitable for mobile users who frequently change machine. If we have an Internet connection, we can work on a project without having to worry about the R installation on PCs. Collaborative work is another context in which this tool can be particularly advantageous. It allows us to avoid the transfer of files and the management of versions. Last, the solution allows us to work on a lightweight front-end, a laptop for example, and export the calculations on a powerful remote server (in the cloud as we would say today). In this tutorial, we will briefly review the features of R-Fiddle. Keywords : R software, R programming, cloud computing, linear discriminant analysis, logistic regression, classification tree, klaR ...

Random Forest - Boosting with R and Python

This tutorial follows the slideshow devoted to the " Bagging, Random Forest and Boosting ". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc. As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software ( Tanagra and Knime ) providing turnkey solutions, very simple to implement, more accessible for peopl...

Bagging, Random Forest, Boosting (slides)

This course material presents ensemble methods: bagging, random forest and boosting. These approaches are based on the same guiding idea : a set of base classifiers learned from the an unique learning algorithm are fitted to different versions of the dataset. For bagging and random forest, the models are fitted independently of bootstrap samples. Random Forest incorporates an additional mechanism in order to “decorrelate” the models which are necessarily decision trees. Boosting works in a sequential fashion. A model at the step (t) is fitted to a weighted version of the sample in order to correct the error of the model learned at the preceding step (t-1). Keywords : bagging, boosting, random forest, decision tree, rpart package, adabag package, randomforest package, R software Slides : Bagging - Random Forest - Boosting References : Breiman L., "Bagging Predictors", Machine Learning, 26, p. 123-140, 1996. Breiman L., "Random Forests", Machine Learning, 45, p. 5-32,...

Python - Machine Learning with scikit-learn (slides)

This course material presents some modules and classes of scikit-learn, a library for machine learning in Python. We focused on a typical classification process as a first step: the subdivision of the dataset into training and test sets; the learning of the logistic regression on the training sample; applying the model to the test set in order to obtain the predicted class values; the evaluation of the classifier using the confusion matrix and the calculation of the performance measurements. In the second step, we study other important domains of the classification task: the cross-validation error evaluation when we deal with a small dataset; the scoring process for direct marketing; the grid search for detecting the optimal parameters of algorithms for a given dataset; the feature selection issue. Keywords : python, numpy, pandas, scikit-learn, logistic regression, predictive analytics Slides : Machine Learning with scikit-learn Dataset and programs: scikit-learn - Programs and datas...

Python - Statistics with SciPy (slides)

This course material presents the use of some modules of SciPy, a library for scientific computing in Python. We study especially the stats package, it allows to perform statistical tests such as comparison of means for independent and related samples, comparison of variances, measuring the association between two variables. We study also the cluster package, especially the k-means and the hierarchical agglomerative clustering algorithms. SciPy handles NumPy vectors and matrices which were presented previously. Keywords : python, numpy, scipy, descriptive statistics, cumulative distribution functions, sampling, random number generator, normality test, test for comparing populations, pearson correlation, spearman correlation, cluster analysis, k-means, hac, dendrogram Slides : scipy.stats and scipy.cluster Dataset and programs: SciPy - Programs and dataset References : SciPy Reference Guide sur SciPy.org Python - Official Site