Posts

Showing posts with the label Python

CDF and PPF in Excel, R and Python

 How to compute the cumulative distribution functions and the percent point functions of various commonly used distributions in Excel, R and Python. I use Excel (in conjunction with Tanagra or Sipina), R and Python for the practical classes of my courses about data mining and statistics at the University. Often, I ask students to perform hypothesis tests or to calculate confidence intervals, etc. We work on computers, it is obviously out of the question to use the statistical tables to obtain the quantile or p-value of the commonly used distribution functions. In this tutorial, I present the main functions for normal distribution , Student's t-distribution , chi-squared distribution and Fisher-Snedecor distribution . I realized that students sometimes find it difficult to match the reading of statistical tables with the functions they have difficulty identifying in software. It is also an opportunity for us to verify the equivalences between the functions proposed by Excel, R (sta...

Regression analysis in Python

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. In this tutorial, we will try to identify the potentialities of StatsModels by conducting a case study in multiple linear regression. We will discuss about: the estimation of model parameters using the ordinary least squares method, the implementation of some statistical tests, the checking of the model assumptions by analyzing the residuals, the detection of outliers and influential points, the analysis of multicollinearity, the calculation of the prediction interval for a new instance. Keywords : regression, statsmodels, pandas, matplotlib Tutorial : en_Tanagra_Python_StatsModels.pdf Dataset and program : en_python_statsmodels.zip References : StatsModels : Statistics in Python

Document classification in Python

The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase. In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as "spam" (spam, malicious) or "ham" (legitimate). We use the “SMS Spam Collection v.1” dataset. Keywords : text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python Tutorial : Spam identification Dataset : Corpu...

SVM: Support Vector Machine in R and Python

This tutorial completes the course material devoted to the Support Vector Machine approach (SVM). It highlights two important dimensions of the method: the position of the support points and the definition of the decision boundaries in the representation space when we construct a linear separator; the difficulty to determine the “best” values of the parameters for a given problem. We will use R (“e1071” package) and Python (“scikit-learn” package). Keywords : svm, package e1071, logiciel R, logiciel Python, package scikit-learn, sklearn Tutorial : SVM - Support Vector Machine Dataset and programs : svm_r_python.zip References : Tanagra Tutorial, " Support Vector Machine ", May 2017. Tanagra Tutorial, " Implementing SVM on large dataset ", July 2009.

Gradient boosting with R and Python

This tutorial follows the course material devoted to the “Gradient Boosting” to which we are referring constantly in this document. It also comes in addition to the supports and tutorials for Bagging, Random Forest and Boosting approaches (see References). The thread will be basic: after importing the data which are split into two data files (learning and testing) in advance, we build predictive models and evaluate them. The test error rate criterion is used to compare performance of various classifiers. The question of parameters, particularly sensitive in the context of the gradient boosting, is studied. Indeed, there are many parameters, and their influence on the behavior of the classifier is considerable. Unfortunately, if we guess about the paths to explore to improve the quality of the models (more or less regularization), accurately identifying the parameters to modify and set the right values are difficult, especially because they (the various parameters) can interact with eac...

Support vector machine (slides)

In machine learning, support vector machines (SVM) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis ( Wikipedia ). These slides show the background of the approach in the classification context. We address the binary classification problem, the soft-margin principle, the construction of the nonlinear classifiers by means of the kernel functions, the feature selection process, the multiclass SVM. The presentation is complemented by the implementation of the approach under the open source software Python (Scikit-Learn), R (e1071) and Tanagra (SVM and C-SVC). Keywords : svm, e1071 package, R software, Python, scikit-learn package, sklearn Components : SVM, C-SVC Slides : Support Vector Machine (SVM) Dataset: svm exemples.xlsx References : Abe S., "Support Vector Machines for Pattern Classification", Springer, 2010.

Random Forest - Boosting with R and Python

This tutorial follows the slideshow devoted to the " Bagging, Random Forest and Boosting ". We show the implementation of these methods on a data file. We will follow the same steps as the slideshow i.e. we first describe the construction of a decision tree, we measure the prediction performance, and then we see how ensemble methods can improve the results. Various aspects of these methods will be highlighted: the measure of the variable importance, the influence of the parameters, the influence of the characteristics of the underlying classifier (e.g. controlling the tree size), etc. As a first step, we will focus on R (rpart, adabag and randomforest packages) and Python (scikit-learn package). We can multiply analyses by programming. Among others, we can evaluate the influence of parameters on the performance. As a second step, we will explore the capabilities of software ( Tanagra and Knime ) providing turnkey solutions, very simple to implement, more accessible for peopl...

Python - Machine Learning with scikit-learn (slides)

This course material presents some modules and classes of scikit-learn, a library for machine learning in Python. We focused on a typical classification process as a first step: the subdivision of the dataset into training and test sets; the learning of the logistic regression on the training sample; applying the model to the test set in order to obtain the predicted class values; the evaluation of the classifier using the confusion matrix and the calculation of the performance measurements. In the second step, we study other important domains of the classification task: the cross-validation error evaluation when we deal with a small dataset; the scoring process for direct marketing; the grid search for detecting the optimal parameters of algorithms for a given dataset; the feature selection issue. Keywords : python, numpy, pandas, scikit-learn, logistic regression, predictive analytics Slides : Machine Learning with scikit-learn Dataset and programs: scikit-learn - Programs and datas...

Python - Statistics with SciPy (slides)

This course material presents the use of some modules of SciPy, a library for scientific computing in Python. We study especially the stats package, it allows to perform statistical tests such as comparison of means for independent and related samples, comparison of variances, measuring the association between two variables. We study also the cluster package, especially the k-means and the hierarchical agglomerative clustering algorithms. SciPy handles NumPy vectors and matrices which were presented previously. Keywords : python, numpy, scipy, descriptive statistics, cumulative distribution functions, sampling, random number generator, normality test, test for comparing populations, pearson correlation, spearman correlation, cluster analysis, k-means, hac, dendrogram Slides : scipy.stats and scipy.cluster Dataset and programs: SciPy - Programs and dataset References : SciPy Reference Guide sur SciPy.org Python - Official Site

Python - Handling matrices with NumPy (slides)

This course material presents the manipulation of matrices using NumPy. The array type is common to vectors and matrices. The special feature is the addition of a second dimension in order to have values within a  rows x columns structure. The matrices pave the way to operators which play a fundamental role in statistical modeling and exploratory data analysis (e.g. matrix inversion, solving equations, calculation of eigenvalues and eigenvectors, singular value decomposition, etc.). Keywords : langage python, numpy, vector, matrix, array, creation, extraction Slides : NumPy Matrices Datasets and programs: Matrices References : NumPy Reference sur SciPy.org Haenel, Gouillart, Varoquaux, " Python Scientific Lecture Notes ". Python - Official Site

Python - Handling vectors with NumPy (slides)

Python is becoming more and more popular in the eyes of Data Scientists. I decided to introduce Statistical Programming in Python among my teachings at the University ( reference page in French ). This first course material described the handling of vectors of NumPy library. The structure and functionality have a certain similarity with the vectors under R. Keywords : langage python, numpy, vector, array, creation, extraction Slides : NumPy Vectors Datasets and programs: Vectors References : NumPy Reference sur SciPy.org Haenel, Gouillart, Varoquaux, " Python Scientific Lecture Notes ". Python - Official Site