Posts

Showing posts from May, 2010

Logistic Regression Diagnostics

This tutorial describes the implementation of tools for the diagnostic and the assessment of a logistic regression. These tools are available in Tanagra version 1.4.33 (and later). We deal with a credit scoring problem. We try to determine by using logistic regression the factors underlying the agreement or refusal of a credit to customers. We perform the following steps: - Estimating the parameters of the classifier; - Retrieving the covariance matrix of coefficients; - Assessment using the Hosmer and Lemeshow goodness of fit test; - Assessment using the reliability diagram; - Assessment using the ROC curve; - Analysis of residuals, detection of outliers and influential points. On the one hand, we use Tanagra 1.4.33 . Then, on the other hand, we perform the same analysis using the R 2.9.2 software [glm(.) procedure] . Keywords : logistic regression, residual analysis, outliers, influential points, pearson residual, deviance residual, leverage, cook's distance, dfbeta, dfbetas, hos

Discretization of continuous features

The discretization transforms a continuous attribute into a discrete one. To do that, it partitions the range into a set of intervals by defining a set of cut points. Thus we must answer to two questions to lead this data transformation: (1) how to determine the right number of intervals; (2) how to compute the cut points. The resolution is not necessarily in that sequence. The best discretization is the one performed by an expert domain. Indeed, he takes into account other information than those only provided by the available dataset. Unfortunately, this kind of approach is not always feasible because: often, the domain knowledge is not available or it does not allow to determine the appropriate discretization; the process cannot be automated to handle a large number of attributes. So, we are often forced to found the determination of the best discretization on a numerical process. Discretization of continuous features as preprocessing for supervised learning process. First, we must d

Sipina Decision Graph Algorithm (case study)

SIPINA is a data mining tool. But it is also a machine learning method. It corresponds to an algorithm for the induction of decision graphs (see References, section 9). A decision graph is a generalization of a decision tree where we can merge any two terminal nodes of the graph, and not only the leaves issued from the same node. The SIPINA method is only available under the version 2.5 of SIPINA data mining tool . This version has some drawbacks. Among others, it cannot handle large datasets (higher than 16.383 instances). But it is the only tool which implements the decision graphs algorithm. This is the main reason for which this version is available online to date. If we want to implement a decision tree algorithm such as C4.5 or CHAID, or if we want to create interactively a decision tree , it is more advantageous to use the research version (named also version 3.0). The research version is more powerful and it supplies much functionality for the data exploration. In this tutoria

User's guide for the old Sipina 2.5 version

SIPINA has a long history. Before the current version (version 3.3, May 2010), we distributed a data mining tool dedicated exclusively to the induction of decision graphs, a generalization of decision trees. Of course, the state-of-the-art decision trees algorithms are also included (such as C4.5, CHAID). This version, called 2.5, is online since 1995. Its development was suspended in 1998 when I started programming the version 3.0. This version 2.5 is the only free tool which implements the decision graphs algorithm . This is a real curiosity in this respect. This is the reason for which I still distribute this version to date. On the other hand, this 2.5 version has some severe limitations. Among others, it can handle only small dataset, up to 16.380 instances. If you want to implement a decision tree or if you want to handle a large dataset, it is always advised to use the current version (version 3.0 and later). Setup of the old 2.5 version : Setup_Sipina_V25.exe User's guide :

Solutions for multicollinearity in multiple regression

Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with others (Wikipedia). Sometimes the signs of the coefficients are inconsistent with the domain knowledge; sometimes, explanatory variables which seems individually significant are invalidated when we add other variables. There are two steps when we want to treat this kind of problem: (1) detecting the presence of the col