mrlearnr - Tumblr blog

mrlearnr · 11 years ago

Text

Advanced Regression Analysis: How To Find And Print 100 Best Regression Models (Works well!)

In this post, I am about to explain you simple way to find as many best possible regression models you want, from any given predictors dataset.

I am going to show you a method, along with code, where you can print the summary statistics of all best models exported to a separate text file, get essential regression statistics and print the fitted values using lm() and rlm() - robust regression along with deviations and plots.

You also have the option to choose your best models based on number of variables in each model and multiple selections parameters such as adj-Rsq and Mallows-Cp. All in one piece of code.

We have had our shots with regression analysis. Though there is nothing as exciting as the moments when you lay your hands over that freshly prepared data, it could get frustrating when you need to get it delivered regularly in a time sensitive manner. The codes I show in this post should help alleviate the issues caused by the routineness of the regression modelling process. In other words its for Stats analysts with routine deadlines.

"Best subsets regression with leaps"

I have seen people coming from other platforms where they typically use a software-inbuilt procedure to run a forecast or regression model or just use mouse clicks in a GUI interface to make their models. Doing these in a procedural manner causes routineness and boredom subsequently when you have to get the results out repeatedly.

It would be a grave mistake if R programmers take the same route and repeat the mistakes committed by GUI analysts and Procedural statisticians. There is just too much amateurish R code out there that they underestimate the potential of R as a programming language - often making comparisons with other statistical softwares. This view later becomes a benchmark for the newcomers to the language, who tend to learn it in parts and end up having an incomplete idea of the potential of this language, a fate JavaScript had suffered for a while now. Who are we after all if we don't use the excellent algorithmic capabilities that R generously offers.

So remember, R is not just a statistical software, its a good programming language too.

Now, coming back to the discussion. Lets load the 'leaps', 'car' and 'MASS' packages. The steps I am writing below should not be considered as a holy grail, but rather, you should have done the prior variable selection part before you feed in the selected variables to the procedure below. In other words, feed only the variables that might be valuable. This script will generate the following outputs in the working directory: Continued in next page..

Output Models.txt - Contains the summary, vif, model fit, forecast of holdout observations and deviations of lm() and rlm().

> This is the main output file because it prints out 10 best models for models of all possible sizes, meaning, the first 10 models will be 1 variable models, the second 10 will have 2 predictor variables, the 3rd 10 models will have 3 variables and so on. Within each group these models are arranged based on highest 'adj Rsq' values. Quite a convenience eh?

Actual vs Fit/Forecast Charts - The Actual vs Fit charts for all the best models are stored in PNG format.

> Should you wish to make changes and incorporate this in your work, you will have to change the read.table and colnames(dat) statements to suit your needs, and do quote this blog 'rprogrammingblog.wordpress.com' at the top in your source code. I would like to know how it works out for you, if you tried it out.

"Leaps for the Best Solution"

How to use this code?

Step1: Install the "leaps", "car", "MASS" libraries using install.packages(c("leaps","car","MASS") command.

Step 2: If you have your own response-predictors data, update dat <- read.table() and colnames(dat) sections in below code to point to the correct location in your local disk. If your dataset has headers, comment out the colnames(dat) part. Finally, the first column in your dataset is assumed as a time column like month-year. If your analysis does not have time series data like demand or sales, fill up the first column in your data as row indices. You are now ready to run the code.

If you do not have your own dataset but want to just see how this code works, just install the packages as explained above and run the rest of it as it is in your R console.

library(leaps) library(car) library(MASS) cat(paste("Outputs will be collected in this location :",getwd())) store.graphs = TRUE holdout.obs = 6 # No of observations you want to holdout in the tail of input data. These values will be predicted using lm() and rlm(). # Change the following 2 statements as per your need. dat <- read.table("http://www.stat.ufl.edu/~winner/data/winepop.dat",header=T) #Load the input data. The last few observations will be in hold-out for prediction, as directed by 'holdout.obs' variable. colnames(dat) <- c("Year", "Total.Population.Thousands", "years.5", "years.5to14","years.15to24","years.25to34","years.35to44","years.45to54","years.55to64","sixtyfiveyears","Wine.Consumption.Millions.of.gallons") predictors.df <- dat[,c(3:ncol(dat))] # The predictor variables in your input dataset (dat) target.df <- dat[,2] # Point to the response variable in 'dat' LPS <- leaps(x = predictors.df ,y = target.df, names = colnames(predictors.df), nbest = 10 ,method = "adjr2") sink("Output Models.txt") for(i in (1:nrow(LPS$which))){ # Create a formula using those variables preds <- paste("I(",names(which(LPS$which[i,]!= "FALSE")),"^",1,")",sep="", collapse = " + ") fm <- paste("(",colnames(dat)[2]," ~ ",preds,")",sep = "") # Create a linear model (LM) and a robust linear model (RLM) rg <- lm(as.formula(fm), data= as.data.frame(dat[c(1:(nrow(dat)-holdout.obs)),])) rrg <- rlm(as.formula(fm), data= as.data.frame(dat[c(1:(nrow(dat)-holdout.obs)),])) # Store the Actual and Predicted out <- data.frame(actuals = dat[,2] ,predicted.robust = round(predict(rrg, dat)), predicted.lm = round(predict(rg, dat))) # Get the percent deviations of the linear model and robust linear model robdev <- as.numeric(as.matrix(out[1:nrow(dat),2])) - as.numeric(as.matrix(out[1:nrow(dat),1])) lmdev <- as.numeric(as.matrix(out[1:nrow(dat),3]))- as.numeric(as.matrix(out[1:nrow(dat),1])) deviation <- cbind(robdev, lmdev) devperc.rob <- sprintf("%1.2f%%" , deviation[,1]/as.numeric(as.matrix(out[1:nrow(dat),1])) * 100) devperc.lm <- sprintf("%1.2f%%" , deviation[,2]/as.numeric(as.matrix(out[1:nrow(dat),1])) * 100) # Add the deviations to the output devperc <- cbind(rob.lm.dev = c(devperc.rob, rep("-", (nrow(out) - length(devperc.rob)))), lm.dev = c (devperc.lm, rep("-", (nrow(out) - length (devperc.lm))))) out <- cbind(dat$Year, out, devperc) # If we chose to store the actual vs. predicted graphs, generate the graphs and store them. The graphs will be stored in the working directory. if(store.graphs){ # Set the margins of the graph par(mar = c(4,4,4,1)) # Set up output to PNG. png(file=paste("Model ",i,".png",sep=""),width=900,height=550) # Plot the actual vs predicted and set parameters of the graph pic <- plot(c(1:nrow(dat)), out[,2], type = "b", cex.axis = 0.5, lwd = 2, xlab = "Year-Month", ylab = "Demand", sub = paste("Plot from: ",dat[1,1] ," to ",dat [nrow(dat),1] ,sep = ""), main = paste("Model ",i,", No. Forecasts: ", (nrow (dat)-nrow(dat)),sep="")) text(12,((max(out[,c(2:4)])-min(out[,c(2:4)]))/2)+10,print(paste(names(which(LPS$which[i,]!= "FALSE")),sep="\n", collapse = " ")), adj = 1) box("figure") lines(out[,3], col = "red", lwd = 2) lines(out[,4], col = "blue", lwd = 2) legend("bottomright", inset=.01, c("Actual","Robust", "linear"), fill=c(1,4,2), horiz=TRUE, cex=0.75 ) # End output to PNG dev.off() } # PRINT THE OUTPUT cat("Model",i,"\n") print(summary(rg)) cat("vif") if(length(rg$coefficients)>2){print(vif(rg))} print(out) cat("------------------------------------------------------------------- \n") } # close the output file sink() Author: Selva Prabhakaran Sanjeevi Julian If you are a beginner to R programming language and would like to get started and learn R super-fast and efficient, just subscribe to this youtube channel and bring yourself up to speed. Remember to do the practice exercises too and save months of time and effort. R Programming for Beginners - Speed Video Lessons http://bit.ly/1kmkxen

Selva Prabhakaran

#rstats #statistics #regression #programming #data science #r programming language #advanced analytics

4 notes · View notes