Tumgik
#Running a Lasso Regression Analysis
apoorvaml-week3 · 1 year
Text
Week 3: Peer-graded Assignment: Running a Lasso Regression Analysis
This assignment is intended for Coursera course "Machine Learning for Data Analysis by Wesleyan University”.
It is for " Week 3: Peer-graded Assignment: Running a Lasso Regression Analysis".
I am working on Lasso Regression Analysis in Python.
Syntax used to run Lasso Regression Analysis
Dataset description: hourly rental data spanning two years.
Dataset can be found at Kaggle
Features:
yr - year
mnth - month
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
hum - relative humidity
windspeed (mph) - wind speed, miles per hour
windspeed (ms) - wind speed, metre per second
Target:
cnt - number of total rentals
Code used to run Lasso Regression Analysis
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Corresponding Output
Tumblr media Tumblr media
Interpretation
A lasso regression analysis was conducted to predict a number of total bikes rentals from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable. Categorical predictors included weather condition and a series of 2 binary categorical variables for holiday and working day to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include year, month, temperature, humidity and wind speed. Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
It tends to make coefficients to absolute zero as compared to Ridge which never sets the value of coefficient to absolute zero.
0 notes
shihab1992 · 2 years
Text
Assignment: Running a Lasso Regression Analysis
import pandas as pandas from sklearn.model_selection import train_test_split import numpy as np import matplotlib.pylab as plt CSV_PATH = 'gapminder.csv' data = pandas.read_csv(CSV_PATH) print('Total number of countries: {0}'.format(len(data)))
PREDICTORS = [ 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate' ] clean = data.copy() for key in PREDICTORS + ['lifeexpectancy']: clean[key] = pandas.to_numeric(clean[key], errors='coerce') clean = clean.dropna() print('Countries remaining:', len(clean)) clean.head()
0 notes
megumi-fm · 8 months
Text
Tumblr media Tumblr media Tumblr media Tumblr media
this week on megumi.fm ▸ media analysis brainrot
📋 Tasks
💻 Internship ↳ setup Linux system on alternate drive (this took me wayy more time than i anticipated) ✅ ↳ install yet more dependencies ✅ ↳ read up on protein folding and families + CATH and SCOP classifications ✅ ↳ download protein structure repositories ✅ ↳ run protein modeller pipeline ✅ ↳ read papers [3/3] ✅ ↳ set up a literature review tracker ✅ ↳ code for a program to parse PDB files to obtain protein seq ✅ 🎓 Uni Final Project our manuscript got a conditional acceptance!! ↳ revise and update manuscript and images according to changes mentioned ✅ 🩺 Radiomics Projects ↳ feature extraction from radiomics data using variance-based analysis ✅ ↳ setup LASSO regression (errors? look into this) 📧 Application-related ↳ collect internship experience letter ✅ ↳ collect degree transcripts ✅ ↳ request for referee report from my prof ✅
📅 Daily-s
🛌 consistent sleep [6/7] (binge watched too much TV and forgot about bed time booo) 💧 good water intake [5/7] (need to start carrying a bottle to work) 👟 exercise [4/7] (I really need to find time between work to move around)
Fun Stuff this week
🧁 met up with my bestfriends! we collected the mugs we painted last year and gifted them to each other! we also surprised one of our besties by showing up at her place. had waffles too ^=^ 📘 met up with another close friend for dinner! hung out at a bookshop after <3 🎮back at game videos: watched this critique on a time loop game called 12 minutes //then i switched up and got super obsessed with this game called The Beginner's Guide. I watched a video analysis on it, then went on to watch the entire gameplay, then read an article on the game's concept and what it means to analyze art and yeah. wow. after which I finally started playing the game with my best friend!! 📺 ongoing: Marry my Husband, Cherry Magic Th, Last Twilight 📺 binged: Taikan Yoho (aka My Personal Weatherman), Hometown Cha Cha Cha 📹 Horror Storytelling in the internet era
📻 This week's soundtrack
so. the Taikan Yoho brainrot was followed by me listening entirely to songs that evoked similar emotions to watching the main couple. personal fav emotions include a love that feels like you could die, a love that feels like losing yourself, a love that makes you feel like you could disappear, a love asking to be held, a love that reminds you that you're not alone, and a love that feels like a promise <3
---
[Jan 15 to 21; week 3/52 || I am having a blast at work ♡ I feel like I'm really learning and checking out a lot of cool stuff. That being said, I think I'm slacking when it comes to my daily routines in regards to my health. and I'm spending wayy too much time chained to my desk. maybe I'll request for an option to work from home so that I can cut on time taken on commute and spend that time exercising or walking
also. my obsession with tv shows is getting a bit. out of hand I think. not that it's particularly an issue? but I think I should switch back to my unread pile of books (or resume magpod) instead of spending my evenings on ki**a*ian. this could be unhealthy for my eyes in the long run, considering my work also involves staring at a screen all day. let's see.]
29 notes · View notes
Text
Capstone Milestone Assignment 2: Methods
Full Code: https://github.com/sonarsujit/MSC-Data-Science_Project/blob/main/capstoneprojectcode.py
Methods
1. Sample: The data for this study was drawn from the World Bank's global dataset, which includes comprehensive economic and health indicators from various countries across the world. The final sample consists of N = 248 countries, and I am only looking at the 2012 data for this sample data.
Sample Description: The sample includes countries from various income levels, regions, and development stages. It encompasses both high-income countries with advanced healthcare systems and low-income countries where access to healthcare services might be limited. The sample is diverse in terms of geographic distribution, economic conditions, and public health outcomes, providing a comprehensive view of global health disparities.
2. Measures: The given dataset has 86 variables and form the perspective of answering the research question, I focused on life expectancy and access to healthcare services.  The objective is to look into these features statistically and narrow down to relevant and important features that will align to my research problem.
Here's a breakdown of the selected features and how they relate to my research:
Healthcare Access and Infrastructure: Access to electricity, Access to non-solid fuel,Fixed broadband subscriptions, Improved sanitation facilities, Improved water source, Internet users
Key Health and Demographic Indicators: Adolescent fertility rate, Birth rate, Cause of death by communicable diseases and maternal/prenatal/nutrition conditions, Fertility rate, Mortality rate, infant , Mortality rate, neonatal , Mortality rate, under-5, Population ages 0-14 ,Urban population growth
Socioeconomic Factors: Population ages 65 and above, Survival to age 65, female, Survival to age 65, male, Adjusted net national income per capita, Automated teller machines (ATMs), GDP per capita, Health expenditure per capita, Population ages 15-64, Urban population.
Variable Management:
The Life Expectancy variable was used as a continuous variable in the analysis.
All the independent variables were also used as continuous variables.
Out of the 84 in quantitative independent variables, I found that the above 26 features closely describe the health access and infrastructure, key health and demographic indicators and socioeconomic factors based on the literature review.
I run the Lasso regression to get insights on features that will align more closely with my research question
Table1: Lasso regression to find important features that support my research question.
Tumblr media
Based on the result from lasso regression, I finalized 8 predictor variables which I believe will potentially help me answer my research question.
To further support the selection of these 8 features, I run Correlation Analysis for these 8 features and found to have both positive and negative correlations with the target variable (Life Expectancy at Birth, Total (Years))
Table 2: Pearson Correlation values and relative p values
Tumblr media
The inclusion of both positive and negative correlations provides a balanced view of the factors affecting life expectancy, making these features suitable for your analysis.
Incorporating these variables should allow us to capture the multifaceted relationship between healthcare access and life expectancy across different countries, and effectively address our research question.
Analyses:
The primary goal of the analysis was to explore and understand the factors that influence life expectancy across different countries. This involved using Lasso regression for feature selection and Pearson correlation for assessing the strength of relationships between life expectancy and various predictor variables.
The Lasso model revealed that factors such as survival rates to age 65, health expenditure, broadband access, and mortality rate under 5 were the most significant predictors of life expectancy.
The mean squared error (MSE)  = 1.2686 of the model was calculated to assess its predictive performance.
Survival to age 65 (both male and female) had strong positive correlations with life expectancy, indicating that populations with higher survival rates to age 65 tend to have higher overall life expectancy.
Health expenditure per capita showed a moderate positive correlation, suggesting that countries investing more in healthcare tend to have longer life expectancy.
Mortality rate, under-5 (per 1,000) had a strong negative correlation with life expectancy, highlighting that higher child mortality rates are associated with lower life expectancy.
Rural population (% of total) had a negative correlation, indicating that countries with a higher percentage of rural populations might face challenges that reduce life expectancy.
0 notes
dataanalyst75 · 2 months
Text
Running a Lasso Regression Analysis to identify a subset of variables that best predicted the alcohol drinking behavior of individuals
A lasso regression analysis is performed to detect a subgroup of variables from a set of 23 categorical and quantitative predictor variables in the Nesarc Wave 1 dataset which best predicted a quantitative response variable assessing the alcohol drinking behaviour of individuals 18 years and older in terms of number of any alcohol drunk per month.  
Categorical predictors include a series of 5 binary categorical variables for ethnicity (Hispanic, White, Black, Native American and Asian) to improve interpretability of the selected model with fewer predictors.
Other binary categorical predictors are related to substance consumption, and in particular  to the fact whether or not individuals are lifetime opioids, cannabis, cocaine, sedatives, tranquilizers consumers, as well as suffer from depression.
About dependencies, other binary categorical variables are those regarding if individuals
are lifetime affected by nicotine dependency,
abuse lifetime of alcohol
or not.
Quantitative predictor variables include age when the aforementioned substances have been taken for the first time, the usual frequency of cigarettes and of any alcohol drunk per month, the number of cigarettes smoked per month.
In a nutshell, looking at the SAS output,
the survey select procedure, used to split the observations in the dataset into training and test data, shows that the sample size is 30,166.
the so-called “NUMALCMO_EST” dependent variable - number of any alcohol drunk per month - along with the selection method used, are being displayed, together with information such as
the choice of a cross validation criteria as criterion for choosing the best model, with K equals 10-fold cross validation,
the random assignments of observations to the folds
the total number of observations in the data set are 43093 and the number of observations used for training and testing the statistical models are 11, where 9 are for training and 2 for testing
the number of parameters to be estimated is 24 for the intercept plus the 23 predictors. 
of the 23 predictor variables, 8 have been maintained in the selected model:
LifetimeAlcAbuseDepy  - alcohol abuse / dependence both in last 12 months and prior to the last 12 months
and
OPIOIDSRegularConsumer – used opioids both in the last 12 months and prior to the last 12 months
have the largest regression coefficient, followed by COCAINERegularConsum behaviour.
LifetimeAlcAbuseDepy and OPIOIDSRegularConsum are positively associated with the response variable, while COCAINERegularConsum is negatively associated with NUMALCMO_EST.
Other predictors associated with lower number of any alcohol drunk per month are S3BD6Q2A – “age first used cocaine or crack” and USEFREQMO – “usual frequency of cigarettes smoked per month”.
These 8 variables accounted for 33.6% of the variance in the number of any alcohol drunk per month response variable.
Hereunder the SAS code used to generate the present analysis and the plots on which the above described results are depicted
PROC IMPORT DATAFILE ='/home/u63783903/my_courses/nesarc_pds.csv' OUT = imported REPLACE; RUN; DATA new; set imported;
/* lib name statement and data step to call in the NESARC data set for the purpose of growing decision trees*/
LABEL MAJORDEPLIFE = "MAJOR DEPRESSION - LIFETIME" ETHRACE2A = "IMPUTED RACE/ETHNICITY" WhiteGroup = "White, Not Hispanic or Latino" BlackGroup = "Black, Not Hispanic or Latino" NamericaGroup = "American Indian/Alaska Native, Not Hispanic or Latino" AsianGroup = "Asian/Native Hawaiian/Pacific Islander, Not Hispanic or Latino" HispanicGroup = "Hispanic or Latino" S3BD3Q2B = "USED OPIOIDS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS" OPIOIDSRegularConsumer = "USED OPIOIDS both IN THE LAST 12 MONTHS AND PRIOR TO LAST 12 MONTHS" S3BD3Q2A = "AGE FIRST USED OPIOIDS" S3BD5Q2B = "USED CANNABIS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS" CANNABISRegularConsumer = "USED CANNABIS both IN THE LAST 12 MONTHS AND PRIOR TO LAST 12 MONTHS" S3BD5Q2A = "AGE FIRST USED CANNABIS" S3BD1Q2B = "USED SEDATIVES IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS" SEDATIVESRegularConsumer = "USED SEDATIVES both IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS" S3BD1Q2A = "AGE FIRST USED SEDATIVES" S3BD2Q2B = "USED TRANQUILIZERS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS" TRANQUILIZERSRegularConsumer = "USED TRANQUILIZERS both IN THE LAST 12 MONTHS AND PRIOR TO LAST 12 MONTHS" S3BD2Q2A = "AGE FIRST USED TRANQUILIZERS" S3BD6Q2B = "USED COCAINE OR CRACK IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS" COCAINERegularConsumer = "USED COCAINE both IN THE LAST 12 MONTHS AND PRIOR TO LAST 12 MONTHS" S3BD6Q2A = "AGE FIRST USED COCAINE OR CRACK" TABLIFEDX = "NICOTINE DEPENDENCE - LIFETIME" ALCABDEP12DX = "ALCOHOL ABUSE/DEPENDENCE IN LAST 12 MONTHS" ALCABDEPP12DX = "ALCOHOL ABUSE/DEPENDENCE PRIOR TO THE LAST 12 MONTHS" LifetimeAlcAbuseDepy = "ALCOHOL ABUSE/DEPENDENCE both IN LAST 12 MONTHS and PRIOR TO THE LAST 12 MONTHS" S3AQ3C1 = "USUAL QUANTITY WHEN SMOKED CIGARETTES" S3AQ3B1 = "USUAL FREQUENCY WHEN SMOKED CIGARETTES" USFREQMO = "usual frequency of cigarettes smoked per month" NUMCIGMO_EST= "NUMBER OF cigarettes smoked per month" S2AQ8A = "HOW OFTEN DRANK ANY ALCOHOL IN LAST 12 MONTHS" S2AQ8B = "NUMBER OF DRINKS OF ANY ALCOHOL USUALLY CONSUMED ON DAYS WHEN DRANK ALCOHOL IN LAST 12 MONTHS" USFREQALCMO = "usual frequency of any alcohol drunk per month" NUMALCMO_EST = "NUMBER OF ANY ALCOHOL drunk per month" S3BD1Q2E = "HOW OFTEN USED SEDATIVES WHEN USING THE MOST" USFREQSEDATIVESMO = "usual frequency of any alcohol drunk per month" S1Q1F = "BORN IN UNITED STATES"
if cmiss(of _all_) then delete; /* delete observations with missing data on any of the variables in the NESARC dataset */
if ETHRACE2A=1 then WhiteGroup=1; else WhiteGroup=0; /* creation of a variable for white ethnicity coded for 0 for non white ethnicity and 1 for white ethnicity */
if ETHRACE2A=2 then BlackGroup=1; else BlackGroup=0; /* creation of a variable for black ethnicity coded for 0 for non black ethnicity and 1 for black ethnicity */
if ETHRACE2A=3 then NamericaGroup=1; else NamericaGrouGroup=0; /* same for native american ethnicity*/
if ETHRACE2A=4 then AsianGroup=1; else AsianGroup=0; /* same for asian ethnicity */
if ETHRACE2A=5 then HispanicGroup=1; else HispanicGroup=0; /* same for hispanic ethnicity */
if S3BD3Q2B = 9 then S3BD3Q2B = .; /* unknown observations set to missing data wrt usage of OPIOIDS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS */
if S3BD3Q2B = 3 then OPIOIDSRegularConsumer = 1; if S3BD3Q2B = 1 or S3BD3Q2B = 2 then OPIOIDSRegularConsumer = 0; if S3BD3Q2B = . then OPIOIDSRegularConsumer = .; /* creation of a group variable where lifetime opioids consumers are coded to 1 and 0 for non lifetime opioids consumers */
if S3BD3Q2A = 99 then S3BD3Q2A = . ; /* unknown observations set to missing data wrt AGE FIRST USED OPIOIDS */
if S3BD5Q2B = 9 then S3BD5Q2B = .; /* unknown observations set to missing data wrt usage of CANNABIS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS */
if S3BD5Q2B = 3 then CANNABISRegularConsumer = 1; if S3BD5Q2B = 1 or S3BD5Q2B = 2 then CANNABISRegularConsumer = 0; if S3BD5Q2B = . then CANNABISRegularConsumer = .; /* creation of a group variable where lifetime cannabis consumers are coded to 1 and 0 for non lifetime cannabis consumers */
if S3BD5Q2A = 99 then S3BD5Q2A = . ; /* unknown observations set to missing data wrt AGE FIRST USED CANNABIS */
if S3BD1Q2B = 9 then S3BD1Q2B = .; /* unknown observations set to missing data wrt usage of SEDATIVES IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS */
if S3BD1Q2B = 3 then SEDATIVESRegularConsumer = 1; if S3BD1Q2B = 1 or S3BD1Q2B = 2 then SEDATIVESRegularConsumer = 0; if S3BD1Q2B = . then SEDATIVESRegularConsumer = .; /* creation of a group variable where lifetime sedatives consumers are coded to 1 and 0 for non lifetime sedatives consumers */
if S3BD1Q2A = 99 then S3BD1Q2A = . ; /* unknown observations set to missing data wrt AGE FIRST USED sedatives */
if S3BD2Q2B = 9 then S3BD1Q2B = .; /* unknown observations set to missing data wrt usage of TRANQUILIZERS IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS */
if S3BD2Q2B = 3 then TRANQUILIZERSRegularConsumer = 1; if S3BD2Q2B = 1 or S3BD2Q2B = 2 then TRANQUILIZERSRegularConsumer = 0; if S3BD2Q2B = . then TRANQUILIZERSRegularConsumer = .; /* creation of a group variable where lifetime TRANQUILIZERS consumers are coded to 1 and 0 for non lifetime TRANQUILIZERS consumers */
if S3BD2Q2A = 99 then S3BD2Q2A = . ; /* unknown observations set to missing data wrt AGE FIRST USED TRANQUILIZERS */
if S3BD6Q2A = 9 then S3BD6Q2A = .; /* unknown observations set to missing data wrt usage of COCAINE IN THE LAST 12 MONTHS/PRIOR TO LAST 12 MONTHS/BOTH TIME PERIODS */
if S3BD6Q2B = 3 then COCAINERegularConsumer = 1; if S3BD6Q2B = 1 or S3BD6Q2B = 2 then COCAINERegularConsumer = 0; if S3BD6Q2B = . then COCAINERegularConsumer = .; /* creation of a group variable where lifetime COCAINE consumers are coded to 1 and 0 for non lifetime COCAINE consumers */
if S3BD6Q2A = 99 then S3BD2Q2A = . ; /* unknown observations set to missing data wrt AGE FIRST USED COCAINE */
if ALCABDEP12DX = 3 and ALCABDEPP12DX = 3 then LifetimeAlcAbuseDepy =1; else LifetimeAlcAbuseDepy = 0; /* creation of a group variable where consumers with lifetime alcohol abuse and dependence are coded to 1 and 0 for consumers with no lifetime alcohol abuse and dependence */
if S3AQ3C1=99 THEN S3AQ3C1=.;
IF S3AQ3B1=9 THEN SS3AQ3B1=.;
IF S3AQ3B1=1 THEN USFREQMO=30; ELSE IF S3AQ3B1=2 THEN USFREQMO=22; ELSE IF S3AQ3B1=3 THEN USFREQMO=14; ELSE IF S3AQ3B1=4 THEN USFREQMO=5; ELSE IF S3AQ3B1=5 THEN USFREQMO=2.5; ELSE IF S3AQ3B1=6 THEN USFREQMO=1; /* usual frequency of smoking per month */
NUMCIGMO_EST=USFREQMO*S3AQ3C1; /* number of cigarettes smoked per month */
if S2AQ8A=99 THEN S2AQ8A=.;
if S2AQ8B = 99 then S2AQ8B = . ;
IF S2AQ8A=1 THEN USFREQALCMO=30; ELSE IF S2AQ8A=2 THEN USFREQALCMO=30; ELSE IF S2AQ8A=3 THEN USFREQALCMO=14; ELSE IF S2AQ8A=4 THEN USFREQALCMO=8; ELSE IF S2AQ8A=5 THEN USFREQALCMO=4; ELSE IF S2AQ8A=6 THEN USFREQALCMO=2.5; ELSE IF S2AQ8A=7 THEN USFREQALCMO=1; ELSE IF S2AQ8A=8 THEN USFREQALCMO=0.75; ELSE IF S2AQ8A=9 THEN USFREQALCMO=0.375; ELSE IF S2AQ8A=10 THEN USFREQALCMO=0.125; /* usual frquency of alcohol drinking per month */
NUMALCMO_EST=USFREQALCMO*S2AQ8B; /* number of any alcohol drunk per month */
if S3BD1Q2E=99 THEN S3BD1Q2E=.;
IF S3BD1Q2E=1 THEN USFREQSEDATIVESMO=30; ELSE IF S3BD1Q2E=2 THEN USFREQSEDATIVESMO=30; ELSE IF S3BD1Q2E=3 THEN USFREQSEDATIVESMO=14; ELSE IF S3BD1Q2E=4 THEN USFREQSEDATIVESMO=6; ELSE IF S3BD1Q2E=5 THEN USFREQSEDATIVESMO=2.5; ELSE IF S3BD1Q2E=6 THEN USFREQSEDATIVESMO=1; ELSE IF S3BD1Q2E=7 THEN USFREQSEDATIVESMO=0.75; ELSE IF S3BD1Q2E=8 THEN USFREQSEDATIVESMO=0.375; ELSE IF S3BD1Q2E=9 THEN USFREQSEDATIVESMO=0.17; ELSE IF S3BD1Q2E=10 THEN USFREQSEDATIVESMO=0.083; /* usual frequency of seadtives assumption per month */
run;
ods graphics on; /* ODS graphics turned on to manage the output and displays in HTML */
proc surveyselect data=new out=traintest seed = 123 samprate=0.7 method=srs outall; run;
/* split data randomly into training data consisting of 70% of the total observations () test dataset consisting of the other 30% of the observations respectively)*/ /* data=new specifies the name of the managed input data set */ /* out equals the name of the randomly split output dataset, called traintest */ /* seed option to specify a random number seed to ensure that the data are randomly split the same way if the code being run again */ /* samprate command split the input data set so that 70% of the observations are being designated as training observations (the remaining 30% are being designated as test observations respectively) */ /* method=srs specifies that the data are to be split using simple random sampling */ /* outall option includes, both the training and test observations in a single output dataset which has a new variable called "selected", to indicate if an observation belongs to the training set, or the test set */
proc glmselect data=traintest plots=all seed=123; partition ROLE=selected(train='1' test='0'); /* glmselect procedure to test the lasso multiple regression w/ Least Angled Regression algorithm k=10 fold validation glmselect procedure standardize the predictor variables, so that they all have a mean equal to 0 and a standard deviation equal to 1, which places them all on the same scale data=traintest to use the randomly split dataset plots=all option to require that all plots associated w/ the lasso regression being printed seed option to allow to specify a random number seed, being used in the cross-validation process partition command to assign each observation a role, based on the variable called selected, indicating if the observation is a training or test observation. Observations with a value of 1 on the selected variable are assigned the role of training observation (observations with a value of 0, are assigned the role of test observation respectively) */
model NUMALCMO_EST = MAJORDEPLIFE WhiteGroup BlackGroup NamericaGroup AsianGroup HispanicGroup OPIOIDSRegularConsumer S3BD3Q2A CANNABISRegularConsumer S3BD5Q2A SEDATIVESRegularConsumer S3BD1Q2A TRANQUILIZERSRegularConsumer S3BD2Q2A COCAINERegularConsumer S3BD6Q2A TABLIFEDX LifetimeAlcAbuseDepy USFREQMO NUMCIGMO_EST USFREQALCMO S3BD1Q2E USFREQSEDATIVESMO/selection=lar(choose=cv stop=none) cvmethod=random(10); /* model command to specify the regression model for which the response variable, NUMALCMO_EST, is equal to the list of the 14 candidate predictor variables */ /* selection option to tell which method to use to compute the parameters for variable selection */ /*Least Angled Regression algorithm is being used */ /* choose=cv option to use cross validation for choosing the final statistical model */ /* stop=none to guarantee the model doesn't stop running until each of the candidate predictor variables is being tested */ /* cvmethod=random(10) to specify a K-fold cross-validation method with ten randomly selected folds is being used */ run;
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
ramanidevi16 · 3 months
Text
Lasso Regression Analysis
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including explanatory variables and a quantitative response variable.###
Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split, KFold, cross_val_scorefrom sklearn.linear_model import LassoCVfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as plt```###
Step 3: Load Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Define explanatory variables (X) and response variable (y)X = data.drop('target_variable', axis=1)y = data['target_variable']```###
Step 4: Set Up k-Fold Cross-Validation```python# Define k-fold cross-validationkf = KFold(n_splits=5, shuffle=True, random_state=42)```###
Step 5: Train the Lasso Regression Model with Cross-Validation```python# Initialize and train the LassoCV modellasso = LassoCV(cv=kf, random_state=42)lasso.fit(X, y)```###
Step 6: Evaluate the Model```python# Evaluate the model's performancemse = mean_squared_error(y, lasso.predict(X))print(f'Mean Squared Error: {mse:.2f}')# Coefficients of the modelcoefficients = pd.Series(lasso.coef_, index=X.columns)print('Lasso Coefficients:')print(coefficients)```###
Step 7: Visualize the Coefficients```python# Plot non-zero coefficientsplt.figure(figsize=(10,6))coefficients[coefficients != 0].plot(kind='barh')plt.title('Lasso Regression Coefficients')plt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:- **Mean Squared Error (MSE)**: This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.- **Lasso Coefficients**: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.
### Blog Entry SubmissionFor your blog entry, include:-
The code used to run the Lasso regression (as shown above).- Screenshots or text of the output (MSE, coefficients, and coefficient plot).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
krishnamanohari2108 · 3 months
Text
Run a lasso regression analysis
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:
Step 1: Prepare Your Data
Ensure your data is ready for analysis, including explanatory variables and a quantitative response variable.
Step 2: Import Necessary Libraries
For this example, I’ll use Python and the scikit-learn library.
Python
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.linear_model import LassoCV from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt
Step 3: Load Your Data
# Load your dataset data = pd.read_csv('your_dataset.csv') # Define explanatory variables (X) and response variable (y) X = data.drop('target_variable', axis=1) y = data['target_variable']
Step 4: Set Up k-Fold Cross-Validation
# Define k-fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=42)
Step 5: Train the Lasso Regression Model with Cross-Validation
# Initialize and train the LassoCV model lasso = LassoCV(cv=kf, random_state=42) lasso.fit(X, y)
Step 6: Evaluate the Model
# Evaluate the model's performance mse = mean_squared_error(y, lasso.predict(X)) print(f'Mean Squared Error: {mse:.2f}') # Coefficients of the model coefficients = pd.Series(lasso.coef_, index=X.columns) print('Lasso Coefficients:') print(coefficients)
Step 7: Visualize the Coefficients
# Plot non-zero coefficients plt.figure(figsize=(10,6)) coefficients[coefficients != 0].plot(kind='barh') plt.title('Lasso Regression Coefficients') plt.show()
Interpretation
After running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:
Mean Squared Error (MSE): This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.
Lasso Coefficients: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.
Blog Entry Submission
For your blog entry, include:
The code used to run the Lasso regression (as shown above).
Screenshots or text of the output (MSE, coefficients, and coefficient plot).
A brief interpretation of the results.
If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
shwetha18112002 · 3 months
Text
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including explanatory variables and a quantitative response variable.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split, KFold, cross_val_scorefrom sklearn.linear_model import LassoCVfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as plt```### Step 3: Load Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Define explanatory variables (X) and response variable (y)X = data.drop('target_variable', axis=1)y = data['target_variable']```### Step 4: Set Up k-Fold Cross-Validation```python# Define k-fold cross-validationkf = KFold(n_splits=5, shuffle=True, random_state=42)```### Step 5: Train the Lasso Regression Model with Cross-Validation```python# Initialize and train the LassoCV modellasso = LassoCV(cv=kf, random_state=42)lasso.fit(X, y)```### Step 6: Evaluate the Model```python# Evaluate the model's performancemse = mean_squared_error(y, lasso.predict(X))print(f'Mean Squared Error: {mse:.2f}')# Coefficients of the modelcoefficients = pd.Series(lasso.coef_, index=X.columns)print('Lasso Coefficients:')print(coefficients)```### Step 7: Visualize the Coefficients```python# Plot non-zero coefficientsplt.figure(figsize=(10,6))coefficients[coefficients != 0].plot(kind='barh')plt.title('Lasso Regression Coefficients')plt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:- **Mean Squared Error (MSE)**: This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.- **Lasso Coefficients**: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the Lasso regression (as shown above).- Screenshots or text of the output (MSE, coefficients, and coefficient plot).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
ratthika · 3 months
Text
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including explanatory variables and a quantitative response variable.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_split, KFold, cross_val_scorefrom sklearn.linear_model import LassoCVfrom sklearn.metrics import mean_squared_errorimport matplotlib.pyplot as plt```### Step 3: Load Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Define explanatory variables (X) and response variable (y)X = data.drop('target_variable', axis=1)y = data['target_variable']```### Step 4: Set Up k-Fold Cross-Validation```python# Define k-fold cross-validationkf = KFold(n_splits=5, shuffle=True, random_state=42)```### Step 5: Train the Lasso Regression Model with Cross-Validation```python# Initialize and train the LassoCV modellasso = LassoCV(cv=kf, random_state=42)lasso.fit(X, y)```### Step 6: Evaluate the Model```python# Evaluate the model's performancemse = mean_squared_error(y, lasso.predict(X))print(f'Mean Squared Error: {mse:.2f}')# Coefficients of the modelcoefficients = pd.Series(lasso.coef_, index=X.columns)print('Lasso Coefficients:')print(coefficients)```### Step 7: Visualize the Coefficients```python# Plot non-zero coefficientsplt.figure(figsize=(10,6))coefficients[coefficients != 0].plot(kind='barh')plt.title('Lasso Regression Coefficients')plt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:- **Mean Squared Error (MSE)**: This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.- **Lasso Coefficients**: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the Lasso regression (as shown above).- Screenshots or text of the output (MSE, coefficients, and coefficient plot).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
divya08112002 · 3 months
Text
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:
Step 1: Prepare Your Data
Ensure your data is ready for analysis, including explanatory variables and a quantitative response variable.
Step 2: Import Necessary Libraries
For this example, I’ll use Python and the scikit-learn library.
Python
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.linear_model import LassoCV from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt
Step 3: Load Your Data
# Load your dataset data = pd.read_csv('your_dataset.csv') # Define explanatory variables (X) and response variable (y) X = data.drop('target_variable', axis=1) y = data['target_variable']
Step 4: Set Up k-Fold Cross-Validation
# Define k-fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=42)
Step 5: Train the Lasso Regression Model with Cross-Validation
# Initialize and train the LassoCV model lasso = LassoCV(cv=kf, random_state=42) lasso.fit(X, y)
Step 6: Evaluate the Model
# Evaluate the model's performance mse = mean_squared_error(y, lasso.predict(X)) print(f'Mean Squared Error: {mse:.2f}') # Coefficients of the model coefficients = pd.Series(lasso.coef_, index=X.columns) print('Lasso Coefficients:') print(coefficients)
Step 7: Visualize the Coefficients
# Plot non-zero coefficients plt.figure(figsize=(10,6)) coefficients[coefficients != 0].plot(kind='barh') plt.title('Lasso Regression Coefficients') plt.show()
Interpretation
After running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:
Mean Squared Error (MSE): This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.
Lasso Coefficients: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.
Blog Entry Submission
For your blog entry, include:
The code used to run the Lasso regression (as shown above).
Screenshots or text of the output (MSE, coefficients, and coefficient plot).
A brief interpretation of the results.
If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
varsha172003 · 3 months
Text
To run a Lasso regression analysis, you will use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:
Step 1: Prepare Your Data
Ensure your data is ready for analysis, including explanatory variables and a quantitative response variable.
Step 2: Import Necessary Libraries
For this example, I’ll use Python and the scikit-learn library.
Python
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, KFold, cross_val_score from sklearn.linear_model import LassoCV from sklearn.metrics import mean_squared_error import matplotlib.pyplot as plt
Step 3: Load Your Data
# Load your dataset data = pd.read_csv('your_dataset.csv') # Define explanatory variables (X) and response variable (y) X = data.drop('target_variable', axis=1) y = data['target_variable']
Step 4: Set Up k-Fold Cross-Validation
# Define k-fold cross-validation kf = KFold(n_splits=5, shuffle=True, random_state=42)
Step 5: Train the Lasso Regression Model with Cross-Validation
# Initialize and train the LassoCV model lasso = LassoCV(cv=kf, random_state=42) lasso.fit(X, y)
Step 6: Evaluate the Model
# Evaluate the model's performance mse = mean_squared_error(y, lasso.predict(X)) print(f'Mean Squared Error: {mse:.2f}') # Coefficients of the model coefficients = pd.Series(lasso.coef_, index=X.columns) print('Lasso Coefficients:') print(coefficients)
Step 7: Visualize the Coefficients
# Plot non-zero coefficients plt.figure(figsize=(10,6)) coefficients[coefficients != 0].plot(kind='barh') plt.title('Lasso Regression Coefficients') plt.show()
Interpretation
After running the above code, you'll have the output from your model, including the mean squared error, coefficients of the model, and a plot of the non-zero coefficients. Here’s an example of how you might interpret the results:
Mean Squared Error (MSE): This metric shows the average squared difference between the observed actual outcomes and the outcomes predicted by the model. A lower MSE indicates better model performance.
Lasso Coefficients: The coefficients show the importance of each feature in the model. Features with coefficients equal to zero are excluded from the model, while those with non-zero coefficients are retained. The bar plot visualizes these non-zero coefficients, indicating which features are most strongly associated with the response variable.
Blog Entry Submission
For your blog entry, include:
The code used to run the Lasso regression (as shown above).
Screenshots or text of the output (MSE, coefficients, and coefficient plot).
A brief interpretation of the results.
If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
melolon · 3 months
Text
Running a Lasso Regression Analysis
Tumblr media
Target response variable: school connectedness (SCHCONN1)
Of the 23 predictor variables, 18 were retained in the selected model. (WHITE, ALCPROBS1, INHEVER1, PASSIST, PARPRES were removed in this model)
Self-esteem (ESTEEM1) is most positively associated with school connectedness (SCHCONN1).
Depression(DEP1) is most negatively associated with school connectedness (SCHCONN1).
Tumblr media
0 notes
sanaablog1 · 3 months
Text
Assignment 3
Lasso regression analysis for shrinkage variable selection in the banking system
Introduction:
A personal equity plan (PEP) was an investment plan introduced in the United Kingdom that encouraged people over the age of 18 to invest in British companies. Participants could invest in shares, authorized unit trusts, or investment trusts and receive both income and capital gains free of tax. The PEP was designed to encourage investment by individuals. Banks engage in data analysis related to Personal Equity Plans (PEPs) for various reasons. They use it to assess the risk associated with these investment plans. By examining historical performance, market trends, and individual investor behavior, banks can make informed decisions about offering PEPs to their clients.
In general, banks analyze PEP-related data to make informed investment decisions, comply with regulations, and tailor their offerings to customer needs. The goal is to provide equitable opportunities for investors while managing risks effectively. 
SAS Code
LIBNAME mylib "/home/u63879373";
proc import out=mylib.mydata datafile='/home/u63879373/bank.csv' dbms=CSV replace;
proc print data=mylib.mydata;
run;
/*********DATA MANAGEMENT****************/
data new; set mylib.mydata; /*using numerical values for new variables*/
if pep="YES" then res=1; else res=0;
if sex="MALE" then gender=1; else gender=0;
if married="YES" then status=1; else status=0;
if car="YES" then cars=1; else cars=0;
if save_act="YES" then save=1; else save=0;
if current_act="YES" then current=1; else current=0;
if mortgage="YES" then mortg=1; else mortg=0;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=new out=traintest seed = 123 samprate=0.7 method=srs outall;
Run;  
* lasso multiple regression with lars algorithm k=10 fold validation;
proc glmselect data=traintest plots=all seed=123;
     partition ROLE=selected(train='1' test='0');
     model res=age gender income status children cars save
     current mortg/selection=lar(choose=cv stop=none) cvmethod=random(10);
RUN;
Dataset
The dataset I used in this assignment contains information about customers in a bank. The Data analysis used will help the bank take know the important features that can affect the PEP of a client from the following features: age, sex, region, income, married, children, car, save_act, current_act and the mortgage.
Id: a unique identification number,
age: age of customer in years (numeric),
income: income of customer (numeric)
sex: MALE / FEMALE
married: is the customer married (YES/NO)
children: number of children (numeric)
car: does the customer own a car (YES/NO)
save_acct: does the customer have a saving account (YES/NO)
current_acct: does the customer have a current account (YES/NO)
mortgage: does the customer have a mortgage (YES/NO)
Figure1: dataset
Tumblr media
Lasso Regression
LASSO is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
In this assignement, we will use Lasso on the above dataset to identify a set of variables that best predicts the binary response variable which is the pep for the bank customers.
Lasso Regression Analysis
I used the libname statement to call my dataset.
data management: I created new variables that will hold numerical values instead of the string values we had in the table:
for the pep variable res=1; if pep is yes and res=0 otherwise;
if sex="MALE" then gender=1; else gender=0; 0 for Female and 1 for Male
if married="YES" then status=1; else status=0; the customer is married or not
if car="YES" then cars=1; else cars=0; the customer has a car or not
if save_act="YES" then save=1; else save=0; the customet has a saving account or not
if current_act="YES" then current=1; else current=0; the customer has a current account or not
if mortgage="YES" then mortg=1; else mortg=0; the customer has a mortgage or not
The SURVEYSELECT procedure
ODS graphics is turned on for the SAS to plot graphics. The data set is randomly split into a training data set consisting of 70% of the total observations in the data set, and a test data set consisting of the other 30% of the observations as seen in the figure below.
Figure 2
Tumblr media
The GLMSELECT procedure
This figure shows information about the Lasso regression. The dependent Variable, res (which is the pep in the bank table), and the LAR selection method used. It also shows that I used as a criterion for choosing the best model, K equals 10-fold cross validation, with random assignments of observations to the folds.
Figure 3
Tumblr media
The GLMSELECT procedure
The figure below shows the total number of observations in the data set, 600 observations, which is the same as the number of observations used for training and testing the statistical models.
It shows also the number of parameters to be estimated which is 10 for the intercept plus the 9 predictors.
Figure 4
Tumblr media
The LAR Selection summary
The next figure shows information about the LAR selection. It shows the steps in the analysis and the variable that is entered at each step. We can see that the variables are entered as follows, income then status (married or not), then age, and so on.
Figure 5
Tumblr media
At the beginning, there are no predictors in the model. Just the intercept. Then variables are entered one at a time in order of the magnitude of the reduction in the mean, or average squared error (ASE). The variables are ordered in terms of how important they are in predicting the customer’s PEP. From the figure and according to the lasso regression results, we can see that the most important predictor of the PEP is the customer’s income, then whether the customer is married or not then the age and so on. You can also see how the average square error declines as variables are added to the model, indicating that the prediction accuracy improves as each variable is added to the model.
The CV PRESS shows the sum of the residual sum of squares in the test data set. There's an asterisk at step 6. This is the model selected as the best model by the procedure since it is the model with the lowest summed residual sum of squares and that adding other variables to this model, actually increases it.
Finally, we can see that the training data ASE continues to decline as variables are added. This is to be expected as model complexity increases.
Coefficient Progression
Next figure shows the change in the regression coefficients at each step, and the vertical line represents the selected model. The plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step as well as the steps at which each variable entered the model. For example, as also indicated in the summary table above, the customer’s income has the largest regression coefficient, followed by married then the age.
We can also see that the number of children and whether the customer has a mortgage or not are negatively associated with the PEP.
The lower plot shows how the chosen selection criterion, in this example CVPRESS, which is the residual sum of squares summed across all the cross-validation folds in the training set, changes as variables are added to the model.
Initially, it decreases rapidly and then levels off to a point in which adding more predictors doesn't lead to much production in the residual sum of squares.
Figure 6
Tumblr media
Fit criteria for pep
The figure plot shows at which step in the selection process different selection criteria would choose the best model. Interestingly, the other, criteria selected more complex models, and the criterion based on cross validation, possibly selecting an overfitted model.
Figure 7
Tumblr media
Progression of average square errors by Role for the PEP (res)
The next figure shows the change in the average or mean square error at each step in the process. As expected, the selected model was less accurate in predicting the PEP in the test data, but the test average squared error at each step was pretty close to the training average squared error especially for the first important variables.
Figure 8
Tumblr media
Analysis of variance
The figure below shows the R-Square (0.1256) and adjusted R-Square (0.1129) for the selected model and the mean square error for both the training (0.21)and test data (0.23).
Figure 9
Tumblr media
Parameter estimates
The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. So, we can see from the results that the two variables current_act and car variables were excluded as they are not associated with the PEP response variable. The other variables age, sex(gender), and income variables are most strongly associated with the response variable.
Tumblr media
Conclusion
Lasso regression is a supervised machine learning method often used in machine learning to select the subset of variables. It is a Shrinkage and Variable Selection method for linear regression models. Lasso Regression is useful because shrinking the regression coefficient can reduce variance without a substantial increase in bias. However, without human intervention, variable selection methods could produce models that make little sense or that turn out to be useless for accurately predicting a response variable in the real world. If machine learning, human intervention, and independent application are combined, we will get better results and therefore better predictions.
0 notes
chaolinchang · 3 months
Text
Running a Lasso Regression Analysis on diabetes
# Import necessary libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
# Load the diabetes dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Define the Lasso regression model with cross-validation
lasso = LassoCV(cv=10, random_state=42)
# Fit the model
lasso.fit(X_scaled, y)
# Get the coefficients
lasso_coef = lasso.coef_
# Get the best alpha (regularization strength)
best_alpha = lasso.alpha_
# Print the results
print("Best alpha (regularization strength):", best_alpha)
print("Lasso coefficients:")
for feature, coef in zip(diabetes.feature_names, lasso_coef):
print(f"{feature}: {coef:.4f}")
# Summary of results
print("\nNon-zero coefficients indicate the variables that are most strongly associated with the response variable.")
Tumblr media
0 notes
deploy111 · 4 months
Text
Task
This week’s assignment involves running a lasso regression analysis. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.
Your assignment is to run a lasso regression analysis using k-fold cross validation to identify a subset of predictors from a larger pool of predictor variables that best predicts a quantitative response variable.
Data
Dataset description: hourly rental data spanning two years.
Dataset can be found at Kaggle
Features:
yr - year
mnth - month
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
hum - relative humidity
windspeed (mph) - wind speed, miles per hour
windspeed (ms) - wind speed, metre per second
Target:
cnt - number of total rentals
Results
A lasso regression analysis was conducted to predict a number of total bikes rentals from a pool of 12 categorical and quantitative predictor variables that best predicted a quantitative response variable. Categorical predictors included weather condition and a series of 2 binary categorical variables for holiday and workingday to improve interpretability of the selected model with fewer predictors. Quantitative predictor variables include year, month, temperature, humidity and wind speed.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 12 predictor variables, 10 were retained in the selected model:
atemp: 63.56915200306693
holiday: -282.431748735072
hum: -12.815264427009353
mnth: 0.0
season: 381.77762475080044
temp: 58.035647703871234
weathersit: -514.6381162101678
weekday: 69.84812053893549
windspeed(mph): 0.0
windspeed(ms): -95.71090321577515
workingday: 36.15135752613271
yr: 2091.5182927517903
Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188
During the estimation process, year and season were most strongly associated with the number of total bikes rentals, followed by temperature and weekday. Holiday, humidity, weather condition and wind speed (ms) were negatively associated with the number of total bikes rentals.
In [1]:import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LassoLarsCV from sklearn import preprocessing from sklearn.metrics import mean_squared_error import seaborn as sns %matplotlib inline rnd_state = 983
In [2]:data = pd.read_csv("data/bikes_rent.csv") data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 731 entries, 0 to 730 Data columns (total 13 columns): season 731 non-null int64 yr 731 non-null int64 mnth 731 non-null int64 holiday 731 non-null int64 weekday 731 non-null int64 workingday 731 non-null int64 weathersit 731 non-null int64 temp 731 non-null float64 atemp 731 non-null float64 hum 731 non-null float64 windspeed(mph) 731 non-null float64 windspeed(ms) 731 non-null float64 cnt 731 non-null int64 dtypes: float64(5), int64(8) memory usage: 74.3 KB
In [3]:data.describe()
Out[3]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cntcount731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000731.000000mean2.4965800.5006846.5198360.0287282.9972640.6839951.39534920.31077623.71769962.78940612.7625765.7052204504.348837std1.1108070.5003423.4519130.1671552.0047870.4652330.5448947.5050918.14805914.2429105.1923572.3211251937.211452min1.0000000.0000001.0000000.0000000.0000000.0000001.0000002.4243463.9534800.0000001.5002440.67065022.00000025%2.0000000.0000004.0000000.0000001.0000000.0000001.00000013.82042416.89212552.0000009.0416504.0418643152.00000050%3.0000001.0000007.0000000.0000003.0000001.0000001.00000020.43165324.33665062.66670012.1253255.4203514548.00000075%3.0000001.00000010.0000000.0000005.0000001.0000002.00000026.87207730.43010073.02085015.6253716.9849675956.000000max4.0000001.00000012.0000001.0000006.0000001.0000003.00000035.32834742.04480097.25000034.00002115.1989378714.000000
In [4]:data.head()
Out[4]:seasonyrmnthholidayweekdayworkingdayweathersittempatemphumwindspeed(mph)windspeed(ms)cnt0101060214.11084718.1812580.583310.7498824.8054909851101000214.90259817.6869569.608716.6521137.443949801210101118.0509249.4702543.727316.6367037.4370601349310102118.20000010.6061059.043510.7398324.8009981562410103119.30523711.4635043.695712.5223005.5978101600
In [5]:data.dropna(inplace=True)
In [17]:fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(20, 10)) for idx, feature in enumerate(data.columns.values[:-1]): data.plot(feature, 'cnt', subplots=True, kind='scatter', ax=axes[int(idx / 4), idx % 4], c='#87486e');
The plot above shows that there is a linear dependence between temp, atemp and cnt features. The correlations below confirm that observation.
In [7]:data.iloc[:, :12].corrwith(data['cnt'])
Out[7]:season 0.406100 yr 0.566710 mnth 0.279977 holiday -0.068348 weekday 0.067443 workingday 0.061156 weathersit -0.297391 temp 0.627494 atemp 0.631066 hum -0.100659 windspeed(mph) -0.234545 windspeed(ms) -0.234545 dtype: float64
In [8]:plt.figure(figsize=(15, 5)) sns.heatmap(data[['temp', 'atemp', 'hum', 'windspeed(mph)', 'windspeed(ms)', 'cnt']].corr(), annot=True, fmt='1.4f');
There is a strong correlation between temp and atemp, as well as windspeed(mph) and windspeed(ms) features, due to the fact that they represent similar metrics in different measures. In further analysis two of those features must be dropped or applyed with penalty (L2 or Lasso regression).
In [9]:predictors = data.iloc[:, :12] target = data['cnt']
In [10]:(predictors_train, predictors_test, target_train, target_test) = train_test_split(predictors, target, test_size = .3, random_state = rnd_state)
In [11]:model = LassoLarsCV(cv=10, precompute=False).fit(predictors_train, target_train)
In [12]:dict(zip(predictors.columns, model.coef_))
Out[12]:{'atemp': 63.56915200306693, 'holiday': -282.431748735072, 'hum': -12.815264427009353, 'mnth': 0.0, 'season': 381.77762475080044, 'temp': 58.035647703871234, 'weathersit': -514.6381162101678, 'weekday': 69.84812053893549, 'windspeed(mph)': 0.0, 'windspeed(ms)': -95.71090321577515, 'workingday': 36.15135752613271, 'yr': 2091.5182927517903}
In [13]:log_alphas =-np.log10(model.alphas_) plt.figure(figsize=(10, 5)) for idx, feature in enumerate(predictors.columns): plt.plot(log_alphas, list(map(lambda r: r[idx], model.coef_path_.T)), label=feature) plt.legend(loc="upper right", bbox_to_anchor=(1.4, 0.95)) plt.xlabel("-log10(alpha)") plt.ylabel("Feature weight") plt.title("Lasso");
In [14]:log_cv_alphas =-np.log10(model.cv_alphas_) plt.figure(figsize=(10, 5)) plt.plot(log_cv_alphas, model.mse_path_, ':') plt.plot(log_cv_alphas, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log10(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold');
In [16]:rsquared_train = model.score(predictors_train, target_train) rsquared_test = model.score(predictors_test, target_test) print('Train data R-square', rsquared_train) print('Test data R-square', rsquared_test) Train data R-square 0.7899877818517489 Test data R-square 0.8131871527614188
0 notes
Text
Machine Learning for Data Analysis
Week 3: Running a Lasso Regression Analysis
Continuing on the machine learning analysis of internet use rate from the GapMinder dataset, I conducted a lasso regression analysis to identify a subset of variables from a pool of 10 quantitative predictor variables that best predicted a quantitative response variable measuring the internet use rates of the countries in the world. I have added several variables to my standard analysis that are not particularly interesting to my main question of how internet use rates of a country affects income in order to have more variables available for this lasso regression. The explanatory variables I have used in this model are income per person, employment rate, female employment rate, polity score, alcohol consumption, life expectancy, oil per person, electricity use per person, and urban rate. All variables have been normalized to have a mean of zero and standard deviation of one.
Load the data, convert all variables to numeric, and discard NA values
In [1]:'import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV data = pd.read_csv('c:/users/greg/desktop/gapminder.csv', low_memory=False) data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce') data['femaleemployrate'] = pd.to_numeric(data['femaleemployrate'], errors='coerce') data['polityscore'] = pd.to_numeric(data['polityscore'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['oilperperson'] = pd.to_numeric(data['oilperperson'], errors='coerce') data['relectricperperson'] = pd.to_numeric(data['relectricperperson'], errors='coerce') data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce') sub1 = data.copy() data_clean = sub1.dropna()
Select predictor variables and target variable as separate data sets
In [3]:predvar = data_clean[['incomeperperson','employrate','femaleemployrate','polityscore', 'alcconsumption', 'lifeexpectancy', 'oilperperson', 'relectricperperson', 'urbanrate']] target = data_clean.internetuserate
Standardize predictors to have mean = 0 and standard deviation = 1
In [4]:predictors=predvar.copy() from sklearn import preprocessing predictors['incomeperperson']=preprocessing.scale(predictors['incomeperperson'].astype('float64')) predictors['employrate']=preprocessing.scale(predictors['employrate'].astype('float64')) predictors['femaleemployrate']=preprocessing.scale(predictors['femaleemployrate'].astype('float64')) predictors['polityscore']=preprocessing.scale(predictors['polityscore'].astype('float64')) predictors['alcconsumption']=preprocessing.scale(predictors['alcconsumption'].astype('float64')) predictors['lifeexpectancy']=preprocessing.scale(predictors['lifeexpectancy'].astype('float64')) predictors['oilperperson']=preprocessing.scale(predictors['oilperperson'].astype('float64')) predictors['relectricperperson']=preprocessing.scale(predictors['relectricperperson'].astype('float64')) predictors['urbanrate']=preprocessing.scale(predictors['urbanrate'].astype('float64'))
Split data into train and test sets
In [6]:pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
Specify the lasso regression model
In [7]:model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
Print the regression coefficients
In [9]:dict(zip(predictors.columns, model.coef_))
Out[9]:{'alcconsumption': 6.2210718136158443, 'employrate': 0.0, 'femaleemployrate': 0.0, 'incomeperperson': 10.730391071065633, 'lifeexpectancy': 7.9415161171462634, 'oilperperson': 0.0, 'polityscore': 0.33239766774625268, 'relectricperperson': 3.3633566029800468, 'urbanrate': 1.1025066401058063}
Plot coefficient progression
In [12]:m_log_alphas =-np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') plt.show()
Tumblr media
Plot mean square error for each fold
In [13]:m_log_alphascv =-np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') plt.show()
Tumblr media
Print the mean squared error from training and test data
In [17]:from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('') print ('test data MSE') print(test_error) training data MSE 100.103936002 test data MSE 120.568970231
Print the r-squared from training and test data
In [18]:rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('') print ('test data R-square') print(rsquared_test) training data R-square 0.861344142378 test data R-square 0.776942580854
Data were randomly split into a training set that included 70% of the observations (N=42) and a test set that included 30% of the observations (N=18). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 10 predictor variables, 6 were retained in the model. During the estimation process, income per person and life expectancy were most strongly associated with internet use rate, followed by alcohol consumption and electricity use per person. The last two predictors were urban rate and polity score. All variables were positively correlated with internet use rate. These 6 variables accounted for 77.7% of the variance in the internet use rate response variable.
0 notes