melaniegonzaalez - Tumblr blog

melaniegonzaalez · 4 years ago

Photo

CODE:

LIBNAME mydata "WORK.IMPORTED" access=readonly;

data clust;

set imported;

IDNUM=_n_;

keep IDNUM S1Q6A SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3EQ1 S3EQ2 NUMREL NONREL S1Q2E;

ods graphics on;

proc surveyselect data=clust out=traintest seed = 138

samprate=0.7 method=srs outall;

run;

data clus_train;

set traintest;

if selected=1;

run;

data clus_test;

set traintest;

if selected=0;

run;

proc standard data=clus_train out=clustvar mean=0 std=1;

var SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3EQ1 S3EQ2 NUMREL NONREL S1Q2E;

run;

%macro kmean(K);

proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;

var SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3EQ1 S3EQ2 NUMREL NONREL S1Q2E;

run;

%mend;

%kmean(1);

%kmean(2);

%kmean(3);

%kmean(4);

%kmean(5);

%kmean(6);

%kmean(7);

%kmean(8);

%kmean(9);

data clus1;

set cluststat1;

nclust=1;

if _type_='RSQ';

keep nclust over_all;

run;

data clus2;

set cluststat2;

nclust=2;

if _type_='RSQ';

keep nclust over_all;

run;

data clus3;

set cluststat3;

nclust=3;

if _type_='RSQ';

keep nclust over_all;

run;

data clus4;

set cluststat4;

nclust=4;

if _type_='RSQ';

keep nclust over_all;

run;

data clus5;

set cluststat5;

nclust=5;

if _type_='RSQ';

keep nclust over_all;

run;

data clus6;

set cluststat6;

nclust=6;

if _type_='RSQ';

keep nclust over_all;

run;

data clus7;

set cluststat7;

nclust=7;

if _type_='RSQ';

keep nclust over_all;

run;

data clus8;

set cluststat8;

nclust=8;

if _type_='RSQ';

keep nclust over_all;

run;

data clus9;

set cluststat9;

nclust=9;

if _type_='RSQ';

keep nclust over_all;

run;

data clusrsquare;

set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9;

run;

symbol1 color=blue interpol=join;

proc gplot data=clusrsquare;

plot over_all*nclust;

run;

proc candisc data=outdata4 out=clustcan;

class cluster;

var SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3EQ1 S3EQ2 NUMREL NONREL S1Q2E;

run;

proc sgplot data=clustcan;

scatter y=can2 x=can1 / group=cluster;

run;

data school_data;

set clus_train;

keep IDNUM S1Q6A;

run;

proc sort data=outdata4;

by IDNUM;

run;

proc sort data=school_data;

by IDNUM;

run;

data merged;

merge outdata4 school_data;

by IDNUM;

run;

proc sort data=merged;

by cluster;

run;

proc means data=merged;

var S1Q6A;

by cluster;

run;

proc anova data=merged;

class cluster;

model S1Q6A = cluster;

means cluster/tukey;

run;

This k-means cluster analysis identifies underlying subgroups based on their similarity of responses on 13 variables that represent characteristics that could have an impact on highest grade or year of school completed. Clustering variables included the binary variables gender, ethnicity (Hispanic/Latino origin, American Indian, Asian, black, white), if the person lived with at least 1 biological parent before age 18, alcohol consumption, whether either parent had problems with drugs and thought about committing suicide and the quantitative variables number of related or unrelated persons in household and age when biological/adoptive parents stopped living together.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. The elbow curve was inconclusive, suggesting that the 4, 5 and 7-cluster solutions might be interpreted. The results are for an interpretation of the 4-cluster solution.

People in cluster 1 had high number of related and unrelated persons in household, moderate alcohol consumption and low age when biological/adoptive parents stopped living together. People in cluster 2 had moderate number of related persons in household but low alcohol consumption, unrelated persons in household, and age when biological/adoptive parents stopped living together. In cluster 3, there was low alcohol consumption, number of related and unrelated persons in household and age when biological/adoptive parents stopped living together. Finally, in cluster 4 there was a moderate alcohol consumption and age when biological/adoptive parents stopped living together, but low number of related and unrelated persons in household. People in cluster 3 achieved the highest grade or year of school completed and people in cluster 1 had the lowest.

0 notes

melaniegonzaalez · 4 years ago

Photo

CODE:

LIBNAME mydata "WORK.IMPORTED" access=readonly;

DATA new; set imported;

if SEX=1 then male=1;

if SEX=2 then male=0;

run;

ods graphics on;

proc surveyselect data=new out=traintest seed = 138

samprate=0.7 method=srs outall;

run;

proc glmselect data=traintest plots=all seed=138;

partition ROLE=selected(train='1' test='0');

model S1Q6A = SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3EQ1 S3EQ2 NUMREL NONREL S1Q2E/selection=lar(choose=cv stop=none) cvmethod=random(10);

run;

This lasso regression analyzes how a subset of 13 categorical and quantitative variables predicted a quantitative response variable measuring highest grade or year of school completed. Categorical predictors include gender, ethnicity (Hispanic/Latino origin, American Indian, Asian, black, white), if the person lived with at least 1 biological parent before age 18, alcohol consumption, whether either parent had problems with drugs and thought about committing suicide. Quantitative predictor variables include number of related or unrelated persons in household and age when biological/adoptive parents stopped living together.

Data were randomly split into a training set that included 70% of the observations (N=4407) and a test set that included 30% of the observations (N=1904). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

The variables that were more associated with highest grade or year of school completed include alcohol consumption, Hispanic origin, number of related persons in household, number of unrelated persons in household and age when biological/adoptive parents stopped living together, respectively. From these variables, alcohol consumption, number of related persons in household, number of unrelated persons in household and age when biological/adoptive parents stopped living together were negatively associated and Hispanic origin was positively associated.

0 notes

melaniegonzaalez · 4 years ago

Photo

CODE:

LIBNAME mydata "WORK.IMPORTED" access=readonly;

DATA new; set imported;

PROC SORT; BY IDNUM;

PROC HPFOREST;

target DGSTATUS/level=nominal;

input SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3AQ1A S3EQ1 S3EQ2 S1Q7A1 S1Q7A2 S1Q7A6 S1Q7A7/level=nominal;

input NUMREL NONREL S1Q6A S1Q2E S1Q10B/level=interval;

RUN;

This random forest analyzes the importance of different explanatory variables for predicting the binary response variable, in this case, drug consumption. The information used comes from the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC).

The target variable, labeled as DGSTATUS, shows 1 for people who have used drugs and 2 for the ones who have never used drugs. The categorical explanatory variables include: gender, ethnicity (Hispanic/Latino origin, American Indian, Asian, black, white), if the person lived with at least 1 biological parent before age 18, alcohol consumption, cigarette consumption, whether either parent had problems with drugs, thought about committing suicide and working status (full time, part time, unemployed looking or not looking for work) and the quantitative explanatory variables: number of related or unrelated persons in household, highest grade or year of school completed, age when biological/adoptive parents stopped living together and total personal income.

The explanatory variables with the highest relative importance scores were cigarette consumption, alcohol consumption, working full time and whether their father ever had problems with drugs. The accuracy of the random forest was 78.8%.

0 notes

melaniegonzaalez · 4 years ago

Text

Peer-graded Assignment: Running a Classification Tree

CODE:

LIBNAME mydata "WORK.IMPORTED" access=readonly;

DATA new; set imported;

PROC SORT; BY IDNUM;

ods graphics on;

proc hpsplit seed=13815;

class DGSTATUS SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3AQ1A S3EQ1 S3EQ2 S1Q7A1 S1Q7A2 S1Q7A6 S1Q7A7;

model DGSTATUS =SEX S1Q1C S1Q1D1 S1Q1D2 S1Q1D3 S1Q1D5 S1Q2A S2AQ1 S3AQ1A S3EQ1 S3EQ2 S1Q7A1 S1Q7A2 S1Q7A6 S1Q7A7 NUMREL NONREL S1Q6A S1Q2E S1Q10B;

grow entropy;

prune costcomplexity;

RUN;

This decision tree analyzes how different explanatory variables relate to the binary response variable, in this case, drug consumption. The information used comes from the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC). The criteria for growing the full tree is entropy and the pruning method used is cost complexity.

The target variable, labeled as DGSTATUS, shows 1 for people who have used drugs and 2 for the ones who have never used drugs. The explanatory variables include the categorical variables: gender, ethnicity (Hispanic/Latino origin, American Indian, Asian, black, white), if the person lived with at least 1 biological parent before age 18, alcohol consumption, cigarette consumption, whether either parent had problems with drugs, thought about committing suicide and working status (full time, part time, unemployed looking or not looking for work) and the quantitative variables: number of related or unrelated persons in household, highest grade or year of school completed, age when biological/adoptive parents stopped living together and total personal income.

The first classification is based on cigarette consumption, where more people who did not smoke more than 100 cigarettes (78%) didn’t consumed drugs in comparison to the people who did smoked (55% didn’t consume drugs). Then, the ones who didn’t smoke where divided based on alcohol consumption, where around 26% of people who drank at least 1 alcoholic drink where more likely to consume drugs than the ones who didn’t drink (only 5% consumed drugs). People who did drink alcohol where classified in whether their father ever had problems with drugs or not; 52% of the ones who did used drugs.

From the side of the people who did smoke, another division appeared based on whether their father ever had problems with drugs; where 71% of the ones who did consumed drugs. Finally, the ones whose fathers didn’t have problems with drugs were separated based on alcohol consumption; from these, around 88% of the ones who didn’t drink also have never consumed drugs.

1 note · View note