vj-vzb
vj-vzb
Vijay Singh
19 posts
Don't wanna be here? Send us removal request.
vj-vzb · 4 years ago
Text
Data Analysis and Interpretation Capstone - Week 3 - Assignment
Analyses:
The distributions for the quantitative service level response variable ‘INCIDENCE OF TUBERCULOSIS’ and 10 explanatory variable values were evaluated through descriptive statistics whereby the count, mean, standard deviation, minimum, 25% quartile, median, 75% quartile, maximum, and mode were calculated.
Scatter and box plots were examined, and Liner regression were used to test associations between predictor variables and the service level response variable for year 2012 and 2013 data separately.
Lasso regression with the least angle regression selection algorithm was used to identify the subset of variables that best predicted global bells service level. Table 2, the model was estimated on a training set consisting of a random sample of 61% of daily results (N=113), and a test data set contained the remaining 39% of the observations (N=71).
Results
Descriptive Statistics
Table 3 below shows descriptive statistics for Incidence of Tuberculosis and the quantitative predictors. The average incidence of Tuberculosis (code:  x157) was 120.64 per 100,000 people in 2012 and 116.83 in 2013.
Tumblr media
Bivariate Analysis
Figure 1: Scatter plots for the association between Incidence of Tuberculosis response variable and Improved Sanitation Facilities revealed that TB incidences were lower when there was a greater improved sanitation facility available within the county for the public. Similarly, the association between Incidence of Tuberculosis response variable and Improved Water Source revealed that TB incidences were lower when there was a greater number of improved water sources
Lower section of Figure 1 below revealed association of TB incidences does not have significant relation with Health Expenditure. Bottom right chart, the association of TB incidence with Adjusted Net National Income Per CAPITA revealed that TB incidence were lower when there is high Net National Income.
Tumblr media
Note: Chart above represent the data collected for year 2012.
Figure 2:
Tumblr media
Figure 2: Linear regression line (on year 2013 data) for the association between Incidence of Tuberculosis response variable and Improved Sanitation Facilities revealed that TB incidences were lower when there was a greater improved sanitation facility available within the county or region.
Figure 3:
Tumblr media
Figure 3: Linear regression line (on year 2013 data) for the association between Incidence of Tuberculosis response variable and Improved Water Source revealed that TB incidences were lower when there was a greater improved water sources available within the county or region.
Figure 4:
Tumblr media
Figure 4: Linear regression line (on year 2013 data) for the association between Incidence of Tuberculosis response variable and Health Expenditure revealed that TB incidences were not lower significantly when there was increase in health expenditure within the county or region.
Figure 5:
Tumblr media
Figure 5: Linear regression line (on year 2013 data) for the association between Incidence of Tuberculosis response variable and Adjusted Net Income revealed that TB incidences were lower when there was increase in adjusted net income within the county or region.
Lasso Regression Analysis
 Figure 6 for 2012 data and Figure 7 on 2013 data, shows that 1 predictor were retained in the selected model. Only Improved sanitation facilities predictor was selected. Please refer to vertical line in the figure indicated as “Selected Step”.
Tumblr media
Figure 7 on 2013 data, shows that 1 predictor were retained in the selected model. Only Improved sanitation facilities predictor was selected.
Tumblr media
The number Improved Sanitation Facilities in country or region were most strongly associated with number of Tuberculosis Incidence, followed by Improved Water resources, Net National Income and Health Expenditure. (Table 4).
Tumblr media
Similar observation noted for year 2013 data as well and it confirmed. The number Improved Sanitation Facilities in country or region were most strongly associated with number of Tuberculosis Incidence, followed by Improved Water resources, Net National Income and Health Expenditure. (Table 5).
Tumblr media
0 notes
vj-vzb · 4 years ago
Text
Data Analysis and Interpretation Capstone - Week 2 - Assignment
Methods
Sample:
This World Bank capstone data set is a subset of data extracted from the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional, and global estimates. This World Bank capstone data set consists of over 80 variables on N=248 countries for the years 2012 and 2013. All variables have valid data observations for minimum of 190 countries. Some of the ways this data set could be used include cross sectional analysis of one year’s data separately or predicting 2013 outcomes from 2012 data.
Data Set Name – WID.WORLDBANK
Observations: 248
Created: March 3rd, 2016
 Measures:
Variables Description:
1. x157_2012
INCIDENCE OF  TUBERCULOSIS (PER 100,000 PEOPLE) – year 2012
2. x157_2013
INCIDENCE OF  TUBERCULOSIS (PER 100,000 PEOPLE) – year 2013
3. x155_2012
IMPROVED  SANITATION FACILITIES (% OF POPULATION WITH ACCESS) – year 2012
4. x155_2013
IMPROVED  SANITATION FACILITIES (% OF POPULATION WITH ACCESS) – year 2013
5. x156_2012
IMPROVED  WATER SOURCE (% OF POPULATION WITH ACCESS) – year 2012
6. x156_2013
IMPROVED  WATER SOURCE (% OF POPULATION WITH ACCESS) – year 2013
7. x150_2012
HEALTH  EXPENDITURE, TOTAL (% OF GDP) – year 2012
8. x150_2013
HEALTH  EXPENDITURE, TOTAL (% OF GDP) – year 2013
9. x219_2012
POPULATION  DENSITY (PEOPLE PER SQ. KM OF LAND AREA) – year 2012
10. x219_2013
POPULATION  DENSITY (PEOPLE PER SQ. KM OF LAND AREA) – year 2013
11. x11_2012
ADJUSTED NET  NATIONAL INCOME PER CAPITA (CURRENT US$) – year 2012
12. x11_2013
ADJUSTED NET  NATIONAL INCOME PER CAPITA (CURRENT US$) – year 2013
 INCIDENCE OF TUBERCULOSIS is the responsive variable measured for each country/region for both year 2012 and 2013. Total number of Incidence of Tuberculosis were divided by total population of the country/region and multiply by 100,000.
Quantitative predictors for year 2012 and 2013 included following:
1.       IMPROVED SANITATION FACILITIES (% OF POPULATION WITH ACCESS)
2.      IMPROVED WATER SOURCE (% OF POPULATION WITH ACCESS)
3.      HEALTH EXPENDITURE, TOTAL (% OF GDP)
4.      POPULATION DENSITY (PEOPLE PER SQ. KM OF LAND AREA)
5.      ADJUSTED NET NATIONAL INCOME PER CAPITA (CURRENT US$)
Per capita income (PCI) or average income measures the average income earned per person in a given area (city, region, country, etc.) in a specified year. It is calculated by dividing the area's total income by its total population.
Adjusted net national income is GNI minus consumption of fixed capital and natural resources depletion. Adjusted net national income is calculated by subtracting from GNI a charge for the consumption of fixed capital (a calculation that yields net national income) and for the depletion of natural resources.
 Analyses:
The distributions for the quantitative service level response variable ‘INCIDENCE OF TUBERCULOSIS’ and 10 explanatory variable values were evaluated through descriptive statistics whereby the count, mean, standard deviation, minimum, 25% quartile, median, 75% quartile, maximum, and mode were calculated.
Scatter and box plots were examined, and Analysis of Variance (ANOVA) were used to test bivariate associations between predictor variables and the service level response variable.
Lasso regression with the least angle regression selection algorithm was used to identify the subset of variables that best predicted global bells service level. The model was estimated on a training set consisting of a random sample of 70% of daily results (N=174), and a test data set contained the remaining 30% of the observations (N=74).
All predictor variables were standardized to have a mean=0 and standard deviation=1 prior to conducting the lasso regression analysis. Cross validation occurred using k-fold specifying 10 folds. The change in the cross validation mean squared error rate at each step was used to identify the best subset of predictor variables. Predictive accuracy was assessed by determining the mean squared error rate of the training data prediction algorithm when applied to the observations in the data set.
0 notes
vj-vzb · 4 years ago
Text
Data Analysis and Interpretation Capstone - Week 1 - Assignment
Report Title:
Incidence Of Tuberculosis Is Associated with Country’s Improved Sanitation Facilities and Total Health Expenditure
Introduction to the Research Question
Tuberculosis (TB) is an infectious disease caused by the bacillus Mycobacterium tuberculosis. It typically affects the lungs (pulmonary TB) but can affect other sites as well (extrapulmonary TB).  Tuberculosis (TB) remains a major global health problem. It causes ill-health among millions of people each year and ranks as the second leading cause of death from an infectious disease worldwide, after the human immunodeficiency virus (HIV).
Tuberculosis (TB) often causes catastrophic economic effects on both the individual suffering the disease and their households. Objective of this paper to assess the incidence, intensity and determinants of Country’s Improved Sanitation Facilities and Total Health Expenditure relating to number of TB incidences.
Tuberculosis (TB) has significant economic impacts in many countries and may hamper national development. Tuberculosis is most prevalent among the most economically productive sector of the population. The disease can therefore cause enormous economic and social disruption by reducing both labor supply and productivity. Purpose of this study to minimize TB incidence by allocating funds to most effective programs that will lead to improve labor supply and productivity.
Also, protecting people from financial risk associated with ill health is a desirable objective of health policy worldwide. Such risk can be quantified in terms of catastrophic health expenditures. Catastrophic health expenditures are defined as out-of-pocket expenditure for health care that exceeds a specified proportion of household income, with the consequence that the household may have to sacrifice the consumption of other goods and services necessary for their well-being. Catastrophic health expenditures do not necessarily mean high health care costs. Relatively small expenditures for common illnesses may have serious financial implications for poor households.
0 notes
vj-vzb · 4 years ago
Text
Machine Learning for Data Analysis - Week4 Assignment
K-Means Cluster
Code:
libname mydata "/courses/d1406ae5ba27fe300" access=readonly; data clust; set mydata.treeaddhealth;
* create a unique identifier to merge cluster assignment variable with the main data set; idnum=_n_;
keep idnum alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 parpres paractv famconct gpa1;
* delete observations with missing data; if cmiss(of _all_) then delete; run;
ods graphics on;
* Split data randomly into test and training data; proc surveyselect data=clust out=traintest seed = 123 samprate=0.7 method=srs outall; run;  
data clus_train; set traintest; if selected=1; run;
data clus_test; set traintest; if selected=0; run;
* standardize the clustering variables to have a mean of 0 and standard deviation of 1; proc standard data=clus_train out=clustvar mean=0 std=1; var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 parpres paractv famconct; run;
%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300; var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 parpres paractv famconct; run;
%mend;
%kmean(1); %kmean(2); %kmean(3); %kmean(4); %kmean(5); %kmean(6); %kmean(7); %kmean(8); %kmean(9);
* extract r-square values from each cluster solution and then merge them to plot elbow curve; data clus1; set cluststat1; nclust=1;
if _type_='RSQ'; keep nclust over_all; run;
data clus2; set cluststat2; nclust=2; if _type_='RSQ'; keep nclust over_all; run;
data clus3; set cluststat3; nclust=3; if _type_='RSQ'; keep nclust over_all; run;
data clus4; set cluststat4; nclust=4;
if _type_='RSQ'; keep nclust over_all; run; data clus5; set cluststat5; nclust=5;
if _type_='RSQ'; keep nclust over_all; run; data clus6; set cluststat6; nclust=6;
if _type_='RSQ'; keep nclust over_all; run; data clus7; set cluststat7; nclust=7;
if _type_='RSQ'; keep nclust over_all; run; data clus8; set cluststat8; nclust=8;
if _type_='RSQ'; keep nclust over_all; run; data clus9; set cluststat9; nclust=9;
if _type_='RSQ'; keep nclust over_all; run;
data clusrsquare; set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9; run;
* plot elbow curve using r-square values; symbol1 color=blue interpol=join; proc gplot data=clusrsquare; plot over_all*nclust; run;
* plot clusters for 4 cluster solution; proc candisc data=outdata4 out=clustcan; class cluster; var alcevr1 marever1 alcprobs1 deviant1 viol1 dep1 esteem1 schconn1 parpres paractv famconct; run;
proc sgplot data=clustcan; scatter y=can2 x=can1 / group=cluster; run;
* validate clusters on GPA; * first merge clustering variable and assignment data with GPA data; data gpa_data; set clus_train; keep idnum gpa1; run;
proc sort data=outdata4; by idnum; run;
proc sort data=gpa_data; by idnum; run;
data merged; merge outdata4 gpa_data; by idnum; run;
proc sort data=merged; by cluster; run;
proc means data=merged; var gpa1; by cluster; run;
proc anova data=merged; class cluster; model gpa1 = cluster; means cluster/tukey; run;
Results / Observations / Comments:
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 11 variables that represent characteristics that could have an impact on school achievement. Clustering variables included two binary variables measuring whether or not the adolescent had ever used alcohol or marijuana, as well as quantitative variables measuring alcohol problems, a scale measuring engaging in deviant behaviors (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school), and scales measuring violence, depression, self-esteem, parental presence, parental activities, family connectedness, and school connectedness. 
Tumblr media
The elbow curve was inconclusive, suggesting that the 2, 4 and 8-cluster solutions might be interpreted. The results below are for an interpretation of the 4-cluster solution.
Tumblr media
--
Tumblr media
Canonical discriminant analyses was used to reduce the 11 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (Figure 2 shown below) indicated that the observations in clusters 1 and 4 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.  
Tumblr media
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on grade point average (GPA). A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on GPA (F(3, 3197)=82.28, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on GPA, with the exception that clusters 1 and 2 were not significantly different from each other.  Adolescents in cluster 4 had the highest GPA (mean=2.99, sd=0.71), and cluster 3 had the lowest GPA (mean=2.36, sd=0.78).
0 notes
vj-vzb · 4 years ago
Text
Machine Learning for Data Analysis - Week3 Assignment
Code:
libname mydata "/courses/d1406ae5ba27fe300" access=readonly;
data new; set mydata.treeaddhealth; ****  create male and non-white explanatory variables ; if white=0 and black = 0 then nonwhite=1; if bio_sex=1 then male=1; if bio_sex=2 then male=0;
* delete observations with missing data; if cmiss(of _all_) then delete; run;
ods graphics on;
* Split data randomly into test (25%) and training(75%) data; proc surveyselect data=new out=traintest seed = 123 samprate=0.75 method=srs outall; run;  
* lasso multiple regression with lars algorithm k=8 fold validation; proc glmselect data=traintest plots=all seed=123;     partition ROLE=selected(train='1' test='0');     model schconn1 = male white black nonwhite alcevr1 marever1 cocever1     inhever1 cigavail passist expel1 age alcprobs1 deviant1 viol1 dep1 esteem1       famconct gpa1/selection=lar(choose=cv stop=none) cvmethod=random(8); run;
Results / Comments:
In this assignment, randomly splitting data set into a training data set consisting of 75% of the total observations, and test data set consisting of the other 25% of the observations.
A lasso regression analysis was conducted to identify a subset of variables from a pool of 19 categorical and quantitative predictor variables that best predicted a quantitative response variable measuring school connectedness in adolescents. Categorical predictors included gender and a series of 3 binary categorical variables for race and ethnicity (White, Black, nonwhite) to improve interpretability of the selected model with fewer predictors.
Tumblr media
--
Tumblr media
--
Tumblr media
The least angle regression algorithm with k=8 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Tumblr media
Of the 19 predictor variables, 16 were retained in the selected model. During the estimation process and depression were most strongly associated with school connectedness, followed by engaging in violent behavior and GPA.
0 notes
vj-vzb · 4 years ago
Text
Machine Learning for Data Analysis - Week2 Assignment
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.treeaddhealth; PROC SORT; BY AID;
PROC HPFOREST; /* Indicating TREG1 is Categorical response variable */ /* In dataset response value 0 being converted to 2. Therefore we have two values 1 or 2 */ target TREG1/level=nominal;
/* input statement with Categorical explanatory variables */ input alcevr1 MARever1 cocever1 inhever1 Cigavail PASSIST EXPEL1 BIO_SEX HISPANIC WHITE BLACK /level=nominal;
/* input statement with Quantitative explanatory variables */ input schconn1 GPA1 age DEVIANT1 VIOL1 DEP1 ESTEEM1 /level=interval;
RUN;
Result / Observation / Comments:
Random forests are predictive models that allow for a data driven exploration of many explanatory variables in predicting a response or target variable. Random forests provide importance scores for each explanatory variable and also allow to evaluate any increases in correct classification with the growing of smaller and larger number of trees.
Following explanatory variables were included as possible contributors to a random forest evaluating regular smoking i.e. my response variable TREG1:
Alcohol use,  marijuana use,  cocaine use, inhalant use, availability of cigarettes,  either parent was on public assistance, experience with being expelled from school, gender,  Hispanic, White, Black  
self-esteem, grade point average, age,  deviance, violence, depression, school connectedness
Tumblr media
--
Tumblr media
--
Tumblr media
Since no trees are actually interpreted, this could be consider as the main weakness of random forests for providing less satisfying results.  Instead, the forest of trees is used to rank the importance of variables in predicting the target. It provide sense of the most important predictive variables but not necessarily their relationships to one another.
The explanatory variables with the highest relative importance scores were  marijuana use, deviance, White ethnicity, Black ethnicity and availability of cigarettes . 
0 notes
vj-vzb · 4 years ago
Text
Machine Learning for Data Analysis - Week1 Assignment
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.treeaddhealth; PROC SORT; BY AID;
ods graphics on; proc hpsplit seed=15535; class TREG1 BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN   alcevr1 marever1 cocever1 inhever1 Cigavail; model TREG1 =AGE BIO_SEX HISPANIC WHITE BLACK NAMERICAN ASIAN alcevr1 ALCPROBS1  marever1 cocever1 inhever1 DEVIANT1 VIOL1 DEP1 ESTEEM1 PARPRES PARACTV  FAMCONCT schconn1 Cigavail PASSIST EXPEL1 GPA1; grow entropy; prune costcomplexity;
RUN;
Result / Observation / Comments:
Choosing the seed option value in hpsplit as 15535 for the cross validation process.  Model is using explanatory variables, both categorical (Race, Gender etc..) and quantitative (GPA).
Below:   252 leaves before pruning and 20 leaves following pruning.  The number of observations read from data set was 6,508 while the number of observations used was only 4,575.
Tumblr media
A vertical reference line is drawn for the tree with the number of leaves that has the lowest cross validated ASE. In this case, the 20 leaf tree.  The horizontal reference line represents the average standard error plus one standard error for this complexity parameter.
--
Tumblr media
--
--
Tumblr media
--  Pruning plot that chose a general model with 10 split levels and 20 leaves
--  Below, the total model correctly classifies 42% of those who have smoked regularly. That is, one minus the error rate of .58 and 96% of those who have not.
Tumblr media
Above the receiver operator characteristic curve, known as the ROC curve, which shows sensitivity, that is the true positive rate, and specificity, the true negative rate plotted against each other.
The variables such as school connected-ness, alcohol problems, and age, have important scores that are relatively similar to grade point average, which was selected as a split in the model above.
0 notes
vj-vzb · 4 years ago
Text
Regression Modeling in Practice - Assignment Week-4
LOGISTIC REGRESSION::
In this assignment, we want to investigate the data and find out that there is no difference in the ethnicity race and access to physical examination, as we know  that is critical for student overall health. We also introduce second explanatory variable “Born in United States - H1GI11” and  there was evidence of confounding for the association between the primary explanatory variable “Ethnicity Race - H1GI9 “ and the response variable “Physical Examination - H1HS1″ .
Our alternate hypothesis - that there is relationship between ethnicity race and routine physical examination of the students.
Following are the Categories for Each Variables and responses
H1GI9 = "RACE"  - aggregated response by Interviewer (Explanatory Variable)  
1 White    
2 Black or African American    
3 American Indian or Native American    
4 Asian or Pacific Islander    
5 Other    
6 refused    
8 don’t know
H1HS1 = In the past year have you had a routine physical examination? (Response Variable)  
0  No    
1  Yes    
6  Refused    
8  Don’t know  
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.addhealth_pds;
LABEL H1GI9 = "Ethnicity RACE" /* - aggregated response by Interviewer "  */      H1GI11 = "Born in The United States"      H1HS1 = "Physical Examination";
IF H1GI9 = 8 OR H1GI9 = 6 THEN H1GI9 = .;  /*Remove rows where response is 'don't know  or refused to answer*/
IF H1GI11 = 7 then H1GI11 = 1; 
IF H1HS1 LE 1;  /* Remove rows where response is 'don't know or refused */
PROC SORT; by BIO_SEX; PROC FREQ; TABLES H1HS1*H1GI9/CHISQ; BY BIO_SEX;
Proc logistic descending; model H1HS1=H1GI9; 
run;
* adding "Born in United State" and “Bio_Sex”; Proc logistic descending; model H1HS1=H1GI9 H1GI11 BIO_SEX; run;
Results:
Tumblr media Tumblr media Tumblr media
Odd Ratio estimate is 0.854 and 95% confidence value between 0.813 and 0.897. With p-value is less than .0001 that does not support null hypothesis. 
-- Below adding  "Born in United States” and “BIO_SEX” as additional explanatory variables for Logistic Regression procedure
Tumblr media
Statistical results:
Odds Ratios -  Ethnicity Race = .876, Born in USA=1.438, Sex = .0936
p-values -  Ethnicity Race < .001, Born in USA= .001, Sex = .2165
95% confidence intervals - for the odds ratios
Ethnicity Race = .832 -0.923; Born in USA=1.159 -1.784, Sex =.844 - 1.039  
Summary/Comments:
As we know when student skip the physical examination that is so critical for their health and impact negatively their academic progress as well. This study is going further and trying to find if there is any association between skip the physical examination for whatever reason has any association with ethnicity race, sex and immigrate status such as born in USA or outside. 
Base on the outcome of logistic regression method above, we can conclude students born in United States most likely to have routine physical examination than those who are not born in the country or are immigrants. Sex and ethnicity race does not matter that much. 
0 notes
vj-vzb · 4 years ago
Text
Regression Modeling in Practice - Assignment Week-3
Code:
* scatterplot with linear regression line lifeexpectancy response variable; proc sgplot;  reg x=urbanrate y=lifeexpectancy / lineattrs=(color=blue thickness=2) clm;  yaxis label="Life Expectancy at Birth (Years)";  xaxis label="Urbanization Rate"; run; * scatterplot with linear and quadratic regression line; proc sgplot;  reg x=urbanrate y=lifeexpectancy / lineattrs=(color=blue thickness=2) degree=1 clm;  reg x=urbanrate y=lifeexpectancy / lineattrs=(color=green thickness=2) degree=2 clm;  yaxis label="Life Expectancy at Birth (Years)";  xaxis label="Urbanization Rate"; run;
* centering quantitative explanatory variables; data new2; set new; if urbanrate ne . and lifeexpectancy ne . and internetuserate ne .; urbanrate_c=urbanrate-56.8410778; internetuserate_c=internetuserate-34.2204688; run; proc means; var urbanrate internetuserate; run;
PROC glm; model lifeexpectancy=urbanrate_c/solution clparm; run;
* polynomial regression model; PROC glm; model lifeexpectancy=urbanrate_c urbanrate_c*urbanrate_c/solution clparm; run;
* multiple regression adding internet use rate;
PROC glm; model lifeexpectancy=urbanrate_c urbanrate_c*urbanrate_c internetuserate_c/solution clparm; run;
* request regression diagnostic plots; PROC glm PLOTS(unpack)=all; model lifeexpectancy=urbanrate_c urbanrate_c*urbanrate_c internetuserate_c/solution clparm; output residual=res student=stdres out=results; run;
* plot of standardized residuals for each observation; proc gplot; label stdres="Standardized Residual" country="Country"; plot stdres*country/vref=0; run;
* using proc reg to get a partial regression plot; * calculate quadratic terms; data partial; set new2; urbanrate2=urbanrate_c*urbanrate_c; run;
*partial regression plot; PROC reg plots=partial; model lifeexpectancy=urbanrate urbanrate2 internetuserate/partial; run;
-------------------------------------------------------------------------------------------------
Results:
Multiple regression analysis:
In multiple regression we can continue to add variables to the model in order to evaluate multiple predictors over quantitative response variable.
When evaluating the independent association among several predictor variables such as Life Expectancy of a new born, Internet use rate within the country and Urban rate are positively and significantly associated.  Armed forces rate, and gender are not.
In a multiple regression, a positive parameter estimate for an explanatory variable with a p-value of less than .05 means there is a significant positive association between the explanatory variable and the response variable, after controlling for the other variables in the model.
The regression coefficients that we get from the analyses on our sample are only estimates of the true population parameters.  
Below  results shows linear association among life expectancy, urban rate, and Internet use rate from the gap minder data set.-
Tumblr media
p-value less .0001 does not support null-hypothesis. 
a) q-q plot
Tumblr media
Above q-q plot shows that the residuals are generally following a straight line, but deviate somewhat at the lower and higher quantiles. This indicates that our residuals do not follow perfect normal distribution. This could mean that the curvilinear association that we observed in our scatter plot may not be fully estimated by the quadratic urban rate term. There may be other explanatory variables that we might consider including in our model.
Tumblr media
For Urbanrate_C (after adjusted mean value) p-value is 0.0025 and for Internetuserate p-value is less than .0001. Therefore we can have 95% confidence with urbanrate dependency on lifeexpentancy between .0286 and .1318.  95% confidence with internetuserate  on lifeexpentancy between .1812 and .2698 .
 There is positive linear coefficient and a positive quadratic coefficient (beta- coefficients ) indicates that the curve is linear with positive. 
Tumblr media Tumblr media
b) Standard Residuals
Tumblr media
There are no observations that are three or more standard deviations from the mean therefore we do not have any extreme outliers.  
c) Leverage Plot
Tumblr media
d) Summary: 
Distribution of the residuals: 
Tumblr media
Model Fit: 
Tumblr media Tumblr media
Influential observations: 
Outliers: Looking at graph, we can say outliers are within 3 standard deviation and they do not have significant negative impact on responses.
The intercept is the value of the response variable when all the explanatory variables are held constant at a value of zero. After centered our two explanatory variables, so that the mean for each variable was equal to zero, the intercept is the life expectancy at the mean of urban rate and Internet use rate. So the life expectancy, when urban rate and Internet use rates are at their mean, is 70 years for a new born child. 
0 notes
vj-vzb · 4 years ago
Text
Regression Modeling in Practice - Assignment Week-2
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.gapminder;
* scatterplot with linear regression line;
proc sgplot;  reg x=urbanrate y=oilperperson / lineattrs=(color=blue thickness=2);  title "Scatterplot for the Association Between Urban Rate and Oil Consuption";  yaxis label="Oil Consuption Rate";  xaxis label="Urbanization Rate"; run; title;
* basic linear regression; PROC glm; model oilperperson=urbanrate/solution; run;
Results:
Oil consumption Mean = 1.475659
Tumblr media Tumblr media Tumblr media
Summary :
oilperperson - oil Consumption per capita (tonnes per year and person)
urbanrate- Urban population refers to people living in urban areas as defined by national statistical offices.
There is one exception in out data set and it is on extreme top right corner.  
regression coefficients value = 106.3707 and p-values is less than .0001 and for standard p-value = .0045 when taking account of mean value in projected calculation. 
0 notes
vj-vzb · 4 years ago
Text
Regression Modeling in Practice - Assignment Week-1
Sample:
Methodology Interviews were conducted September 7-10, 2020 among a random national sample of individual 1,311 registered voters (RV). Landline (276) and cellphone (1,035) telephone numbers were randomly selected for inclusion in the survey using a probability proportionate to size method, which means phone numbers for each state are proportional to the number of voters in each state.
 Procedure:
The Fox News Poll is conducted under the joint direction of Beacon Research (D) (formerly known as Anderson Robbins Research) and Shaw & Company Research (R). Fieldwork conducted by Braun Research, Inc. of Princeton, NJ. Fox News Polls before 2011 were conducted by Opinion Dynamics Corporation. It was conducted by telephone (landline and cellphone) with live interviewers September 7-10, 2020 among a random national sample of 1,311 registered voters and 1,191 likely voters. Results based on the full sample and the likely voter sample have a margin of sampling error of plus or minus 2.5 percentage points. All results are for release 9:00AM/ET Sunday, September 13, 2020.
 Measures:
Explanatory variables consist of 42 questions related economics, national security, and social life. Responses variables consist of Strongly favorable, Somewhat favorable, Somewhat unfavorable, Strongly unfavorable, (Can't say), and Never heard of.
Response scale is categorical. Choose one choice only as Yes.
Results based on the full sample have a margin of sampling error of ± 2.5 percentage points.
The measure of President approval rating was drawn from responses collected from both parties’ strong supporter, independent, registered voters, and likely voters. Data was collected during a week period and each week data was compared to observe and determine the direction of President Candidates Approval rating.  
0 notes
vj-vzb · 5 years ago
Text
Data Analysis Tools - Week 4 Assignment (Testing a Potential Moderator)
Run an ANOVA, Chi-Square Test or correlation coefficient that includes a moderator.
Code:
/*Code Using ANOVA */ LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.addhealth_pds;
LABEL H1GI9 = "Ethnicity RACE" /* - aggregated response by Interviewer "  */      H1GI11 = "Born in The United States"      H1ED11 = "Grade in English or Language Arts"  H1ED13 = "Grade in History or Social Studies"  /* Same Code and Categories  as above */  H1ED2 = "Number of times skipped school" ;  /* Same Code and Categories  as above */
IF H1GI11 = 7 then H1GI11 = 1; IF H1GI11 LE 1;  /* Remove rows where response is 'don't know  */
PROC SORT; by BIO_SEX;
PROC ANOVA; CLASS H1GI11; MODEL H1ED2=H1GI11; MEANS H1GI11;  BY BIO_SEX; RUN;
/*Code Using Chi-Square*/ LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new; set mydata.addhealth_pds;
LABEL H1GI9 = "Ethnicity RACE" /* - aggregated response by Interviewer "  */      H1GI11 = "Born in The United States"      H1HS1 = "Physical Examination";
IF H1GI9 = 8 OR H1GI9 = 6 THEN H1GI9 = .;  /*Remove rows where response is 'don't know  or refused to answer*/ IF H1GI11 = 7 then H1GI11 = 1; IF H1HS1 LE 1;  /* Remove rows where response is 'don't know or refused */
PROC SORT; by BIO_SEX; PROC FREQ; TABLES H1HS1*H1GI9/CHISQ; BY BIO_SEX;
PROC GCHART; VBAR H1GI9/discrete type=mean SUMVAR=H1HS1; RUN;
/* Code Using Pearson Correlaton */
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly; DATA new2; set mydata.gapminder;
IF incomeperperson eq . THEN incomegroup=.; IF alcconsumption eq . THEN alcconsumption=.; IF lifeexpectancy eq . THEN lifeexpectancy=.;
ELSE IF incomeperperson LE 750 THEN incomegrp=1; ELSE IF incomeperperson LE 2500 THEN incomegrp=2; ELSE IF incomeperperson GT 2500 THEN incomegrp=3;
IF incomegrp NE .;
PROC SORT; by COUNTRY; PROC SORT; by incomegrp;
PROC CORR; VAR lifeexpectancy alcconsumption; BY incomegrp;
RUN;
Results for Review:
ANOVA:
Tumblr media Tumblr media Tumblr media Tumblr media
Summary / Comments:
Moderation occurs when the relationship between two variables depends on a third variable. In this case, the third variable is referred to as the moderating variable, or simply the moderator. In the example above we are trying to find the impact of moderator variable i.e. Male or Female when trying to find the association between different Ethnicity Race and number of days student skip the school. As we know, this finally impact the overall academic progress of the student. 
Chi-Square:
Tumblr media Tumblr media Tumblr media
Summary / Comments:
Above,  evaluate third variables as potential moderators in the context of chi-squared test of independence.
P value for Male student is significant lower that suggesting the relationship between physical examination and Ethnicity Race is much stronger for SEX-W1=2 (Male) is much stronger compare to Female students. 
Pearson Correlation:
Tumblr media Tumblr media
Summary / Comments:
Above, we are trying to explore if the correlation between alcohol consumption rate and life expectancy  differ based on countries with different income levels.  
Created a third variable called income group which is categorical. For this new variable, the income per person variable, which is quantitative, was categorized as follows:
high income country - value of 3, 
moderate income country - value of 2, 
low income country - value of 1
When we examine the correlation coefficients between alcconsuption ( alcohol consumption)  and life expectancy  for each of the income groups, we find the following:
For the low income group, the correlation between alcohol consumption)  and life expectancy is 0.4716 and the p-value is not significant but negative -0.10103 .
For the moderate income countries, the association between alcohol consumption)  and life expectancy is 0.4762 with not significant p-value at 0.11443 . 
And finally, among high income countries, the correlation coefficient is 0.4011 , again with p-value 0.09396 , suggesting that the association between alcohol consumption)  and life expectancy is more significant for high income countries compare to low income countries.
0 notes
vj-vzb · 5 years ago
Text
Data Analysis Tools - Week 3 Assignment (Correlation Coefficient )
Code:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new2; set mydata.gapminder; IF incomeperperson eq . THEN incomegroup=.; IF urbanrate eq . THEN urbanrate=.; IF oilperperson eq . THEN oilperperson=.;
ELSE IF incomeperperson LE 750 THEN incomelevel=1; ELSE IF incomeperperson LE 2500 THEN incomelevel=2; ELSE IF incomeperperson LE 9400 THEN incomelevel=3; ELSE IF incomeperperson GT 9400 THEN incomelevel=3;
PROC SORT; by COUNTRY;
PROC CORR; VAR urbanrate incomeperperson oilperperson;
RUN;
Result for Review:
Tumblr media
Observations and Summary :
Above, in the  Pearson Correlation Coefficient table  where two variables of interest intersect, represent   the correlation coefficients of interest and the associated p values.
oilperperson - oil Consumption per capita (tonnes per year and person)
urbanrate- Urban population refers to people living in urban areas as defined by national statistical offices
incomeperperson - 2010 Gross Domestic Product per capita in constant 2000 US$.
For the association between urbanrate and oilperperson, the correlation coefficient is approximately 0.62 with a p-value of 0.0001. This means that the relationship is statistically significant.
For the association between incomeperperson and oilperperson , the correlation coefficient is approximately 0.61 and also has a significant p-value.
The association between oilperperson and income is strong and it's also positive. The association between oilperperson and urbanrate is also positive but slightly more strong at 0.62. Both are statistically significant. That is, for both associations, it's highly unlikely that a relationship of this magnitude would be due to chance alone.
As we know here, Post hoc tests are not necessary when conducting Pearson correlation.
In addition to this, small r squared is the fraction of the variability of one variable that can be predicted by the other. If we square our correlation coefficient of 0.62, we get 0.37. This means we can predict 37% of the variability we will see in the rate of Internet use. 
0 notes
vj-vzb · 5 years ago
Text
Assignment - Data Analysis Tools - Week 2 (Chi Square Test)
In this assignment, we want to investigate the data and find out that there is no difference in the ethnicity race and access to physical examination, as we know  that is critical for student overall health. 
Our alternate hypothesis - that there is relationship between ethnicity race and routine physical examination of the students. 
Code For Review:
Tumblr media Tumblr media
Result for Review:
Tumblr media
Following are the Categories for Each Variables and responses
H1GI9 = "RACE"  - aggregated response by Interviewer (Explanatory Variable)   
1 White    
2 Black or African American    
3 American Indian or Native American    
4 Asian or Pacific Islander    
5 Other    
6 refused    
8 don’t know 
H1HS1 = In the past year have you had a routine physical examination? (Response Variable)  
0  No    
1  Yes    
6  Refused     
8  Don’t know  
Chi-square test of independence, is to measure how far the data are from what is claimed in the null hypothesis. The further the data are from the null hypothesis, the more evidence the data presents against it. P Value < .0001 clearly show that alternate hypothesis Ethnicity Race and Physical Examination is true. But we are still going to run
My explanatory variable has five categories. So I know that not all are equal. But I don't know which are different and which are not.
Post Hoc Approach - “The Bonferroni Adjustment” is to control overall type 1 error rate. For the 10 paired comparisons that we plan to make to better understand the association between Ethnicity Race and physical examination dependence, our adjusted p value is .005. But rather than evaluating significance at the p .05 level, we would adjust the p value to make it more difficult to reject the null hypothesis. 
To determine which groups are different from the others, need to perform a post hoc test. By conducting post hoc comparisons between pairs of rates, avoids rejecting the null hypothesis, when the null hypothesis is true. We will be much better able to appropriately describe which population rates are different from the others.
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
  Summary:
All 10 cases were compare using Chi-Square method as follows:
Tumblr media
Where 
1 White    
2 Black or African American    
3 American Indian or Native American    
4 Asian or Pacific Islander    
5 Other    
We can see there is significant higher P value between certain groups. Therefore it concludes that there is relationship between Ethnicity of the students and accessibility to have Physical Exams done last year. 
0 notes
vj-vzb · 5 years ago
Text
Assignment - Data Analysis Tools - Week 1
Code for Review:
Tumblr media
Results for Review: 
Tumblr media Tumblr media Tumblr media
Summary / Comments: Trying to explore the hypothesis if there is a relationship between sex of the students and number of times skipping the schools. 
Analysis of Variance run above between  Categorical Explanatory variables and Quantitative Response variable. It provide p-value .0011, indicate that null hypothesis is not true and there is relationship. Group 1 (Male) has higher Mean value than female students.
Tumblr media Tumblr media
 Summary / Comments:
In this section,  Quantitative Response variable i.e number of times student skip the school is categorize in following: 
SKIPGROUP=1  - if not skip or skip one time only SKIPGROUP=2 - if skip between 2 and  5 times  SKIPGROUP=3 - if skip between 6 and 10 times   SKIPGROUP=4; - if skip between 11 and 20 times SKIPGROUP=5; - if skip school 21- 30 times  SKIPGROUP=6; - if skip school more than 30 times 
Post Hocs Duncan procedure ran and it provide significant Mean difference between  5th, 4th and rest of the group as high lighted in red and blue bar above. 
0 notes
vj-vzb · 5 years ago
Text
Assignment - Data Management and Visualization - Week 4
Code and Logic:
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.addhealth_pds;
LABEL H1GI9 = "Ethnicity RACE" /* - aggregated response by Interviewer "  */      H1GI11 = "Born in The United States"      H1ED11 = "Grade in English or Language Arts"  H1ED13 = "Grade in History or Social Studies"  /* Same Code and Categories  as above */  H1ED2 = "Number of times skipped school" ;  /* Same Code and Categories  as above */
IF H1GI9 = 8 OR H1GI9 = 6 THEN H1GI9 = .;  /*Remove rows where response is 'don't know  or refused to answer*/ IF H1GI11 = 7 then H1GI11 = 1; IF H1ED11 LE 4;  /* Remove rows where response is 'don't know  */ IF H1ED13 LE 4;  /* Remove rows where response is 'don't know  */ IF H1ED2 = 3 THEN H1ED2 = .; /* Remove rows where response is missing */ IF H1ED2 LE 99; /* Remove rows where response refused, legitimate skip or don't know */
IF H1ED2 = 0 THEN SKIPGROUP=1; /* Skip school 10 times or less*/ ELSE IF H1ED2 LE 5 THEN SKIPGROUP=2; /*Skip school 11- 20 times */ ELSE IF H1ED2 LE 10 THEN SKIPGROUP=3; /*Skip school 11- 20 times */ ELSE IF H1ED2 LE 20 THEN SKIPGROUP=4; /*Skip school 11- 20 times */ ELSE IF H1ED2 LE 30 THEN SKIPGROUP=5; /*Skip school 21- 30 times */ ELSE SKIPGROUP=4; /* Skip school more than 30 times */
PROC SORT; by AID;
PROC FREQ; TABLES H1GI9 H1GI11 H1ED11 H1ED13 H1ED2; PROC GCHART; VBAR H1GI9/Discrete type=PCT; /* Multiple Categorical variable example*/ PROC GCHART; VBAR H1GI11/Discrete type=PCT width=30; /*Categorical variable example*/ PROC GCHART; VBAR H1ED11/Discrete type=PCT; /* Multiple Categorical variable example*/ PROC GCHART; VBAR H1ED13/Discrete type=PCT; /* Multiple Categorical variable example*/ PROC GCHART; VBAR H1ED2/discrete type=PCT SUMVAR=SKIPGROUP; PROC GCHART; VBAR SKIPGROUP/discrete type=PCT; PROC UNIVARIATE; VAR H1ED2; RUN;
/* second program */
DATA new2; set mydata.gapminder;
IF incomeperperson eq . THEN incomegroup=.; ELSE IF incomeperperson LE 744.239 THEN incomegroup=1; ELSE IF incomeperperson LE 2553.496 THEN incomegroup=2; ELSE IF incomeperperson LE 9425.236 THEN incomegroup=3; ELSE IF incomeperperson GT 9425.236 THEN incomegroup=3;
PROC SORT; by COUNTRY;
PROC FREQ; TABLES incomegroup;
PROC UNIVARIATE; VAR urbanrate internetuserate;
PROC GPLOT; PLOT internetuserate*urbanrate;
RUN;
Results:
Tumblr media
Summary:
This exercise to determine if there is any relationship between Race, citizenship and grade in English, Language arts, Social studies.
In addition to this, we are also tryin to study the relationship between academic grade and number of times student skip the school. Number of times skip school (H1ED2) is quantitative variable with range of value between 0 - 99.This range further divided into SKIPGROUP to show in the graph below. 
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Summary: 
Internet usage rate and urban rate are in average directly proportional to each other with some exception. Standard deviation of 28% is quite high and should be due to income level. 
Tumblr media
0 notes
vj-vzb · 5 years ago
Text
Assignment - Data Management and Visualization - Week 3
STEP 1: First Code
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.addhealth_pds;
LABEL H1GI9 = "RACE" /* - aggregated response by Interviewer "  */              H1GI11 = " Born in The United States"              H1ED11 = "Grade in English or Language Arts"              H1ED13 = " Grade in History or Social Studies" ;   
IF H1GI9 = 8 OR H1GI9 = 6 THEN H1GI9 = .;  /*Remove rows where response is 'don't know  or refused to answer*/ IF H1GI11 = 7 then H1GI11 = 1; /* Living at current address since birth , should be consider as response code 1 (born in The United States)  IF H1ED11 LE 4;  /* Remove rows where response is not graded  */ IF H1ED13 LE 4;  /* Remove rows where response is not graded  */
PROC SORT; by AID;
PROC FREQ; TABLES H1GI9 H1GI11 H1ED11 H1ED13;
RUN;
/* This program preparing the data set to analyze Grades in  English or Language Arts ( H1ED11)  and Grades in  History or Social Studies ( H1ED13) for the children those are born and live in United States or came from Outside the country. */
STEP 1: Second Code
LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;
DATA new; set mydata.addhealth_pds;
IF H1GI4 GE 6 then H1GI4=.; IF H1GI6A GE 6 then H1GI6A=.; IF H1GI6B GE 6 then H1GI6B=.; IF H1GI6C GE 6 then H1GI6C=.; IF H1GI6D GE 6 then H1GI6D=.;
NUMETHNIC=SUM(of H1GI4 H1GI6A H1GI6B H1GI6C H1GI6D);
IF NUMETHNIC GE 2 THEN ETHNICITY = 1; /* Multiple Race */ ELSE IF H1GI4 =1  THEN ETHNICITY=2; /* Hispanic or Latino ethnicity */ ELSE IF H1GI6A=1 THEN ETHNICITY=3; /* African American ethnicity */ ELSE IF H1GI6B=1 THEN ETHNICITY=4; /* American Indian or Native American*/ ELSE IF H1GI6C=1 THEN ETHNICITY=5; /* Asian or Pacific Islander */ ELSE IF H1GI6D=1 THEN ETHNICITY=6; /* white ethnicity */
PROC SORT; by AID; PROC PRINT; VAR H1GI4 H1GI6A H1GI6B H1GI6C H1GI6D NUMETHNIC ETHNICITY; PROC FREQ; TABLES H1GI4 H1GI6A H1GI6B H1GI6C H1GI6D NUMETHNIC ETHNICITY;
RUN;
/* This program create new variable for Multiple Race Ethnicity by aggregating the responses for each Ethnicity. Uses Print command to verify the result */  
STEP 2: Results 
Tumblr media
Result 2:
Tumblr media Tumblr media
0 notes