greentine-blog1 - Tumblr blog

greentine-blog1 · 8 years ago

Text

Course 3 Week 4 Homework – Logistic Regression

The Null hypothesis for this assignment is:

There is NO relationship between the whether or not a person votes (binary categorical explanatory variable) and participates in professional or community groups in (binary categorical response variable). And also that the impact of being ethnically black as a binary categorical explanatory variable is not associated with participation in these types of groups.

Results of just relating Voting status to whether a person participates in these groups:

· p value of .0046 – statistically significant relationship

· Odds Ratio is 1.96, indicating that the probability of Participating in groups increases 1.96 times among those who Voted versus those who did Not Vote. It is predicted (in 95 samples of this population out of 100 samples) that the increase would be between 1.23 and 3.13 times.

Analysis of Maximum Likelihood Estimates

Parameter

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept

-1.2217

0.2150

32.2765

<.0001

VOTED

0.6744

0.2381

8.0230

0.0046

Odds Ratio Estimates

Effect

Point Estimate

95% Wald Confidence Limits

VOTED

1.963

1.231

3.130

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

0.3414634146

0.08548164

3.99

<.0001

VOTED

0.2944589155

0.09740946

3.02

0.0026

After Ethnicity-black was added to the analysis:

Voting status p value remained < .05 (result = .0091). This indicates that Voting is still significantly associated with Participating in groups and Ethnicity-black is NOT a confounding variable.

As also shown in the results below, p value of Ethnicity-black is .032, indicating significant association with Participation in groups.

Odds Ratios for both Voting status and Ethnicity-black are both > 1, so probability of Participating in groups increases with these factors. Specifically:

· Probability of Participating in groups increases 1.87 times among those who Voted versus those who did Not Vote, after controlling for Ethnicity-Black status. It is predicted (in 95 samples of this population out of 100 samples) that the increase would be between 1.17 and 2.29 times.

· Probability of Participating in groups increases 1.50 times among those who are Ethnically-Black versus those who are not, after controlling for Voting status. It is predicted (in 95 samples of this population out of 100 samples) that the increase would be between 1.04 and 2.17 times.

THEREFORE: we can ACCEPT the hypothesis that Voting and being Ethnically black are associated with increasing whether one participates in professional or community groups.

The LOGISTIC Procedure

Model Information

Data Set

WORK.NEW

Response Variable

PARTICIPATED

Number of Response Levels

Model

binary logit

Optimization Technique

Fisher's scoring

Observations Summary

Number of Observations Read

553

Number of Observations Used

535

Response Profile

Ordered Value

PARTICIPATED

Total Frequency

179

356

Probability modeled is PARTICIPATED=1.

Note:18 observations were deleted due to missing values for the response or explanatory variables.

Convergence Status

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Fit Statistics

Model Fit Statistics

Criterion

Intercept Only

Intercept and Covariates

AIC

683.991

674.740

688.273

687.587

-2 Log L

681.991

668.740

Global Tests

Testing Global Null Hypothesis: BETA=0

Test

Chi-Square

Pr > ChiSq

Likelihood Ratio

13.2505

0.0013

Score

12.7559

0.0017

Wald

12.4411

0.0020

Parameter Estimates

Analysis of Maximum Likelihood Estimates

Parameter

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept

-1.4065

0.2343

36.0362

<.0001

VOTED

0.6255

0.2399

6.7986

0.0091

ETHNICBLK

0.4036

0.1882

4.6005

0.0320

Odds Ratios

Odds Ratio Estimates

Effect

Point Estimate

95% Wald Confidence Limits

VOTED

1.869

1.168

2.991

ETHNICBLK

1.497

1.035

2.165

Association Statistics

Association of Predicted Probabilities and Observed Responses

Percent Concordant

41.4

Somers' D

0.163

Percent Discordant

25.1

Gamma

0.244

Percent Tied

33.4

Tau-a

0.073

Pairs

63724

0.581

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 3 Homework – Multiple Regression:Summary

The Null hypothesis for this assignment is:

There is NO relationship between the whether or not a person votes (binary categorical explanatory variable) and the number of professional or community groups that person participates in (quantitative response variable) – and will be examining whether being ethnically black is a confounding binary categorical explanatory variable.

Original results of just relating Voting status to number of groups a person is in: p value of .0026 & linear equation (as shown by Estimates) of NBR_GROUPS_IN = .29*VOTED + .34

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

0.3414634146

0.08548164

3.99

<.0001

VOTED

0.2944589155

0.09740946

3.02

0.0026

After Ethnicity-black was added to the analysis:

Voting status p value remained > .05 (result = .0084). This indicates that Voting is still significantly associated with Number of groups and Ethnicity-black is NOT a confounding variable. The Coefficient is .26

As also shown in the results below, p value of Ethnicity-black has a p value of .0007, indicating significant association with Number of groups a person is in – and its co-efficient is .28. THEREFORE: we can ACCEPT the hypothesis that Voting and being Ethnically black are associated with the Number of groups one participates in.

The linear equation for this is: NBR_GROUPS_IN = .26*VOTED + .28*ETHNICBLK + .22.

It is noted that the R Squared value is only .04, meaning these variables only explain about 4% of the result – so there are likely missing variables from this model.

· I tried substituting Gender and also Have Children for ETHNICBLK. These both showed p value > .05 and did not cause VOTED p value to increase beyond .05, so these variables are neither significant to this model nor confounders.

· Comment on other complexities with this data will be discussed in blog entry on regression diagnostic plots.

The GLM Procedure

Number of Observations

Number of Observations Read

553

Number of Observations Used

535

Dependent Variable: NBR_GROUPS_IN

Source

Sum of Squares

Mean Square

F Value

Pr > F

Model

18.4819468

9.2409734

10.49

<.0001

Error

532

468.7778663

0.8811614

Corrected Total

534

487.2598131

Fit Statistics

R-Square

Coeff Var

Root MSE

NBR_GROUPS_IN Mean

0.037930

165.1992

0.938702

0.568224

Type I Model ANOVA

Source

Type I SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.32

0.0024

ETHNICBLK

10.26901980

11.65

0.0007

Type III Model ANOVA

Source

Type III SS

Mean Square

F Value

Pr > F

VOTED

6.16684962

7.00

0.0084

ETHNICBLK

10.26901980

11.65

0.0007

Solution

Parameter

Estimate

Standard Error

t Value

Pr > |t|

95% Confidence Limits

Intercept

0.2210017570

0.09170099

2.41

0.0163

0.0408613010

0.4011422131

VOTED

0.2568187849

0.09707845

2.65

0.0084

0.0661146584

0.4475229115

ETHNICBLK

0.2795619602

0.08189197

3.41

0.0007

0.1186906628

0.4404332575

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 3 Homework – Multiple Regression: Summary

The Null hypothesis for this assignment is:

Original results of just relating Voting status to number of groups a person is in: p value of .0026 & linear equation (as shown by Estimates) of NBR_GROUPS_IN = .29*VOTED + .34

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

0.3414634146

0.08548164

3.99

<.0001

VOTED

0.2944589155

0.09740946

3.02

0.0026

After Ethnicity-black was added to the analysis:

As also shown in the results below, p value of Ethnicity-black has a p value of .0007, indicating significant association with Number of groups a person is in – and its co-efficient is .28.

The linear equation for this is: NBR_GROUPS_IN = .26*VOTED + .28*ETHNICBLK + .22.

It is noted that the R Squared value is only .04, meaning these variables only explain about 4% of the result – so there are likely missing variables from this model.

· Comment on other complexities with this data will be discussed in blog entry on regression diagnostic plots.

The GLM Procedure

Number of Observations

Number of Observations Read

553

Number of Observations Used

535

Dependent Variable: NBR_GROUPS_IN

Source

Sum of Squares

Mean Square

F Value

Pr > F

Model

18.4819468

9.2409734

10.49

<.0001

Error

532

468.7778663

0.8811614

Corrected Total

534

487.2598131

Fit Statistics

R-Square

Coeff Var

Root MSE

NBR_GROUPS_IN Mean

0.037930

165.1992

0.938702

0.568224

Type I Model ANOVA

Source

Type I SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.32

0.0024

ETHNICBLK

10.26901980

11.65

0.0007

Type III Model ANOVA

Source

Type III SS

Mean Square

F Value

Pr > F

VOTED

6.16684962

7.00

0.0084

ETHNICBLK

10.26901980

11.65

0.0007

Solution

Parameter

Estimate

Standard Error

t Value

Pr > |t|

95% Confidence Limits

Intercept

0.2210017570

0.09170099

2.41

0.0163

0.0408613010

0.4011422131

VOTED

0.2568187849

0.09707845

2.65

0.0084

0.0661146584

0.4475229115

ETHNICBLK

0.2795619602

0.08189197

3.41

0.0007

0.1186906628

0.4404332575

0 notes

greentine-blog1 · 8 years ago

Photo

COURSE3 WEEK3 MULTIPLE REGRESSION PLOTS:

QQ Plot: My understanding of this chart is that it essentially diagrams the values into quartiles. With 67% of the data value = 0, it makes sense that there is such a large amount to the left of this plot since the data is heavily skewed right. The other data values (1-5) have a more normal distribution.

Standard residuals plot: Evidence of a challenged model - there are 5 outliers above 3; 3% have a value above absolute 2.5; 6% have a value above absolute 2.

Residual plot for response variable of VOTED: doesn't appear to be meaningful, for this binary categorical variable

Leverage plot: There are several outliers. However they do not have Leverage. Nor are there any non-outliers with Leverage.

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 2 Homework – Basics of Linear Regression - Code and SAS results

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA new; SET mydata.oll_pds (KEEP = CASEID W2_CASEID2 W1_L1_A W1_L1_C W1_L2_1 W1_L2_2 W1_L2_3 W1_L2_5 W1_L3 W2_QB1A W2_QB3 PPAGE PPAGECAT PPETHM W1_D1 W1_N1A W1_P17 W1_P17A PPHHSIZE PPGENDER); LABEL W1_L1_A="Participation in Professional Association in last 12 months" W1_L1_C="Participation in Cultural Organization in last 12 months" W1_L2_1="Participation in NAACP in last 12 months" W1_L2_2="Participation in National Urban League in last 12 months" W1_L2_3="Participation in Southern Christian Leadership Conference in last 12 months" W1_L2_5="Participation in Occupy Wall Street Movement in last 12 months" W1_L3="Participation in community group in last 12 months" W2_QB1A="Participation in election on Nov 6 2012" W2_QB3="Regularity of Voting" PPAGE="Age" PPAGECAT="Age Category" PPETHM="Ethnic Category" W1_D1 = "Rating of Barack Obama" W1_P17 = "Have Children" W1_P17A = "Number of children" PPGENDER = "Gender" ; /* used to view original data and validate accuracy of later program steps*/ /*proc print;*/ /*Data management - coding out missing data for those variables where need no other form of reassignment*/ If W2_QB3 = -1 then W2_QB3 = .; /* -1 = missing*/ If W1_D1 = -1 OR W1_D1 = 998 then W1_D1 = .; /*-1 = missing and 998 = refused*/ If W1_P17A = -1 then W1_P17A = .; /*-1 = missing*/ If W1_B2 = -1 then W1_B2 = .; /*-1 = missing*/ /*Data management - Create secondary variable to aggregate voting participation to show only whether did vote or not or missing. Set missing and not sure code values to missing. */ If W2_QB1A >= 2 AND W2_QB1A <= 5 then /*2 through 5 describe different methods of voting*/ VOTED = 1;/* transformed to 1 = yes I voted*/ ELSE IF W2_QB1A = -1 OR W2_QB1A = 6 then /*-1 = missing, 6 = not sure*/ W2_QB1A = .; /*transformed to one value for Missing*/ ELSE IF W2_QB1A = 1 then /* 1 = Did not vote*/ VOTED = 0; /*transformed to zero, the code that usually means "no"*/ If W1_P17 = -1 then W1_P17 = .;/*-1 = missing*/ ELSE IF W1_P17 = 2 then /* 2 = No*/ W1_P17 = 0; /*transformed to zero, the code that usually means "no"*/ /*Data management - creating a secondary variable and reassigning code values to populate new variable NBR_GROUPS_IN. This will be used understand context of group participation and whether is a large enough sample, by understanding how many groups a person participates in. This field needs each survey result to be a simple yes/no (1/0). Summing the number of "participates" values for each question is not the same value as the number of distinct people who participate, as some participate in multiple organizations. */ /*Also these steps reassigning missing data. Setting to '.' for original variable - and setting to '0' for use in NBR_GROUPS_IN variable - because setting to Missing prevents NBR_GROUPS_IN value from being calculated*/ IF W1_L1_A = 1 OR W1_L1_A = 2 then do; /* 1 = participated more than twice, 2=participated once or twice*/ W1_L1_A = 1 ; /* transformed to 1 (yes I participated)*/ W1_L1_A_MissEqNo = 1; /* transformed to 1 (yes I participated)*/ END; ELSE IF W1_L1_A = -1 then do; /* -1 = refused*/ W1_L1_A = .; W1_L1_A_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END; ELSE DO; W1_L1_A = 0; /* transformed to 0 = no I didn’t participate*/ W1_L1_A_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END; IF W1_L1_C = 1 OR W1_L1_C = 2 THEN DO; /* 1 = participated more than twice, 2=participated once or twice*/ W1_L1_C = 1; /* transformed to 1 (yes I participated)*/ W1_L1_C_MissEqNo = 1; /* transformed to 1 (yes I participated)*/ END; ELSE IF W1_L1_C = -1 then do; /*-1 = refused*/ W1_L1_C = .; W1_L1_C_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END; ELSE DO; W1_L1_C = 0; /* transformed to 0 = no I didn’t participate*/ W1_L1_C_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END;

IF W1_L3 = 1 THEN DO; W1_L3 = 1; /* included for consistency with other variables’ management*/ W1_L3_MissEqNo = 1; /* transformed to 1 (yes I participated)*/ END; IF W1_L3 = -1 THEN DO; /*-1 = refused*/ W1_L3 = .; W1_L3_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END; ELSE IF W1_L3 = 2 then do; /* 2= no*/ W1_L3 = 0; /*transformed 2 to 0 to have consistent meaning with other No values*/ W1_L3_MissEqNo = 0; /* transformed to 0 = no I didn’t participate*/ END;

NBR_GROUPS_IN = W1_L1_A_MissEqNo + W1_L1_C_MissEqNo + W1_L2_1 + W1_L2_2 + W1_L2_3 + W1_L2_5 + W1_L3_MissEqNo; /*Bin Participation in the various activities into one variable*/ IF W1_L1_A = 1 OR W1_L1_A = 2 THEN PARTICIPATED = 1; /*Evaluate participation in Professional Association */ ELSE IF W1_L1_C = 1 OR W1_L1_C = 2 THEN PARTICIPATED = 1; /*Evaluate participation in Cultural Organization*/ ELSE IF W1_L2_1 = 1 THEN PARTICIPATED = 1; /*Evaluate participation in NAACP*/ ELSE IF W1_L2_2 = 1 THEN PARTICIPATED = 1; /*Evaluate participation in National Urban League*/ ELSE IF W1_L2_3 = 1 THEN PARTICIPATED = 1; /*Evaluate participation in Southern Christian Leadership Conference*/ ELSE IF W1_L2_5 = 1 THEN PARTICIPATED = 1; /*Evaluate participation in Occupy Wall Street Movement*/ ELSE IF W1_L3 = 1 THEN PARTICIPATED = 1; /*Evaluate participation in community group*/ ELSE PARTICIPATED = 0; /*Set remainder to 0 - meaning No participation*/ /*Transform Gender into only 0 & 1 values for regression. Male is originally = 1 in data, so not transform needed. Converted Female (original value of 2) to 0 */ IF PPGENDER = 2 then PPGENDER = 0; /*Since Voting metrics are essential to testing hypothesis, have selected for observations where Voting metrics are populated (this is indicated by Wave 2 participation number - column W2_CASEID2 - because those questions occurred in Wave 2 testing)*/ IF W2_CASEID2 > 0; /*Narrowing hypothesis to age range of 18 to 44 because older ages typically do vote and are community engaged so wish to select for those who may not be voting or may not be community engaged - or both */ IF PPAGE GE 18 AND PPAGE LE 44;

PROC SORT ; by CASEID; PROC FREQ; TABLES VOTED; PROC FREQ; TABLES NBR_GROUPS_IN; PROC GLM; model NBR_GROUPS_IN = VOTED;

RESULTS:

The FREQ Procedure

VOTED

Frequency

Percent

Cumulative Frequency

Cumulative Percent

Frequency Missing = 18

123

22.99

123

22.99

412

77.01

535

100.00

The FREQ Procedure

NBR_GROUPS_IN

Frequency

Percent

Cumulative Frequency

Cumulative Percent

369

66.73

369

66.73

17.72

467

84.45

9.04

517

93.49

5.42

547

98.92

0.72

551

99.64

0.36

553

100.00

The GLM Procedure

Number of Observations Read

553

Number of Observations Used

535

The GLM Procedure

Dependent Variable: NBR_GROUPS_IN

Source

Sum of Squares

Mean Square

F Value

Pr > F

Model

8.2129270

9.14

0.0026

Error

533

479.0468861

0.8987746

Corrected Total

534

487.2598131

R-Square

Coeff Var

Root MSE

NBR_GROUPS_IN Mean

0.016855

166.8421

0.948037

0.568224

Source

Type I SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.14

0.0026

Source

Type III SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.14

0.0026

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

0.3414634146

0.08548164

3.99

<.0001

VOTED

0.2944589155

0.09740946

3.02

0.0026

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 2 Homework – Basics of Linear Regression - Model results

(SAS Results)

The FREQ Procedure

VOTED

Frequency

Percent

Cumulative Frequency

Cumulative Percent

Frequency Missing = 18

123

22.99

123

22.99

412

77.01

535

100.00

The FREQ Procedure

NBR_GROUPS_IN

Frequency

Percent

Cumulative Frequency

Cumulative Percent

369

66.73

369

66.73

17.72

467

84.45

9.04

517

93.49

5.42

547

98.92

0.72

551

99.64

0.36

553

100.00

The GLM Procedure

Number of Observations Read

553

Number of Observations Used

535

The GLM Procedure

Dependent Variable: NBR_GROUPS_IN

Source

Sum of Squares

Mean Square

F Value

Pr > F

Model

8.2129270

9.14

0.0026

Error

533

479.0468861

0.8987746

Corrected Total

534

487.2598131

R-Square

Coeff Var

Root MSE

NBR_GROUPS_IN Mean

0.016855

166.8421

0.948037

0.568224

Source

Type I SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.14

0.0026

Source

Type III SS

Mean Square

F Value

Pr > F

VOTED

8.21292698

9.14

0.0026

Parameter

Estimate

Standard Error

t Value

Pr > |t|

Intercept

0.3414634146

0.08548164

3.99

<.0001

VOTED

0.2944589155

0.09740946

3.02

0.0026

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 2 Homework – Basics of Linear Regression - Code

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

PROC SORT ; by CASEID; PROC FREQ; TABLES VOTED; PROC FREQ; TABLES NBR_GROUPS_IN; PROC GLM; model NBR_GROUPS_IN = VOTED;

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Week 2 Homework – Basics of Linear Regression - Summary

Adapting my original hypothesis to this assignment requirements, the Null hypothesis this model is testing is: There is NO relationship between the whether or not a person votes (binary categorical explanatory variable) and the number of professional or community groups that person participates in (quantitative response variable).

Variable values:

· VOTED (whether person voted in November 2012 US federal election): 0 = No, 1 = Yes

· NBR_GROUPS_IN: 0-5

Summary of results:

P value: .0026 – this value does cross the alpha threshold indicating we can REJECT hypothesis and conclude that there is a relationship between voting and the number of groups participated in.

Beta1 value: The model provides the beta value (VOTED estimate in results table) of .29, showing a Positive association of Voting with the number of groups participated in

· Beta1 value is also known as the regression co-efficient “m” in the linear equation y=mx + b

· Beta0 value is the intercept “b” in the linear equation y=mx + b. For this model that value is .34

0 notes

greentine-blog1 · 8 years ago

Text

Course 3 Regression, Week 1 Intro to Regression Homework

BACKGROUND ON DATA USED FOR STUDY

METHODS:

1 & 2. Sample & Data Collection Procedures:

Data is from the Outlook on Life Surveys, conducted by GfK Knowledge Networks on behalf of the University of California Irvine. Data is made available by the Inter-university Consortium for Political and Social Research (ICPSR). http://www.icpsr.umich.edu/icpsrweb/content/membership/index.html.

The two instances of this survey were fielded between August and December 2012 from a sample from GfK’s web panel designed to be representative of the United States population. Panel members are randomly recruited through probability-based sampling and households are provided with access to the Internet and hardware if needed. Random-digit dialing and address-based sampling methodologies are used. The target population were non-institutionalized adults 18 years of age and older.

A total of 2294 respondents participated during Wave 1 survey of the Outlook on Life and 1601 were interviewed during Wave 2.

The focus of my research question is whether younger adult’s participation in community and professional groups is related to their voting habits. Because this included a voting participation survey question only in Wave 2 survey, only those who participated in Wave 2 were included in this research (therefore, up to 1601 participants)

Participants were further restricted to those who were 44 years or less – this resulted in a sample of 553 participants in the research.

The ethnic composition was White (n= 192, 35%), African American (n=297, 54%), Other-Non-Hispanic & 2 or more races (n =24, 4%), Hispanic (n=40, 7%). The gender composition was male (n=260, 47%) and female (n=293, 53%).

3. Measures – including variables used & data management performed:

Variables used:

· Participation in 7 non-partisan professional or community groups within the last 12 months - Categorical data

o Explanatory variables

o Categorical data

o Values for 2 groups (choices: frequency of participation and refusal to answer): -1, 1, 2, 3, 4

o Values for 4 groups (choices: yes/no): 0, 1

o Value for 1 group (choices: yes/no/refusal to answer): -1, 1, 2

· Participation in election on 11/7/2012

o Response variable

o Categorical data

o Values -1, 1-6 (choices: various methods of voting, didn’t vote, refusal to answer)

· Age

o Selection criteria for study

o Categorical data

o Values 18-81

Data Management:

· Missing – set all variables with a data value meaning No Answer to Missing

· Participation in community or professional group values:

o 2 group’s variables that listed frequency in addition to participation (variables W1_L1_A & W1_L1_C): transformed frequency values (values 1 and 2) to 1 and values meaning no participation or belonging (3 and 4) to 0

o 1 group’s variables that captured participation with different values from other questions (variables W1_L1_A & W1_L1_C): transformed Yes & No values from 1 and 2, respectively, to align with values used for the other variables (to 1 and 0)

· PARTICIPATED variable: derived a variable which consolidated participation value of all 7 groups/types of groups into 1 variable. Rule: if a person participated with any 1 or more times in any group, set value to Y; else value = N

· MissEqNo variables: To be able to sum the number of groups the person participated in, wanted Value to only reflect 0 (No Participation) or 1 (Participation at any frequency). Therefore, set derived variables for the 3 groups which had Missing values as -1 to 0

· Selection criteria:

o Respondents with Age less than or equal to 44

o Data records with a Wave2 CaseID

0 notes

greentine-blog1 · 8 years ago

Text

Course 2 Data Analysis Tools: Week 4 Exploring Statistical Interactions

Null Hypothesis: That for 18 – 44 year olds, whether or not they have children DOES NOT moderate the relationship between whether they participate in community or professional groups to their likelihood to vote.

Alternate Hypothesis: That for 18 – 44 year olds, whether or not they have children DOES moderate the relationship between whether they participate in community or professional groups to their likelihood to vote.

SAS Program:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA new;

SET mydata.oll_pds (KEEP = CASEID W2_CASEID2 W1_L1_A W1_L1_C W1_L2_1 W1_L2_2 W1_L2_3 W1_L2_5 W1_L3 W2_QB1A W2_QB3 PPAGE PPAGECAT PPETHM W1_D1 W1_N1A W1_P17 W1_P17A PPHHSIZE);

LABEL

W1_P17 = "Have Children";

(showing only needed portions of code – excludes all data management and selection criteria)

/* Week 4's assignment: PARTICIPATED to VOTED, influenced by Having Children */

PROC FREQ; TABLES VOTED*PARTICIPATED/CHISQ;

PROC GCHART; VBAR PARTICIPATED/DISCRETE TYPE=MEAN SUMVAR=VOTED;

PROC SORT; BY W1_P17;

PROC FREQ; TABLES VOTED*PARTICIPATED/CHISQ; BY W1_P17;

Results Summary and Analysis:

ChiSquared result on only the 2 variables (PARTICIPATED & VOTED)

* ChiSquared value is substantial at 8.2. (normal threshold to consider a significant association is > 3.84). This indicates there is a notable difference between expected and observed results and the association of the variables is significant.

· P value is .0042 which is far less than the significance threshold of .05 – this low value suggests that the relationship is statistically significant and reliable such that samples would give the same results

· Conclusion on the 2 variables: With high Chi Squared and low p value, this indicates a strong associative relationship between Participation in community and professional groups and Voting.

ChiSquared result regarding moderation of Having Children on Participating in groups and Voting

· For those without children (Have Children = 0): ChiSquared value is not large (3.01) and p value does not cross significance threshold of .05 (p is .082)

· For those with children (Have Children =1): ChiSquared value is large (5.31) and p value does cross significance threshold of .05 (p is .021)

Conclusion:

· Null hypothesis should be rejected and Alternate hypothesis accepted: whether a person has children DOES moderate the relationship between Participation in community groups and Voting

o This is shown by the fact that the ChiSquared and p values between the 2 Have Children values (0-No/1-Yes) are significantly different. That is, the values for Having children show a significant ChiSquared and p values and those for Not Having children do not.

OUTPUT from FREQ procedures:

Table VOTED * PARTICIPATED (WITHOUT MODERATOR)

Cross-Tabular Freq Table

Frequency

Percent

Row Pct

Col Pct

Table of VOTED by PARTICIPATED

VOTED

PARTICIPATED

Total

17.76

77.24

26.69

5.23

22.76

15.64

123

22.99

261

48.79

63.35

73.31

151

28.22

36.65

84.36

412

77.01

Total

356

66.54

179

33.46

535

100.00

Frequency Missing = 18

Chi-Square Tests

Statistic

Value

Prob

Chi-Square

8.2040

0.0042

Likelihood Ratio Chi-Square

8.6083

0.0033

Continuity Adj. Chi-Square

7.5921

0.0059

Mantel-Haenszel Chi-Square

8.1886

0.0042

Phi Coefficient

0.1238

Contingency Coefficient

0.1229

Cramer's V

0.1238

Fisher's Exact Test

Cell (1,1) Frequency (F)

Left-sided Pr <= F

0.9988

Right-sided Pr >= F

0.0025

Table Probability (P)

0.0013

Two-sided Pr <= P

0.0045

Effective Sample Size = 535 Frequency Missing = 18

RESULTS WITH MODERATOR “Have Children”

Have Children=0

Table VOTED * PARTICIPATED (WITH MODERATOR)

Cross-Tabular Freq Table

Frequency

Percent

Row Pct

Col Pct

Table of VOTED by PARTICIPATED

VOTED

PARTICIPATED

Total

16.90

71.64

26.97

6.69

28.36

17.92

23.59

130

45.77

59.91

73.03

30.63

40.09

82.08

217

76.41

Total

178

62.68

106

37.32

284

100.00

Frequency Missing = 9

Chi-Square Tests

Statistic

Value

Prob

Chi-Square

3.0131

0.0826

Likelihood Ratio Chi-Square

3.1000

0.0783

Continuity Adj. Chi-Square

2.5324

0.1115

Mantel-Haenszel Chi-Square

3.0025

0.0831

Phi Coefficient

0.1030

Contingency Coefficient

0.1025

Cramer's V

0.1030

Fisher's Exact Test

Cell (1,1) Frequency (F)

Left-sided Pr <= F

0.9714

Right-sided Pr >= F

0.0544

Table Probability (P)

0.0258

Two-sided Pr <= P

0.0855

Effective Sample Size = 284 Frequency Missing = 9

Have Children=1

Table VOTED * PARTICIPATED

Cross-Tabular Freq Table

Frequency

Percent

Row Pct

Col Pct

Table of VOTED by PARTICIPATED

VOTED

PARTICIPATED

Total

17.96

83.02

25.58

3.67

16.98

12.33

21.63

128

52.24

66.67

74.42

26.12

33.33

87.67

192

78.37

Total

172

70.20

29.80

245

100.00

Frequency Missing = 4

Chi-Square Tests

Statistic

Value

Prob

Chi-Square

5.3094

0.0212

Likelihood Ratio Chi-Square

5.7577

0.0164

Continuity Adj. Chi-Square

4.5564

0.0328

Mantel-Haenszel Chi-Square

5.2877

0.0215

Phi Coefficient

0.1472

Contingency Coefficient

0.1456

Cramer's V

0.1472

Fisher's Exact Test

Cell (1,1) Frequency (F)

Left-sided Pr <= F

0.9949

Right-sided Pr >= F

0.0140

Table Probability (P)

0.0089

Two-sided Pr <= P

0.0267

Effective Sample Size = 245 Frequency Missing = 4

0 notes

greentine-blog1 · 8 years ago

Text

Course 2 Data Analysis Tools: Week 3 Pearson Correlation Coefficient

Null Hypothesis: That for 18 – 44 year olds, how many children they have DOES NOT have a relationship to their age.

Alternate Hypothesis: That for 18 – 44 year olds, how many children they have DOES have a relationship to their age.

NOTE about homework and variables selected:

Outlook on life didn’t have a lot of quantitative variables, so simply chose 2 that could display use of the Pearson test.

SAS Program:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA new;

SET mydata.oll_pds (KEEP = CASEID W2_CASEID2 W1_L1_A W1_L1_C W1_L2_1 W1_L2_2

W1_L2_3 W1_L2_5 W1_L3 W2_QB1A W2_QB3 PPAGE PPAGECAT PPETHM

W1_D1 W1_N1A W1_P17A PPHHSIZE);

LABEL

PPAGE="Age"

W1_P17A = "Number of children";

(showing only needed portions of code – excludes all data management and selection criteria)

PROC CORR; VAR PPAGE W1_P17A;

OUTPUT from CORR procedure:

2 Variables:

PPAGE W1_P17A

Simple Statistics

Variable

Mean

Std Dev

Sum

Minimum

Maximum

Label

PPAGE

553

31.64195

8.26652

17498

18.00000

44.00000

Age

W1_P17A

249

2.25301

1.21659

561.00000

1.00000

9.00000

Number of children

Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations

PPAGE

W1_P17A

PPAGE Age

1.00000 553

0.30393 <.0001 249

W1_P17A Number of children

0.30393 <.0001 249

1.00000 249

Results Summary and Analysis:

PEARSON (r) CORRELATION COEFFICIENT

· P value is <.0001 which is far less than the significance threshold of .05 – this low value suggests that the relationship is statistically significant and reliable such that samples would give the same results

· Coefficient (r) is .304. This shows a weak linear relationship (as it is closer to 0 than 1)

· r2 indicates what proportion of the variability in one variable is described by variation in the second variable (a.k.a. Coefficient of Determination).

o rsquared => .304 * .304 = .09 => only 9% of the time we would be able to predict – low ability to form predictions

Conclusion:

· Null hypothesis should be accepted: how many children a person has DOES NOT have a relationship to their age.

o The r and r2 values indicate very weak relationship

o Although the p value shows statistical significance, the magnitude of the effect is essentially meaningless (r2) and the relationship is weak (r)

o Because the meaning of the p and r values seemed to point to conflicting conclusions, I found an answer in the Course Forum on this situation to led me to this conclusion

0 notes

greentine-blog1 · 8 years ago

Text

Course 2 Data Analysis Tools: Week 2 Chi Square test of Independence

Null Hypothesis: That for 18 – 44 year olds, whether or not they voted DOES NOT vary by number of community or professional groups they participated in.

Alternate Hypothesis: That for 18 – 44 year olds, whether or not they voted varies by number of community or professional groups they participated in.

NOTE about homework and variables selected:

It appears that the assignment was to be applied on a situation with:

1. Categorical variable with more than 2 levels. My dataset and subject area has two: Number of community groups participated in & age category/group.

2. Chi Squared test was statistically significant (p < .05).

However, neither of these variables’ relationship to Voted variable had a p < .05. Because this is the question I am working with, I completed the assignment using Number of groups participated in, as it is the best I have.

Program code and all charts and statistics are below the Conclusion. Results are in this and a 2nd tumblr post as was too long to be all in one post.

Conclusion:

CHI SQUARED test (initial test on all categories/values)

· P value is .0587 – close to the threshold of .05, but not crossing it.

· From this result, this relationship is not significant and the Null hypothesis that there is no relationship between number of groups involved in and voting must be accepted.

Posthoc CHI SQUARED TESTS

· Bonferroni adjusted p value threshold to rejecting the null hypothesis is .003

· P values across the 15 pairs range from .9677 to .0167 – no values cross the threshold

· The lowest p values are between 0 groups involved in and 2 and 3 groups involved in (.0167 and .0521, respectively).

· From this result, this relationship is not significant and the Null hypothesis that there is no relationship between number of groups involved in and voting must be accepted.

SAS Program:

LIBNAME mydata "/courses/d1406ae5ba27fe300 " access=readonly;

DATA new;

SET mydata.oll_pds (KEEP = CASEID W2_CASEID2 W1_L1_A W1_L1_C W1_L2_1 W1_L2_2

W1_L2_3 W1_L2_5 W1_L3 W2_QB1A W2_QB3 PPAGE PPAGECAT PPETHM);

LABEL W1_L1_A="Participation in Professional Association in last 12 months"

W1_L1_C="Participation in Cultural Organization in last 12 months"

W1_L2_1="Participation in NAACP in last 12 months"

W1_L2_2="Participation in National Urban League in last 12 months"

W1_L2_3="Participation in Southern Christian Leadership Conference in last 12 months"

W1_L2_5="Participation in Occupy Wall Street Movement in last 12 months"

W1_L3="Participation in community group in last 12 months"

W2_QB1A="Participation in election on Nov 6 2012"

W2_QB3="Regularity of Voting"

PPAGE="Age"

PPAGECAT="Age Category"

PPETHM="Ethnic Category";

(omitting from this listing the data management and selection code)

/*FREQ statement is Response Var*Explanatory Var/CHISQ */