courserall-da-w2 - Tumblr blog

courserall-da-w2 · 4 years ago

Text

Is there a relation within education level and family unemployment?

This analysis is based on the ‘Outlook on life’ surveys (OOL.csv file), providing survey’s answers to a wide range of questions.

The explanatory variable is the categorical variable:

PPEDUCAT: Education Category, assuming these values:

1 Less than high school

2 High school

3 Some college

4 Bachelor's degree or higher

The Chi-square analysis is performed within the explanatory variable and the response variable:

W1_P11: Is anyone in your household currently unemployed?

Which can assume these values:

-1 did not answered

1 Yes

2 No

The goal is to understand if there’s a relation within the education category and the presence of almost an unemployed people in the household.

The null hypothesis refuses the idea that there’s a relation,

while the alternative hypothesis accept the idea that there’s a relation.

First of all, I excluded from the analysis the non meaningful data (W1_P11= -1 ‘did not answered’)

and I recoded the values as:

0 = No

1 = Yes

to have the crosstab results ordered in a more readable way.

Then I performed the Chi-square, which provides this results:

chi-square value, p value, expected counts

(128.229823336798, 1.3017582137131253e-27, 3, array([[127.20430108, 418.30645161, 406.68682796, 412.80241935],

[ 80.79569892, 265.69354839, 258.31317204, 262.19758065]]))

The small p-value (less than 0,05) lead me to reject the null hypothesis,

and also the caplet graph seems to shows that as the percentage of anyone unemployed in the HH rise, the level of education category il lower.

But it’s not sufficient,

since the explanatory variable is not binary (it can assume 4 values) it needs to perform the Bonferroni post-hoc test.

To do this, it’s necessary to perform a chi-square test for every combination within the 4 class variable, so we need to perform 6 different comparisons.

During this process, to protect against the type-1 error, we will adjust the p-value this way:

P/C = 0,05 / 6 comparisons = 0,008

So, to reject the null hypothesis within the comparisons, the adjusted p-value must be < 0,008

The result shows that all the p-values of the 6 comparisons are less than 0,008,

but the first comparisons within class 1 and 2 is quite close, it’s p-value = 0.005846220357609504.

While the other 5 comparisons are very smaller.

The summary is that education category seems to have a relation with the presence of anyone unemployed in the household.

================================================================================================

Python code:

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Sun Apr 18 10:23:46 2021

@author: lorenzo

"""

# -*- coding: utf-8 -*-

"""

Created on Wed Aug 26 18:17:30 2015

@author: ldierker

"""

import pandas

import numpy

import scipy.stats

import seaborn

import matplotlib.pyplot as plt

data = pandas.read_csv('/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_1/ool_pds.csv', low_memory=False)

# PPEDUCAT: Education (Categorical)

"""

1 Less than high school

2 High school

3 Some college

4 Bachelor's degree or higher

"""

print("PEDUCAT: Education (Categorical)")

ct1 = data.groupby('PPEDUCAT').size()

print (ct1)

# W1_P11: Is anyone in your household currently unemployed?

"""

1 Yes

2 No

"""

print("W1_P11: Is anyone in your household currently unemployed?")

ct1 = data.groupby('W1_P11').size()

print (ct1)

#subset data to young adults age 18 to 25 who have smoked in the past 12 months

sub1=data[(data['W1_P11']>=0)]

sub2= sub1.copy()

#recoding values

recode1 = {1: 1, 2: 0}

sub2['UNENPLOYED']= sub2['W1_P11'].map(recode1)

# Education Category VS Anyone in the HH unenplyed:

# contingency table of observed counts

print(" ")

print("Crosstab: ")

ct1=pandas.crosstab(sub2['UNENPLOYED'], sub1['PPEDUCAT'])

print (ct1)

# column percentages

print(" ")

print("column percentages:")

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)

# chi-square

print(" ")

print ('chi-square value, p value, expected counts')

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)

# set variable types

sub2["PPEDUCAT"] = sub2["PPEDUCAT"].astype('category')

sub2['UNENPLOYED'] = pandas.to_numeric(sub2['UNENPLOYED'], errors='coerce')

# graph percent anyone unenployed in the HH for each education category

seaborn.catplot(x="PPEDUCAT", y="UNENPLOYED", data=sub2, kind="bar", ci=None)

plt.xlabel('Education Category')

plt.ylabel('Proportion anyone unenployed in the HH')

# Bonferroni post-hoc test

print(" ")

print("Bonferroni post-hoc test")

# COMPARE 1 VS 2

print(" ")

print ("compare 1 vs 2")

recode2 = {1: 1, 2: 2}

sub2['COMP1v2']= sub2['PPEDUCAT'].map(recode2)

# crosstab

ct2=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v2'])

print (ct2)

# column percentages

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

# COMPARE 1 VS 3

print(" ")

print ("compare 1 vs 3")

recode3 = {1: 1, 3: 3}

sub2['COMP1v3']= sub2['PPEDUCAT'].map(recode3)

# crosstab

ct3=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v3'])

print (ct3)

# column percentages

colsum=ct3.sum(axis=0)

colpct=ct3/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs3= scipy.stats.chi2_contingency(ct3)

print (cs3)

# COMPARE 1 VS 4

print(" ")

print ("compare 1 vs 4")

recode4 = {1: 1, 4: 4}

sub2['COMP1v4']= sub2['PPEDUCAT'].map(recode4)

# crosstab

ct4=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v4'])

print (ct4)

# column percentages

colsum=ct4.sum(axis=0)

colpct=ct4/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs4= scipy.stats.chi2_contingency(ct4)

print (cs4)

# COMPARE 2 VS 3

print(" ")

print ("compare 2 vs 3")

recode23 = {2: 2, 3: 3}

sub2['COMP2v3']= sub2['PPEDUCAT'].map(recode23)

# crosstab

ct23=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP2v3'])

print (ct23)

# column percentages

colsum=ct23.sum(axis=0)

colpct=ct23/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs23= scipy.stats.chi2_contingency(ct23)

print (cs23)

# COMPARE 2 VS 4

print(" ")

print ("compare 2 vs 4")

recode24 = {2: 2, 4: 4}

sub2['COMP2v4']= sub2['PPEDUCAT'].map(recode24)

# crosstab

ct24=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP2v4'])

print (ct24)

# column percentages

colsum=ct24.sum(axis=0)

colpct=ct24/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs24= scipy.stats.chi2_contingency(ct24)

print (cs24)

# COMPARE 3 VS 4

print(" ")

print ("compare 3 vs 4")

recode34 = {3: 3, 4: 4}

sub2['COMP3v4']= sub2['PPEDUCAT'].map(recode34)

# crosstab

ct34=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP3v4'])

print (ct34)

# column percentages

colsum=ct34.sum(axis=0)

colpct=ct34/colsum

print(colpct)

# Chi-square

print ('chi-square value, p value, expected counts')

cs34= scipy.stats.chi2_contingency(ct34)

print (cs34)

================================================================================================

Results:

runfile('/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_2/WEEK_2_OOP_Submission_02.py', wdir='/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_2')

PEDUCAT: Education (Categorical)

PPEDUCAT

1 219

2 700

3 682

4 693

dtype: int64

W1_P11: Is anyone in your household currently unemployed?

W1_P11

-1 62

1 867

2 1365

dtype: int64

Crosstab:

PPEDUCAT 1 2 3 4

UNENPLOYED

0 84 353 414 514

1 124 331 251 161

column percentages:

PPEDUCAT 1 2 3 4

UNENPLOYED

0 0.403846 0.516082 0.622556 0.761481

1 0.596154 0.483918 0.377444 0.238519

chi-square value, p value, expected counts

(128.229823336798, 1.3017582137131253e-27, 3, array([[127.20430108, 418.30645161, 406.68682796, 412.80241935],

[ 80.79569892, 265.69354839, 258.31317204, 262.19758065]]))

Bonferroni post-hoc test

compare 1 vs 2

COMP1v2 1.0 2.0

UNENPLOYED

0 84 353

1 124 331

COMP1v2 1.0 2.0

UNENPLOYED

0 0.403846 0.516082

1 0.596154 0.483918

chi-square value, p value, expected counts

(7.597101736033438, 0.005846220357609504, 1, array([[101.90134529, 335.09865471],

[106.09865471, 348.90134529]]))

compare 1 vs 3

COMP1v3 1.0 3.0

UNENPLOYED

0 84 414

1 124 251

COMP1v3 1.0 3.0

UNENPLOYED

0 0.403846 0.622556

1 0.596154 0.377444

chi-square value, p value, expected counts

(30.04366055462451, 4.224272521731211e-08, 1, array([[118.65292096, 379.34707904],

[ 89.34707904, 285.65292096]]))

compare 1 vs 4

COMP1v4 1.0 4.0

UNENPLOYED

0 84 514

1 124 161

COMP1v4 1.0 4.0

UNENPLOYED

0 0.403846 0.761481

1 0.596154 0.238519

chi-square value, p value, expected counts

(91.4095475375738, 1.1681149415262415e-21, 1, array([[140.86523216, 457.13476784],

[ 67.13476784, 217.86523216]]))

compare 2 vs 3

COMP2v3 2.0 3.0

UNENPLOYED

0 353 414

1 331 251

COMP2v3 2.0 3.0

UNENPLOYED

0 0.516082 0.622556

1 0.483918 0.377444

chi-square value, p value, expected counts

(15.152379365146949, 9.917325539812312e-05, 1, array([[388.90140845, 378.09859155],

[295.09859155, 286.90140845]]))

compare 2 vs 4

COMP2v4 2.0 4.0

UNENPLOYED

0 353 514

1 331 161

COMP2v4 2.0 4.0

UNENPLOYED

0 0.516082 0.761481

1 0.483918 0.238519

chi-square value, p value, expected counts

(87.52215384135604, 8.334106892462309e-21, 1, array([[436.37086093, 430.62913907],

[247.62913907, 244.37086093]]))

compare 3 vs 4

COMP3v4 3.0 4.0

UNENPLOYED

0 414 514

1 251 161

COMP3v4 3.0 4.0

UNENPLOYED

0 0.622556 0.761481

1 0.377444 0.238519

chi-square value, p value, expected counts

(29.714178536304683, 5.006728481369606e-08, 1, array([[460.53731343, 467.46268657],

[204.46268657, 207.53731343]]))

0 notes