courserall-da-w2
courserall-da-w2
Data Analysis - Week 2
1 post
Don't wanna be here? Send us removal request.
courserall-da-w2 · 4 years ago
Text
Is there a relation within education level and family unemployment?
This analysis is based on the ‘Outlook on life’ surveys (OOL.csv file), providing survey’s answers to a wide range of questions.
The explanatory variable is the categorical variable:
PPEDUCAT: Education Category, assuming these values:
1 Less than high school
2 High school
3 Some college
4 Bachelor's degree or higher
The Chi-square analysis is performed within the explanatory variable and the response variable:
W1_P11: Is anyone in your household currently unemployed?
Which can assume these values:
-1 did not answered
1 Yes
2 No
The goal is to understand if there’s a relation within the education category and the presence of almost an unemployed people in the household.
The null hypothesis refuses the idea that there’s a relation,
while the alternative hypothesis accept the idea that there’s a relation.
First of all, I excluded from the analysis the non meaningful data (W1_P11= -1 ‘did not answered’)
and I recoded the values as:
0 = No
1 = Yes
to have the crosstab results ordered in a more readable way.
Then I performed the Chi-square, which provides this results:
chi-square value, p value, expected counts
(128.229823336798, 1.3017582137131253e-27, 3, array([[127.20430108, 418.30645161, 406.68682796, 412.80241935],
[ 80.79569892, 265.69354839, 258.31317204, 262.19758065]]))
The small p-value (less than 0,05) lead me to reject the null hypothesis,
and also the caplet graph seems to shows that as the percentage of anyone unemployed in the HH rise, the level of education category il lower.
But it’s not sufficient,
since the explanatory variable is not binary (it can assume 4 values) it needs to perform the Bonferroni post-hoc test.
To do this, it’s necessary to perform a chi-square test for every combination within the 4 class variable, so we need to perform 6 different comparisons.
During this process, to protect against the type-1 error, we will adjust the p-value this way:
P/C = 0,05 / 6 comparisons = 0,008
So, to reject the null hypothesis within the comparisons, the adjusted p-value must be < 0,008
The result shows that all the p-values of the 6 comparisons are less than 0,008,
but the first comparisons within class 1 and 2 is quite close, it’s p-value = 0.005846220357609504.
While the other 5 comparisons are very smaller.
The summary is that education category seems to have a relation with the presence of anyone unemployed in the household.
================================================================================================
Python code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sun Apr 18 10:23:46 2021
@author: lorenzo
"""
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 26 18:17:30 2015
@author: ldierker
"""
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_1/ool_pds.csv', low_memory=False)
# PPEDUCAT: Education (Categorical)
"""
1 Less than high school
2 High school
3 Some college
4 Bachelor's degree or higher
"""
print("PEDUCAT: Education (Categorical)")
ct1 = data.groupby('PPEDUCAT').size()
print (ct1)
# W1_P11: Is anyone in your household currently unemployed?
"""
1 Yes
2 No
"""
print("W1_P11: Is anyone in your household currently unemployed?")
ct1 = data.groupby('W1_P11').size()
print (ct1)
#subset data to young adults age 18 to 25 who have smoked in the past 12 months
sub1=data[(data['W1_P11']>=0)]
sub2= sub1.copy()
#recoding values
recode1 = {1: 1, 2: 0}
sub2['UNENPLOYED']= sub2['W1_P11'].map(recode1)
# Education Category VS Anyone in the HH unenplyed:
# contingency table of observed counts
print(" ")
print("Crosstab: ")
ct1=pandas.crosstab(sub2['UNENPLOYED'], sub1['PPEDUCAT'])
print (ct1)
# column percentages
print(" ")
print("column percentages:")
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print(" ")
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
# set variable types
sub2["PPEDUCAT"] = sub2["PPEDUCAT"].astype('category')
sub2['UNENPLOYED'] = pandas.to_numeric(sub2['UNENPLOYED'], errors='coerce')
# graph percent anyone unenployed in the HH for each education category
seaborn.catplot(x="PPEDUCAT", y="UNENPLOYED", data=sub2, kind="bar", ci=None)
plt.xlabel('Education Category')
plt.ylabel('Proportion anyone unenployed in the HH')
# Bonferroni post-hoc test
print(" ")
print("Bonferroni post-hoc test")
# COMPARE 1 VS 2
print(" ")
print ("compare 1 vs 2")
recode2 = {1: 1, 2: 2}
sub2['COMP1v2']= sub2['PPEDUCAT'].map(recode2)
# crosstab
ct2=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
# COMPARE 1 VS 3
print(" ")
print ("compare 1 vs 3")
recode3 = {1: 1, 3: 3}
sub2['COMP1v3']= sub2['PPEDUCAT'].map(recode3)
# crosstab
ct3=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v3'])
print (ct3)
# column percentages
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
# COMPARE 1 VS 4
print(" ")
print ("compare 1 vs 4")
recode4 = {1: 1, 4: 4}
sub2['COMP1v4']= sub2['PPEDUCAT'].map(recode4)
# crosstab
ct4=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP1v4'])
print (ct4)
# column percentages
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)
# COMPARE 2 VS 3
print(" ")
print ("compare 2 vs 3")
recode23 = {2: 2, 3: 3}
sub2['COMP2v3']= sub2['PPEDUCAT'].map(recode23)
# crosstab
ct23=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP2v3'])
print (ct23)
# column percentages
colsum=ct23.sum(axis=0)
colpct=ct23/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs23= scipy.stats.chi2_contingency(ct23)
print (cs23)
# COMPARE 2 VS 4
print(" ")
print ("compare 2 vs 4")
recode24 = {2: 2, 4: 4}
sub2['COMP2v4']= sub2['PPEDUCAT'].map(recode24)
# crosstab
ct24=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP2v4'])
print (ct24)
# column percentages
colsum=ct24.sum(axis=0)
colpct=ct24/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs24= scipy.stats.chi2_contingency(ct24)
print (cs24)
# COMPARE 3 VS 4
print(" ")
print ("compare 3 vs 4")
recode34 = {3: 3, 4: 4}
sub2['COMP3v4']= sub2['PPEDUCAT'].map(recode34)
# crosstab
ct34=pandas.crosstab(sub2['UNENPLOYED'], sub2['COMP3v4'])
print (ct34)
# column percentages
colsum=ct34.sum(axis=0)
colpct=ct34/colsum
print(colpct)
# Chi-square
print ('chi-square value, p value, expected counts')
cs34= scipy.stats.chi2_contingency(ct34)
print (cs34)
================================================================================================
Results:
runfile('/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_2/WEEK_2_OOP_Submission_02.py', wdir='/Users/lorenzo/Documents/Coursera/02_Data_Analysis_Tools/files/Week_2')
PEDUCAT: Education (Categorical)
PPEDUCAT
1 219
2 700
3 682
4 693
dtype: int64
W1_P11: Is anyone in your household currently unemployed?
W1_P11
-1 62
1 867
2 1365
dtype: int64
Crosstab:
PPEDUCAT 1 2 3 4
UNENPLOYED
0 84 353 414 514
1 124 331 251 161
column percentages:
PPEDUCAT 1 2 3 4
UNENPLOYED
0 0.403846 0.516082 0.622556 0.761481
1 0.596154 0.483918 0.377444 0.238519
chi-square value, p value, expected counts
(128.229823336798, 1.3017582137131253e-27, 3, array([[127.20430108, 418.30645161, 406.68682796, 412.80241935],
[ 80.79569892, 265.69354839, 258.31317204, 262.19758065]]))
Tumblr media
Bonferroni post-hoc test
compare 1 vs 2
COMP1v2 1.0 2.0
UNENPLOYED
0 84 353
1 124 331
COMP1v2 1.0 2.0
UNENPLOYED
0 0.403846 0.516082
1 0.596154 0.483918
chi-square value, p value, expected counts
(7.597101736033438, 0.005846220357609504, 1, array([[101.90134529, 335.09865471],
[106.09865471, 348.90134529]]))
compare 1 vs 3
COMP1v3 1.0 3.0
UNENPLOYED
0 84 414
1 124 251
COMP1v3 1.0 3.0
UNENPLOYED
0 0.403846 0.622556
1 0.596154 0.377444
chi-square value, p value, expected counts
(30.04366055462451, 4.224272521731211e-08, 1, array([[118.65292096, 379.34707904],
[ 89.34707904, 285.65292096]]))
compare 1 vs 4
COMP1v4 1.0 4.0
UNENPLOYED
0 84 514
1 124 161
COMP1v4 1.0 4.0
UNENPLOYED
0 0.403846 0.761481
1 0.596154 0.238519
chi-square value, p value, expected counts
(91.4095475375738, 1.1681149415262415e-21, 1, array([[140.86523216, 457.13476784],
[ 67.13476784, 217.86523216]]))
compare 2 vs 3
COMP2v3 2.0 3.0
UNENPLOYED
0 353 414
1 331 251
COMP2v3 2.0 3.0
UNENPLOYED
0 0.516082 0.622556
1 0.483918 0.377444
chi-square value, p value, expected counts
(15.152379365146949, 9.917325539812312e-05, 1, array([[388.90140845, 378.09859155],
[295.09859155, 286.90140845]]))
compare 2 vs 4
COMP2v4 2.0 4.0
UNENPLOYED
0 353 514
1 331 161
COMP2v4 2.0 4.0
UNENPLOYED
0 0.516082 0.761481
1 0.483918 0.238519
chi-square value, p value, expected counts
(87.52215384135604, 8.334106892462309e-21, 1, array([[436.37086093, 430.62913907],
[247.62913907, 244.37086093]]))
compare 3 vs 4
COMP3v4 3.0 4.0
UNENPLOYED
0 414 514
1 251 161
COMP3v4 3.0 4.0
UNENPLOYED
0 0.622556 0.761481
1 0.377444 0.238519
chi-square value, p value, expected counts
(29.714178536304683, 5.006728481369606e-08, 1, array([[460.53731343, 467.46268657],
[204.46268657, 207.53731343]]))
0 notes