reportsondataanalysiscoursework
Coursera Blog
4 posts
Don't wanna be here? Send us removal request.
Text
Project Four:  Testing a Potential Moderator
For project 4, I chose the Gap Minder data set, which looks at the relationships between several variables of interest across all countries in the world.  I wanted to see if countries with a higher internet use rate tended to have a lower HIV rate, and to what extent this effect was moderated by polity.  I found that there was a strong visible correlation, as shown in my scatter plot, and that it was significant as shown in my Pearson Correlation analysis. Next, I considered whether this was modified by polity.  Polity is a measure of democracy, a quantitative variable which ranges from -10 to 10.  I converted it to a categorical, and then performed separate tests.  I found that HIV rate and internet use still had a significant correlation in high polity countries (low p-value), but not in mid or low polity countries. It was interesting to consider these findings.  Since there are more high polity countries, the question comes up-- is there really a difference in effect based on polity, or is this the result of limited statistical power due to fewer observations for the low and mid polity countries? There is a negative correlation value for all three polity groups, so I wonder if the correlation differs significantly across groups.  I'm not sure if the methods I have learned thus far would be sufficient to test this. The scatterplots do show an interesting phenomenon.  There are almost no HIV scores of > 50% in low or mid polity countries.  But further tests would be required to tell if this is significant.
CODE
import pandas import numpy import scipy.stats import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)
# recode missing values to python missing (NaN) data['hivrate']=data['hivrate'].replace(' ', numpy.nan) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan)
data_clean=data.dropna()
plt.plot(data_clean['hivrate'], data_clean['internetuserate'], 'o') plt.xlabel("Internet Use Rate") plt.ylabel("HIV rate") plt.xscale('log') plt.title("HIV rate vs Internet Use Rate") plt.savefig("data_clean.pdf") plt.clf()
print ('association between hivrate and internetuserate') print (scipy.stats.pearsonr(data_clean['hivrate'], data_clean['internetuserate']))
# group by polity recode1 = {} for idx in range(-10, -3): recode1[idx] = 'LOW_polity' for idx in range(-3, 4): recode1[idx] = 'MID_polity' for idx in range(4,10+1): recode1[idx] = 'HIGH_polity' #data_clean['polityGroup'] = data_clean.apply(lambda row: recode1[row['polityscore']], axis=1) data_clean['polityGroup'] = data_clean['polityscore'].map(recode1) print data_clean['polityGroup']
# subset by group for group in ['LOW', 'MID', 'HIGH']:    sub = data_clean[(data_clean['polityGroup']=='%s_polity' % group)]
   plt.plot(sub['hivrate'], sub['internetuserate'], 'o')    plt.xlabel("Internet Use Rate")    plt.ylabel("HIV rate")    plt.xscale('log')    plt.title("HIV rate vs Internet Use Rate for %s polity countries" % group)    plt.savefig("group_%s.pdf" % group)    plt.clf()
   print ('association between hivrate and internetuserate for %s polity countries)' % group)    print (scipy.stats.pearsonr(sub['hivrate'], sub['internetuserate']))
OUTPUT
association between hivrate and internetuserate (-0.34486521888032518, 4.2017570567076531e-05)
association between hivrate and internetuserate for LOW polity countries) (-0.31916290244378037, 0.15847170186437021) association between hivrate and internetuserate for MID polity countries) (-0.14079247309123549, 0.53200179148983429) association between hivrate and internetuserate for HIGH polity countries) (-0.38371766591794482, 0.00015950868986257328)
Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
Text
Project Three:  Run a Pearson Correlation Test
For project 3, I chose the Gap Minder data set, which looks at the relationships between several variables of interest across all countries in the world.  I wanted to see if countries with a higher mean income per person tended to have a higher life expectancy.  I found that this was true, in my scatter plot, which showed a strong linear correlation between life expectancy and log of income per person.   I ran a Pearson correlation test which confirmed this with a p=0.60  (which indicates an r squared of 0.36, or 36% of variance in one variable explained by the other).  The correlation was statistically significant, reporting a p-value of 1e-18, rejecting the null hypothesis that knowing information about income does not impact expectation of life expectancy.
CODE
import pandas import numpy import scipy.stats import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
# recode missing values to python missing (NaN) data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ', numpy.nan)
data_clean=data.dropna()
plt.plot(data_clean['incomeperperson'], data_clean['lifeexpectancy'], 'o') plt.xlabel("Income per Person") plt.ylabel("Life Expectancy") plt.xscale('log') plt.title("Life Expectancy vs Income Per Person") plt.savefig("scatterplot.pdf")
print ('association between lifeexpectancy and incomeperperson') print (scipy.stats.pearsonr(data_clean['lifeexpectancy'], data_clean['incomeperperson']))
OUTPUT
association between lifeexpectancy and incomeperperson (0.60151634019643985, 1.065341893502591e-18)
                  r                                   p-value
Tumblr media
0 notes
Text
Project Two:  Run a Chi-Square Test of Independence
For project 2, I chose the Nesarc data set, which looks at the relationships between various health conditions.  In particular, I wanted to see if people who had occurrences of unusually elevated mood (S5Q1: 1+ week period of excitement/elation that seemed not normal self) were also prone to lowered mood (S4AQ1: 2 week period when felt sad, blue, depressed, or down most of time).  To test this, I ran a chi-squared test.  The null hypothesis H_0 was that the frequency with which people reported a prolonged unusually low mood was not different when those people also reported a prolonged high mood. The results show a 2x higher chance of having prolonged low mood, given that one had prolonged unusually high mood.  The Chi Squared value was 1601, which was so high that the p-value was reported as exactly 0 (either the value was estimated as 0, or the true value was too low to display). Since there were only 2 categories in each case, there was no need to perform a post-hoc analysis, though if I had done so, I would have compared each individual pairing of category in one group with a category in the other group, and then adjusted the p-values using the bonferroni correction to determine which pairs of categories in one group had a significantly different effect on the other group.
CODE
import pandas import numpy import scipy.stats
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
## data['S4AQ1'] # EVER HAD 2-WEEK PERIOD WHEN FELT SAD, BLUE, DEPRESSED, OR DOWN MOST OF TIME # 1=YES, 2=NO, 9=UNKNOWN data['S4AQ1'] = data['S4AQ1'].convert_objects(convert_numeric=True)
## data['S5Q1'] # HAD 1+ WEEK PERIOD OF EXCITEMENT/ELATION THAT SEEMED NOT NORMAL SELF # 1=YES, 2=NO, 9=UNKNOWN data['S5Q1'] = data['S5Q1'].convert_objects(convert_numeric=True)
# recode missing values to python missing (NaN) data['S4AQ1']=data['S4AQ1'].replace(9, numpy.nan) data['S5Q1']=data['S5Q1'].replace(9, numpy.nan)
# contingency table of observed counts ct1=pandas.crosstab(data['S4AQ1'], data['S5Q1']) print (ct1)
# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)
# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)
OUTPUT
S5Q1         1          2 S4AQ1 1              1786   10853 2              1018   28231 S5Q1                   1                  2 S4AQ1 1                    0.636947    0.277684 2                    0.363053    0.722316
chi-square value,           p value,  expected counts (1601.0553150105552,  0.0,              1,   array([[   846.05987395,  11792.94012605],       [  1957.94012605,  27291.05987395]]))
0 notes
Text
Data Analysis Tools Week 1 Project--
For project 1, I chose the GapMinder data set, which looks at the relationships between several variables of interest across all countries in the world.  In particular, I wanted to focus on the relationship between polity score, a measure of democracy and female employment rate.  Polity score was not a categorical, so I created a new variable “positive polity score.”  To do this, I expanded on the mapping operator from the example code, creating, a mapping of negative polity scores to 0 and positive polity scores to 1.  I ran my code (below) to do test the hypothesis that whether or not you had a positive polity score would have a significant effect on female employment rate.  The null hypothesis was that it would not, that is, that female employment rate was not affected by whether or not a country had a positive polity score.  I computed the F-statistic, and obtained a p-value of greater than 0.05, so I could not reject the null hypothesis. Code is below: import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
data = pandas.read_csv('gapminder.csv', low_memory=False) data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True) recode1 = {} for idx in range(-10, 0): recode1[str(idx)] = '0' for idx in range(0,10+1): recode1[str(idx)] = '1' data['positivepolityscore'] = data['polityscore'].map(recode1)
model1 = smf.ols(formula='femaleemployrate ~ C(positivepolityscore)', data=data) results1 = model1.fit() print(results1.summary())
Printed results (p-value in bold):
                           OLS Regression Results ============================================================ Dep. Variable:       femaleemployrate   R-squared:                       0.002 Model:                            OLS                Adj. R-squared:                 -0.004 Method:                 Least Squares        F-statistic:                    0.3181 Date:                Mon, 05 Apr 2021        Prob (F-statistic):              0.574 Time:                        17:07:40              Log-Likelihood:                -648.72 No. Observations:                 158          AIC:                             1301. Df Residuals:                     156             BIC:                             1308. Df Model:                           1 ============================================================                                             coef    std err          t      P>|t|      [95.0% Conf. Int.] ----------------------------------------------------------------------------------------------- Intercept                          47.0304   2.179   21.582   0.000    42.726    51.335 C(positivepolityscore)[T.1]  1.4597   2.588   0.564   0.574     -3.653     6.572 ============================================================ Omnibus:                        0.317   Durbin-Watson:                   1.824 Prob(Omnibus):             0.853   Jarque-Bera (JB):                0.155 Skew:                           0.071     Prob(JB):                        0.925 Kurtosis:                       3.056     Cond. No.                         3.47 ============================================================
1 note · View note