reportsondataanalysiscoursework - Tumblr blog

reportsondataanalysiscoursework · 4 years ago

Text

Project Four: Testing a Potential Moderator

For project 4, I chose the Gap Minder data set, which looks at the relationships between several variables of interest across all countries in the world. I wanted to see if countries with a higher internet use rate tended to have a lower HIV rate, and to what extent this effect was moderated by polity. I found that there was a strong visible correlation, as shown in my scatter plot, and that it was significant as shown in my Pearson Correlation analysis. Next, I considered whether this was modified by polity. Polity is a measure of democracy, a quantitative variable which ranges from -10 to 10. I converted it to a categorical, and then performed separate tests. I found that HIV rate and internet use still had a significant correlation in high polity countries (low p-value), but not in mid or low polity countries. It was interesting to consider these findings. Since there are more high polity countries, the question comes up-- is there really a difference in effect based on polity, or is this the result of limited statistical power due to fewer observations for the low and mid polity countries? There is a negative correlation value for all three polity groups, so I wonder if the correlation differs significantly across groups. I'm not sure if the methods I have learned thus far would be sufficient to test this. The scatterplots do show an interesting phenomenon. There are almost no HIV scores of > 50% in low or mid polity countries. But further tests would be required to tell if this is significant.

CODE

import pandas import numpy import scipy.stats import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

data['hivrate'] = data['hivrate'].convert_objects(convert_numeric=True) data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True) data['polityscore'] = data['polityscore'].convert_objects(convert_numeric=True)

# recode missing values to python missing (NaN) data['hivrate']=data['hivrate'].replace(' ', numpy.nan) data['internetuserate']=data['internetuserate'].replace(' ', numpy.nan) data['polityscore']=data['polityscore'].replace(' ', numpy.nan)

data_clean=data.dropna()

plt.plot(data_clean['hivrate'], data_clean['internetuserate'], 'o') plt.xlabel("Internet Use Rate") plt.ylabel("HIV rate") plt.xscale('log') plt.title("HIV rate vs Internet Use Rate") plt.savefig("data_clean.pdf") plt.clf()

print ('association between hivrate and internetuserate') print (scipy.stats.pearsonr(data_clean['hivrate'], data_clean['internetuserate']))

# group by polity recode1 = {} for idx in range(-10, -3): recode1[idx] = 'LOW_polity' for idx in range(-3, 4): recode1[idx] = 'MID_polity' for idx in range(4,10+1): recode1[idx] = 'HIGH_polity' #data_clean['polityGroup'] = data_clean.apply(lambda row: recode1[row['polityscore']], axis=1) data_clean['polityGroup'] = data_clean['polityscore'].map(recode1) print data_clean['polityGroup']

# subset by group for group in ['LOW', 'MID', 'HIGH']: sub = data_clean[(data_clean['polityGroup']=='%s_polity' % group)]

plt.plot(sub['hivrate'], sub['internetuserate'], 'o') plt.xlabel("Internet Use Rate") plt.ylabel("HIV rate") plt.xscale('log') plt.title("HIV rate vs Internet Use Rate for %s polity countries" % group) plt.savefig("group_%s.pdf" % group) plt.clf()

print ('association between hivrate and internetuserate for %s polity countries)' % group) print (scipy.stats.pearsonr(sub['hivrate'], sub['internetuserate']))

OUTPUT

association between hivrate and internetuserate (-0.34486521888032518, 4.2017570567076531e-05)

association between hivrate and internetuserate for LOW polity countries) (-0.31916290244378037, 0.15847170186437021) association between hivrate and internetuserate for MID polity countries) (-0.14079247309123549, 0.53200179148983429) association between hivrate and internetuserate for HIGH polity countries) (-0.38371766591794482, 0.00015950868986257328)

0 notes

reportsondataanalysiscoursework · 4 years ago

Text

Project Three: Run a Pearson Correlation Test

For project 3, I chose the Gap Minder data set, which looks at the relationships between several variables of interest across all countries in the world. I wanted to see if countries with a higher mean income per person tended to have a higher life expectancy. I found that this was true, in my scatter plot, which showed a strong linear correlation between life expectancy and log of income per person. I ran a Pearson correlation test which confirmed this with a p=0.60 (which indicates an r squared of 0.36, or 36% of variance in one variable explained by the other). The correlation was statistically significant, reporting a p-value of 1e-18, rejecting the null hypothesis that knowing information about income does not impact expectation of life expectancy.

CODE

import pandas import numpy import scipy.stats import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)

data['lifeexpectancy'] = data['lifeexpectancy'].convert_objects(convert_numeric=True) data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)

# recode missing values to python missing (NaN) data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan) data['lifeexpectancy']=data['lifeexpectancy'].replace(' ', numpy.nan)

data_clean=data.dropna()

plt.plot(data_clean['incomeperperson'], data_clean['lifeexpectancy'], 'o') plt.xlabel("Income per Person") plt.ylabel("Life Expectancy") plt.xscale('log') plt.title("Life Expectancy vs Income Per Person") plt.savefig("scatterplot.pdf")

print ('association between lifeexpectancy and incomeperperson') print (scipy.stats.pearsonr(data_clean['lifeexpectancy'], data_clean['incomeperperson']))

OUTPUT

association between lifeexpectancy and incomeperperson (0.60151634019643985, 1.065341893502591e-18)

r p-value

0 notes

reportsondataanalysiscoursework · 4 years ago

Text

Project Two: Run a Chi-Square Test of Independence

For project 2, I chose the Nesarc data set, which looks at the relationships between various health conditions. In particular, I wanted to see if people who had occurrences of unusually elevated mood (S5Q1: 1+ week period of excitement/elation that seemed not normal self) were also prone to lowered mood (S4AQ1: 2 week period when felt sad, blue, depressed, or down most of time). To test this, I ran a chi-squared test. The null hypothesis H_0 was that the frequency with which people reported a prolonged unusually low mood was not different when those people also reported a prolonged high mood. The results show a 2x higher chance of having prolonged low mood, given that one had prolonged unusually high mood. The Chi Squared value was 1601, which was so high that the p-value was reported as exactly 0 (either the value was estimated as 0, or the true value was too low to display). Since there were only 2 categories in each case, there was no need to perform a post-hoc analysis, though if I had done so, I would have compared each individual pairing of category in one group with a category in the other group, and then adjusted the p-values using the bonferroni correction to determine which pairs of categories in one group had a significantly different effect on the other group.

CODE

import pandas import numpy import scipy.stats

data = pandas.read_csv('nesarc_pds.csv', low_memory=False)

## data['S4AQ1'] # EVER HAD 2-WEEK PERIOD WHEN FELT SAD, BLUE, DEPRESSED, OR DOWN MOST OF TIME # 1=YES, 2=NO, 9=UNKNOWN data['S4AQ1'] = data['S4AQ1'].convert_objects(convert_numeric=True)

## data['S5Q1'] # HAD 1+ WEEK PERIOD OF EXCITEMENT/ELATION THAT SEEMED NOT NORMAL SELF # 1=YES, 2=NO, 9=UNKNOWN data['S5Q1'] = data['S5Q1'].convert_objects(convert_numeric=True)

# recode missing values to python missing (NaN) data['S4AQ1']=data['S4AQ1'].replace(9, numpy.nan) data['S5Q1']=data['S5Q1'].replace(9, numpy.nan)

# contingency table of observed counts ct1=pandas.crosstab(data['S4AQ1'], data['S5Q1']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

OUTPUT

S5Q1 1 2 S4AQ1 1 1786 10853 2 1018 28231 S5Q1 1 2 S4AQ1 1 0.636947 0.277684 2 0.363053 0.722316

chi-square value, p value, expected counts (1601.0553150105552, 0.0, 1, array([[ 846.05987395, 11792.94012605], [ 1957.94012605, 27291.05987395]]))

0 notes

reportsondataanalysiscoursework · 4 years ago

Text

Data Analysis Tools Week 1 Project--

For project 1, I chose the GapMinder data set, which looks at the relationships between several variables of interest across all countries in the world. In particular, I wanted to focus on the relationship between polity score, a measure of democracy and female employment rate. Polity score was not a categorical, so I created a new variable “positive polity score.” To do this, I expanded on the mapping operator from the example code, creating, a mapping of negative polity scores to 0 and positive polity scores to 1. I ran my code (below) to do test the hypothesis that whether or not you had a positive polity score would have a significant effect on female employment rate. The null hypothesis was that it would not, that is, that female employment rate was not affected by whether or not a country had a positive polity score. I computed the F-statistic, and obtained a p-value of greater than 0.05, so I could not reject the null hypothesis. Code is below: import numpy import pandas import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

data = pandas.read_csv('gapminder.csv', low_memory=False) data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True) recode1 = {} for idx in range(-10, 0): recode1[str(idx)] = '0' for idx in range(0,10+1): recode1[str(idx)] = '1' data['positivepolityscore'] = data['polityscore'].map(recode1)

model1 = smf.ols(formula='femaleemployrate ~ C(positivepolityscore)', data=data) results1 = model1.fit() print(results1.summary())

Printed results (p-value in bold):

OLS Regression Results ============================================================ Dep. Variable: femaleemployrate R-squared: 0.002 Model: OLS Adj. R-squared: -0.004 Method: Least Squares F-statistic: 0.3181 Date: Mon, 05 Apr 2021 Prob (F-statistic): 0.574 Time: 17:07:40 Log-Likelihood: -648.72 No. Observations: 158 AIC: 1301. Df Residuals: 156 BIC: 1308. Df Model: 1 ============================================================ coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------------------- Intercept 47.0304 2.179 21.582 0.000 42.726 51.335 C(positivepolityscore)[T.1] 1.4597 2.588 0.564 0.574 -3.653 6.572 ============================================================ Omnibus: 0.317 Durbin-Watson: 1.824 Prob(Omnibus): 0.853 Jarque-Bera (JB): 0.155 Skew: 0.071 Prob(JB): 0.925 Kurtosis: 3.056 Cond. No. 3.47 ============================================================

1 note · View note