sizzlinglightfestival-blog1
16 posts
Don't wanna be here? Send us removal request.
Text
Running a k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 8 variables that represent characteristics that could have an impact on countries . Clustering variables included 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th','hivrate', 'suicideper100th', 'employrate', 'lifeexpectancy'. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret. The elbow curve was inconclusive, suggesting that the 2, 3 and 4-cluster solutions might be interpreted. The results are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 8 clustering variable down to 2 variables that accounted for most of the variance in the clustering variables. A scatterplot of the two canonical variables by cluster indicated that the observations in clusters 1and 4 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Observations in cluster 2 and 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
The means on the clustering variables showed that, compared to the other clusters, countries in cluster 3 had highest level of income, urban rate and life expectancy. Cluster 2 had the highest rate of HIV rate and lowest clustering variables compared to other clusters.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on life expectancy. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on life expectancy(p=5.10e-86). The tukey post hoc comparisons showed significant differences between clusters on life expectancy.
0 notes
Text
Running a Lasso Regression Analysis
A lasso regression analysis was conducted to identify a subset of variables from a pool of 7 quantitative predictor variables that best predicted a quantitative response variable measuring life expectancy. Predictor variables include 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th', 'hivrate', 'suicideper100th', 'employrate'. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 7 predictor variables, 6 were retained in the selected model. During the estimation process, income, urban rate and HIV rate were most strongly associated with life expectancy. These 6 variables accounted for 70% of the variance in the life expectancy response variable.
0 notes
Text
Running a Random Forest
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating life expectancy, 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th','hivrate', 'suicideper100th', 'employrate'.
The explanatory variables with the highest relative importance scores were income per person and urban rate. The accuracy of the random forest was 93%, with the subsequent growing of multiple trees rather than a single tree, adding much to the overall accuracy of the model, and suggesting that interpretation of a random forest may be appropriate.
0 notes
Text
Running a Classification Tree
[[23 5] [ 4 25]] 0.842105263158
Response variable life expectancy is quantitive. So it is binned into two groups - above average life expectancy 1 and below average 0.
The following explanatory variables were included as possible contributors to a classification tree model evaluating life expectancy - ‘incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th', 'hivrate', 'suicideper100th', 'employrate'.
The income was the first variable to separate the sample into two subgroups. Countries with a income less than 1274.51 were more likely to have experimented with short life expectancy compared to countries having income per person more thank 1274.51
Of the countries with a income less than 1274.51, a further subdivision was made with the income again. Countries which reported having income less than 830.9976 were more likely to have experimented with short life expectancy.
Of the countries with a income more than 1274.51, a further subdivision was made with the HIV rate. Countries with HIV rate less than 4.05 were more likely to have long life expectancy
The total model classified 84% of the sample correctly, 85% of experimenters (sensitivity) and 83% of nonsmokers (specificity).
0 notes
Text
Test a Logistic Regression Model
both explanatory variables and response variables are binned into two groups based on their mean values. So there are two groups for each variable which are group above average 1 and group below average 0.
After adjusting for potential confounding factors (income per person, urban rate and alcohol consumption), the odds of having life expectancy above average were more than 23 times higher for countries with income above average than countries with income per person below average (OR=23.51, 95% CI = 3.01-183.76, p=.0003). urban rate was also significantly associated with life expectancy, such that countries with urban rate above average were more likely to have life expectancy above average (OR= 4.56, 95% CI=2.10-9.91, p=.0001). No confounding variable is identified. p value of alcohol consumption is 0.048 which means alcohol consumption is also associated with life expectancy group.
0 notes
Text
Test a Multiple Regression Model
After adjusting for potential confounding factors urban rate, alcohol consumption, income per person, income per person (Beta=0.0003, p=.0001) was significantly and positively associated with life expectancy. urban rate was also significantly associated with life expectancy, such that countries with higher urban rate has longer life expectancy (Beta= 0.1619, p=.0001).
qqplot shows the residuals do not follow a perfect normal distribution, meaning that there may be other explanatory variables need to be considered.
the residual plot shows that the absolute value of residuals are significantly larger at lower values of income per person. This model does not predict the life expectancy well for countries with lower income per person. It suggests a curvilinear association.
Partial residual plot shows that the residuals spread out a random pattern around the partial regression line. The association between income per person and life expectancy is weak.
The leverage plot shows the outliers have very small leverage. they do not have undue influence on the estimation of the regression model.
0 notes
Text
Test a Basic Linear Regression Model
the income per person has been centred with mean value -1.1006952924865552e-12
The results of the linear regression model indicated that life expectancy (Beta=0.0006, p=1.07e-18) was significantly and positively associated with income per person.
R squared value is 0.362 which means only 36.2% of the variability is explained by the this regression model
0 notes
Text
Writing About Your Data
Sample
The sample is from the Gapminder aggregating the data of GDP per capita and life expectancy, total employment rate, and estimated HIV prevalence for a total of 215 areas.
There were 192 UN members included in this data set. 24 of 51 other entities listed in the “List of countries” in Wikipedia. The data analytic sample for this study included Gross National Income per capita and life expectancy for all 215 areas.
Procedure
Data reporting was used by GapMinder. The data was donwloaded in 2010 from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, the World Bank, Human Mortality Database and UN Population divisions World Population Prospects.
Measures
regarding Gross National Income per capita, for 206 areas data was collected from United Nations Statistics Division. For the areas for which data was missing from the United Nations Statistics Division, Landsbanki Foroya and World Bank. It measures Gross Domestic Product per capita ranging from $103 to $105147. The inflation but not the differences in the cost of living between countries has been taken into account. For the current analysis, it was binned into 5 categories based on a quantile split.
As for life expectancy, two big datasets were the main two sources: Human Mortality Database and UN Population divisions World Population Prospects. It measures life expectancy at birth (years) - the average number of years a newborn child would live if current mortality patterns were to stay the same rangeing from 47.8 to 83.4. For the current analysis, it was binned into 5 categories based on a quantile split
0 notes
Text
Testing a Potential Moderator
RESULT:
OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.658 Model: OLS Adj. R-squared: 0.649 Method: Least Squares F-statistic: 79.72 Date: Thu, 01 Jun 2017 Prob (F-statistic): 1.28e-37 Time: 20:58:12 Log-Likelihood: -539.52 No. Observations: 171 AIC: 1089. Df Residuals: 166 BIC: 1105. Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 56.2721 0.974 57.793 0.000 54.350 58.194 C(INCOMEGRP)[T.2] 10.4927 1.387 7.565 0.000 7.754 13.231 C(INCOMEGRP)[T.3] 15.2263 1.387 10.977 0.000 12.488 17.965 C(INCOMEGRP)[T.4] 16.6018 1.387 11.969 0.000 13.863 19.340 C(INCOMEGRP)[T.5] 23.6333 1.387 17.038 0.000 20.895 26.372 ============================================================================== Omnibus: 54.461 Durbin-Watson: 2.095 Prob(Omnibus): 0.000 Jarque-Bera (JB): 134.477 Skew: -1.361 Prob(JB): 6.29e-30 Kurtosis: 6.386 Cond. No. 5.77 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for life expectancy by income group lifeexpectancy INCOMEGRP 1 56.272057 2 66.764765 3 71.498353 4 72.873824 5 79.905382 OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.611 Model: OLS Adj. R-squared: 0.592 Method: Least Squares F-statistic: 31.78 Date: Thu, 01 Jun 2017 Prob (F-statistic): 6.50e-16 Time: 20:58:12 Log-Likelihood: -272.79 No. Observations: 86 AIC: 555.6 Df Residuals: 81 BIC: 567.8 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 57.0086 1.124 50.723 0.000 54.772 59.245 C(INCOMEGRP)[T.2] 9.5140 1.654 5.751 0.000 6.222 12.806 C(INCOMEGRP)[T.3] 13.8798 1.829 7.590 0.000 10.241 17.518 C(INCOMEGRP)[T.4] 18.3197 2.191 8.362 0.000 13.960 22.679 C(INCOMEGRP)[T.5] 20.6624 2.513 8.222 0.000 15.662 25.663 ============================================================================== Omnibus: 13.403 Durbin-Watson: 2.020 Prob(Omnibus): 0.001 Jarque-Bera (JB): 17.061 Skew: -0.739 Prob(JB): 0.000197 Kurtosis: 4.605 Cond. No. 5.08 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.631 Model: OLS Adj. R-squared: 0.601 Method: Least Squares F-statistic: 20.96 Date: Thu, 01 Jun 2017 Prob (F-statistic): 4.03e-10 Time: 20:58:12 Log-Likelihood: -173.58 No. Observations: 54 AIC: 357.2 Df Residuals: 49 BIC: 367.1 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 53.4570 2.828 18.906 0.000 47.775 59.139 C(INCOMEGRP)[T.2] 11.0237 3.829 2.879 0.006 3.330 18.717 C(INCOMEGRP)[T.3] 19.8335 3.365 5.893 0.000 13.070 26.597 C(INCOMEGRP)[T.4] 16.4817 3.265 5.048 0.000 9.920 23.043 C(INCOMEGRP)[T.5] 27.2495 3.239 8.412 0.000 20.740 33.759 ============================================================================== Omnibus: 20.199 Durbin-Watson: 1.968 Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.066 Skew: -1.391 Prob(JB): 1.33e-06 Kurtosis: 5.072 Cond. No. 8.48 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.916 Model: OLS Adj. R-squared: 0.903 Method: Least Squares F-statistic: 70.57 Date: Thu, 01 Jun 2017 Prob (F-statistic): 1.41e-13 Time: 20:58:12 Log-Likelihood: -66.555 No. Observations: 31 AIC: 143.1 Df Residuals: 26 BIC: 150.3 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 52.9975 1.599 33.145 0.000 49.711 56.284 C(INCOMEGRP)[T.2] 18.6460 1.958 9.521 0.000 14.621 22.671 C(INCOMEGRP)[T.3] 16.2735 1.892 8.601 0.000 12.385 20.162 C(INCOMEGRP)[T.4] 22.0411 1.768 12.469 0.000 18.407 25.675 C(INCOMEGRP)[T.5] 27.1645 1.738 15.627 0.000 23.591 30.738 ============================================================================== Omnibus: 9.369 Durbin-Watson: 1.593 Prob(Omnibus): 0.009 Jarque-Bera (JB): 9.927 Skew: -0.772 Prob(JB): 0.00699 Kurtosis: 5.303 Cond. No. 10.2 ==============================================================================
When examining the association between average income and life expectancy, an Analysis of Variance (ANOVA) revealed that among countries with different level of GDP per capita, those with higher GDP have longer life expectancy, with the mean values being 56.27, 66.76, 71.50, 72.87 and 79.91 of each group. P is 1.28e-37
Then alcohol consumption is considered as an potential moderator and examined. however, with very p values for each alcohol consumption group, it indicates that all the directions are same for each group. We can say that alcohol consumption does not moderate the relationship between life expectancy and income level.
0 notes
Text
RESULT
association between income per person and life expectancy (0.60151634019643963, 1.0653418935026235e-18)
The correlation coefficient is 0.60 with a very small p value.Which means the association is positive and the relationship is statistically significant.
r2 is 0.36 which means 36% of the variability in life expectancy can be explained by income per person
0 notes
Text
Running a Chi-Square Test of Independence
RESULT
INCOMEGRP 1 2 3 4 5 LIFEGRP 1 25 6 1 3 0 2 10 17 6 1 0 3 0 5 18 10 1 4 0 6 9 16 3 5 0 0 0 4 30 INCOMEGRP 1 2 3 4 5 LIFEGRP 1 0.714286 0.176471 0.029412 0.088235 0.000000 2 0.285714 0.500000 0.176471 0.029412 0.000000 3 0.000000 0.147059 0.529412 0.294118 0.029412 4 0.000000 0.176471 0.264706 0.470588 0.088235 5 0.000000 0.000000 0.000000 0.117647 0.882353 chi-square value, p value, expected counts (244.01884753901561, 8.7043813094460167e-43, 16, array([[ 7.16374269, 6.95906433, 6.95906433, 6.95906433, 6.95906433], [ 6.95906433, 6.76023392, 6.76023392, 6.76023392, 6.76023392], [ 6.95906433, 6.76023392, 6.76023392, 6.76023392, 6.76023392], [ 6.95906433, 6.76023392, 6.76023392, 6.76023392, 6.76023392], [ 6.95906433, 6.76023392, 6.76023392, 6.76023392, 6.76023392]])) Comparison of dict_keys([1, 2]) COMP 1.0 2.0 LIFEGRP 1 25 6 2 10 17 3 0 5 4 0 6 COMP 1.0 2.0 LIFEGRP 1 0.714286 0.176471 2 0.285714 0.500000 3 0.000000 0.147059 4 0.000000 0.176471 chi-square value, p value, expected counts (24.450618957260325, 2.011326232682634e-05, 3, array([[ 15.72463768, 15.27536232], [ 13.69565217, 13.30434783], [ 2.53623188, 2.46376812], [ 3.04347826, 2.95652174]])) Comparison of dict_keys([1, 3]) COMP 1.0 3.0 LIFEGRP 1 25 1 2 10 6 3 0 18 4 0 9 COMP 1.0 3.0 LIFEGRP 1 0.714286 0.029412 2 0.285714 0.176471 3 0.000000 0.529412 4 0.000000 0.264706 chi-square value, p value, expected counts (50.149886877828045, 7.4230108342485379e-11, 3, array([[ 13.1884058 , 12.8115942 ], [ 8.11594203, 7.88405797], [ 9.13043478, 8.86956522], [ 4.56521739, 4.43478261]])) Comparison of dict_keys([1, 4]) COMP 1.0 4.0 LIFEGRP 1 25 3 2 10 1 3 0 10 4 0 16 5 0 4 COMP 1.0 4.0 LIFEGRP 1 0.714286 0.088235 2 0.285714 0.029412 3 0.000000 0.294118 4 0.000000 0.470588 5 0.000000 0.117647 chi-square value, p value, expected counts (54.646335807050093, 3.853370589093216e-11, 4, array([[ 14.20289855, 13.79710145], [ 5.57971014, 5.42028986], [ 5.07246377, 4.92753623], [ 8.11594203, 7.88405797], [ 2.02898551, 1.97101449]])) Comparison of dict_keys([1, 5]) COMP 1.0 5.0 LIFEGRP 1 25 0 2 10 0 3 0 1 4 0 3 5 0 30 COMP 1.0 5.0 LIFEGRP 1 0.714286 0.000000 2 0.285714 0.000000 3 0.000000 0.029412 4 0.000000 0.088235 5 0.000000 0.882353 chi-square value, p value, expected counts (69.0, 3.6903599414292823e-14, 4, array([[ 12.68115942, 12.31884058], [ 5.07246377, 4.92753623], [ 0.50724638, 0.49275362], [ 1.52173913, 1.47826087], [ 15.2173913 , 14.7826087 ]])) Comparison of dict_keys([2, 3]) COMP 2.0 3.0 LIFEGRP 1 6 1 2 17 6 3 5 18 4 6 9 COMP 2.0 3.0 LIFEGRP 1 0.176471 0.029412 2 0.500000 0.176471 3 0.147059 0.529412 4 0.176471 0.264706 chi-square value, p value, expected counts (16.780124223602485, 0.00078427148742470207, 3, array([[ 3.5, 3.5], [ 11.5, 11.5], [ 11.5, 11.5], [ 7.5, 7.5]])) Comparison of dict_keys([2, 4]) COMP 2.0 4.0 LIFEGRP 1 6 3 2 17 1 3 5 10 4 6 16 5 0 4 COMP 2.0 4.0 LIFEGRP 1 0.176471 0.088235 2 0.500000 0.029412 3 0.147059 0.294118 4 0.176471 0.470588 5 0.000000 0.117647 chi-square value, p value, expected counts (25.434343434343432, 4.1140263895507357e-05, 4, array([[ 4.5, 4.5], [ 9. , 9. ], [ 7.5, 7.5], [ 11. , 11. ], [ 2. , 2. ]])) Comparison of dict_keys([2, 5]) COMP 2.0 5.0 LIFEGRP 1 6 0 2 17 0 3 5 1 4 6 3 5 0 30 COMP 2.0 5.0 LIFEGRP 1 0.176471 0.000000 2 0.500000 0.000000 3 0.147059 0.029412 4 0.176471 0.088235 5 0.000000 0.882353 chi-square value, p value, expected counts (56.666666666666671, 1.453286023334173e-11, 4, array([[ 3. , 3. ], [ 8.5, 8.5], [ 3. , 3. ], [ 4.5, 4.5], [ 15. , 15. ]])) Comparison of dict_keys([3, 4]) COMP 3.0 4.0 LIFEGRP 1 1 3 2 6 1 3 18 10 4 9 16 5 0 4 COMP 3.0 4.0 LIFEGRP 1 0.029412 0.088235 2 0.176471 0.029412 3 0.529412 0.294118 4 0.264706 0.470588 5 0.000000 0.117647 chi-square value, p value, expected counts (12.817142857142857, 0.01220470436944464, 4, array([[ 2. , 2. ], [ 3.5, 3.5], [ 14. , 14. ], [ 12.5, 12.5], [ 2. , 2. ]])) Comparison of dict_keys([3, 5]) COMP 3.0 5.0 LIFEGRP 1 1 0 2 6 0 3 18 1 4 9 3 5 0 30 COMP 3.0 5.0 LIFEGRP 1 0.029412 0.000000 2 0.176471 0.000000 3 0.529412 0.029412 4 0.264706 0.088235 5 0.000000 0.882353 chi-square value, p value, expected counts (55.210526315789473, 2.9351647730436071e-11, 4, array([[ 0.5, 0.5], [ 3. , 3. ], [ 9.5, 9.5], [ 6. , 6. ], [ 15. , 15. ]])) Comparison of dict_keys([4, 5]) COMP 4.0 5.0 LIFEGRP 1 3 0 2 1 0 3 10 1 4 16 3 5 4 30 COMP 4.0 5.0 LIFEGRP 1 0.088235 0.000000 2 0.029412 0.000000 3 0.294118 0.029412 4 0.470588 0.088235 5 0.117647 0.882353 chi-square value, p value, expected counts (40.140726146918098, 4.0478469801893014e-08, 4, array([[ 1.5, 1.5], [ 0.5, 0.5], [ 5.5, 5.5], [ 9.5, 9.5], [ 17. , 17. ]]))
Model Interpretation for Chi-Square Tests:
When examining the association between income level and life expectancy, a chi-square test of independence revealed that among countries with different level of GDP per capita, those with higher GDP were more likely to have longer life expectancy. p value is 8.7043813094460167e-43
Model Interpretation for post hoc Chi-Square Test results:
A Chi Square test of independence revealed that income level and life expectancy were significantly associated, X2 = 244, p= 8.7043813094460167e-43. Post hoc comparisons of mean life expectancy by income categories revealed that every income group has a significantly longer life expectancy than the groups lower except group 3 and 4 (p value 0.0122 ) which are considered statistically similar
0 notes
Text
Running an analysis of variance
RESULT
OLS Regression Results ============================================================================== Dep. Variable: lifeexpectancy R-squared: 0.658 Model: OLS Adj. R-squared: 0.649 Method: Least Squares F-statistic: 79.72 Date: Wed, 31 May 2017 Prob (F-statistic): 1.28e-37 Time: 21:24:32 Log-Likelihood: -539.52 No. Observations: 171 AIC: 1089. Df Residuals: 166 BIC: 1105. Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 56.2721 0.974 57.793 0.000 54.350 58.194 C(INCOMEGRP)[T.2] 10.4927 1.387 7.565 0.000 7.754 13.231 C(INCOMEGRP)[T.3] 15.2263 1.387 10.977 0.000 12.488 17.965 C(INCOMEGRP)[T.4] 16.6018 1.387 11.969 0.000 13.863 19.340 C(INCOMEGRP)[T.5] 23.6333 1.387 17.038 0.000 20.895 26.372 ============================================================================== Omnibus: 54.461 Durbin-Watson: 2.095 Prob(Omnibus): 0.000 Jarque-Bera (JB): 134.477 Skew: -1.361 Prob(JB): 6.29e-30 Kurtosis: 6.386 Cond. No. 5.77 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for life expectancy by income group lifeexpectancy INCOMEGRP 1 56.272057 2 66.764765 3 71.498353 4 72.873824 5 79.905382 standard deviations for life expectancy by income group lifeexpectancy INCOMEGRP 1 6.269807 2 6.677341 3 5.062343 4 7.169844 5 2.188812 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff lower upper reject --------------------------------------------- 1 2 10.4927 6.6671 14.3183 True 1 3 15.2263 11.4007 19.0519 True 1 4 16.6018 12.7761 20.4274 True 1 5 23.6333 19.8077 27.459 True 2 3 4.7336 0.8803 8.5868 True 2 4 6.1091 2.2558 9.9623 True 2 5 13.1406 9.2874 16.9939 True 3 4 1.3755 -2.4778 5.2287 False 3 5 8.407 4.5538 12.2603 True 4 5 7.0316 3.1783 10.8848 True ---------------------------------------------
Model Interpretation for ANOVA:
When examining the association between average income and life expectancy, an Analysis of Variance (ANOVA) revealed that among countries with different level of GDP per capita, those with higher GDP have longer life expectancy, with the mean values being 56.27, 66.76, 71.50, 72.87 and 79.91 of each group. F value is 79.72. P is 1.28e-37
Model Interpretation for post hoc ANOVA results:
ANOVA revealed that income level and life expectancy were significantly associated, F = 79.72 , p= 1.28e-37 . Post hoc comparisons of mean life expectancy by income categories revealed that every income group has a significantly longer life expectancy than the lower group except group 3 and 4 which are statistically similar.
0 notes
Text
Creating graphs for your data
import numpy as np import pandas as pd import seaborn import matplotlib.pyplot as plt
data = pd.read_csv('gapminder.csv', low_memory=False)
#plt.rcParams['figure.figsize'] = (10,5)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data.dropna(inplace=True)
def LIFE (lifeexpectancy): if lifeexpectancy < 45: return 1 if lifeexpectancy >= 45 and lifeexpectancy < 55: return 2 if lifeexpectancy >= 55 and lifeexpectancy < 65: return 3 if lifeexpectancy >= 65 and lifeexpectancy < 75: return 4 if lifeexpectancy >= 75: return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income): if income < 3000: return 1 if income >= 3000 and income < 7000: return 2 if income >= 7000 and income < 10000: return 3 if income >= 10000: return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol): if alcohol < 3: return 1 if alcohol >= 3 and alcohol < 6: return 2 if alcohol >= 6 and alcohol < 9: return 3 if alcohol >= 9 and alcohol < 12: return 4 if alcohol >= 12: return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category')
plt.figure(figsize=(10,5)) seaborn.countplot(x="LIFEGRP", data=data) plt.xlabel('Life Expectancy') plt.ylabel('Frequency') plt.figure(figsize=(10,5)) seaborn.countplot(x="INCOMEGRP", data=data) plt.xlabel('INCOME') plt.ylabel('Frequency')
plt.figure(figsize=(10,5)) scat4 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", data=data) plt.xlabel('Income per Person') plt.ylabel('Life expectancy')
plt.figure(figsize=(10,5))
data['INCOMEGRP5'] = pd.qcut(data.incomeperperson, 5, labels=["1=20th%tile","2=40%tile","3=65%tile","4=80%tile", "5=100%tile"]) # bivariate bar graph C->Q seaborn.factorplot(x='INCOMEGRP5', y='lifeexpectancy', data=data, kind="box", ci=None, size=6, aspect = 1.5) plt.xlabel('income group') plt.ylabel('mean life expectancy')
This graph is unimodal, with its highest peak at 65 to 75 years old. It seems to be skewed to the left as there are higher frequencies in the higher age ranges.
This graph is bimodal, with its highest peak at the lowest category which is less than $3000 per person.
The graph above plots the life expectancy of a country to the country’s corresponding GDP per capita. There seems to be an linear regression between these two variables, but it’s not clear. We then plot the average value for each income group and can see that the life expectancy does increase as income gets higher.
1 note
·
View note
Text
Making Data Management and Decisions
import numpy as np import pandas as pd
data = pd.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data.dropna(inplace=True)
def LIFE (lifeexpectancy): if lifeexpectancy < 45: return 1 if lifeexpectancy >= 45 and lifeexpectancy < 55: return 2 if lifeexpectancy >= 55 and lifeexpectancy < 65: return 3 if lifeexpectancy >= 65 and lifeexpectancy < 75: return 4 if lifeexpectancy >= 75: return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income): if income < 3000: return 1 if income >= 3000 and income < 7000: return 2 if income >= 7000 and income < 10000: return 3 if income >= 10000: return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol): if alcohol < 3: return 1 if alcohol >= 3 and alcohol < 6: return 2 if alcohol >= 6 and alcohol < 9: return 3 if alcohol >= 9 and alcohol < 12: return 4 if alcohol >= 12: return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category') print('Income per person - 4 categories - quartiles') c1 = data['INCOMEGRP'].value_counts(sort=False, dropna=False) print(c1)
print('Life expectancy - 4 categories - quartiles') c2 = data['LIFEGRP'].value_counts(sort=False, dropna = False) print(c2)
print('Alcohol consumption - 4 categories - quartiles') c3 = data['ALCGRP'].value_counts(sort=False, dropna = False) print(c3)
RESULT:Income per person - 4 categories - quartiles 1 0.573099 2 0.169591 3 0.040936 4 0.216374 Name: INCOMEGRP, dtype: float64 Life expectancy - 4 categories - quartiles 2 0.128655 3 0.146199 4 0.415205 5 0.309942 Name: LIFEGRP, dtype: float64 Alcohol consumption - 4 categories - quartiles 1 0.257310 2 0.245614 3 0.181287 4 0.134503 5 0.181287 Name: ALCGRP, dtype: float64
NA values are dropped. Variables are grouped into categories.
57% of the countries have a GDP less than $3000 per person
In around 70% of the countries, people have a life expectancy longer 65 years.
half of the countries have an alcohol consumption level less than 6 litres per person.
0 notes
Text
Running Your First Program
import numpy as np import pandas as pd
data = pd.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
def LIFE (lifeexpectancy): if lifeexpectancy < 45: return 1 if lifeexpectancy >= 45 and lifeexpectancy < 55: return 2 if lifeexpectancy >= 55 and lifeexpectancy < 65: return 3 if lifeexpectancy >= 65 and lifeexpectancy < 75: return 4 if lifeexpectancy >= 75: return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income): if income < 3000: return 1 if income >= 3000 and income < 7000: return 2 if income >= 7000 and income < 10000: return 3 if income >= 10000: return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol): if alcohol < 3: return 1 if alcohol >= 3 and alcohol < 6: return 2 if alcohol >= 6 and alcohol < 9: return 3 if alcohol >= 9 and alcohol < 12: return 4 if alcohol >= 12: return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category')
print('Income per person - 4 categories - quartiles') c1 = data['INCOMEGRP'].value_counts(sort=False, dropna=False, normalize=True) print(c1)
print('Life expectancy - 4 categories - quartiles') c2 = data['LIFEGRP'].value_counts(sort=False, dropna = False, normalize=True) print(c2)
print('Alcohol consumption - 4 categories - quartiles') c3 = data['ALCGRP'].value_counts(sort=False, dropna = False, normalize=True) print(c3)
RESULT:
Income per person - 4 categories - quantiles 1.0 0.478873 2.0 0.150235 3.0 0.042254 4.0 0.220657 NaN 0.107981 Name: INCOMEGRP, dtype: float64 Life expectancy - 4 categories - quantiles 2.0 0.112676 3.0 0.122066 4.0 0.356808 5.0 0.305164 NaN 0.103286 Name: LIFEGRP, dtype: float64 Alcohol consumption - 4 categories - quantiles 1.0 0.234742 2.0 0.206573 3.0 0.164319 4.0 0.122066 5.0 0.150235 NaN 0.122066 Name: ALCGRP, dtype: float64
There is no categorical variables in Gapminder dataset. Therefore, some variables are grouped into new categorical variables.
Almost half of the countries GDP fell in level 1 which is less thank $3000 per person. About 10% of the countries do not have record of their GDP.
In around 65% of the countries, people have a life expectancy longer 65 years. About 10% of the countries do not have record of their population’s life expectancy.
About 12% of the countries do not have record of alcohol consumption.
0 notes
Text
Getting Your Research Project Started
After looking through the codebook for the GapMinder study, I have decided that I am particularly interested in life expectancy. So for now I will include the variable "lifeexpectancy" in my personal codebook.
While life expectancy is a good starting point, I need to determine what it is about life expecntancy that I am interested in. I decide that I am most interested in exploring the association between level of GDP and life expectancy. I add to my codebook "incomeperperson" reflecting GDP levels. The research question can be asked regarding these two variables is "Is life expenctancy associated with level of Gross Domestic Product per capita".
Many theories have been proposed to explain income inequalties in living standards are associated with health differences within countries.The researchers claim that life expectancy is one of the indicators of economic development(Income distribution and life expectancy: a critical appraisal. K. Judge BMJ. 1995 Nov 11; 311(7015): 1282–1287. [PMC free article]). Citizens of rich countries can expect to live for many decades more than those of poor countries. Although they also mentioned other reasons which will affect the life expectancy such as dietary influences and cultural factors, these factors are beyond the scope of this research. We will focus on finding the relationship between income levels and life expectancy and quatifying the relationship. The hypothesis stated in this analysis is higher income leads to longer life expectancy.
0 notes