Text
Running a k-means Cluster Analysis
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based on their similarity of responses on 8 variables that represent characteristics that could have an impact on countries . Clustering variables included 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th','hivrate', 'suicideper100th', 'employrate', 'lifeexpectancy'. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.  The elbow curve was inconclusive, suggesting that the 2, 3 and 4-cluster solutions might be interpreted. The results are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses was used to reduce the 8 clustering variable down to 2 variables that accounted for most of the variance in the clustering variables. A scatterplot of the two canonical variables by cluster indicated that the observations in clusters 1and 4 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Observations in cluster 2 and 3 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
The means on the clustering variables showed that, compared to the other clusters, countries in cluster 3 had highest level of income, urban rate and life expectancy. Cluster 2 had the highest rate of HIV rate and lowest clustering variables compared to other clusters.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on life expectancy. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on life expectancy(p=5.10e-86). The tukey post hoc comparisons showed significant differences between clusters on life expectancy.
0 notes
Text
Running a Lasso Regression Analysis
Tumblr media Tumblr media Tumblr media
A lasso regression analysis was conducted to identify a subset of variables from a pool of 7 quantitative predictor variables that best predicted a quantitative response variable measuring life expectancy. Predictor variables include 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th', 'hivrate', 'suicideper100th', 'employrate'. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Of the 7 predictor variables, 6 were retained in the selected model. During the estimation process, income, urban rate and HIV rate were most strongly associated with life expectancy. These 6 variables accounted for 70% of the variance in the life expectancy response variable.
0 notes
Text
Running a Random Forest
Tumblr media Tumblr media
Random forest analysis was performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable. The following explanatory variables were included as possible contributors to a random forest evaluating life expectancy, 'incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th','hivrate', 'suicideper100th', 'employrate'.
The explanatory variables with the highest relative importance scores were income per person and urban rate. The accuracy of the random forest was 93%, with the subsequent growing of multiple trees rather than a single tree, adding much to the overall accuracy of the model, and suggesting that interpretation of a random forest may be appropriate.
0 notes
Text
Running a Classification Tree
Tumblr media Tumblr media
[[23  5] [ 4 25]] 0.842105263158
Response variable life expectancy is quantitive. So it is binned into two groups - above average life expectancy 1 and below average 0.
The following explanatory variables were included as possible contributors to a classification tree model evaluating life expectancy - ‘incomeperperson', 'urbanrate','alcconsumption', 'breastcancerper100th', 'hivrate', 'suicideper100th', 'employrate'.
The income was the first variable to separate the sample into two subgroups. Countries with a income less than 1274.51 were more likely to have experimented with short life expectancy compared to countries having income per person more thank 1274.51
Of the countries with a income less than 1274.51, a further subdivision was made with the income again. Countries which reported having income less than 830.9976 were more likely to have experimented with short life expectancy. 
 Of the countries with a income more than 1274.51, a further subdivision was made with the HIV rate. Countries with HIV rate less than 4.05 were more likely to have long life expectancy    
The total model classified 84% of the sample correctly, 85% of experimenters (sensitivity) and 83% of nonsmokers (specificity).
0 notes
Text
Test a Logistic Regression Model
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
both explanatory variables and response variables are binned into two groups based on their mean values. So there are two groups for each variable which are group above average 1 and group below average 0.
After adjusting for potential confounding factors (income per person, urban rate and alcohol consumption), the odds of having life expectancy above average were more than 23 times higher for countries with income above average than countries with income per person below average (OR=23.51, 95% CI = 3.01-183.76, p=.0003). urban rate was also significantly associated with life expectancy, such that countries with urban rate above average were more likely to have life expectancy above average (OR= 4.56, 95% CI=2.10-9.91, p=.0001). No confounding variable is identified. p value of alcohol consumption is 0.048 which means alcohol consumption is also associated with life expectancy group.
0 notes
Text
Test a Multiple Regression Model
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
After adjusting for potential confounding factors urban rate, alcohol consumption, income per person, income per person (Beta=0.0003, p=.0001) was significantly and positively associated with life expectancy. urban rate was also significantly associated with life expectancy, such that countries with higher urban rate has longer life expectancy (Beta= 0.1619, p=.0001).
qqplot shows the residuals do not follow a perfect normal distribution, meaning that there may be other explanatory variables need to be considered.
the residual plot shows that the absolute value of residuals are significantly larger at lower values of income per person. This model does not predict the life expectancy well for countries with lower income per person. It suggests a curvilinear association.
Partial residual plot shows that the residuals spread out a random pattern around the partial regression line. The association between income per person and life expectancy is weak.
The leverage plot shows the outliers have very small leverage. they do not have undue influence on the estimation of the regression model.
0 notes
Text
Test a Basic Linear Regression Model
Tumblr media Tumblr media
the income per person has been centred with mean value -1.1006952924865552e-12
The results of the linear regression model indicated that life expectancy (Beta=0.0006, p=1.07e-18) was significantly and positively associated with income per person.
R squared value is 0.362 which means only 36.2% of the variability is explained by the this regression model
0 notes
Text
Writing About Your Data
Sample
The sample is from the Gapminder aggregating the data of GDP per capita and life expectancy, total employment rate, and estimated HIV prevalence for a total of 215 areas.
There were 192 UN members included in this data set. 24 of 51 other entities listed in the “List of countries” in Wikipedia. The data analytic sample for this study included Gross National Income per capita and life expectancy for all 215 areas.
Procedure
Data reporting was used by GapMinder. The data was donwloaded in 2010 from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, the World Bank, Human Mortality Database and UN Population divisions World Population Prospects. 
Measures
regarding Gross National Income per capita, for 206 areas data was collected from United Nations Statistics Division. For the areas for which data was missing from the United Nations Statistics Division, Landsbanki Foroya and World Bank. It measures Gross Domestic Product per capita ranging from $103 to $105147. The inflation but not the differences in the cost of living between countries has been taken into account. For the current analysis, it was binned into 5 categories based on a quantile split.
As for life expectancy, two big datasets were the main two sources: Human Mortality Database and UN Population divisions World Population Prospects. It measures life expectancy at birth (years) - the average number of years a newborn child would live if current mortality patterns were to stay the same rangeing from 47.8 to 83.4. For the current analysis, it was binned into 5 categories based on a quantile split
0 notes
Text
Testing a Potential Moderator
Tumblr media Tumblr media
RESULT:
OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.658 Model:                            OLS   Adj. R-squared:                  0.649 Method:                 Least Squares   F-statistic:                     79.72 Date:                Thu, 01 Jun 2017   Prob (F-statistic):           1.28e-37 Time:                        20:58:12   Log-Likelihood:                -539.52 No. Observations:                 171   AIC:                             1089. Df Residuals:                     166   BIC:                             1105. Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                        coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept            56.2721      0.974     57.793      0.000        54.350    58.194 C(INCOMEGRP)[T.2]    10.4927      1.387      7.565      0.000         7.754    13.231 C(INCOMEGRP)[T.3]    15.2263      1.387     10.977      0.000        12.488    17.965 C(INCOMEGRP)[T.4]    16.6018      1.387     11.969      0.000        13.863    19.340 C(INCOMEGRP)[T.5]    23.6333      1.387     17.038      0.000        20.895    26.372 ============================================================================== Omnibus:                       54.461   Durbin-Watson:                   2.095 Prob(Omnibus):                  0.000   Jarque-Bera (JB):              134.477 Skew:                          -1.361   Prob(JB):                     6.29e-30 Kurtosis:                       6.386   Cond. No.                         5.77 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for life expectancy by income group           lifeexpectancy INCOMEGRP                 1               56.272057 2               66.764765 3               71.498353 4               72.873824 5               79.905382                            OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.611 Model:                            OLS   Adj. R-squared:                  0.592 Method:                 Least Squares   F-statistic:                     31.78 Date:                Thu, 01 Jun 2017   Prob (F-statistic):           6.50e-16 Time:                        20:58:12   Log-Likelihood:                -272.79 No. Observations:                  86   AIC:                             555.6 Df Residuals:                      81   BIC:                             567.8 Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                        coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept            57.0086      1.124     50.723      0.000        54.772    59.245 C(INCOMEGRP)[T.2]     9.5140      1.654      5.751      0.000         6.222    12.806 C(INCOMEGRP)[T.3]    13.8798      1.829      7.590      0.000        10.241    17.518 C(INCOMEGRP)[T.4]    18.3197      2.191      8.362      0.000        13.960    22.679 C(INCOMEGRP)[T.5]    20.6624      2.513      8.222      0.000        15.662    25.663 ============================================================================== Omnibus:                       13.403   Durbin-Watson:                   2.020 Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.061 Skew:                          -0.739   Prob(JB):                     0.000197 Kurtosis:                       4.605   Cond. No.                         5.08 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.631 Model:                            OLS   Adj. R-squared:                  0.601 Method:                 Least Squares   F-statistic:                     20.96 Date:                Thu, 01 Jun 2017   Prob (F-statistic):           4.03e-10 Time:                        20:58:12   Log-Likelihood:                -173.58 No. Observations:                  54   AIC:                             357.2 Df Residuals:                      49   BIC:                             367.1 Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                        coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept            53.4570      2.828     18.906      0.000        47.775    59.139 C(INCOMEGRP)[T.2]    11.0237      3.829      2.879      0.006         3.330    18.717 C(INCOMEGRP)[T.3]    19.8335      3.365      5.893      0.000        13.070    26.597 C(INCOMEGRP)[T.4]    16.4817      3.265      5.048      0.000         9.920    23.043 C(INCOMEGRP)[T.5]    27.2495      3.239      8.412      0.000        20.740    33.759 ============================================================================== Omnibus:                       20.199   Durbin-Watson:                   1.968 Prob(Omnibus):                  0.000   Jarque-Bera (JB):               27.066 Skew:                          -1.391   Prob(JB):                     1.33e-06 Kurtosis:                       5.072   Cond. No.                         8.48 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.                            OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.916 Model:                            OLS   Adj. R-squared:                  0.903 Method:                 Least Squares   F-statistic:                     70.57 Date:                Thu, 01 Jun 2017   Prob (F-statistic):           1.41e-13 Time:                        20:58:12   Log-Likelihood:                -66.555 No. Observations:                  31   AIC:                             143.1 Df Residuals:                      26   BIC:                             150.3 Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                        coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept            52.9975      1.599     33.145      0.000        49.711    56.284 C(INCOMEGRP)[T.2]    18.6460      1.958      9.521      0.000        14.621    22.671 C(INCOMEGRP)[T.3]    16.2735      1.892      8.601      0.000        12.385    20.162 C(INCOMEGRP)[T.4]    22.0411      1.768     12.469      0.000        18.407    25.675 C(INCOMEGRP)[T.5]    27.1645      1.738     15.627      0.000        23.591    30.738 ============================================================================== Omnibus:                        9.369   Durbin-Watson:                   1.593 Prob(Omnibus):                  0.009   Jarque-Bera (JB):                9.927 Skew:                          -0.772   Prob(JB):                      0.00699 Kurtosis:                       5.303   Cond. No.                         10.2 ==============================================================================
Tumblr media Tumblr media Tumblr media Tumblr media
When examining the association between average income and life expectancy, an Analysis of Variance (ANOVA) revealed that among countries with different level of GDP per capita, those with higher GDP have longer life expectancy, with the mean values being  56.27, 66.76, 71.50, 72.87 and 79.91 of each group. P is 1.28e-37
Then alcohol consumption is considered as an potential moderator and examined. however, with very p values for each alcohol consumption group, it indicates that all the directions are same for each group. We can say that alcohol consumption does not moderate the relationship between life expectancy and income level.
0 notes
Text
Tumblr media
RESULT
Tumblr media
association between income per person and life expectancy (0.60151634019643963, 1.0653418935026235e-18)
The correlation coefficient is 0.60 with a very small p value.Which means the association is positive and the relationship is statistically significant.
r2 is 0.36 which means 36% of the variability in life expectancy can be explained by income per person
0 notes
Text
Running a Chi-Square Test of Independence
Tumblr media
RESULT
INCOMEGRP   1   2   3   4   5 LIFEGRP                       1          25   6   1   3   0 2          10  17   6   1   0 3           0   5  18  10   1 4           0   6   9  16   3 5           0   0   0   4  30 INCOMEGRP         1         2         3         4         5 LIFEGRP                                                     1          0.714286  0.176471  0.029412  0.088235  0.000000 2          0.285714  0.500000  0.176471  0.029412  0.000000 3          0.000000  0.147059  0.529412  0.294118  0.029412 4          0.000000  0.176471  0.264706  0.470588  0.088235 5          0.000000  0.000000  0.000000  0.117647  0.882353 chi-square value, p value, expected counts (244.01884753901561, 8.7043813094460167e-43, 16, array([[ 7.16374269,  6.95906433,  6.95906433,  6.95906433,  6.95906433],       [ 6.95906433,  6.76023392,  6.76023392,  6.76023392,  6.76023392],       [ 6.95906433,  6.76023392,  6.76023392,  6.76023392,  6.76023392],       [ 6.95906433,  6.76023392,  6.76023392,  6.76023392,  6.76023392],       [ 6.95906433,  6.76023392,  6.76023392,  6.76023392,  6.76023392]])) Comparison of dict_keys([1, 2]) COMP     1.0  2.0 LIFEGRP           1         25    6 2         10   17 3          0    5 4          0    6 COMP          1.0       2.0 LIFEGRP                     1        0.714286  0.176471 2        0.285714  0.500000 3        0.000000  0.147059 4        0.000000  0.176471 chi-square value, p value, expected counts (24.450618957260325, 2.011326232682634e-05, 3, array([[ 15.72463768,  15.27536232],       [ 13.69565217,  13.30434783],       [  2.53623188,   2.46376812],       [  3.04347826,   2.95652174]])) Comparison of dict_keys([1, 3]) COMP     1.0  3.0 LIFEGRP           1         25    1 2         10    6 3          0   18 4          0    9 COMP          1.0       3.0 LIFEGRP                     1        0.714286  0.029412 2        0.285714  0.176471 3        0.000000  0.529412 4        0.000000  0.264706 chi-square value, p value, expected counts (50.149886877828045, 7.4230108342485379e-11, 3, array([[ 13.1884058 ,  12.8115942 ],       [  8.11594203,   7.88405797],       [  9.13043478,   8.86956522],       [  4.56521739,   4.43478261]])) Comparison of dict_keys([1, 4]) COMP     1.0  4.0 LIFEGRP           1         25    3 2         10    1 3          0   10 4          0   16 5          0    4 COMP          1.0       4.0 LIFEGRP                     1        0.714286  0.088235 2        0.285714  0.029412 3        0.000000  0.294118 4        0.000000  0.470588 5        0.000000  0.117647 chi-square value, p value, expected counts (54.646335807050093, 3.853370589093216e-11, 4, array([[ 14.20289855,  13.79710145],       [  5.57971014,   5.42028986],       [  5.07246377,   4.92753623],       [  8.11594203,   7.88405797],       [  2.02898551,   1.97101449]])) Comparison of dict_keys([1, 5]) COMP     1.0  5.0 LIFEGRP           1         25    0 2         10    0 3          0    1 4          0    3 5          0   30 COMP          1.0       5.0 LIFEGRP                     1        0.714286  0.000000 2        0.285714  0.000000 3        0.000000  0.029412 4        0.000000  0.088235 5        0.000000  0.882353 chi-square value, p value, expected counts (69.0, 3.6903599414292823e-14, 4, array([[ 12.68115942,  12.31884058],       [  5.07246377,   4.92753623],       [  0.50724638,   0.49275362],       [  1.52173913,   1.47826087],       [ 15.2173913 ,  14.7826087 ]])) Comparison of dict_keys([2, 3]) COMP     2.0  3.0 LIFEGRP           1          6    1 2         17    6 3          5   18 4          6    9 COMP          2.0       3.0 LIFEGRP                     1        0.176471  0.029412 2        0.500000  0.176471 3        0.147059  0.529412 4        0.176471  0.264706 chi-square value, p value, expected counts (16.780124223602485, 0.00078427148742470207, 3, array([[  3.5,   3.5],       [ 11.5,  11.5],       [ 11.5,  11.5],       [  7.5,   7.5]])) Comparison of dict_keys([2, 4]) COMP     2.0  4.0 LIFEGRP           1          6    3 2         17    1 3          5   10 4          6   16 5          0    4 COMP          2.0       4.0 LIFEGRP                     1        0.176471  0.088235 2        0.500000  0.029412 3        0.147059  0.294118 4        0.176471  0.470588 5        0.000000  0.117647 chi-square value, p value, expected counts (25.434343434343432, 4.1140263895507357e-05, 4, array([[  4.5,   4.5],       [  9. ,   9. ],       [  7.5,   7.5],       [ 11. ,  11. ],       [  2. ,   2. ]])) Comparison of dict_keys([2, 5]) COMP     2.0  5.0 LIFEGRP           1          6    0 2         17    0 3          5    1 4          6    3 5          0   30 COMP          2.0       5.0 LIFEGRP                     1        0.176471  0.000000 2        0.500000  0.000000 3        0.147059  0.029412 4        0.176471  0.088235 5        0.000000  0.882353 chi-square value, p value, expected counts (56.666666666666671, 1.453286023334173e-11, 4, array([[  3. ,   3. ],       [  8.5,   8.5],       [  3. ,   3. ],       [  4.5,   4.5],       [ 15. ,  15. ]])) Comparison of dict_keys([3, 4]) COMP     3.0  4.0 LIFEGRP           1          1    3 2          6    1 3         18   10 4          9   16 5          0    4 COMP          3.0       4.0 LIFEGRP                     1        0.029412  0.088235 2        0.176471  0.029412 3        0.529412  0.294118 4        0.264706  0.470588 5        0.000000  0.117647 chi-square value, p value, expected counts (12.817142857142857, 0.01220470436944464, 4, array([[  2. ,   2. ],       [  3.5,   3.5],       [ 14. ,  14. ],       [ 12.5,  12.5],       [  2. ,   2. ]])) Comparison of dict_keys([3, 5]) COMP     3.0  5.0 LIFEGRP           1          1    0 2          6    0 3         18    1 4          9    3 5          0   30 COMP          3.0       5.0 LIFEGRP                     1        0.029412  0.000000 2        0.176471  0.000000 3        0.529412  0.029412 4        0.264706  0.088235 5        0.000000  0.882353 chi-square value, p value, expected counts (55.210526315789473, 2.9351647730436071e-11, 4, array([[  0.5,   0.5],       [  3. ,   3. ],       [  9.5,   9.5],       [  6. ,   6. ],       [ 15. ,  15. ]])) Comparison of dict_keys([4, 5]) COMP     4.0  5.0 LIFEGRP           1          3    0 2          1    0 3         10    1 4         16    3 5          4   30 COMP          4.0       5.0 LIFEGRP                     1        0.088235  0.000000 2        0.029412  0.000000 3        0.294118  0.029412 4        0.470588  0.088235 5        0.117647  0.882353 chi-square value, p value, expected counts (40.140726146918098, 4.0478469801893014e-08, 4, array([[  1.5,   1.5],       [  0.5,   0.5],       [  5.5,   5.5],       [  9.5,   9.5],       [ 17. ,  17. ]]))
Model Interpretation for Chi-Square Tests:
When examining the association between income level and life expectancy, a chi-square test of independence revealed that among countries with different level of GDP per capita, those with higher GDP were more likely to have longer life expectancy. p value is 8.7043813094460167e-43
Model Interpretation for post hoc Chi-Square Test results:
A Chi Square test of independence revealed that income level and life expectancy were significantly associated, X2 = 244, p= 8.7043813094460167e-43. Post hoc comparisons of mean life expectancy by income categories revealed that every income group has a significantly longer life expectancy than the groups lower except group 3 and 4 (p value 0.0122 ) which are considered statistically similar
0 notes
Text
Running an analysis of variance
Tumblr media
RESULT
                           OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.658 Model:                            OLS   Adj. R-squared:                  0.649 Method:                 Least Squares   F-statistic:                     79.72 Date:                Wed, 31 May 2017   Prob (F-statistic):           1.28e-37 Time:                        21:24:32   Log-Likelihood:                -539.52 No. Observations:                 171   AIC:                             1089. Df Residuals:                     166   BIC:                             1105. Df Model:                           4                                         Covariance Type:            nonrobust                                         =====================================================================================                        coef    std err          t      P>|t|      [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept            56.2721      0.974     57.793      0.000        54.350    58.194 C(INCOMEGRP)[T.2]    10.4927      1.387      7.565      0.000         7.754    13.231 C(INCOMEGRP)[T.3]    15.2263      1.387     10.977      0.000        12.488    17.965 C(INCOMEGRP)[T.4]    16.6018      1.387     11.969      0.000        13.863    19.340 C(INCOMEGRP)[T.5]    23.6333      1.387     17.038      0.000        20.895    26.372 ============================================================================== Omnibus:                       54.461   Durbin-Watson:                   2.095 Prob(Omnibus):                  0.000   Jarque-Bera (JB):              134.477 Skew:                          -1.361   Prob(JB):                     6.29e-30 Kurtosis:                       6.386   Cond. No.                         5.77 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for life expectancy by income group           lifeexpectancy INCOMEGRP                 1               56.272057 2               66.764765 3               71.498353 4               72.873824 5               79.905382 standard deviations for life expectancy by income group           lifeexpectancy INCOMEGRP                 1                6.269807 2                6.677341 3                5.062343 4                7.169844 5                2.188812 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff  lower   upper  reject ---------------------------------------------  1      2    10.4927   6.6671 14.3183  True  1      3    15.2263  11.4007 19.0519  True  1      4    16.6018  12.7761 20.4274  True  1      5    23.6333  19.8077  27.459  True  2      3     4.7336   0.8803  8.5868  True  2      4     6.1091   2.2558  9.9623  True  2      5    13.1406   9.2874 16.9939  True  3      4     1.3755  -2.4778  5.2287 False  3      5     8.407    4.5538 12.2603  True  4      5     7.0316   3.1783 10.8848  True ---------------------------------------------
Model Interpretation for ANOVA:
When examining the association between average income and life expectancy, an Analysis of Variance (ANOVA) revealed that among countries with different level of GDP per capita, those with higher GDP have longer life expectancy, with the mean values being  56.27, 66.76, 71.50, 72.87 and 79.91 of each group. F value is 79.72. P is 1.28e-37
Model Interpretation for post hoc ANOVA results:
ANOVA revealed that income level and life expectancy were significantly associated, F = 79.72 , p= 1.28e-37 . Post hoc comparisons of mean life expectancy by income categories revealed that every income group has a significantly longer life expectancy than the lower group except group 3 and 4 which are statistically similar. 
0 notes
Text
Creating graphs for your data
import numpy as np import pandas as pd import seaborn import matplotlib.pyplot as plt
data = pd.read_csv('gapminder.csv', low_memory=False)
#plt.rcParams['figure.figsize'] = (10,5)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data.dropna(inplace=True)
def LIFE (lifeexpectancy):    if lifeexpectancy < 45:        return 1    if lifeexpectancy >= 45 and lifeexpectancy < 55:        return 2    if lifeexpectancy >= 55 and lifeexpectancy < 65:        return 3    if lifeexpectancy >= 65 and lifeexpectancy < 75:        return 4    if lifeexpectancy >= 75:        return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income):    if income < 3000:        return 1    if income >= 3000 and income < 7000:        return 2    if income >= 7000 and income < 10000:        return 3    if income >= 10000:        return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol):    if alcohol < 3:        return 1    if alcohol >= 3 and alcohol < 6:        return 2    if alcohol >= 6 and alcohol < 9:        return 3    if alcohol >= 9 and alcohol < 12:        return 4    if alcohol >= 12:        return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category')
plt.figure(figsize=(10,5)) seaborn.countplot(x="LIFEGRP", data=data) plt.xlabel('Life Expectancy') plt.ylabel('Frequency') plt.figure(figsize=(10,5)) seaborn.countplot(x="INCOMEGRP", data=data) plt.xlabel('INCOME') plt.ylabel('Frequency')
plt.figure(figsize=(10,5)) scat4 = seaborn.regplot(x="incomeperperson", y="lifeexpectancy", data=data) plt.xlabel('Income per Person') plt.ylabel('Life expectancy')
plt.figure(figsize=(10,5))
data['INCOMEGRP5'] = pd.qcut(data.incomeperperson, 5, labels=["1=20th%tile","2=40%tile","3=65%tile","4=80%tile", "5=100%tile"]) # bivariate bar graph C->Q seaborn.factorplot(x='INCOMEGRP5', y='lifeexpectancy', data=data, kind="box", ci=None, size=6, aspect = 1.5) plt.xlabel('income group') plt.ylabel('mean life expectancy')
Tumblr media
This graph is unimodal, with its highest peak at 65 to 75 years old. It seems to be skewed to the left as there are higher frequencies in the higher age ranges.  
Tumblr media
This graph is bimodal, with its highest peak at the lowest category which is less than $3000 per person. 
Tumblr media
The graph above plots the life expectancy of a country to the country’s corresponding GDP per capita. There seems to be an linear regression between these two variables, but it’s not clear. We then plot the average value for each income group and can see that the life expectancy does increase as income gets higher.
Tumblr media
1 note · View note
Text
Making Data Management and Decisions
import numpy as np import pandas as pd
data = pd.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data.dropna(inplace=True)
def LIFE (lifeexpectancy):    if lifeexpectancy < 45:        return 1    if lifeexpectancy >= 45 and lifeexpectancy < 55:        return 2    if lifeexpectancy >= 55 and lifeexpectancy < 65:        return 3    if lifeexpectancy >= 65 and lifeexpectancy < 75:        return 4    if lifeexpectancy >= 75:        return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income):    if income < 3000:        return 1    if income >= 3000 and income < 7000:        return 2    if income >= 7000 and income < 10000:        return 3    if income >= 10000:        return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol):    if alcohol < 3:        return 1    if alcohol >= 3 and alcohol < 6:        return 2    if alcohol >= 6 and alcohol < 9:        return 3    if alcohol >= 9 and alcohol < 12:        return 4    if alcohol >= 12:        return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category') print('Income per person - 4 categories - quartiles') c1 = data['INCOMEGRP'].value_counts(sort=False, dropna=False) print(c1)
print('Life expectancy - 4 categories - quartiles') c2 = data['LIFEGRP'].value_counts(sort=False, dropna = False) print(c2)
print('Alcohol consumption - 4 categories - quartiles') c3 = data['ALCGRP'].value_counts(sort=False, dropna = False) print(c3)
RESULT:Income per person - 4 categories - quartiles 1    0.573099 2    0.169591 3    0.040936 4    0.216374 Name: INCOMEGRP, dtype: float64 Life expectancy - 4 categories - quartiles 2    0.128655 3    0.146199 4    0.415205 5    0.309942 Name: LIFEGRP, dtype: float64 Alcohol consumption - 4 categories - quartiles 1    0.257310 2    0.245614 3    0.181287 4    0.134503 5    0.181287 Name: ALCGRP, dtype: float64
NA values are dropped. Variables are grouped into categories.
57% of the countries have a GDP less than $3000 per person
In around 70% of the countries, people have a life expectancy longer 65 years.
half of the countries have an alcohol consumption level less than 6 litres per person.
0 notes
Text
Running Your First Program
import numpy as np import pandas as pd
data = pd.read_csv('gapminder.csv', low_memory=False)
#setting variables you will be working with to numeric data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
def LIFE (lifeexpectancy):    if lifeexpectancy < 45:        return 1    if lifeexpectancy >= 45 and lifeexpectancy < 55:        return 2    if lifeexpectancy >= 55 and lifeexpectancy < 65:        return 3    if lifeexpectancy >= 65 and lifeexpectancy < 75:        return 4    if lifeexpectancy >= 75:        return 5
data['LIFEGRP'] = data['lifeexpectancy'].apply(LIFE) data['LIFEGRP'] = data['LIFEGRP'].astype('category')
def INCOME(income):    if income < 3000:        return 1    if income >= 3000 and income < 7000:        return 2    if income >= 7000 and income < 10000:        return 3    if income >= 10000:        return 4 data['INCOMEGRP'] = data['incomeperperson'].apply(INCOME) data['INCOMEGRP'] = data['INCOMEGRP'].astype('category')
def ALCOHOL(alcohol):    if alcohol < 3:        return 1    if alcohol >= 3 and alcohol < 6:        return 2    if alcohol >= 6 and alcohol < 9:        return 3    if alcohol >= 9 and alcohol < 12:        return 4    if alcohol >= 12:        return 5 data['ALCGRP'] = data['alcconsumption'].apply(ALCOHOL) data['ALCGRP'] = data['ALCGRP'].astype('category')
print('Income per person - 4 categories - quartiles') c1 = data['INCOMEGRP'].value_counts(sort=False, dropna=False, normalize=True) print(c1)
print('Life expectancy - 4 categories - quartiles') c2 = data['LIFEGRP'].value_counts(sort=False, dropna = False, normalize=True) print(c2)
print('Alcohol consumption - 4 categories - quartiles') c3 = data['ALCGRP'].value_counts(sort=False, dropna = False, normalize=True) print(c3)
RESULT:
Income per person - 4 categories - quantiles 1.0    0.478873 2.0    0.150235 3.0    0.042254 4.0    0.220657 NaN     0.107981 Name: INCOMEGRP, dtype: float64 Life expectancy - 4 categories - quantiles 2.0    0.112676 3.0    0.122066 4.0    0.356808 5.0    0.305164 NaN     0.103286 Name: LIFEGRP, dtype: float64 Alcohol consumption - 4 categories - quantiles 1.0    0.234742 2.0    0.206573 3.0    0.164319 4.0    0.122066 5.0    0.150235 NaN     0.122066 Name: ALCGRP, dtype: float64
There is no categorical variables in Gapminder dataset. Therefore, some variables are grouped into new categorical variables.
Almost half of the countries GDP fell in level 1 which is less thank $3000 per person. About 10% of the countries do not have record of their GDP.
In around 65% of the countries, people have a life expectancy longer 65 years. About 10% of the countries do not have record of their population’s life expectancy.
About 12% of the countries do not have record of alcohol consumption.
0 notes
Text
Getting Your Research Project Started
After looking through the codebook for the GapMinder study, I have decided that I am particularly interested in life expectancy. So for now I will include the variable "lifeexpectancy" in my personal codebook.
While life expectancy is a good starting point, I need to determine what it is about life expecntancy that I am interested in. I decide that I am most interested in exploring the association between level of GDP and life expectancy. I add to my codebook "incomeperperson" reflecting GDP levels. The research question can be asked regarding these two variables is "Is life expenctancy associated with level of Gross Domestic Product per capita".
Many theories have been proposed to explain income inequalties in living standards are associated with health differences within countries.The researchers claim that life expectancy is one of the indicators of economic development(Income distribution and life expectancy: a critical appraisal. K. Judge BMJ. 1995 Nov 11; 311(7015): 1282–1287. [PMC free article]). Citizens of rich countries can expect to live for many decades more than those of poor countries. Although they also mentioned other reasons which  will affect the life expectancy such as dietary influences and cultural factors, these factors are beyond the scope of this research. We will focus on finding the relationship between income levels and life expectancy and quatifying the relationship. The hypothesis stated in this analysis is higher income leads to longer life expectancy.
0 notes