milindpkshirsagar - Tumblr blog

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Running a k-means Cluster Analysis

1) Python script

2) Output of the Script

3) Interpretation

A k-means cluster analysis was conducted to identify underlying subgroups of life expectancy on their similarity of responses on 13 variables that represent characteristics that could have an impact ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons', 'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person'). All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

Data was not split in to test and train data set as it is not required in the assignment and only 56 data points were available in the given data set. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.

The elbow curve was inconclusive, suggesting that the 2, and 5-cluster solutions might be interpreted. The results here are for an interpretation of the 3-cluster solution.

Canonical discriminant analyses was used to reduce the 13 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in clusters 1 and 2 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters.

The means on the clustering variables showed that, compared to the other clusters, 'life expectancy’ in cluster 1 had higher levels on the clustering variables. In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on 'life expectancy’. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on 'life expectancy’ (F(2, 56)=14.36, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on 'life expectancy’ with all the clusters significantly different from each other.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Running a Lasso Regression Analysis

1) Python script

2) Output of the Script

3) Interpretation of results

A lasso regression analysis was conducted to identify a subset of variables from a pool of 13 quantitative predictor variables ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons', 'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person') that best predicted a quantitative response variable 'Life expectancy’. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

The data set of total 56 countries was split in to ‘training ‘ and ‘testing’ data sets in the ratio 70:30 respectively. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

From the output and the first figure (Change in the regression coefficients) it can be seen that only 5 predictor variables were found to be significant with non-zero regression coefficients and hence retained ('HIV rate': -2.165, 'Polity score': 0.33, 'Suicide per 100 persons': -0.374, 'Urbanization rate': 1.135 and 'income per person': 2.99). The predictors removed by the Lasso regression (Those with reduced to zero coefficients) are 'Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons', 'CO2 emissions', 'Female employment rate', 'Oil consumption per person', 'Electricity use per person', 'Total employment rate'.

The predictor variable having highest association with the ‘life expectancy’ is ‘Income per person’ with positive regression coefficient of 2.99. the other predictors with positive association with the ‘life expectancy’ are ‘Urbanization rate’ and ‘Polity score’. It can be understood that with higher income better access to medical facilities the ‘life expectancy’ can be higher. Similarly with high polity score (democracy score) the less conflicts and upholding of human values can elongate life span. With obvious negative association one can understand that ‘HIV rate’ and ‘Suicide rate’ are bound to have negative impact on ‘life expectancy’.

These 5 predictor variables accounted for 84.29% variance in the training data set and 52.18% variance for the test dat set for the response variable ‘life expectancy’.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Running a Random Forest

1) Python script

2) Output of the script

3) Interpretation

Random forest analysis was performed to evaluate the importance of total 13 explanatory variables ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons', 'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person') in predicting a binary, categorical response variable ('Life expectancy’).

The explanatory variables with the highest relative importance scores for predicting 'Life expectancy’ were ‘Income per person’ (0.14059446), ‘HIV rate’ (0.13162605), ‘Oil consumption per person’ (0.12815323) and ‘Urbanization rate’ (0.10183941). The accuracy of the random forest was 95.65%, with the subsequent growing of multiple trees (about 9) rather than a single tree. However, increasing number of trees beyond 9 will not add anything to the accuracy.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Running a Classification Tree

1) Python script

2) Output of the Python script

3) Decision tree

4) Interpretation of the Output

The data is taken from gap minder dataset. The purpose is to categorize ‘life expectancy’ in a particular country based upon ‘Income per person; and ‘urbanization rate’. the Quantitative data for all three variables was converted to two categories (’0’ and ‘1′).

Decision tree analysis was performed to test nonlinear relationships among two binary, categorical explanatory variables (’Income per person’ and ’Urbanization rate’) and a binary, categorical response variable (’Life expectancy’). The data set of total 175 countries was split in to ‘training ‘ and ‘testing’ data sets in the ratio 60:40 respectively. The accuracy score for predicting test data set was estimated to be 85.7%.

The 'Urbanization rate’ was the first variable to separate the sample into two subgroups. The second separator was ’Income per person’.

It can be seen from the decision tree that there are 65 countries (in the given data set) with less than 50% urbanization rates and 38 out of these are having higher life expectancy(more than average = 70 Years). Those with less than 50% urbanization rate and belonging to lower income group 28 are having higher life expectancy. For the countries with less than 50% urbanization rate and belonging to higher income group there are only 10 countries belonging to higher ‘life expectancy’ group.

For the countries with higher urbanization rate (total 40), 32 are belonging to higher income group and 31 out of these are having higher ‘life expectancy’. Whereas the countries with higher urbanization and lower income rate (total 8) 7 are found to have higher ‘life expectancy’.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Test a Logistic Regression Model

1) The summary of results

Null hypothesis: The life expectancy for a country is not associated with its urbanization rate.

After adjusting for potential confounding factor (Per capita income), the odds of having higher life expectancy were about 0 to 6 % higher for countries with higher Urbanization rate than for the countries with lower Urbanization rate (OR=1.03, 95% CI = 1.05-1.1, p=.01). Per capita income was also significantly associated with life expectancy, such that higher income countries were having higher life expectancy than the countries with lower income rate (OR= 1.0006, 95% CI=1.0003- 1.000877, p=0.0001).

The results have helped to reject the null hypothesis and hence its likely that the life expectancy for a given country is dependent upon its urbanization rate. However, the income per person is found to be a major confounder between these two variables. As the income rate goes up , usually the urbanization and the life expectancy also goes up.

2 )Python Script

3) The output from logistic regression model

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Test a Multiple Regression Model

1 )Python program

2) Output of the Python program and Discussion

The regression analysis for linear relation shows a weak positive correlation between the explanatory variable ‘Polity score’ and the response variable ‘Life expectancy’ (R2= 0.088, p=0.0001). The correlation can be observed from the plot as well.

The regression analysis for quadratic relation shows a moderate positive correlation between the explanatory variable ‘Polity score’ and the response variable ‘Life expectancy’ (R2= 0.375, p<0.0001). The correlation can be observed from the above plot as well.

The multiple regression analysis shows a good positive correlation between the explanatory variables ‘Polity score’ and ‘Income per person’ and the response variable ‘Life expectancy’ (R2= 0.466, p<0.0001).

3) Regression diagnostic plots

All the regression diagnostic plots shows some imperfection and deviation from normality. However the standardized residual plots are fairly distributed on both the sides if we neglect the outlier.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Test a Basic Linear Regression Model

1) Python Program

2) Mean for my centered explanatory variable (Income per person) =-3.934761288722879e-12 (almost zero)

3) Description of results

The scatter plot shows a moderate positive linear association between ‘Income per person’ and ‘Life expectancy’. (with R2=0.362, p<0.001).

The mathematical expression can be written as:

Life expectancy= 70.4376+0.006*Income per person(Centered)

4) Output of the Program

0 notes

milindpkshirsagar · 4 years ago

Text

Measures

The per capita income was taken from the World Bank statistics for the year 2010. The inflation but not the differences in the cost of living between countries have been taken into account. The electricity consumption per person/year was taken from IEA (International energy agency). The data presents 2008 residential electricity consumption, per person during the given year, counted in kilowatt-hours (kWh). It also presents 2010 oil Consumption per capita (tonnes per year and person). From these two parameters the net energy consumption per capita can be calculated in common unit like TOE to investigate any association between the energy consumption and income for different countries.

0 notes

milindpkshirsagar · 4 years ago

Text

Procedure

Different Data sets were collected by leading organizations in that field like the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank. The data for income is a fact-based statistics generated by different countries. It is calculated by using any of the three methods—the supply or production method, the income method and the demand or expenditure method and by definition the value of GDP should be identical, irrespective of the method used. The energy consumption per person is calculated from the data published by IEA (International energy agency) for per person electricity and oil consumption.

0 notes

milindpkshirsagar · 4 years ago

Text

Sample

Research question

Is there a statistically significant association between ‘Per capita Income’ and ‘Energy consumption per person’ across different countries?

Sample

The sample is taken from “GapMinder” dataset. GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating 215 areas.

0 notes

milindpkshirsagar · 4 years ago

Text

Testing a Potential Moderator (Income level) between ‘Polity Score’ and ‘Life expectancy’

1) Python script

2) Output of the Script and Interpretation

A)

Association between Polity score and Life Expectancy (for all income levels)

(0.3020631677809377, 0.00013354062808472477)

For different 155 countries across all income levels, a moderate association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot and the correlation coefficient r=0.3, with a very low value of p= 0.00013. However this could be misleading in the presence of a potential moderator like “Income levels” of different countries.

B)

Association Between Polity Score and Life expectancy for LOW income countries

(0.07964002358397154, 0.43808688761650616)

For 97 countries with LOW income level, NO association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient r = 0.08, also shows very weak correlation with a very high value of p= 0.43. Hence it can be concluded that there exists no association between ‘polity score’ and ‘ Life expectancy’ for LOW income level countries.

C)

Association Between Polity Score and Life expectancy for MIDDLE income countries

(0.21433792146661593, 0.2553824762097742)

For 30 countries with MIDDLE income level, a very weak association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient is r = 0.21, with a very high value of p= 0.25. Hence it can be concluded that there exists no significant association between ‘polity score’ and ‘ Life expectancy’ for MIDDLE income level countries.

D)

Association Between Polity Score and Life expectancy for HIGH income countries

(0.6636797458117805, 0.00011804616038541663)

For 28 countries with HIGH income level, a very good association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient is r = 0.66, which is considerably higher with a very low value of p= 0.0001. Hence it can be concluded that there exists statistically significant association between ‘polity score’ and ‘ Life expectancy’ for HIGH income level countries.

However it is also concluded from the complete analysis that the “Income level” is acting as a major moderator between the “Polity score” and “Life expectancy”.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Correlation Coefficients for dependency of 3 different variables on ‘Per capita Income’ of a country

1) Python script

2) Output of Python script

Association between Per capita Income and Electricity use (0.6536076842568538, 4.5973586072830653e-17)

Association Between Per capita Income and Life expectancy (0.6235764361228903, 2.9488174511561413e-15)

Association Between Per capita Income and Female employment rate (0.13500896170853055, 0.1271406725277556)

3) Interpretation of the results

An effort was made to study the relationship between dependency of three different variable i. e. 1) Electricity use per person 2) Life expectancy and 3) Female employment rate on the explanatory variable ‘Per capita Income’ for 129 countries.

Null hypothesis: Dependent variables has no association with the explanatory variable

Alternate hypothesis: Dependent variables has an association with the explanatory variable

1) Association between ‘Per capita Income’ and ‘Electricity use per person’

Looking at the plot one can observe a good positive correlation between income and electricity use ( r= 0.65, p=4.6e-17 ). Though there are a few outliers.

2) Association between ‘Per capita Income’ and Life expectancy

Looking at the plot one can observe a good positive correlation between income and Life expectancy ( r=0.62, p= 2.94e-15 ). However, for very low income countries the life expectancy is even lower than that modeled by the fitted straight line.

3) Association between ‘Per capita Income’ and Female employment rate

Looking at the plot one can observe that there is no correlation between income and Female employment rate ( r= 0.135, p=0.127). This might be because of agriculture based economies wherein females are also employed equally but that doesn't reflect in to a higher income.

0 notes

milindpkshirsagar · 4 years ago

Text

Running a Chi-Square Test of Independence

1) Python Script

# -*- coding: utf-8 -*-

"""

@author: M. P. kshirsagar

"""

import numpy as np

import pandas as pd

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

import scipy

import seaborn

import matplotlib.pyplot as plt

data = pd.read_csv('gapminder.csv', low_memory=False)

data.columns

type(data)

print("Total numbe of rows")

print (len(data)) #number of observations (rows)

print("Total numbe of columns")

print (len(data.columns)) # number of variables (columns)

#%%

#setting variables to numeric

data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce')

data['polity'] = pd.to_numeric(data['polityscore'],errors='coerce')

#subset data for those countries with all three variable values available

#(dropping missing value rows)

data= data.dropna()

#Retaining only required columns

data = data[['income','polity']]

#Converting quantittative variable to categorical variable

data['Income_Group'] = pd.cut(data['income'],bins=[0,3995,12375,40000], labels=[1,2,3])

data['Polity_Group'] = pd.cut(data['polity'],bins=[-10,-6,5,10], labels=['Auto','Ano', 'Demo'])

data

#%%

# contingency table of observed counts

ct1=pd.crosstab(data['Polity_Group'], data['Income_Group'])

print (ct1)

# column percentages

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)

# chi-square

print ('chi-square value, p value, expected counts')

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)

# set variable types

data['Polity_Group'] = data['Polity_Group'].astype('category')

# new code for setting variables to numeric:

data['Income_Group'] = pd.to_numeric(data['Income_Group'], errors='coerce')

# graph percent with nicotine dependence within each smoking frequency group

seaborn.factorplot(x='Polity_Group',y='Income_Group', data=data, kind="bar", ci=None)

plt.xlabel('Polity Score')

plt.ylabel('Income level')

#%%

#make a copy of my new subsetted data

data1 = data.copy()

#recoding values for Income_Group into a new variable

recode2 = {1: 1, 2:2}

data1['COMP1v2']= data1['Income_Group'].map(recode2)

# contingency table of observed counts

ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])

print (ct2)

# column percentages

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

#%%

#make a copy of my new subsetted data

data1 = data.copy()

#recoding values for Income_Group into a new variable

recode2 = {1: 1, 3:3}

data1['COMP1v2']= data1['Income_Group'].map(recode2)

# contingency table of observed counts

ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])

print (ct2)

# column percentages

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

#%%

#make a copy of my new subsetted data

data1 = data.copy()

#recoding values for Income_Group into a new variable

recode2 = {2:2, 3:3}

data1['COMP1v2']= data1['Income_Group'].map(recode2)

# contingency table of observed counts

ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])

print (ct2)

# column percentages

colsum=ct2.sum(axis=0)

colpct=ct2/colsum

print(colpct)

print ('chi-square value, p value, expected counts')

cs2= scipy.stats.chi2_contingency(ct2)

print (cs2)

2) Output of the Program

Total numbe of rows 213 Total numbe of columns 16

Income_Group 1 2 3 Polity_Group Auto 13 3 2 Ano 42 3 1 Demo 42 23 24

Income_Group 1 2 3 Polity_Group Auto 0.134021 0.103448 0.074074 Ano 0.432990 0.103448 0.037037 Demo 0.432990 0.793103 0.888889

chi-square value, p value, expected counts (26.644429815415215, 2.345555777530802e-05, 4, array([[11.41176471, 3.41176471, 3.17647059], [29.16339869, 8.71895425, 8.11764706], [56.4248366 , 16.86928105, 15.70588235]]))

COMP1v2 1.0 2.0

Polity_Group Auto 13 3 Ano 42 3 Demo 42 23

COMP1v2 1.0 2.0 Polity_Group Auto 0.134021 0.103448 Ano 0.432990 0.103448 Demo 0.432990 0.793103

chi-square value, p value, expected counts (12.565113894282039, 0.0018686165268392632, 2, array([[12.31746032, 3.68253968], [34.64285714, 10.35714286], [50.03968254, 14.96031746]]))

COMP1v2 1.0 3.0 Polity_Group Auto 13 2 Ano 42 1 Demo 42 24

COMP1v2 1.0 3.0 Polity_Group Auto 0.134021 0.074074 Ano 0.432990 0.037037 Demo 0.432990 0.888889

chi-square value, p value, expected counts (18.423976142253135, 9.983536575606077e-05, 2, array([[11.73387097, 3.26612903], [33.63709677, 9.36290323], [51.62903226, 14.37096774]]))

COMP1v2 2.0 3.0 Polity_Group Auto 3 2 Ano 3 1 Demo 23 24

COMP1v2 2.0 3.0 Polity_Group Auto 0.103448 0.074074 Ano 0.103448 0.037037 Demo 0.793103 0.888889

chi-square value, p value, expected counts (1.1513165403114045, 0.5623345788741289, 2, array([[ 2.58928571, 2.41071429], [ 2.07142857, 1.92857143], [24.33928571, 22.66071429]]))

3) Interpretation of results

The data taken was from ‘gapminder’ dat set and grouped by per capita income and polity score as well. The basic question investigated was ‘Is there any correlation between the ‘Democracy score’ and ‘Income levels’ for different countries. As per the requirement of assignment the quantitative data of ‘Per capita income’ and ‘Democracy score’ was converted to categorical variables by using "New World Bank country classifications by income level: 2019-2020" (For income levels) and “ Polity data series “ (for ‘Democracy score’)

Interpretation for Chi-Square Tests:

When examining the association between 'Income level’ of a country (Quantitative response converted to categorical response) and 'Polity score’ of that country ( Quantitative explanatory variable converted to categorical explanatory variable), a chi-square test of independence revealed that among my sample of 155 countries those with high ‘polity score’ (Democracy) were more likely to have higher income levels compared to those with low ‘polity score’ (Autocracy) with X2 =26.64, 2 df, p=2.345e-05.

The df or degree of freedom we record is the number of levels of the explanatory variable -1. Here the df is 2 polity score which has 3 levels (df 3-1=2).

Interpretation for post hoc Chi-Square Test results:

Post hoc comparisons of ‘income levels’ by 'polity score’ revealed that higher 'income levels’ were observed among those with higher ‘polity score comparison between income levels 1(low) & 2 (middle) and 1 (low) & 3 (high). However, no statistically significant difference was observed in the ‘income levels’ based on ‘polity scores’ between countries with income levels 2 (middle) & 3 (high).

0 notes

milindpkshirsagar · 4 years ago

Text

Association between ‘Per capita income ‘ and ‘Life expectancy’

1) Python script

"""

@author: M. P. kshirsagar

"""

import numpy as np

import pandas as pd

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

data = pd.read_csv('gapminder.csv', low_memory=False)

data.columns

type(data)

print("Total numbe of rows")

print (len(data)) #number of observations (rows)

print("Total numbe of columns")

print (len(data.columns)) # number of variables (columns)

#%%

#setting variables to numeric

data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce')

data['life'] = pd.to_numeric(data['lifeexpectancy'],errors='coerce')

#subset data for those countries with all three variable values available

#(dropping missing value rows)

data= data.dropna()

#Retaining only required columns

data = data[['income','life']]

#Converting quantittative variable to categorical variable

data['Group'] = pd.cut(data['income'],bins=[0,1026,3995,12375,40000], labels=[1,2,3,4])

# using ols function for calculating the F-statistic and associated p value

model1 = smf.ols(formula='life ~ C(Group)', data=data)

results1 = model1.fit()

print (results1.summary())

#%%

#Grouping data by income groups as per the "New World Bank country classifications by income level: 2019-2020"

categorized=data.groupby(pd.cut(data['income'], bins=[0,1026,3995,12375,40000]))['life'].agg(['mean', 'std', 'size'])

#adding category column as per the "New World Bank country classifications by income level: 2019-2020"

categorized['Group'] = ['Low income', 'Lower-middle income', 'Upper-middle income', 'High income']

cols= ['Group','mean','std','size']

categorized=categorized[cols]

categorized=categorized.dropna()

print(categorized)

#%%

#Using multiple comparision test to compare mean life expectancy in different income groups

from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)

data=data.dropna()

mod = MultiComparison(data['life'], data['Group'])

print(mod.tukeyhsd())

2) Output of the Program

Total numbe of rows 213 Total numbe of columns 16 OLS Regression Results ============================================================= Dep. Variable: life R-squared: 0.623 Model: OLS Adj. R-squared: 0.617 Method: Least Squares F-statistic: 94.24 Date: Sat, 19 Sep 2020 Prob (F-statistic): 4.84e-36 Time: 12:54:33 Log-Likelihood: -560.52 No. Observations: 175 AIC: 1129. Df Residuals: 171 BIC: 1142. Df Model: 3 Covariance Type: nonrobust ============================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------------------------- Intercept 59.1590 0.827 71.507 0.000 57.526 60.792 C(Group)[T.2] 11.2720 1.176 9.588 0.000 8.951 13.593 C(Group)[T.3] 14.5697 1.301 11.200 0.000 12.002 17.137 C(Group)[T.4] 21.0527 1.323 15.908 0.000 18.440 23.665 ============================================================= Omnibus: 43.476 Durbin-Watson: 1.925 Prob(Omnibus): 0.000 Jarque-Bera (JB): 93.762 Skew: -1.115 Prob(JB): 4.36e-21 Kurtosis: 5.809 Cond. No. 4.46 =============================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

income (US$) Group mean std size (0, 1026] Low income 59.159000 7.421381 53 (1026, 3995] Lower-middle income 70.430981 6.097336 52 (3995, 12375] Upper-middle income 73.728722 6.156826 36 (12375, 40000] High income 80.211706 1.878559 34

Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================ group1 group2 meandiff p-adj lower upper reject ----------------------------------------------------------------------------------------------------------- 1 2 11.272 0.001 8.2217 14.3223 True 1 3 14.5697 0.001 11.1946 17.9449 True 1 4 21.0527 0.001 17.619 24.4864 True 2 3 3.2977 0.0597 -0.0905 6.686 False 2 4 9.7807 0.001 6.3341 13.2273 True 3 4 6.483 0.001 2.7458 10.2202 True -------------------------------------------------------------------------------------------------------

3) Interpretation

The data taken was from ‘gapminder’ dat set and grouped by per capita income. The basic question investigated was ‘Is there any correlation between the ‘per capita income’ and ‘Life expectancy’ in different countries. As per the requirement of assignment the quantitative data of income was converted to categorical variable by using "New World Bank country classifications by income level: 2019-2020".

It can been seen from the regression output that there exists very good correlation between ‘per capita income’ and ‘Life expectancy’ in different countries. An Analysis of Variance (ANOVA) revealed that , 'low incom’ countries has lower life expectancy (Mean= 59.159000 s.d. ± 7.421381 ) compared to those of ‘high income’ countries (Mean= 80.211706 , s.d. ± 1.878559). From the ANOVA results, null hypothesis of ‘no difference’ between ‘life expectancy’ based on ‘per capita income’ can be rejected with F_stat = 94.24 and p< 4.84e-36 .

Model Interpretation for post hoc ANOVA results:

Post hoc comparisons of mean ‘life expectancy’ revealed that except for the pair of group 2 (Lower-middle income countries) and group 3 (Upper-middle income countries) all other groups are having statistically different mean life expectancy from others.

Hence it is concluded that there exists a correlation between ‘per capita income category’ and the ‘life expectancy’ for different countries.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Creating graphs for your data

1) Program

""" @author: M. P. kshirsagar """ import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt data_o = pd.read_csv('gapminder.csv', low_memory=False) type(data_o) print("Total numbe of rows") print (len(data_o)) #number of observations (rows) print("Total numbe of columns") print (len(data_o.columns)) # number of variables (columns) data_o.head()

#%% #setting variables to numeric data_o['income'] = pd.to_numeric(data_o['incomeperperson'],errors='coerce') data_o['oil'] = pd.to_numeric(data_o['oilperperson'],errors='coerce') data_o['electricity'] = pd.to_numeric(data_o['relectricperperson'],errors='coerce') data_o

#Retaining only required columns data = data_o[['country','income','oil','electricity']] data.head() #make a copy of data data1 = data.copy()

#%% # count of missing values print("\n") print ('counts for income with number of missing requested') c2 = data1['income'].value_counts(sort=False, dropna=False) print(c2)

print("\n") print ('counts for oil consumption with number of missing requested') c2 = data1['oil'].value_counts(sort=False, dropna=False) print(c2)

print("\n") print ('counts for electricity consumption with number of missing requested') c2 = data1['electricity'].value_counts(sort=False, dropna=False) print(c2)

# Replacing zero with NaN data1 = data1.replace(0, np.nan)

#subset data for those countries with all three variable values available #(dropping missing value rows) data2 = data1.dropna()

#Exploratory data analysis data2.describe()

#calculate frequency in bins

# The unibariate distribution for Income plt.figure(figsize=(20,10)) # Change plot size sb.distplot(data2['income'], kde=False) plt.xlabel('Income (US$/year/person)') plt.title('Per capita Income') plt.show()

#%% # The unibariate distribution of Oil consumption plt.figure(figsize=(20,10)) # Change plot size sb.distplot(data2['oil'], kde=False) plt.xlabel('Oil consumption (TOE/year/person') plt.title('Oil consumption') plt.show() #%% # The unibariate distribution of Electricity consumption plt.figure(figsize=(20,10)) # Change plot size sb.distplot(data2['electricity'], kde=False) plt.xlabel('Electricity consumption (kWh/year/person)') plt.title('Electricty consumption') plt.show() #%% #basic scatterplot for Oil consumption and Income plt.scatter(data2['oil'], data2['income']) plt.xlabel('Oil consumption (TOE/year/person') plt.ylabel('Income (US$/year/person)') plt.title('Scatterplot for the Association Between Per capita Oil consumption and Per capita Income') plt.show() #basic scatterplot for Electricity consumption and Income plt.scatter(data2['electricity'], data2['income']) plt.xlabel('Electricity consumption (kWh/year/person)') plt.ylabel('Income (US$/year/person)') plt.title('Scatterplot for the Association Between Per capita Electricity consumption and Per capita Income') plt.show() #%%

2) Plots and Discussion

2.1) Uni-variate distribution for oil consumption:

The distribution of per capita oil consumption is highly right-skewed. It shows that there exists a wide disparity between oil consumption of different countries . The data point on the extreme right is corresponding to the Sigapore, which almost seems like an outlier. Most of the countries are located on the left hand side with very low per capita oil consumption.

2.2) Uni-variate distribution for Electricity consumption

The distribution of per capita elctricity consumption is again highly right-skewed. It shows that there exists a wide disparity between electricity consumption of different countries . The data point on the extreme right is corresponding to the UAE, which almost seems like an outlier. This might be because of very low population and non-availability of other energy resources. Most of the countries are located on the left hand side with very low per capita electricity consumption (Just like oil).

2.3) Uni-variate distribution for per capita Income

The distribution of per capita income is again highly right-skewed. It shows that there exists a wide disparity between income of different countries. The data point on the extreme right is corresponding to the Norway. However for income the distribution is less skewed as compared to the oil and electricity consumption. This might be because of other factors responsible for higher income than the energy consumption. however, most of the countries are again located on the left hand side with very low per capita income (Just like oil & Electicty).

2.4) Bi-variate plots:

The scatter-plot shows a positive correlation between oil consumption and income for different countries. If we neglect the Singapore (with very high oil consumption) we can draw a straight line without missing much of the data points. Hence, primarily it can be concluded that there exists a strong association between oil consumption and income of a country.

The scatter-plot shows a positive correlation between electricity consumption and income for different countries. If we neglect the UAE (with very high electricity consumption) rest of the data can be represented by a straight line without missing much of the data points. Hence, primarily it can be concluded that there exists a strong association between electricity consumption and income of a country.

Collectively we can conclude that the energy consumption is an indicator of income of a country.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Making Data Management Decisions

1) Program

""" @author: M. P. kshirsagar """ import pandas as pd import numpy as np data_o = pd.read_csv('gapminder.csv', low_memory=False) type(data_o) print("Total numbe of rows") print (len(data_o)) #number of observations (rows) print("Total numbe of columns") print (len(data_o.columns)) # number of variables (columns) data_o.head() #%% #setting variables to numeric data_o['income'] = pd.to_numeric(data_o['incomeperperson'],errors='coerce') data_o['oil'] = pd.to_numeric(data_o['oilperperson'],errors='coerce') data_o['electricity'] = pd.to_numeric(data_o['relectricperperson'],errors='coerce') data_o #Retaining only required columns data = data_o[['country','income','oil','electricity']] data.head() #make a copy of data data1 = data.copy() #%% # count of missing values print("\n") print ('counts for income with number of missing requested') c2 = data1['income'].value_counts(sort=False, dropna=False) print(c2) print("\n") print ('counts for oil consumption with number of missing requested') c2 = data1['oil'].value_counts(sort=False, dropna=False) print(c2) print("\n") print ('counts for electricity consumption with number of missing requested') c2 = data1['electricity'].value_counts(sort=False, dropna=False) print(c2) # Replacing zero with NaN data1 = data1.replace(0, np.nan) #subset data for those countries with all three variable values available #(dropping missing value rows) data2 = data1.dropna() data2 #%% # freqeuncy disributions using the 'bygroup' function print("\n") print('Frequency Distribution of Per capita Income ( US$/year/person)') inc= data2['income'].value_counts(bins=[0,10000,20000,30000,40000,50000,60000,70000 print(inc) print("\n") print('Frequency Distribution of Oil consumption per capita (tons/year/person)') oel= data2['oil'].value_counts(bins=[0,2,3,4,5,6,7,8,9,10,11,12,13], sort=False) print (oel) print("\n") print('Frequency Distribution of Residential electricity consumption, per capita (kWh/elec= data2['electricity'].value_counts(bins=[0,1000,2000,3000,4000,5000,6000,7000, print (elec) #%%

2) Output

Python 3.8.3 (default, Jul 2 2020, 17:28:51) [MSC v.1916 32 bit (Intel)]

Type "copyright", "credits" or "license" for more information.

IPython 7.16.1 -- An enhanced Interactive Python.

Total numbe of rows

213

Total numbe of columns

16

counts for income with number of missing requested

NaN 23

8614.120219 1

39972.352768 1

279.180453 1

161.317137 1

..

377.421113 1

2344.896916 1

25306.187193 1

4180.765821 1

25575.352623 1

Name: income, Length: 191, dtype: int64

counts for oil consumption with number of missing requested

NaN 150

1.938654 1

0.726250 1

0.732817 1

1.567527 1

...

0.858962 1

0.394489 1

0.032281 1

0.420095 1

0.812369 1

Name: oil, Length: 64, dtype: int64

counts for electricity consumption with number of missing requested

NaN 77

0.000000 5

1920.962215 1

2826.044873 1

55.794744 1

..

7432.130852 1

351.166594 1

97.246492 1

9.192395 1

1259.392457 1

Name: electricity, Length: 133, dtype: int64

Frequency Distribution of Per capita Income ( US$/year/person)

(-0.001, 5000.0] 24

(5000.0, 10000.0] 11

(10000.0, 15000.0] 4

(15000.0, 20000.0] 3

(20000.0, 25000.0] 2

(25000.0, 30000.0] 8

(30000.0, 35000.0] 4

(35000.0, 40000.0] 5

Name: income, dtype: int64

Frequency Distribution of Oil consumption per capita (tons/year/person)

(-0.001, 2.0] 51

(2.0, 3.0] 5

(3.0, 4.0] 1

(4.0, 5.0] 3

(5.0, 6.0] 0

(6.0, 7.0] 0

(7.0, 8.0] 0

(8.0, 9.0] 0

(9.0, 10.0] 0

(10.0, 11.0] 0

(11.0, 12.0] 0

(12.0, 13.0] 1

Name: oil, dtype: int64

Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)

(-0.001, 1000.0] 32

(1000.0, 2000.0] 14

(2000.0, 3000.0] 7

(3000.0, 4000.0] 1

(4000.0, 5000.0] 5

(5000.0, 6000.0] 0

(6000.0, 7000.0] 0

(7000.0, 8000.0] 1

(8000.0, 9000.0] 0

(9000.0, 10000.0] 0

(10000.0, 11000.0] 0

(11000.0, 12000.0] 1

Name: electricity, dtype: int64

3) Discussion

Missing values:

In the first variable ‘income per capita’ 23 missing values are reported.

In the second variable ‘oil consumption per capita’ 150 missing values are reported.

In the third variable ‘electricity consumpton per capita’ 77 missing values are reported. in addition 5 values having ‘zero’ entry also reported.

Comments on distribution:

Frequency Distribution of Per capita Income shows the top heavy table, wherein most of the countries are having very low per capita income. (Less than US$10000/year/person).

A similar trend is also observed in energy consumption per capita (oil and electricity both).

This indicates a possible positive correlation between per capita energy consumption and income of a particular country.

0 notes

milindpkshirsagar · 4 years ago

Text

Peer-graded Assignment: Running Your First Program

1) My program

""" @author: M. P. kshirsagar """ import pandas as pd data = pd.read_csv('gapminder.csv', low_memory=False) type(data) print("Total numbe of rows") print (len(data)) #number of observations (rows) print("Total numbe of columns") print (len(data.columns)) # number of variables (columns) data.head()

#%% #setting variables to numeric data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce') data['oil'] = pd.to_numeric(data['oilperperson'],errors='coerce') data['electricity'] = pd.to_numeric(data['relectricperperson'],errors='coerce')

#%% # freqeuncy disributions using the 'bygroup' function print('Frequency Distribution of Per capita Income ( US$/year/person)') ic= data['income'].value_counts(bins=[0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,110000], sort=False) print(ic)

print('Frequency Distribution of Oil consumption per capita (tons/year/person)') oc= data['oil'].value_counts(bins=[0,2,3,4,5,6,7,8,9,10,11,12,13], sort=False) print (oc)

print('Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)') ec= data['electricity'].value_counts(bins=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000],sort=False) print (ec)

#%%

2) the output that displays three of my variables as frequency tables

Python 3.8.3 (default, Jul 2 2020, 17:28:51) [MSC v.1916 32 bit (Intel)]

Type "copyright", "credits" or "license" for more information.

IPython 7.16.1 -- An enhanced Interactive Python.

In [1]: runfile('C:/Users/Admin/Desktop/PS_DS_ML_AI/Coursera/Data Management and Visualization/Resources/Assignment 1.py', wdir='C:/Users/Admin/Desktop/PS_DS_ML_AI/Coursera/Data Management and Visualization/Resources')

Total numbe of rows

213

Total numbe of columns

16

Frequency Distribution of Per capita Income ( US$/year/person)

(-0.001, 10000.0] 143

(10000.0, 20000.0] 17

(20000.0, 30000.0] 14

(30000.0, 40000.0] 12

(40000.0, 50000.0] 0

(50000.0, 60000.0] 1

(60000.0, 70000.0] 1

(70000.0, 80000.0] 0

(80000.0, 90000.0] 1

(90000.0, 100000.0] 0

(100000.0, 110000.0] 1

Name: income, dtype: int64

Frequency Distribution of Oil consumption per capita (tons/year/person)

(-0.001, 2.0] 51

(2.0, 3.0] 6

(3.0, 4.0] 1

(4.0, 5.0] 3

(5.0, 6.0] 0

(6.0, 7.0] 1

(7.0, 8.0] 0

(8.0, 9.0] 0

(9.0, 10.0] 0

(10.0, 11.0] 0

(11.0, 12.0] 0

(12.0, 13.0] 1

Name: oil, dtype: int64

Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)

(-0.001, 1000.0] 91

(1000.0, 2000.0] 22

(2000.0, 3000.0] 12

(3000.0, 4000.0] 2

(4000.0, 5000.0] 5

(5000.0, 6000.0] 0

(6000.0, 7000.0] 0

(7000.0, 8000.0] 2

(8000.0, 9000.0] 1

(9000.0, 10000.0] 0

(10000.0, 11000.0] 0

(11000.0, 12000.0] 1

Name: electricity, dtype: int64

3) a few sentences describing my frequency distributions

The income disparity is clearly visible in ‘ Frequency Distribution of Per capita Income ( US$/year/person)’. As one can see majority of nations are having per capita income less than 10000 US$/year/person.

Primarily the disparity of income, oil consumption and electricity consumption seems to be correlated. All three variables follows the same top heavy frequency table.

There is a possibility of positive correlation among oil and electricity consumption and per capita GDP.

0 notes