milindpkshirsagar
Untitled
18 posts
Don't wanna be here? Send us removal request.
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Running a k-means Cluster Analysis
1) Python script
Tumblr media Tumblr media Tumblr media
2) Output of the Script
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
3) Interpretation
A k-means cluster analysis was conducted to identify underlying subgroups of life expectancy on their similarity of responses on 13 variables that represent characteristics that could have an impact ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons',  'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person'). All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Data was not split in to test and train data set as it is not required in the assignment and only 56 data points were available in the given data set. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
The elbow curve was inconclusive, suggesting that the 2, and 5-cluster solutions might be interpreted. The results here are for an interpretation of the 3-cluster solution.
Canonical discriminant analyses was used to reduce the 13 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in clusters 1 and 2 were densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. 
The means on the clustering variables showed that, compared to the other clusters, 'life expectancy’ in cluster 1 had higher levels on the clustering variables. In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on 'life expectancy’. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on 'life expectancy’ (F(2, 56)=14.36, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on 'life expectancy’ with all the clusters significantly different from each other. 
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Running a Lasso Regression Analysis
1) Python script
Tumblr media Tumblr media Tumblr media
2) Output of the Script
Tumblr media Tumblr media Tumblr media
3) Interpretation of results
A lasso regression analysis was conducted to identify a subset of variables from a pool of 13 quantitative predictor variables ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons',  'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person') that best predicted a quantitative response variable 'Life expectancy’. All predictor variables were standardized to have a mean of zero and a standard deviation of one.
The data set of total 56 countries was split in to ‘training ‘ and ‘testing’ data sets in the ratio 70:30 respectively. The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
From the output and the first figure (Change in the regression coefficients) it can be seen that only 5 predictor variables were found to be significant with non-zero regression coefficients and hence retained ('HIV rate': -2.165, 'Polity score': 0.33, 'Suicide per 100 persons': -0.374, 'Urbanization rate': 1.135 and 'income per person': 2.99). The predictors removed by the Lasso regression (Those with reduced to zero coefficients) are 'Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons',  'CO2 emissions', 'Female employment rate', 'Oil consumption per person', 'Electricity use per person', 'Total employment rate'.
The predictor variable having highest association with the ‘life expectancy’ is ‘Income per person’ with positive regression coefficient of 2.99. the other predictors with positive association with the ‘life expectancy’ are ‘Urbanization rate’ and ‘Polity score’. It can be understood that with higher income better access to medical facilities the ‘life expectancy’ can be higher. Similarly with high polity score (democracy score) the less conflicts and upholding of human values can elongate life span. With obvious negative association one can understand that ‘HIV rate’ and ‘Suicide rate’ are bound to have negative impact on ‘life expectancy’.
These 5 predictor variables accounted for 84.29% variance in the training data set and 52.18% variance for the test dat set for the response variable ‘life expectancy’.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Running a Random Forest
1) Python script
Tumblr media Tumblr media
2) Output of the script
Tumblr media Tumblr media
3) Interpretation
Random forest analysis was performed to evaluate the importance of total 13 explanatory variables  ('Alcohol consumption', Armed forces rate', 'breast cancer per 100 persons',  'CO2 emissions', 'Female employment rate', 'HIV rate', 'Oil consumption per person', 'Polity score', 'Electricity use per person', 'Suicide rate per100 persons', 'Total employment rate', 'Urbanization rate', and 'Income per person') in predicting a binary, categorical response variable ('Life expectancy’).
The explanatory variables with the highest relative importance scores for predicting 'Life expectancy’  were ‘Income per person’ (0.14059446), ‘HIV rate’ (0.13162605), ‘Oil consumption per person’ (0.12815323) and ‘Urbanization rate’ (0.10183941). The accuracy of the random forest was 95.65%, with the subsequent growing of multiple trees (about 9) rather than a single tree. However, increasing number of trees beyond 9 will not add anything to the accuracy.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Running a Classification Tree
1) Python script
Tumblr media
2) Output of the Python script
Tumblr media Tumblr media Tumblr media Tumblr media
3) Decision tree
Tumblr media
4) Interpretation of the Output
The data is taken from gap minder dataset. The purpose is to categorize ‘life expectancy’ in a particular country based upon ‘Income per person; and ‘urbanization rate’. the Quantitative data for all three variables was converted to two categories (’0’ and ‘1′).
Decision tree analysis was performed to test nonlinear relationships among  two binary, categorical explanatory variables (’Income per person’ and ’Urbanization rate’) and a binary, categorical response variable (’Life expectancy’). The data set of total 175 countries was split in to ‘training ‘ and ‘testing’ data sets in the ratio 60:40 respectively. The accuracy score for predicting test data set was estimated to be 85.7%.
The 'Urbanization rate’ was the first variable to separate the sample into two subgroups. The second separator was ’Income per person’.
It can be seen from the decision tree that there are 65 countries (in the given data set) with less than 50% urbanization rates and 38 out of these are having higher life expectancy(more than average = 70 Years). Those with less than 50% urbanization rate and belonging to lower income group 28 are having higher life expectancy. For the countries with less than 50% urbanization rate and belonging to higher income group there are only 10 countries belonging to higher ‘life expectancy’ group. 
For the countries with higher urbanization rate (total 40), 32 are belonging to higher income group and 31 out of these are having higher ‘life expectancy’. Whereas the countries with higher urbanization and lower income rate (total 8) 7 are found to have higher ‘life expectancy’.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Test a Logistic Regression Model
1) The summary of results
Null hypothesis: The life expectancy for a country is not associated with its urbanization rate.
After adjusting for potential confounding factor (Per capita income), the odds of having higher life expectancy were about  0 to 6 % higher for countries with higher Urbanization rate than for the countries with lower  Urbanization rate (OR=1.03, 95% CI = 1.05-1.1, p=.01). Per capita income  was also significantly associated with life expectancy, such that higher income countries were having higher life expectancy than the countries with lower income rate (OR= 1.0006, 95% CI=1.0003- 1.000877, p=0.0001).
The results have helped to reject the null hypothesis and hence its likely that the life expectancy for a given country is dependent upon its urbanization rate. However, the income per person is found to be a major confounder between these two variables. As the income rate goes up , usually the urbanization and the life expectancy also goes up.
2 )Python Script
Tumblr media
3) The output from logistic regression model
Tumblr media Tumblr media Tumblr media
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Test a Multiple Regression Model
1 )Python program
Tumblr media
2) Output of the Python program and Discussion
Tumblr media Tumblr media
The regression analysis for linear relation shows a weak positive correlation between the explanatory variable ‘Polity score’ and the response variable ‘Life expectancy’ (R2= 0.088, p=0.0001). The correlation can be observed from the plot as well.
Tumblr media Tumblr media
The regression analysis for quadratic relation shows a moderate positive correlation between the explanatory variable ‘Polity score’ and the response variable ‘Life expectancy’ (R2= 0.375, p<0.0001). The correlation can be observed from the above plot as well.
Tumblr media
The multiple regression analysis shows a good positive correlation between the explanatory variables ‘Polity score’ and ‘Income per person’ and the response variable ‘Life expectancy’ (R2= 0.466, p<0.0001). 
3) Regression diagnostic plots
Tumblr media Tumblr media Tumblr media
All the regression diagnostic plots shows some imperfection and deviation from normality. However the standardized residual plots are fairly distributed on both the sides if we neglect the outlier.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Test a Basic Linear Regression Model
1) Python Program
Tumblr media
2)  Mean for my centered explanatory variable (Income per person) =-3.934761288722879e-12 (almost zero)
3) Description of results
The scatter plot shows a moderate positive linear association between ‘Income per person’ and ‘Life expectancy’. (with R2=0.362, p<0.001). 
The mathematical expression can be written as:
Life expectancy= 70.4376+0.006*Income per person(Centered)
Tumblr media
4) Output of the Program
Tumblr media
0 notes
milindpkshirsagar · 4 years ago
Text
Measures
Measures
The per capita income was taken from the World Bank statistics for the year 2010. The inflation but not the differences in the cost of living between countries have been taken into account. The electricity consumption per person/year was taken from IEA (International energy agency). The data presents 2008 residential electricity consumption, per person during the given year, counted in kilowatt-hours (kWh). It also presents 2010 oil Consumption per capita (tonnes per year and person). From these two parameters the net energy consumption per capita can be calculated in common unit like TOE to investigate any association between the energy consumption and income for different countries.
0 notes
milindpkshirsagar · 4 years ago
Text
Procedure
Procedure
Different Data sets were collected by leading organizations in that field like the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank. The data for income is a fact-based statistics generated by different countries. It is calculated by using any of the three methods—the supply or production method, the income method and the demand or expenditure method and by definition the value of GDP should be identical, irrespective of the method used. The energy consumption per person is calculated from the data published by IEA (International energy agency) for per person electricity and oil consumption.
0 notes
milindpkshirsagar · 4 years ago
Text
Sample
Research question
Is there a statistically significant association between ‘Per capita Income’ and ‘Energy consumption per person’ across different countries?
 Sample
The sample is taken from “GapMinder” dataset. GapMinder collects data from a handful of sources, including the Institute for Health Metrics and Evaulation, US Census Bureau’s International Database, United Nations Statistics Division, and the World Bank. Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating 215 areas.
0 notes
milindpkshirsagar · 4 years ago
Text
Testing a Potential Moderator (Income level) between ‘Polity Score’ and  ‘Life expectancy’
1) Python script
Tumblr media
2) Output of the Script and Interpretation
A)
Association between Polity score and Life Expectancy (for all income levels)
(0.3020631677809377, 0.00013354062808472477)
Tumblr media
For different 155 countries across all income levels, a moderate association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot and the correlation coefficient r=0.3,  with a very low value of p= 0.00013. However this could be misleading in the presence of a potential moderator like “Income levels” of different countries.
B)
Association Between Polity Score and Life expectancy for LOW income countries
(0.07964002358397154, 0.43808688761650616)
Tumblr media
For 97 countries with LOW income level, NO association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient r = 0.08,  also shows very weak correlation with a very high value of p= 0.43. Hence it can be concluded that there exists no association between ‘polity score’ and ‘ Life expectancy’  for LOW income level countries.
C)
Association Between Polity Score and Life expectancy for MIDDLE income countries
(0.21433792146661593, 0.2553824762097742)
Tumblr media
For 30 countries with MIDDLE income level, a very weak association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient is  r = 0.21,  with a very high value of p= 0.25. Hence it can be concluded that there exists no significant association between ‘polity score’ and ‘ Life expectancy’  for MIDDLE income level countries.
D) 
Association Between Polity Score and Life expectancy for HIGH income countries
(0.6636797458117805, 0.00011804616038541663)
Tumblr media
For 28 countries with HIGH income level, a very good association between the ‘Polity score’ and ‘Life expectancy’ is observed from the scatter plot. The correlation coefficient is  r = 0.66, which is considerably higher  with a very low value of p= 0.0001. Hence it can be concluded that there exists statistically significant association between ‘polity score’ and ‘ Life expectancy’  for HIGH income level countries.
However it is also concluded from the complete analysis that the “Income level” is acting as a major moderator between the “Polity score” and “Life expectancy”.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Correlation Coefficients for dependency of 3 different variables on ‘Per capita Income’ of a country
1) Python script
Tumblr media
2) Output of Python script
Tumblr media
Association between Per capita Income and Electricity use (0.6536076842568538, 4.5973586072830653e-17)
Tumblr media
Association Between Per capita Income and Life expectancy (0.6235764361228903, 2.9488174511561413e-15)
Tumblr media
Association Between Per capita Income and Female employment rate (0.13500896170853055, 0.1271406725277556)
3) Interpretation of the results
An effort was made to study the relationship between dependency of three different variable i. e. 1) Electricity use per person 2) Life expectancy and 3) Female employment rate on the explanatory variable ‘Per capita Income’ for 129 countries.
Null hypothesis: Dependent variables has no association with the explanatory variable
Alternate hypothesis:  Dependent variables has an association with the explanatory variable
1) Association between ‘Per capita Income’ and  ‘Electricity use per person’
Looking at the plot one can observe a good positive correlation between income and electricity use ( r= 0.65, p=4.6e-17 ). Though there are a few outliers.
2) Association between ‘Per capita Income’ and Life expectancy 
Looking at the plot one can observe a good positive correlation between income and Life expectancy ( r=0.62, p= 2.94e-15 ). However, for very low income countries the life expectancy is even lower than that modeled by the fitted straight line.
3) Association between ‘Per capita Income’ and Female employment rate 
Looking at the plot one can observe that there is no correlation between income and Female employment rate ( r= 0.135, p=0.127). This might be because of agriculture based economies wherein females are also employed equally but that doesn't reflect in to a higher income. 
0 notes
milindpkshirsagar · 4 years ago
Text
Running a Chi-Square Test of Independence
1) Python Script
# -*- coding: utf-8 -*-
"""
@author: M. P. kshirsagar
"""
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
import scipy
import seaborn
import matplotlib.pyplot as plt
data = pd.read_csv('gapminder.csv', low_memory=False)
data.columns
type(data)
print("Total numbe of rows")
print (len(data)) #number of observations (rows)
print("Total numbe of columns")
print (len(data.columns)) # number of variables (columns)
#%%
#setting variables to numeric
data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce')
data['polity'] = pd.to_numeric(data['polityscore'],errors='coerce')
#subset data for those countries with all three variable values available
#(dropping missing value rows)
data= data.dropna()
#Retaining only required columns
data = data[['income','polity']]
#Converting quantittative variable to categorical variable
data['Income_Group'] = pd.cut(data['income'],bins=[0,3995,12375,40000], labels=[1,2,3])
data['Polity_Group'] = pd.cut(data['polity'],bins=[-10,-6,5,10], labels=['Auto','Ano', 'Demo'])
data
#%%
# contingency table of observed counts
ct1=pd.crosstab(data['Polity_Group'], data['Income_Group'])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
# set variable types 
data['Polity_Group'] = data['Polity_Group'].astype('category')
# new code for setting variables to numeric:
data['Income_Group'] = pd.to_numeric(data['Income_Group'], errors='coerce')
# graph percent with nicotine dependence within each smoking frequency group 
seaborn.factorplot(x='Polity_Group',y='Income_Group', data=data, kind="bar", ci=None)
plt.xlabel('Polity Score')
plt.ylabel('Income level')
#%%
#make a copy of my new subsetted data
data1 = data.copy()
#recoding values for Income_Group into a new variable
recode2 = {1: 1, 2:2}
data1['COMP1v2']= data1['Income_Group'].map(recode2)
# contingency table of observed counts
ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#%%
#make a copy of my new subsetted data
data1 = data.copy()
#recoding values for Income_Group into a new variable
recode2 = {1: 1, 3:3}
data1['COMP1v2']= data1['Income_Group'].map(recode2)
# contingency table of observed counts
ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
#%%
#make a copy of my new subsetted data
data1 = data.copy()
#recoding values for Income_Group into a new variable
recode2 = {2:2, 3:3}
data1['COMP1v2']= data1['Income_Group'].map(recode2)
# contingency table of observed counts
ct2=pd.crosstab(data1['Polity_Group'], data1['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
2) Output of the Program
Total numbe of rows 213 Total numbe of columns 16
Income_Group    1      2      3 Polity_Group             Auto                   13      3      2 Ano                     42     3     1 Demo                  42    23    24
Income_Group         1                 2                3 Polity_Group                               Auto                 0.134021    0.103448     0.074074 Ano                  0.432990     0.103448    0.037037 Demo               0.432990     0.793103    0.888889
chi-square value, p value, expected counts (26.644429815415215, 2.345555777530802e-05, 4, array([[11.41176471,  3.41176471,  3.17647059], [29.16339869,  8.71895425,  8.11764706],       [56.4248366 , 16.86928105, 15.70588235]]))
Tumblr media
COMP1v2       1.0     2.0
Polity_Group           Auto                 13    3 Ano                  42    3 Demo               42   23
COMP1v2            1.0           2.0 Polity_Group                     Auto              0.134021     0.103448 Ano               0.432990     0.103448 Demo            0.432990     0.793103
chi-square value, p value, expected counts (12.565113894282039, 0.0018686165268392632, 2, array([[12.31746032,  3.68253968], [34.64285714, 10.35714286], [50.03968254, 14.96031746]]))
COMP1v2       1.0      3.0 Polity_Group           Auto                 13    2 Ano                  42    1 Demo               42   24
COMP1v2            1.0              3.0 Polity_Group                     Auto               0.134021      0.074074 Ano                0.432990      0.037037 Demo             0.432990      0.888889
chi-square value, p value, expected counts (18.423976142253135, 9.983536575606077e-05, 2, array([[11.73387097,  3.26612903], [33.63709677,  9.36290323], [51.62903226, 14.37096774]]))
COMP1v2       2.0      3.0 Polity_Group           Auto                  3        2 Ano                   3        1 Demo                23      24
COMP1v2            2.0               3.0 Polity_Group                     Auto                 0.103448       0.074074 Ano                  0.103448       0.037037 Demo               0.793103       0.888889
chi-square value, p value, expected counts (1.1513165403114045, 0.5623345788741289, 2, array([[ 2.58928571,  2.41071429], [ 2.07142857,  1.92857143], [24.33928571, 22.66071429]]))
3) Interpretation of results
The data taken was from ‘gapminder’ dat set and grouped by  per capita income and polity score as well. The basic question investigated was ‘Is there any correlation between the ‘Democracy score’ and ‘Income levels’ for different countries. As per the requirement of assignment the quantitative data of ‘Per capita income’ and ‘Democracy score’ was converted to categorical variables by using  "New World Bank country classifications by income level: 2019-2020" (For income levels) and  “ Polity data series “ (for ‘Democracy score’)
Interpretation for Chi-Square Tests: 
When examining the association between 'Income level’ of a country (Quantitative response converted to categorical response) and 'Polity score’ of that country ( Quantitative explanatory variable converted to categorical explanatory variable), a chi-square test of independence revealed that among my sample of 155 countries those with high ‘polity score’ (Democracy) were more likely to have higher income levels compared to those with low ‘polity score’ (Autocracy)  with  X2 =26.64, 2 df, p=2.345e-05.
The df or degree of freedom we record is the number of levels of the explanatory variable -1. Here the df is 2 polity score which has 3 levels (df 3-1=2).
Interpretation for post hoc Chi-Square Test results:
Post hoc comparisons of ‘income levels’ by 'polity score’ revealed that higher 'income levels’ were observed among those with higher ‘polity score comparison between income levels 1(low) & 2 (middle) and 1 (low) & 3 (high). However, no statistically significant difference was observed in the ‘income levels’ based on ‘polity scores’ between countries with income levels 2 (middle) & 3 (high).
0 notes
milindpkshirsagar · 4 years ago
Text
Association between ‘Per capita income ‘ and ‘Life expectancy’
1) Python script
"""
@author: M. P. kshirsagar
"""
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
data = pd.read_csv('gapminder.csv', low_memory=False)
data.columns
type(data)
print("Total numbe of rows")
print (len(data)) #number of observations (rows)
print("Total numbe of columns")
print (len(data.columns)) # number of variables (columns)
#%%
#setting variables to numeric
data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce')
data['life'] = pd.to_numeric(data['lifeexpectancy'],errors='coerce')
#subset data for those countries with all three variable values available
#(dropping missing value rows)
data= data.dropna()
#Retaining only required columns
data = data[['income','life']]
#Converting quantittative variable to categorical variable
data['Group'] = pd.cut(data['income'],bins=[0,1026,3995,12375,40000], labels=[1,2,3,4])
# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula='life ~ C(Group)', data=data)
results1 = model1.fit()
print (results1.summary())
#%%
#Grouping data by income groups as per the "New World Bank country classifications by income level: 2019-2020"
categorized=data.groupby(pd.cut(data['income'], bins=[0,1026,3995,12375,40000]))['life'].agg(['mean', 'std', 'size'])
#adding category column as per the "New World Bank country classifications by income level: 2019-2020"
categorized['Group'] = ['Low income', 'Lower-middle income', 'Upper-middle income', 'High income']
cols= ['Group','mean','std','size']
categorized=categorized[cols]
categorized=categorized.dropna()
print(categorized)
#%%
#Using multiple comparision test to compare mean life expectancy in different income groups
from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)
data=data.dropna()
mod = MultiComparison(data['life'], data['Group'])
print(mod.tukeyhsd())
2) Output of the Program
Total numbe of rows 213 Total numbe of columns 16                            OLS Regression Results                             ============================================================= Dep. Variable:                   life            R-squared:                    0.623 Model:                            OLS            Adj. R-squared:             0.617 Method:                 Least Squares     F-statistic:                     94.24 Date:                Sat, 19 Sep 2020      Prob (F-statistic):          4.84e-36 Time:                        12:54:33           Log-Likelihood:              -560.52 No. Observations:                 175      AIC:                             1129. Df Residuals:                     171          BIC:                             1142. Df Model:                           3                                         Covariance Type:            nonrobust                                         =============================================================                            coef           std err          t          P>|t|      [0.025        0.975] ----------------------------------------------------------------------------------------------------------- Intercept            59.1590      0.827     71.507      0.000      57.526      60.792 C(Group)[T.2]    11.2720      1.176      9.588      0.000       8.951      13.593 C(Group)[T.3]    14.5697      1.301     11.200      0.000      12.002      17.137 C(Group)[T.4]    21.0527      1.323     15.908      0.000      18.440      23.665 ============================================================= Omnibus:                       43.476   Durbin-Watson:            1.925 Prob(Omnibus):             0.000   Jarque-Bera (JB):           93.762 Skew:                          -1.115   Prob(JB):                          4.36e-21 Kurtosis:                       5.809   Cond. No.                         4.46 =============================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  income (US$)           Group                    mean              std            size                                                      (0, 1026]                Low income              59.159000     7.421381     53 (1026, 3995]    Lower-middle income     70.430981     6.097336     52 (3995, 12375]   Upper-middle income   73.728722      6.156826     36 (12375, 40000]          High income         80.211706     1.878559      34
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================ group1 group2  meandiff        p-adj       lower            upper        reject -----------------------------------------------------------------------------------------------------------     1      2           11.272        0.001       8.2217         14.3223      True     1      3           14.5697     0.001      11.1946        17.9449       True     1      4            21.0527     0.001      17.619          24.4864      True     2      3            3.2977       0.0597     -0.0905        6.686          False     2      4            9.7807       0.001       6.3341         13.2273      True     3      4            6.483         0.001       2.7458         10.2202       True -------------------------------------------------------------------------------------------------------
3) Interpretation
The data taken was from ‘gapminder’ dat set and grouped by  per capita income. The basic question investigated was ‘Is there any correlation between the ‘per capita income’ and ‘Life expectancy’ in different countries. As per the requirement of assignment the quantitative data of income was converted to categorical variable by using  "New World Bank country classifications by income level: 2019-2020". 
It can been seen from the regression output that there exists very good correlation between ‘per capita income’ and ‘Life expectancy’ in different countries. An Analysis of Variance (ANOVA) revealed that , 'low incom’ countries has lower life expectancy (Mean= 59.159000 s.d. ± 7.421381 ) compared to those of ‘high income’ countries (Mean= 80.211706 , s.d. ± 1.878559). From the ANOVA results, null hypothesis of ‘no difference’ between ‘life expectancy’ based on ‘per capita income’ can be rejected with F_stat = 94.24  and p< 4.84e-36 .
Model Interpretation for post hoc ANOVA results:
Post hoc comparisons of mean ‘life expectancy’ revealed that except for the pair of group 2 (Lower-middle income countries) and group 3  (Upper-middle income countries) all other groups are having statistically different mean life expectancy from others. 
Hence it is concluded that there exists a correlation between ‘per capita income category’ and the ‘life expectancy’ for different countries.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Creating graphs for your data
1) Program
""" @author: M. P. kshirsagar """ import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt data_o = pd.read_csv('gapminder.csv', low_memory=False) type(data_o) print("Total numbe of rows") print (len(data_o)) #number of observations (rows) print("Total numbe of columns") print (len(data_o.columns)) # number of variables (columns) data_o.head()
#%% #setting variables to numeric data_o['income'] = pd.to_numeric(data_o['incomeperperson'],errors='coerce') data_o['oil'] = pd.to_numeric(data_o['oilperperson'],errors='coerce') data_o['electricity'] = pd.to_numeric(data_o['relectricperperson'],errors='coerce') data_o
#Retaining only required columns data = data_o[['country','income','oil','electricity']] data.head() #make a copy of data data1 = data.copy()
#%% # count of missing values print("\n") print ('counts for income with number of missing requested') c2 = data1['income'].value_counts(sort=False, dropna=False) print(c2)
print("\n") print ('counts for oil consumption with number of missing requested') c2 = data1['oil'].value_counts(sort=False, dropna=False) print(c2)
print("\n") print ('counts for electricity consumption with number of missing requested') c2 = data1['electricity'].value_counts(sort=False, dropna=False) print(c2)
# Replacing zero with NaN data1 = data1.replace(0, np.nan)
#subset data for those countries with all three variable values available #(dropping missing value rows) data2 = data1.dropna()
#Exploratory data analysis data2.describe()
#calculate frequency in bins
# The unibariate distribution for Income plt.figure(figsize=(20,10))     # Change plot size sb.distplot(data2['income'], kde=False) plt.xlabel('Income (US$/year/person)') plt.title('Per capita Income') plt.show()
#%% # The unibariate distribution of Oil consumption plt.figure(figsize=(20,10))     # Change plot size sb.distplot(data2['oil'], kde=False) plt.xlabel('Oil consumption (TOE/year/person') plt.title('Oil consumption') plt.show() #%% # The unibariate distribution of Electricity consumption plt.figure(figsize=(20,10))     # Change plot size sb.distplot(data2['electricity'], kde=False) plt.xlabel('Electricity consumption (kWh/year/person)') plt.title('Electricty consumption') plt.show() #%% #basic scatterplot for Oil consumption and Income plt.scatter(data2['oil'], data2['income']) plt.xlabel('Oil consumption (TOE/year/person') plt.ylabel('Income (US$/year/person)') plt.title('Scatterplot for the Association Between Per capita Oil consumption and Per capita Income') plt.show() #basic scatterplot for Electricity consumption and Income plt.scatter(data2['electricity'], data2['income']) plt.xlabel('Electricity consumption (kWh/year/person)') plt.ylabel('Income (US$/year/person)') plt.title('Scatterplot for the Association Between Per capita Electricity consumption and Per capita Income') plt.show() #%%
2) Plots and Discussion
2.1) Uni-variate distribution for oil consumption:
Tumblr media
The distribution of per capita oil consumption is highly right-skewed. It shows that there exists a wide disparity between oil consumption of different countries . The data point on the extreme right is corresponding to the Sigapore, which almost seems like an outlier. Most of the countries are located on the left hand side with very low per capita oil consumption.
2.2) Uni-variate distribution for Electricity consumption
Tumblr media
The distribution of per capita elctricity consumption is again highly right-skewed. It shows that there exists a wide disparity between electricity consumption of different countries . The data point on the extreme right is corresponding to the UAE, which almost seems like an outlier. This might be because of very low population and non-availability of other energy resources. Most of the countries are located on the left hand side with very low per capita electricity consumption (Just like oil).
2.3) Uni-variate distribution for per capita Income
Tumblr media
The distribution of per capita income is again highly right-skewed. It shows that there exists a wide disparity between income of different countries. The data point on the extreme right is corresponding to the Norway. However for income the distribution is less skewed as compared to the oil and electricity consumption. This might be because of other factors responsible for higher income than the energy consumption. however, most of the countries are  again located on the left hand side with very low per capita income (Just like oil & Electicty).
2.4) Bi-variate plots:
Tumblr media
The scatter-plot shows a positive correlation between oil consumption and income for different countries.  If we neglect the Singapore (with very high oil consumption) we can draw a straight line without missing much of the data points. Hence, primarily it can be concluded that there exists a strong association between oil consumption and income of a country.
Tumblr media
The scatter-plot shows a positive correlation between electricity consumption and income for different countries.  If we neglect the UAE (with very high electricity consumption) rest of the data can be represented by a straight line without missing much of the data points. Hence, primarily it can be concluded that there exists a strong association between electricity consumption and income of a country.
Collectively we can conclude that the energy consumption is an indicator of income of a country.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Making Data Management Decisions
1) Program
""" @author: M. P. kshirsagar """ import pandas as pd import numpy as np data_o = pd.read_csv('gapminder.csv', low_memory=False) type(data_o) print("Total numbe of rows") print (len(data_o)) #number of observations (rows) print("Total numbe of columns") print (len(data_o.columns)) # number of variables (columns) data_o.head() #%% #setting variables to numeric data_o['income'] = pd.to_numeric(data_o['incomeperperson'],errors='coerce') data_o['oil'] = pd.to_numeric(data_o['oilperperson'],errors='coerce') data_o['electricity'] = pd.to_numeric(data_o['relectricperperson'],errors='coerce') data_o #Retaining only required columns data = data_o[['country','income','oil','electricity']] data.head() #make a copy of data data1 = data.copy() #%% # count of missing values print("\n") print ('counts for income with number of missing requested') c2 = data1['income'].value_counts(sort=False, dropna=False) print(c2) print("\n") print ('counts for oil consumption with number of missing requested') c2 = data1['oil'].value_counts(sort=False, dropna=False) print(c2) print("\n") print ('counts for electricity consumption with number of missing requested') c2 = data1['electricity'].value_counts(sort=False, dropna=False) print(c2) # Replacing zero with NaN data1 = data1.replace(0, np.nan) #subset data for those countries with all three variable values available #(dropping missing value rows) data2 = data1.dropna() data2 #%% # freqeuncy disributions using the 'bygroup' function print("\n") print('Frequency Distribution of Per capita Income ( US$/year/person)') inc= data2['income'].value_counts(bins=[0,10000,20000,30000,40000,50000,60000,70000 print(inc) print("\n") print('Frequency Distribution of Oil consumption per capita (tons/year/person)') oel= data2['oil'].value_counts(bins=[0,2,3,4,5,6,7,8,9,10,11,12,13], sort=False) print (oel) print("\n") print('Frequency Distribution of Residential electricity consumption, per capita (kWh/elec= data2['electricity'].value_counts(bins=[0,1000,2000,3000,4000,5000,6000,7000, print (elec) #%%
2) Output
Python 3.8.3 (default, Jul 2 2020, 17:28:51) [MSC v.1916 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.
IPython 7.16.1 -- An enhanced Interactive Python.
Total numbe of rows
213
Total numbe of columns
16
counts for income with number of missing requested
NaN 23
8614.120219 1
39972.352768 1
279.180453 1
161.317137 1
..
377.421113 1
2344.896916 1
25306.187193 1
4180.765821 1
25575.352623 1
Name: income, Length: 191, dtype: int64
counts for oil consumption with number of missing requested
NaN 150
1.938654 1
0.726250 1
0.732817 1
1.567527 1
...
0.858962 1
0.394489 1
0.032281 1
0.420095 1
0.812369 1
Name: oil, Length: 64, dtype: int64
counts for electricity consumption with number of missing requested
NaN 77
0.000000 5
1920.962215 1
2826.044873 1
55.794744 1
..
7432.130852 1
351.166594 1
97.246492 1
9.192395 1
1259.392457 1
Name: electricity, Length: 133, dtype: int64
Frequency Distribution of Per capita Income ( US$/year/person)
(-0.001, 5000.0]       24
(5000.0, 10000.0]     11
(10000.0, 15000.0]    4
(15000.0, 20000.0]   3
(20000.0, 25000.0]   2
(25000.0, 30000.0]   8
(30000.0, 35000.0]   4
(35000.0, 40000.0]   5
Name: income, dtype: int64
Frequency Distribution of Oil consumption per capita (tons/year/person)
(-0.001, 2.0] 51
(2.0, 3.0] 5
(3.0, 4.0] 1
(4.0, 5.0] 3
(5.0, 6.0] 0
(6.0, 7.0] 0
(7.0, 8.0] 0
(8.0, 9.0] 0
(9.0, 10.0] 0
(10.0, 11.0] 0
(11.0, 12.0] 0
(12.0, 13.0] 1
Name: oil, dtype: int64
Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)
(-0.001, 1000.0] 32
(1000.0, 2000.0] 14
(2000.0, 3000.0] 7
(3000.0, 4000.0] 1
(4000.0, 5000.0] 5
(5000.0, 6000.0] 0
(6000.0, 7000.0] 0
(7000.0, 8000.0] 1
(8000.0, 9000.0] 0
(9000.0, 10000.0] 0
(10000.0, 11000.0] 0
(11000.0, 12000.0] 1
Name: electricity, dtype: int64
3) Discussion
Missing values:
In the first variable ‘income per capita’ 23 missing values are reported.
In the second variable ‘oil consumption per capita’ 150 missing values are reported.
In the third variable ‘electricity consumpton per capita’  77  missing values are reported. in addition 5 values having ‘zero’ entry also reported.
Comments on distribution:
Frequency Distribution of Per capita Income shows the top heavy table, wherein most of the countries are having very low per capita income. (Less than US$10000/year/person).
A similar trend is also observed in energy consumption per capita (oil and electricity both). 
This indicates a possible positive correlation between per capita energy consumption and income of a particular country.
0 notes
milindpkshirsagar · 4 years ago
Text
Peer-graded Assignment: Running Your First Program
1) My program 
 """ @author: M. P. kshirsagar """ import pandas as pd data = pd.read_csv('gapminder.csv', low_memory=False) type(data) print("Total numbe of rows") print (len(data)) #number of observations (rows) print("Total numbe of columns") print (len(data.columns)) # number of variables (columns) data.head()
#%% #setting variables to numeric data['income'] = pd.to_numeric(data['incomeperperson'],errors='coerce') data['oil'] = pd.to_numeric(data['oilperperson'],errors='coerce') data['electricity'] = pd.to_numeric(data['relectricperperson'],errors='coerce')
#%% # freqeuncy disributions using the 'bygroup' function print('Frequency Distribution of Per capita Income ( US$/year/person)') ic= data['income'].value_counts(bins=[0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,110000], sort=False) print(ic)
print('Frequency Distribution of Oil consumption per capita (tons/year/person)') oc= data['oil'].value_counts(bins=[0,2,3,4,5,6,7,8,9,10,11,12,13], sort=False) print (oc)
print('Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)') ec= data['electricity'].value_counts(bins=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000],sort=False) print (ec)
#%%
2) the output that displays three of my variables as frequency tables
Python 3.8.3 (default, Jul  2 2020, 17:28:51) [MSC v.1916 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.
IPython 7.16.1 -- An enhanced Interactive Python.
In [1]: runfile('C:/Users/Admin/Desktop/PS_DS_ML_AI/Coursera/Data Management and Visualization/Resources/Assignment 1.py', wdir='C:/Users/Admin/Desktop/PS_DS_ML_AI/Coursera/Data Management and Visualization/Resources')
Total numbe of rows
213
Total numbe of columns
16
Frequency Distribution of Per capita Income ( US$/year/person)
(-0.001, 10000.0]       143
(10000.0, 20000.0]       17
(20000.0, 30000.0]       14
(30000.0, 40000.0]       12
(40000.0, 50000.0]        0
(50000.0, 60000.0]        1
(60000.0, 70000.0]        1
(70000.0, 80000.0]        0
(80000.0, 90000.0]        1
(90000.0, 100000.0]       0
(100000.0, 110000.0]      1
Name: income, dtype: int64
Frequency Distribution of Oil consumption per capita (tons/year/person)
(-0.001, 2.0]    51
(2.0, 3.0]        6
(3.0, 4.0]        1
(4.0, 5.0]        3
(5.0, 6.0]        0
(6.0, 7.0]        1
(7.0, 8.0]        0
(8.0, 9.0]        0
(9.0, 10.0]       0
(10.0, 11.0]      0
(11.0, 12.0]      0
(12.0, 13.0]      1
Name: oil, dtype: int64
Frequency Distribution of Residential electricity consumption, per capita (kWh/year/person)
(-0.001, 1000.0]      91
(1000.0, 2000.0]      22
(2000.0, 3000.0]      12
(3000.0, 4000.0]       2
(4000.0, 5000.0]       5
(5000.0, 6000.0]       0
(6000.0, 7000.0]       0
(7000.0, 8000.0]       2
(8000.0, 9000.0]       1
(9000.0, 10000.0]      0
(10000.0, 11000.0]     0
(11000.0, 12000.0]     1
Name: electricity, dtype: int64
3) a few sentences describing my frequency distributions 
The income disparity is clearly visible in ‘ Frequency Distribution of Per capita Income ( US$/year/person)’. As one can see majority of nations are having per capita income less than 10000  US$/year/person.
Primarily the disparity of income, oil consumption and electricity consumption seems to be correlated. All three variables follows the same top heavy frequency table.
There is a possibility of positive correlation among oil and electricity consumption and per capita GDP. 
0 notes