#f/2.84E | Explore Tumblr posts and blogs

panjinkhoma-blog · 7 years ago

Text

Data Analysis Tools Week 4

ANOVA

I chose to work with the gapminder dataset

Python Code

import numpy import pandas as pd import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi import seaborn import matplotlib.pyplot as plt

# Load gapminder dataset and replace blank values as Na data = pd.read_csv("gapminder.csv", low_memory = False, na_values = " ")

# Bug fix for display formats to avoid run time errors pd.set_option("display.float_format", lambda x: "%f"%x)

# Show number of roles and columns print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns)

# Convert all variables to numeric data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors="coerce") data["lifeexpectancy"] = pd.to_numeric(data["lifeexpectancy"],errors="coerce") data["co2emissions"] = pd.to_numeric(data["co2emissions"],errors="coerce")

# To run ANOVA I will take employment rate as the Independent or X variable and life expectancy as the dependent variable #Creating new variable by Spliting eemployrate into 2 groups (25-60, 60-100)

data["life"] = pd.cut(data.lifeexpectancy, [25, 55, 100], labels=["short", "long"])

print("Counts (Frequencies) for life") c1 = data["life"].value_counts().sort_index(ascending = True) print(c1)

model1 = smf.ols(formula='incomeperperson ~ C(life)', data=data).fit() print (model1.summary())

sub1 = data[['incomeperperson', 'life']].dropna()

print ("means for income per person by lifeexpectancy short vs. long") m1= sub1.groupby('life').mean() print (m1)

print ("standard deviation for mean income per person by life short vs. long") st1= sub1.groupby('life').std() print (st1)

# bivariate bar graph seaborn.factorplot(x="life", y="incomeperperson", data=data, kind="bar", ci=None) plt.xlabel('life Expectancy') plt.ylabel('Mean Income Per Person')

# I want to moderate for CO2emmission and see whether it has an effect on the relationship between life expectancy and incom per person # Convert co2emisions to low and high (132000 - 5000000000 5000000000 - 340000000000)

data["co2"] = pd.cut(data.co2emissions, [132000, 5000000000, 340000000000], labels=["low", "high"])

sub2 = data[['incomeperperson', 'life', 'co2']].dropna()

sub3=data[(data['co2']=='low')] sub4=data[(data['co2']=='high')]

print ('association between life expectancy and income per person for those in low carbon dioxide emmission countries') model2 = smf.ols(formula='incomeperperson ~ C(life)', data=sub3).fit() print (model2.summary())

print ('association between life expectancy and income Per Person for those in high carbon dioxide emmission countries') model3 = smf.ols(formula='incomeperperson ~ C(life)', data=sub4).fit() print (model3.summary())

print ("means for incomeperperson by life short vs. long for low") m3= sub3.groupby('life').mean() print (m3) print ("Means for incomeperperson by life short vs. long for high") m4 = sub4.groupby('life').mean() print (m4)

Output

213 16 Counts (Frequencies) for life short 24 long 167 Name: life, dtype: int64 OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.049 Model: OLS Adj. R-squared: 0.044 Method: Least Squares F-statistic: 8.989 Date: Sun, 29 Apr 2018 Prob (F-statistic): 0.00311 Time: 18:21:20 Log-Likelihood: -1875.5 No. Observations: 176 AIC: 3755. Df Residuals: 174 BIC: 3761. Df Model: 1 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1148.3985 2203.227 0.521 0.603 -3200.092 5496.889 C(life)[T.long] 7061.7667 2355.349 2.998 0.003 2413.035 1.17e+04 ============================================================================== Omnibus: 69.319 Durbin-Watson: 1.685 Prob(Omnibus): 0.000 Jarque-Bera (JB): 155.486 Skew: 1.827 Prob(JB): 1.72e-34 Kurtosis: 5.803 Cond. No. 5.49 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for income per person by lifeexpectancy short vs. long incomeperperson life short 1148.398518 long 8210.165256 standard deviation for mean income per person by life short vs. long incomeperperson life short 2010.119040 long 10995.264363 association between life expectancy and income per person for those in low carbon dioxide emmission countries OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.043 Model: OLS Adj. R-squared: 0.036 Method: Least Squares F-statistic: 6.321 Date: Sun, 29 Apr 2018 Prob (F-statistic): 0.0131 Time: 18:21:20 Log-Likelihood: -1517.1 No. Observations: 143 AIC: 3038. Df Residuals: 141 BIC: 3044. Df Model: 1 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 1051.1692 2207.143 0.476 0.635 -3312.201 5414.539 C(life)[T.long] 5983.3357 2379.830 2.514 0.013 1278.575 1.07e+04 ============================================================================== Omnibus: 79.735 Durbin-Watson: 1.695 Prob(Omnibus): 0.000 Jarque-Bera (JB): 262.292 Skew: 2.269 Prob(JB): 1.11e-57 Kurtosis: 7.841 Cond. No. 5.17 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. association between life expectancy and income Per Person for those in high carbon dioxide emmission countries OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.028 Model: OLS Adj. R-squared: -0.011 Method: Least Squares F-statistic: 0.7260 Date: Sun, 29 Apr 2018 Prob (F-statistic): 0.402 Time: 18:21:20 Log-Likelihood: -290.77 No. Observations: 27 AIC: 585.5 Df Residuals: 25 BIC: 588.1 Df Model: 1 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 3745.6499 1.2e+04 0.313 0.757 -2.09e+04 2.84e+04 C(life)[T.long] 1.038e+04 1.22e+04 0.852 0.402 -1.47e+04 3.55e+04 ============================================================================== Omnibus: 2.909 Durbin-Watson: 1.900 Prob(Omnibus): 0.233 Jarque-Bera (JB): 2.251 Skew: 0.566 Prob(JB): 0.324 Kurtosis: 2.151 Cond. No. 10.3 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. means for incomeperperson by life short vs. long for low incomeperperson alcconsumption armedforcesrate breastcancerper100th \ life short 1051.169165 4.981364 0.565333 19.681818 long 7034.504879 6.269350 1.630469 37.226271

co2emissions femaleemployrate hivrate internetuserate \ life short 191539166.666667 55.709091 7.135000 6.072571 long 722409047.263682 46.801626 0.925051 34.148252

lifeexpectancy oilperperson polityscore relectricperperson \ life short 50.879000 nan 1.545455 161.251591 long 71.651843 1.596282 3.292453 1177.558105

suicideper100th employrate urbanrate life short 11.583359 65.122727 34.001818 long 9.264804 58.337398 56.234656 Means for incomeperperson by life short vs. long for high incomeperperson alcconsumption armedforcesrate breastcancerper100th \ life short 3745.649852 10.160000 0.331863 35.000000 long 14125.192877 9.603077 1.426218 51.355556

co2emissions femaleemployrate hivrate internetuserate \ life short 14609848000.000000 34.299999 17.800000 12.334893 long 32802046172.839485 45.340740 0.301250 53.845072

lifeexpectancy oilperperson polityscore relectricperperson \ life short 52.797000 0.504659 9.000000 920.137600 long 76.212889 1.340801 5.814815 1437.614364

suicideper100th employrate urbanrate life short 15.714571 41.099998 60.740000 long 10.859063 56.462963 73.145185

Comments

Is income per person associated with life expectancy for those countries with low or high co2 emission?

Results show that income per person is associated with life expectancy for those countries with low co2 emission p = 0.0131.

However, there is no association for those countries with high co2 emission p = 0.402

CHI SQUARE

I chose to work with the gapminder dataset

Python Code

import pandas as pd import numpy import scipy.stats import seaborn import matplotlib.pyplot as plt import itertools from scipy import stats

data = pd.read_csv("gapminder.csv", low_memory=False, na_values = " ")

# Bug fix for display formats to avoid run time errors pd.set_option("display.float_format", lambda x: "%f"%x)

# setting variables I will be working with to numeric data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['suicideper100th'] = pd.to_numeric(data['suicideper100th'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce')

# Data management for Alcohol Consumption

data["alcconsumption_GRPS"] = pd.cut(data.alcconsumption, bins=[0, 12, 24], labels=["low", "high"])

print("Counts (Frequencies) for Alcohol Consumption_GRPS") c1 = data["alcconsumption_GRPS"].value_counts(sort = False, dropna = True) print(c1)

# Data management for Suicide Per 100th

data["suicideper100th_GRPS"] = pd.cut(data.suicideper100th, bins=[0, 8, 38], labels=["0", "1"])

print("Counts (Frequencies) for suicideper100th_GRPS") c2 = data["suicideper100th_GRPS"].value_counts(sort = False, dropna = True) print(c2)

# contingency table of observed counts ct1=pd.crosstab(data['suicideper100th_GRPS'], data['alcconsumption_GRPS']) print (ct1)

sub1 = data.copy()

# Create dataframe containing "alcconsumption_GRPS" and "suicideper100th_GRPS", where # alcconsumption is modified to be 'low' and 'high' sub2 = sub1[['alcconsumption_GRPS', 'suicideper100th_GRPS', 'employrate']].dropna()

# contingency table of observed counts ct1=pd.crosstab(sub2['alcconsumption_GRPS'], sub2['suicideper100th_GRPS']) print (ct1)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs1= scipy.stats.chi2_contingency(ct1) print (cs1)

# set variable types sub2["alcconsumption_GRPS"] = sub2["alcconsumption_GRPS"].astype('category') sub2['suicideper100th_GRPS'] = pd.to_numeric(sub2['suicideper100th_GRPS'], errors='coerce')

# graph percent with alcohol consumption within each suicide frequency group seaborn.factorplot(x="alcconsumption_GRPS", y="suicideper100th_GRPS", data=sub2, kind="bar", ci=None) plt.xlabel('Levels of Drinking') plt.ylabel('Proportion suicide')

# I want to moderate for employrate and see whether it has an effect on the relationship between suicide and alcohol consumption # Convert employrate to low and high ()

print("Describe Employment Rate") desc1 = data["employrate"].describe() print(desc1)

sub2["employrate_GRPS"] = pd.cut(data.employrate, bins=[32, 58, 83], labels=["0", "1"]) sub2['employrate_GRPS'] = pd.to_numeric(sub2['employrate_GRPS'], errors='coerce')

print("Counts (Frequencies) for employrate") c3 = sub2["employrate_GRPS"].value_counts(sort = False, dropna = True) print(c3)

sub3=sub2[(sub2['employrate_GRPS']== 0)] sub4=sub2[(sub2['employrate_GRPS']== 1)]

print ('association between level of alcohol consumption and suicide rate for those countries with low employment rate') # contingency table of observed counts ct2=pd.crosstab(sub3['alcconsumption_GRPS'], sub3['suicideper100th_GRPS']) print (ct2)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

print ('association between level of alcohol consumption and suicide rate for those countries with high employment rate') # contingency table of observed counts ct2=pd.crosstab(sub4['alcconsumption_GRPS'], sub4['suicideper100th_GRPS']) print (ct2)

# column percentages colsum=ct1.sum(axis=0) colpct=ct1/colsum print(colpct)

# chi-square print ('chi-square value, p value, expected counts') cs2= scipy.stats.chi2_contingency(ct2) print (cs2)

# graph percent with alcohol consumption within each suicide frequency group seaborn.factorplot(x="alcconsumption_GRPS", y="suicideper100th_GRPS", data=sub3, kind="bar", ci=None) plt.xlabel('Levels of Drinking') plt.ylabel('Proportion suicide') plt.title('association between level of drinking and suicide rate for countries WITH low em,ployment rate')

# graph percent with alcohol consumption within each suicide frequency group seaborn.factorplot(x="alcconsumption_GRPS", y="suicideper100th_GRPS", data=sub4, kind="bar", ci=None) plt.xlabel('Levels of Drinking') plt.ylabel('Proportion suicide') plt.title('association between level of drinking and suicide rate for countries WITH high em,ployment rate')

Output

Counts (Frequencies) for Alcohol Consumption_GRPS low 155 high 32 Name: alcconsumption_GRPS, dtype: int64 Counts (Frequencies) for suicideper100th_GRPS 0 88 1 103 Name: suicideper100th_GRPS, dtype: int64 alcconsumption_GRPS low high suicideper100th_GRPS 0 81 4 1 72 28 suicideper100th_GRPS 0 1 alcconsumption_GRPS low 67 70 high 4 25 suicideper100th_GRPS 0 1 alcconsumption_GRPS low 0.943662 0.736842 high 0.056338 0.263158 chi-square value, p value, expected counts (10.66289761300192, 0.0010930602193584638, 1, array([[ 58.59638554, 78.40361446], [ 12.40361446, 16.59638554]])) Describe Employment Rate count 178.000000 mean 58.635955 std 10.519454 min 32.000000 25% 51.225000 50% 58.699999 75% 64.975000 max 83.199997 Name: employrate, dtype: float64 Counts (Frequencies) for employrate 0.000000 74 1.000000 90 Name: employrate_GRPS, dtype: int64 association between level of alcohol consumption and suicide rate for those countries with low employment rate suicideper100th_GRPS 0 1 alcconsumption_GRPS low 32 20 high 2 20 suicideper100th_GRPS 0 1 alcconsumption_GRPS low 0.943662 0.736842 high 0.056338 0.263158 chi-square value, p value, expected counts (15.075911404771702, 0.00010327277630028857, 1, array([[ 23.89189189, 28.10810811], [ 10.10810811, 11.89189189]])) association between level of alcohol consumption and suicide rate for those countries with high employment rate suicideper100th_GRPS 0 1 alcconsumption_GRPS low 35 49 high 2 4 suicideper100th_GRPS 0 1 alcconsumption_GRPS low 0.943662 0.736842 high 0.056338 0.263158 chi-square value, p value, expected counts (0.0008195527063451476, 0.97716141526886258, 1, array([[ 34.53333333, 49.46666667], [ 2.46666667, 3.53333333]])) Out[64]: Text(0.5,1,'association between level of drinking and suicide rate for countries WITH high em,ployment rate')

Comments

Does employment rate affect the relationship between alcohol consumption and suicide rate?

The relationship between alcohol consumption and suicide is significant for those countries with low employment rate, p < 0.001.

However, it is not significant for those with high employment rate p = 0.977

PEARSON CORRELATION

I chose to work with the gapminder dataset

Python Code

import pandas as pd import numpy import seaborn import scipy import matplotlib.pyplot as plt

data = pd.read_csv("gapminder.csv", low_memory=False, na_values = " ")

# Bug fix for display formats to avoid run time errors pd.set_option("display.float_format", lambda x: "%f"%x)

data_clean=data.dropna()

# Pearsson Correlation for association between incomeperperson and Suicide Rate print ('association between alcconsumption and suicideper100th') print (scipy.stats.pearsonr(data_clean['alcconsumption'], data_clean['suicideper100th']))

print("Describe Employment Rate") desc1 = data["employrate"].describe() print(desc1)

def employgrp (row): if row['employrate'] <= 51.225: return 1 elif row['employrate'] <= 58.699 : return 2 elif row['employrate'] > 64.975: return 3

data_clean['employgrp'] = data_clean.apply (lambda row: employgrp(row),axis=1)

chk1 = data_clean['employgrp'].value_counts(sort=False, dropna=False) print(chk1)

sub1=data_clean[(data_clean['employgrp']== 1)] sub2=data_clean[(data_clean['employgrp']== 2)] sub3=data_clean[(data_clean['employgrp']== 3)]

print ('association between alcconsumption and suicideper100th for LOW employrate countries') print (scipy.stats.pearsonr(sub1['alcconsumption'], sub1['suicideper100th'])) print (' ') print ('association between urbanrate and internetuserate for MIDDLE employrate countries') print (scipy.stats.pearsonr(sub2['alcconsumption'], sub2['suicideper100th'])) print (' ') print ('association between urbanrate and internetuserate for HIGH employrate countries') print (scipy.stats.pearsonr(sub3['alcconsumption'], sub3['suicideper100th']))

scat1 = seaborn.regplot(x="alcconsumption", y="suicideper100th", data=sub1) plt.xlabel('alcconsumption') plt.ylabel('suicideper100th') plt.title('Scatterplot for the Association Between alcohol consumption and suicide per 100th for LOW employrate countries') print (scat1) plt.show()

scat2 = seaborn.regplot(x="alcconsumption", y="suicideper100th", data=sub2) plt.xlabel('alcconsumption') plt.ylabel('suicideper100th') plt.title('Scatterplot for the Association Between alcohol consumption and suicide per 100th for MEDIUM employrate countries') print (scat2) plt.show()

scat1 = seaborn.regplot(x="alcconsumption", y="suicideper100th", data=sub3) plt.xlabel('alcconsumption') plt.ylabel('suicideper100th') plt.title('Scatterplot for the Association Between alcohol consumption and suicide per 100th for HIGH employrate countries') print (scat1) plt.show()

Output

association between alcconsumption and suicideper100th (0.45834250546091254, 0.00038178766966525383) Describe Employment Rate count 178.000000 mean 58.635955 std 10.519454 min 32.000000 25% 51.225000 50% 58.699999 75% 64.975000 max 83.199997 Name: employrate, dtype: float64 1.000000 14 2.000000 15 nan 20 3.000000 7 Name: employgrp, dtype: int64 association between alcconsumption and suicideper100th for LOW employrate countries (0.56573500213060901, 0.034974841840160878)

association between urbanrate and internetuserate for MIDDLE employrate countries (0.43336812208699116, 0.10658773388707161)

association between urbanrate and internetuserate for HIGH employrate countries (0.089398986518181067, 0.84883731927600881) __main__:37: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy AxesSubplot(0.125,0.125;0.775x0.755)

Comments

Does employment rate moderate the relationship between alcohol consumption and suicide rate?

For low employment countries, the relationship is significant, p = 0.035

For medium and high employment countries, the relationship is not significant p = 0.107 & p = 0.85 respectively

0 notes

pedro-couto-blr-blog · 8 years ago

Text

Regression Modelling in Practice - Week 3 Assignment E

Final Output

OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.800 Model: OLS Adj. R-squared: 0.795 Method: Least Squares F-statistic: 148.4 Date: Sun, 05 Mar 2017 Prob (F-statistic): 9.70e-51 Time: 22:54:36 Log-Likelihood: -1508.8 No. Observations: 153 AIC: 3028. Df Residuals: 148 BIC: 3043. Df Model: 4 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2257.0700 619.225 3.645 0.000 1033.405 3480.735 internetuserate_c 105.5975 29.360 3.597 0.000 47.579 163.617 I(internetuserate_c ** 2) 6.4900 0.626 10.367 0.000 5.253 7.727 urbanrate_c 88.3966 25.135 3.517 0.001 38.726 138.067 lifeexpectancy_c 183.6493 69.398 2.646 0.009 46.509 320.789 ============================================================================== Omnibus: 34.477 Durbin-Watson: 2.198 Prob(Omnibus): 0.000 Jarque-Bera (JB): 110.667 Skew: 0.812 Prob(JB): 9.31e-25 Kurtosis: 6.837 Cond. No. 1.81e+03 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.81e+03. This might indicate that there are strong multicollinearity or other numerical problems.

Complete Code

import numpy import pandas import scipy.stats import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.api as sm

#Format errors pandas.set_option('display.float_format', lambda x:'%f'%x)

#Remove data set limitations pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)

#gapminder csv data = pandas.read_csv('/Users/carol_novo/Desktop/Data_Analysis/Data_Management/Python/gapminder.csv', low_memory=False)

#Lower case columns names data.columns = map(str.lower, data.columns)

#Force data conversion to numeric data["internetuserate"] = data["internetuserate"].convert_objects(convert_numeric=True) data["incomeperperson"] = data["incomeperperson"].convert_objects(convert_numeric=True) data["armedforcesrate"] = data["armedforcesrate"].convert_objects(convert_numeric=True) data["co2emissions"] = data["co2emissions"].convert_objects(convert_numeric=True) data["femaleemployrate"] = data["femaleemployrate"].convert_objects(convert_numeric=True) data["urbanrate"] = data["urbanrate"].convert_objects(convert_numeric=True) data["lifeexpectancy"] = data["lifeexpectancy"].convert_objects(convert_numeric=True)

#Removing outliers #data=data[(data['incomeperperson']<=60000)] data=data[(data['co2emissions']<=3.0e11)]

#Clean the dataset data=data.dropna()

#Centering the explanatory variable data['internetuserate_c'] = (data['internetuserate'] - data['internetuserate'].mean()) data['armedforcesrate_c'] = (data['armedforcesrate'] - data['armedforcesrate'].mean()) data['co2emissions_c'] = (data['co2emissions'] - data['co2emissions'].mean()) data['femaleemployrate_c'] = (data['femaleemployrate'] - data['femaleemployrate'].mean()) data['urbanrate_c'] = (data['urbanrate'] - data['urbanrate'].mean()) data['lifeexpectancy_c'] = (data['lifeexpectancy'] - data['lifeexpectancy'].mean())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c + armedforcesrate_c + co2emissions_c + femaleemployrate_c + urbanrate_c + lifeexpectancy_c', data=data).fit() print (reg.summary())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ femaleemployrate_c', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ urbanrate_c', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ internetuserate_c + urbanrate_c', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ internetuserate_c + femaleemployrate_c + urbanrate_c', data=data).fit() print (reg.summary())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c + I(internetuserate_c**2)', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ urbanrate_c + I(urbanrate_c**2)', data=data).fit() print (reg.summary())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c + I(internetuserate_c**2) + armedforcesrate_c + co2emissions_c + femaleemployrate_c + urbanrate_c + I(urbanrate_c**2) + lifeexpectancy_c', data=data).fit() print (reg.summary())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c + I(internetuserate_c**2) + armedforcesrate_c + co2emissions_c + femaleemployrate_c + urbanrate_c + lifeexpectancy_c', data=data).fit() print (reg.summary())

reg = smf.ols('incomeperperson ~ lifeexpectancy_c', data=data).fit() print (reg.summary())

#Linear regression reg = smf.ols('incomeperperson ~ internetuserate_c + I(internetuserate_c**2) + urbanrate_c', data=data).fit() print (reg.summary())

#Linear regression reg2 = smf.ols('incomeperperson ~ internetuserate_c + I(internetuserate_c**2) + urbanrate_c + lifeexpectancy_c', data=data).fit() print (reg2.summary())

#q-q plot #fig1 = sm.qqplot(reg2.resid, line='r')

#Standard residuals #stdres = pandas.DataFrame(reg.resid_pearson) #fig2 = plt.plot(stdres, 'o', ls='None') #l = plt.axhline(y=0,color='r')

#fig3 = sm.graphics.plot_regress_exog(reg2, "internetuserate_c", fig=plt.figure()) #fig3 = sm.graphics.plot_regress_exog(reg2, "urbanrate_c", fig=plt.figure()) #fig3 = sm.graphics.plot_regress_exog(reg2, "lifeexpectancy_c", fig=plt.figure())

fig4 = sm.graphics.influence_plot(reg2,size=8) print(fig4)

Complete Output

OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.674 Model: OLS Adj. R-squared: 0.660 Method: Least Squares F-statistic: 50.28 Date: Sun, 05 Mar 2017 Prob (F-statistic): 3.82e-33 Time: 22:54:36 Log-Likelihood: -1546.3 No. Observations: 153 AIC: 3107. Df Residuals: 146 BIC: 3128. Df Model: 6 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept 7314.7640 490.815 14.903 0.000 6344.745 8284.783 internetuserate_c 267.2068 31.679 8.435 0.000 204.598 329.816 armedforcesrate_c 153.8284 352.855 0.436 0.664 -543.535 851.192 co2emissions_c 4.374e-08 4.27e-08 1.026 0.307 -4.06e-08 1.28e-07 femaleemployrate_c 100.3340 38.488 2.607 0.010 24.268 176.400 urbanrate_c 73.7339 33.636 2.192 0.030 7.258 140.209 lifeexpectancy_c -24.4754 87.947 -0.278 0.781 -198.290 149.339 ============================================================================== Omnibus: 35.385 Durbin-Watson: 2.386 Prob(Omnibus): 0.000 Jarque-Bera (JB): 94.933 Skew: 0.908 Prob(JB): 2.43e-21 Kurtosis: 6.405 Cond. No. 1.21e+10 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.21e+10. This might indicate that there are strong multicollinearity or other numerical problems. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.650 Model: OLS Adj. R-squared: 0.648 Method: Least Squares F-statistic: 280.5 Date: Sun, 05 Mar 2017 Prob (F-statistic): 3.01e-36 Time: 22:54:36 Log-Likelihood: -1551.7 No. Observations: 153 AIC: 3107. Df Residuals: 151 BIC: 3114. Df Model: 1 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 7314.7640 499.968 14.630 0.000 6326.928 8302.600 internetuserate_c 299.9326 17.910 16.747 0.000 264.547 335.319 ============================================================================== Omnibus: 26.923 Durbin-Watson: 2.515 Prob(Omnibus): 0.000 Jarque-Bera (JB): 60.502 Skew: 0.740 Prob(JB): 7.28e-14 Kurtosis: 5.702 Cond. No. 27.9 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.000 Model: OLS Adj. R-squared: -0.007 Method: Least Squares F-statistic: 0.01802 Date: Sun, 05 Mar 2017 Prob (F-statistic): 0.893 Time: 22:54:36 Log-Likelihood: -1632.1 No. Observations: 153 AIC: 3268. Df Residuals: 151 BIC: 3274. Df Model: 1 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept 7314.7640 845.079 8.656 0.000 5645.057 8984.471 femaleemployrate_c 7.6083 56.672 0.134 0.893 -104.365 119.582 ============================================================================== Omnibus: 64.884 Durbin-Watson: 1.838 Prob(Omnibus): 0.000 Jarque-Bera (JB): 150.022 Skew: 1.893 Prob(JB): 2.65e-33 Kurtosis: 6.034 Cond. No. 14.9 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.356 Model: OLS Adj. R-squared: 0.352 Method: Least Squares F-statistic: 83.40 Date: Sun, 05 Mar 2017 Prob (F-statistic): 4.09e-16 Time: 22:54:36 Log-Likelihood: -1598.4 No. Observations: 153 AIC: 3201. Df Residuals: 151 BIC: 3207. Df Model: 1 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept 7314.7640 678.313 10.784 0.000 5974.554 8654.974 urbanrate_c 281.6570 30.841 9.133 0.000 220.721 342.593 ============================================================================== Omnibus: 57.158 Durbin-Watson: 2.190 Prob(Omnibus): 0.000 Jarque-Bera (JB): 138.162 Skew: 1.591 Prob(JB): 9.97e-31 Kurtosis: 6.399 Cond. No. 22.0 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.654 Model: OLS Adj. R-squared: 0.649 Method: Least Squares F-statistic: 141.6 Date: Sun, 05 Mar 2017 Prob (F-statistic): 2.84e-35 Time: 22:54:36 Log-Likelihood: -1550.9 No. Observations: 153 AIC: 3108. Df Residuals: 150 BIC: 3117. Df Model: 2 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------- Intercept 7314.7640 498.953 14.660 0.000 6328.880 8300.648 internetuserate_c 278.5970 24.522 11.361 0.000 230.143 327.050 urbanrate_c 39.5535 31.125 1.271 0.206 -21.947 101.054 ============================================================================== Omnibus: 29.993 Durbin-Watson: 2.524 Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.057 Skew: 0.819 Prob(JB): 1.01e-15 Kurtosis: 5.855 Cond. No. 32.8 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.671 Model: OLS Adj. R-squared: 0.665 Method: Least Squares F-statistic: 101.4 Date: Sun, 05 Mar 2017 Prob (F-statistic): 8.12e-36 Time: 22:54:36 Log-Likelihood: -1547.0 No. Observations: 153 AIC: 3102. Df Residuals: 149 BIC: 3114. Df Model: 3 Covariance Type: nonrobust ====================================================================================== coef std err t P>|t| [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept 7314.7640 487.794 14.996 0.000 6350.877 ��8278.651 internetuserate_c 266.0643 24.383 10.912 0.000 217.884 314.245 femaleemployrate_c 99.8275 35.424 2.818 0.005 29.829 169.826 urbanrate_c 73.6839 32.751 2.250 0.026 8.968 138.400 ============================================================================== Omnibus: 33.811 Durbin-Watson: 2.406 Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.434 Skew: 0.884 Prob(JB): 1.70e-19 Kurtosis: 6.230 Cond. No. 33.0 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.768 Model: OLS Adj. R-squared: 0.765 Method: Least Squares F-statistic: 247.8 Date: Sun, 05 Mar 2017 Prob (F-statistic): 2.86e-48 Time: 22:54:36 Log-Likelihood: -1520.4 No. Observations: 153 AIC: 3047. Df Residuals: 150 BIC: 3056. Df Model: 2 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 3197.6448 624.638 5.119 0.000 1963.419 4431.871 internetuserate_c 220.4029 17.251 12.776 0.000 186.316 254.490 I(internetuserate_c ** 2) 5.2831 0.606 8.716 0.000 4.085 6.481 ============================================================================== Omnibus: 25.313 Durbin-Watson: 2.216 Prob(Omnibus): 0.000 Jarque-Bera (JB): 78.932 Skew: 0.563 Prob(JB): 7.25e-18 Kurtosis: 6.334 Cond. No. 1.70e+03 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.7e+03. This might indicate that there are strong multicollinearity or other numerical problems. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.411 Model: OLS Adj. R-squared: 0.403 Method: Least Squares F-statistic: 52.28 Date: Sun, 05 Mar 2017 Prob (F-statistic): 5.94e-18 Time: 22:54:36 Log-Likelihood: -1591.6 No. Observations: 153 AIC: 3189. Df Residuals: 150 BIC: 3198. Df Model: 2 Covariance Type: nonrobust ======================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------- Intercept 5015.2143 895.524 5.600 0.000 3245.745 6784.684 urbanrate_c 302.5706 30.119 10.046 0.000 243.058 362.083 I(urbanrate_c ** 2) 4.7538 1.271 3.739 0.000 2.242 7.266 ============================================================================== Omnibus: 66.361 Durbin-Watson: 2.128 Prob(Omnibus): 0.000 Jarque-Bera (JB): 213.352 Skew: 1.707 Prob(JB): 4.69e-47 Kurtosis: 7.671 Cond. No. 978. ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.810 Model: OLS Adj. R-squared: 0.799 Method: Least Squares F-statistic: 76.74 Date: Sun, 05 Mar 2017 Prob (F-statistic): 4.26e-48 Time: 22:54:36 Log-Likelihood: -1505.0 No. Observations: 153 AIC: 3028. Df Residuals: 144 BIC: 3055. Df Model: 8 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 1781.1438 686.353 2.595 0.010 424.516 3137.772 internetuserate_c 105.9222 29.498 3.591 0.000 47.617 164.228 I(internetuserate_c ** 2) 6.0576 0.653 9.282 0.000 4.768 7.348 armedforcesrate_c 101.0620 279.771 0.361 0.718 -451.927 654.051 co2emissions_c 3.96e-08 3.28e-08 1.208 0.229 -2.52e-08 1.04e-07 femaleemployrate_c 15.3682 31.671 0.485 0.628 -47.232 77.968 urbanrate_c 99.2216 26.057 3.808 0.000 47.718 150.725 I(urbanrate_c ** 2) 1.6805 0.813 2.066 0.041 0.073 3.288 lifeexpectancy_c 176.8532 70.725 2.501 0.014 37.061 316.646 ============================================================================== Omnibus: 43.451 Durbin-Watson: 2.128 Prob(Omnibus): 0.000 Jarque-Bera (JB): 149.642 Skew: 1.027 Prob(JB): 3.20e-33 Kurtosis: 7.388 Cond. No. 2.20e+10 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.2e+10. This might indicate that there are strong multicollinearity or other numerical problems. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.804 Model: OLS Adj. R-squared: 0.795 Method: Least Squares F-statistic: 85.17 Date: Sun, 05 Mar 2017 Prob (F-statistic): 3.56e-48 Time: 22:54:36 Log-Likelihood: -1507.2 No. Observations: 153 AIC: 3030. Df Residuals: 145 BIC: 3055. Df Model: 7 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2371.6534 631.003 3.759 0.000 1124.501 3618.805 internetuserate_c 102.4478 29.780 3.440 0.001 43.588 161.307 I(internetuserate_c ** 2) 6.3430 0.645 9.834 0.000 5.068 7.618 armedforcesrate_c 241.8809 274.383 0.882 0.379 -300.425 784.187 co2emissions_c 3.872e-08 3.32e-08 1.168 0.245 -2.68e-08 1.04e-07 femaleemployrate_c 34.2900 30.657 1.118 0.265 -26.303 94.883 urbanrate_c 93.9412 26.222 3.583 0.000 42.114 145.768 lifeexpectancy_c 181.3226 71.484 2.537 0.012 40.038 322.607 ============================================================================== Omnibus: 37.123 Durbin-Watson: 2.185 Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.950 Skew: 0.882 Prob(JB): 8.98e-27 Kurtosis: 6.963 Cond. No. 2.00e+10 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2e+10. This might indicate that there are strong multicollinearity or other numerical problems. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.369 Model: OLS Adj. R-squared: 0.365 Method: Least Squares F-statistic: 88.27 Date: Sun, 05 Mar 2017 Prob (F-statistic): 8.50e-17 Time: 22:54:36 Log-Likelihood: -1596.8 No. Observations: 153 AIC: 3198. Df Residuals: 151 BIC: 3204. Df Model: 1 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------------ Intercept 7314.7640 671.373 10.895 0.000 5988.265 8641.263 lifeexpectancy_c 653.7658 69.583 9.395 0.000 516.283 791.249 ============================================================================== Omnibus: 51.449 Durbin-Watson: 2.142 Prob(Omnibus): 0.000 Jarque-Bera (JB): 108.214 Skew: 1.501 Prob(JB): 3.17e-24 Kurtosis: 5.822 Cond. No. 9.65 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.791 Model: OLS Adj. R-squared: 0.787 Method: Least Squares F-statistic: 188.0 Date: Sun, 05 Mar 2017 Prob (F-statistic): 1.97e-50 Time: 22:54:36 Log-Likelihood: -1512.3 No. Observations: 153 AIC: 3033. Df Residuals: 149 BIC: 3045. Df Model: 3 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2719.1890 605.941 4.488 0.000 1521.841 3916.537 internetuserate_c 156.0236 22.782 6.849 0.000 111.007 201.041 I(internetuserate_c ** 2) 5.8970 0.596 9.891 0.000 4.719 7.075 urbanrate_c 102.2168 25.077 4.076 0.000 52.664 151.770 ============================================================================== Omnibus: 32.888 Durbin-Watson: 2.255 Prob(Omnibus): 0.000 Jarque-Bera (JB): 110.159 Skew: 0.752 Prob(JB): 1.20e-24 Kurtosis: 6.876 Cond. No. 1.73e+03 ==============================================================================

Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.73e+03. This might indicate that there are strong multicollinearity or other numerical problems. OLS Regression Results ============================================================================== Dep. Variable: incomeperperson R-squared: 0.800 Model: OLS Adj. R-squared: 0.795 Method: Least Squares F-statistic: 148.4 Date: Sun, 05 Mar 2017 Prob (F-statistic): 9.70e-51 Time: 22:54:36 Log-Likelihood: -1508.8 No. Observations: 153 AIC: 3028. Df Residuals: 148 BIC: 3043. Df Model: 4 Covariance Type: nonrobust ============================================================================================= coef std err t P>|t| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2257.0700 619.225 3.645 0.000 1033.405 3480.735 internetuserate_c 105.5975 29.360 3.597 0.000 47.579 163.617 I(internetuserate_c ** 2) 6.4900 0.626 10.367 0.000 5.253 7.727 urbanrate_c 88.3966 25.135 3.517 0.001 38.726 138.067 lifeexpectancy_c 183.6493 69.398 2.646 0.009 46.509 320.789 ============================================================================== Omnibus: 34.477 Durbin-Watson: 2.198 Prob(Omnibus): 0.000 Jarque-Bera (JB): 110.667 Skew: 0.812 Prob(JB): 9.31e-25 Kurtosis: 6.837 Cond. No. 1.81e+03 ==============================================================================

0 notes