jcvallier
Data Analysis - Coursera
19 posts
Specialization Blog 
Don't wanna be here? Send us removal request.
jcvallier · 7 years ago
Text
Data Analysis Capstone - Week 3
This week I am discussing my analysis of Tanzania water wells. Analysis
After using the random forest algorithm, the following 10 variables were ranked for importance using the scikit-learn feature importance function (use the feature table to reference each variable name and description), see chart below. The top four variables were: (9) gps_height, (5) quantity, (2) lga, (3) construction_year.  
Tumblr media
Gps_height variable :
The following is a density plot of of gps_height.  The target variables of functional, functional needs repair, and nonfunctional all have similar plots, with a spike at zero and another near 1500 meters(units assumed).  
This indicates that gps_height alone is not sufficient to determine between the target variables.Given that gps_height is the most important feature, there is probably a strong relationship with other variables in the dataset.  
Tumblr media
Quantity variable :
For the quantity variable there is more non functional dry wells than functional dry or functional needs repair dry.  It is also of note that non functional enough is the largest count for functional, functional needs repair, and non functional.  It is important to note, it is twice more likely to be functional when the quantity is functional enough versus non-functional enough.
Tumblr media
Lga variable:
Lga Is a geographical territory. No one region had more non functionals in the top ten, see table below.   
Tumblr media
Construction_Year variable -
The range of dates included 1960 through 2013 (excluding nulls 20,710).  This variable revels that the vast majority of wells were set up in the last decade.  It is interesting to see that there are more significant spikes in functional needs repair and non functional repair around 1980.  
Tumblr media
0 notes
jcvallier · 7 years ago
Text
Data Analysis -  Capstone Week 2
Methods
Sample:
Data was collected  from 2002 to 2013 by Taarifa and the Tanzanian Ministry of Water on water pumps in through out Tanzanian.  The data collected was designed to understand which waterpoints will fail and where they can improve maintenance operations and ensure that clean, potable water is available to local communities. The wells are classified into three categories: functional, functional needs repair, and not functional.  
Variables:
The following variables were selected to train and test the model.
Tumblr media
The following are the target variables, were ‘not functional’ is the key target to be predicted for feature importance.
Tumblr media
Statistical Analyses:
Training & Test Split with cross validation
After the key variables were selected above, the remaining dataset was split into a training set and a test set.  The dataset training to test split was 75% and 25%, respectively.
A 10 k-fold cross validation was done with on 8 features with the sequential feature selection greedy algorithm.  This was done to observe which features contributed to the overall accuracy score.
Random Forest Algorithm Explained:
Random Forest is an ensemble approach with decision trees.  The ensemble method takes a group of weak learners and together form a strong learner.
Therefore, a decision tree which, in ensemble terms, corresponds to our weak learner, and takes your target data entered at the top and then gets traverses down the tree where it is bucketed into smaller and smaller sets until your target cannot be split any further.
In the example below, the tree advises us, based upon weather conditions, whether to play ball. For example, if the outlook is sunny and the humidity is less than or equal to 70, then it’s probably OK to play. See diagram below..
Tumblr media
As shown in diagram below, random forest is a collection of the decision trees where the data are randomly shuffled into subsets and decision trees are formed for each random split.   
Tumblr media
Once the trees have been split, a majority vote of the mode is used for categorical predictor and mean for quantitative.  The result is a predictive model, that due to the ensemble method, creates a stronger predictive model that avoids overfitting the data due to its randomness.
See the example below, where each decision tree is shown as a gray curve. Each gray curve (a weak learner) is a fair approximation to the underlying data. The red curve (the ensemble “strong learner”) can be seen to be a much better approximation to the underlying data.
Tumblr media
Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way, where features which produce large values for this score are ranked as more important than features which produce small values.
This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Using additional unbiased trees or training on a balanced data set can help with this problem.
How Random Forest is being applied to the dataset :
When setting the parameters of random forest, 125 trees were selected for a robust generalized score.  Additionally, class weight was selected as ‘balanced’ as the key target of ‘not functional’ was under-represented in the training set and additional weight for predication would be given to the other categories of ‘functional’, and ‘functional needs repair’ if left unadjusted.
Source: http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metric
0 notes
jcvallier · 7 years ago
Text
Data Analysis - Capstone Week 1
In the preparation of writing a data analysis paper for this capstone.  This week we are asked to provide a (1) project title, (2) a research question with my motivation for answering the question, and (3)  potential implications of answering the research question.
(1) Project Title:  Identifying the best predictors which may cause water pumps to fail in Tanzania using a random forest machine learning algorithm.
(2) Research Question:  Can a random forest machine learning algorithm predicts which water pumps will fail in Tanzania, to improve improve maintenance operations that provide clean, potable water for local communities?  I plan on using the this method to best identify possible factors.
(3) Implications for research question: By answering this question, this could lead to an improved maintenance schedule that could decrease the amount of water pump failures, and lead to more potable clean water for local communities.
0 notes
jcvallier · 7 years ago
Text
Machine Learning for Data Analysis - Week 4
This week we focused on K-means unsupervised learning algorithm.  K-means clustering aims to partition n observations into k - clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.  These clusters are not labeled and need the user to examine closely the related variables to understand the characteristics.  
I did this assignment on red wine quality dataset found in UCI machine learning data set repository. 
I didn’t have any N/A values to manage:
angeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity ��         1599 non-null float64 volatile acidity        1599 non-null float64 citric acid             1599 non-null float64 residual sugar          1599 non-null float64 chlorides               1599 non-null float64 free sulfur dioxide     1599 non-null float64 total sulfur dioxide    1599 non-null float64 density                 1599 non-null float64 pH                      1599 non-null float64 sulphates               1599 non-null float64 alcohol                 1599 non-null float64 quality                 1599 non-null int64 dtypes: float64(11), int64(1)
Looking at the elbow method for electing how many k clusters, it looks like between 2 and 4 looks optional.  I ran a Turkey HSD test for 3 and 4 clusters. with the best results for 3 clusters.
Tumblr media
Clustering variable means by cluster              index  fixed acidity  volatile acidity  citric acid  \ cluster                                                             0        852.964981      -0.613198          0.523980    -0.766575   1        735.110787       0.972455         -0.724979     1.044456   2        710.416031      -0.000537          0.037230     0.174129  
        residual sugar  chlorides  free sulfur dioxide  total sulfur dioxide  \ cluster                                                                         0             -0.213666  -0.157534            -0.246837             -0.351623   1              0.006391   0.270782            -0.473441             -0.497666   2              0.489412   0.020018             1.046380              1.276995  
         density        pH  sulphates   alcohol   quality   cluster                                                     0       -0.384480  0.590419  -0.315140 -0.005503 -0.187766   1        0.368134 -0.734856   0.576452  0.413372  0.570021   2        0.385615 -0.199118  -0.121968 -0.521363 -0.447425  
Here cluster 1 has the best quality with the highest alcohol, sulphates, and fixed acidity.  Cluster 0 is close to the mean quality and has the second highest alcohol (near average), however other qualities vary.  Cluster 2 has the lowest quality and the lowest alcohol, but also vary in other qualities.
Running a post hoc test again we can see the clusters rejected the null hypthesis and the means and standard deviations vary. 
means for alcohol by cluster         alcohol_y cluster           0        -0.060487 1         0.422731 2        -0.513167 standard deviations for alcohol by cluster         alcohol_y cluster           0         0.949251 1         1.045772 2         0.728444 Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================= group1 group2 meandiff  lower   upper  reject ---------------------------------------------  0      1     0.4832   0.3003  0.6661  True  0      2    -0.4527  -0.6481 -0.2573  True  1      2    -0.9359  -1.1472 -0.7246  True
In summary, it looks like I was able to determine three categories of wine: high quality, middle quality, and low quality using K-means analysis.  This allowed me to understand the characteristics of each category and to perhaps predict whether a bottle of wine will be high quality.  Here is my code:
@author: jcvall """
from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
#%% """ Data Management """ #%cd /Users/jcvall/Google Drive/Education/Python & Data Science/Mac - Python data = pd.read_csv("winequality-red.csv",sep=";")
# subset clustering variables data.info()
# preprocess dataframe
clustervar = data
# standardize clustering variables to have mean=0 and sd=1
clustervar['fixed acidity']=preprocessing.scale(clustervar['fixed acidity'].astype('float64')) clustervar['volatile acidity']=preprocessing.scale(clustervar['volatile acidity'].astype('float64')) clustervar['citric acid']=preprocessing.scale(clustervar['citric acid'].astype('float64')) clustervar['residual sugar']=preprocessing.scale(clustervar['residual sugar'].astype('float64')) clustervar['chlorides']=preprocessing.scale(clustervar['chlorides'].astype('float64')) clustervar['free sulfur dioxide']=preprocessing.scale(clustervar['free sulfur dioxide'].astype('float64')) clustervar['total sulfur dioxide']=preprocessing.scale(clustervar['total sulfur dioxide'].astype('float64')) clustervar['density']=preprocessing.scale(clustervar['density'].astype('float64')) clustervar['pH']=preprocessing.scale(clustervar['pH'].astype('float64')) clustervar['sulphates']=preprocessing.scale(clustervar['sulphates'].astype('float64')) clustervar['alcohol']=preprocessing.scale(clustervar['alcohol'].astype('float64')) clustervar['quality']=preprocessing.scale(clustervar['quality'].astype('float64'))                                                                                                                
# split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=111)
# k-means cluster analysis for 1-9 clusters                                                           from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters:    model=KMeans(n_clusters=k)    model.fit(clus_train)    clusassign=model.predict(clus_train)    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))    / clus_train.shape[0])
#%%
""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') plt.show()
# Interpret 3 cluster solution model3=KMeans(n_clusters=5) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()
#%% """ BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']
# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()
#%%
""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """
# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)
#%%
# validate clusters in training data by examining cluster differences in alcohol using ANOVA # first have to merge alcohol with clustering variables and cluster assignment data alcohol_data=data['alcohol'] # split alcohol data into train and test sets alcohol_train, alcohol_test = train_test_split(alcohol_data, test_size=.3, random_state=123) alcohol_train1=pd.DataFrame(alcohol_train) alcohol_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(alcohol_train1, merged_train, on='index') sub1 = merged_train_all[['alcohol_y', 'cluster']]
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
alcoholmod = smf.ols(formula='alcohol_y ~ C(cluster)', data=sub1).fit() print (alcoholmod.summary())
print ('means for alcohol by cluster') m1= sub1.groupby('cluster').mean() print (m1)
print ('standard deviations for alcohol by cluster') m2= sub1.groupby('cluster').std() print (m2)
mc1 = multi.MultiComparison(sub1['alcohol_y'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
0 notes
jcvallier · 7 years ago
Text
Machine Learning for Data Analysis - Week 3
This week we studied lasso regression. A regression algorithm that selects the best variable for better predictability.  Lasso Regression is able to achieve this goal by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to be set to zero, effectively choosing a simpler model that does not include those coefficients. 
I used the red wine data set where I am predicting the highest quality of wine, based on wine characteristics.
Tumblr media Tumblr media
As you can see above the mean squared error is close to each fold with the exception of one that is very low MSE.  The training and test MSE are within .02 and the R square test is also within .02 of the training.  The lasso regression model has an R square of .34 in the test which means these variables can explain approximately 34% of the variable qualities of wine.
training data MSE 0.41434260704 test data MSE 0.432483541099 training data R-square 0.362115137182 test data R-square 0.340279344086
In summary the model shows us that the there is a relationship with Sulphates and Alcohol and red wine quality with other factors playing a lesser role.
Code:
#from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV import os
#Load the dataset os.chdir('C:/Users/10040207/myproject/Upload') data = pd.read_csv("winequality-red.csv",sep=';')
#%%
#select predictor variables and target variable as separate data sets   target = data['quality'] data.drop('quality',axis=1,inplace=True) predvar= data #%% # standardize predictors to have mean=0 and sd=1 predictors=predvar from sklearn import preprocessing predictors['fixeed acidity']=preprocessing.scale(predictors['fixed acidity'].astype('float64')) predictors['volatile acidity']=preprocessing.scale(predictors['volatile acidity'].astype('float64')) predictors['citric acid']=preprocessing.scale(predictors['citric acid'].astype('float64')) predictors['residual sugar']=preprocessing.scale(predictors['residual sugar'].astype('float64')) predictors['chlorides']=preprocessing.scale(predictors['chlorides'].astype('float64')) predictors['free sulfur dioxide']=preprocessing.scale(predictors['free sulfur dioxide'].astype('float64')) predictors['total sulfur dioxide']=preprocessing.scale(predictors['total sulfur dioxide'].astype('float64')) predictors['density']=preprocessing.scale(predictors['density'].astype('float64')) predictors['pH']=preprocessing.scale(predictors['pH'].astype('float64')) predictors['alcohol']=preprocessing.scale(predictors['alcohol'].astype('float64'))
#%% # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,                                                              test_size=.3, random_state=111)
# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train) #%% # print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))
#%% # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',            label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') #%% # plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',         label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',            label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')    #%%    
# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)
# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)
0 notes
jcvallier · 7 years ago
Text
Machine Learning for Data Analysis - Week 2
I continued with the red wine dataset this week to see if random forest can do a better job than decision trees.  I was able to get a marginal improvement from 0.978 for decision trees to 0.987 for random forest.  I used the quality variable as the target variable, where 1 is the highest rating and 0 is all other ratings.  The goal was to develop a random forest  model that can pick the best quality wine. Random Forest features ranking are as follows (higher the number the better). 
Features Rankfixed acidity0.065315 (9)volatile acidity0.109250 (3)citric acid0.104638 (4)residual sugar
0.0977009( 5)
chlorides0.059328 (10)free sulfur dioxide0.053690 (11)total sulfur dioxide0.072195 (7)density0.067850 (8)ph
0.093891 (6)
sulphates0.113916 (2)alcohol
0.162228 (1)
Summary -
The most important five features are Alcohol, sulphates, volatile acidity, Citric acid, and residual sugar.  The confusion matrix is as follows:
Confusion Matrix::
 [[632   0] [  7   1]]
The model correctly identified 632 that were correct classified as 0, 1 correctly classified  as 1.  Seven classified incorrectly when they indicate 0 but should have been 1.  Overall the model may need more examples but seems to do a good job.
Tumblr media
Here is the code:
 from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics  # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier %matplotlib inline
 print(wine.dtypes) print(wine.describe())
#Split into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=111)
col = X.columns
# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(X_train,y_train) # display the relative importance of each attribute features = model.feature_importances_ print(features)
pd.DataFrame(features,index=col,columns=['Features Rank'])
""" Running a different number of trees and see the effect  of that on the accuracy of the prediction """
trees=range(25) accuracy=np.zeros(25)
for idx in range(len(trees)):    classifier=RandomForestClassifier(n_estimators=idx + 1)    classifier=classifier.fit(X_train,y_train)    predictions=classifier.predict(X_test)    accuracy[idx]=sklearn.metrics.accuracy_score(y_test, predictions)    plt.cla() plt.plot(trees, accuracy)
0 notes
jcvallier · 7 years ago
Text
Machine Learning for Data Analysis
I went to the UCI website to get an additional data set for the assignment. I choose red wine quality data.   A created a new variable for the where there is a 1 if it is the highest quality (8) or 0 if it is below 8.  
My variables are as such:
fixed acidity           X1 volatile acidity        X2 citric acid             X3 residual sugar          X3 chlorides               X4 free sulfur dioxide     X5 total sulfur dioxide    X6 density                 X7 pH                      x8 sulphates               X9 alcohol                 X10 quality                 X11 Highest_Q           X12
Decision Tree: 
Tumblr media
My Code:
import pandas as pd import matplotlib.pyplot as plt import numpy as np import os from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics
wine = pd.read_csv('winequality-red.csv',sep=";")
wine['Highest_Q'] = wine['quality'].apply(lambda x: 0 if x < 8 else 1)
wine['quality'] = wine['quality'].astype('float64') wine['Highest_Q'] = wine['Highest_Q'].astype('float64') y = wine['Highest_Q'] wine.drop('quality',axis=1,inplace=True) wine.drop('Highest_Q',axis=1,inplace=True) X = wine
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=111)
classifier = DecisionTreeClassifier() classifier = classifier.fit(X_train,y_train)
predictions = classifier.predict(X_test)
sklearn.metrics.confusion_matrix(y_test,predictions)
sklearn.metrics.accuracy_score(y_test,predictions)
from sklearn.externals.six import StringIO   from IPython.display import Image   from sklearn.tree import export_graphviz import pydotplus dot_data = StringIO() export_graphviz(classifier, out_file=dot_data,                  filled=True, rounded=True,                special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue())   Image(graph.create_png())
Summary:
My decision tree will give the key elements as to what makes great wine.  For example if alcohol is less than 12.45 and fixed acidity is less than .36 and citric acid is less than 2.05.  Then it classifies as an 8 (highest wine) 35 percent of the time.
0 notes
jcvallier · 7 years ago
Text
Regression Modeling - Week 4
This week I tested to see if I certain variables were important to the developed countries.  I used life expectancy, employment rate, female employment rate, and income per person.  
I first checked to see if there were any confounding relationships among the variables.  It seems that income per person is a confounder for life expectancy and female employment rate where my p-values went from 0.05 to  0.15 for female employment.  Life expectency went from 0.0 to 0.37 after adding income per person.  see below.
 Logit Regression Results before income per person variable added.                     ============================================================================== Dep. Variable:           DevelCountry   No. Observations:                  166 Model:                          Logit   Df Residuals:                      162 Method:                           MLE   Df Model:                            3 Date:                Wed, 18 Oct 2017   Pseudo R-squ.:                  0.4786 Time:                        18:43:56   Log-Likelihood:                -39.274 converged:                       True   LL-Null:                       -75.327                                        LLR p-value:                 1.511e-15 ====================================================================================                       coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------ Intercept          -29.9497      7.088     -4.225      0.000     -43.842     -16.058 employrate          -0.1755      0.084     -2.085      0.037      -0.340      -0.011 femaleemployrate     0.1314      0.068      1.939      0.052      -0.001       0.264 lifeexpectancy       0.4229      0.092      4.594      0.000       0.243       0.603 =============================================================
          Logit Regression Results after income per person added                        ============================================================================== Dep. Variable:           DevelCountry   No. Observations:                  166 Model:                          Logit   Df Residuals:                      161 Method:                           MLE   Df Model:                            4 Date:                Wed, 18 Oct 2017   Pseudo R-squ.:                  0.5935 Time:                        18:41:47   Log-Likelihood:                -30.624 converged:                       True   LL-Null:                       -75.327                                        LLR p-value:                 1.760e-18 ====================================================================================                       coef    std err          z      P>|z|      [0.025      0.975] ------------------------------------------------------------------------------------ Intercept           -1.6280      6.810     -0.239      0.811     -14.975      11.719 employrate          -0.2039      0.072     -2.838      0.005      -0.345      -0.063 femaleemployrate     0.0752      0.052      1.442      0.149      -0.027       0.177 lifeexpectancy       0.0751      0.084      0.890      0.374      -0.090       0.240 incomeperperson      0.0002   5.17e-05      3.680      0.000    8.89e-05       0.000 =============================================================
After removing female employment rate and life expectancy, I get the following results. 
Logit Regression Results                           ============================================================================== Dep. Variable:           DevelCountry   No. Observations:                  166 Model:                          Logit   Df Residuals:                      163 Method:                           MLE   Df Model:                            2 Date:                Wed, 18 Oct 2017   Pseudo R-squ.:                  0.5684 Time:                        19:06:14   Log-Likelihood:                -32.509 converged:                       True   LL-Null:                       -75.327                                        LLR p-value:                 2.538e-19 ===================================================================================                      coef    std err          z      P>|z|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept           3.4351      2.232      1.539      0.124      -0.940       7.810 employrate         -0.1433      0.048     -2.984      0.003      -0.237      -0.049 incomeperperson     0.0002   4.16e-05      5.624      0.000       0.000       0.000 =================================================================================== Odds Ratio Intercept          31.034682 employrate          0.866494 incomeperperson     1.000234 dtype: float64                 Lower CI     Upper CI  Odds Ratio Intercept        0.390663  2465.424885   31.034682 employrate       0.788658     0.952011    0.866494 incomeperperson  1.000153     1.000316    1.000234
The odds the country being a developed country shown insignificant for income per person with an odds ratio of 1 (OR=1.00, 95% CI = 1.00-1.00, p=0.0001). Employment rate had an odds ratio of less than one. So lower employment rate is slight more probable with developed countries than non-developed countries (OR= 0.87, 95% CI=0.78-0.95, p=0.003).
0 notes
jcvallier · 7 years ago
Text
Regression Modeling - Week 3
For week three, I ran a multiple regression to determine if female employment rate and whether it is a developed country can predict life expectancy.  Though I knew these variables would not have a strong r square score, I wanted to see what I can learn from this relationship.  
          OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.295 Model:                            OLS   Adj. R-squared:                  0.286 Method:                 Least Squares   F-statistic:                     34.07 Date:                Sat, 14 Oct 2017   Prob (F-statistic):           4.35e-13 Time:                        23:58:47   Log-Likelihood:                -587.22 No. Observations:                 166   AIC:                             1180. Df Residuals:                     163   BIC:                             1190. Df Model:                           2                                         Covariance Type:            nonrobust                                         =========================================================================================                            coef    std err          t      P>|t|      [0.025      0.975] ----------------------------------------------------------------------------------------- Intercept                67.4903      0.715     94.434      0.000      66.079      68.902 C(DevelCountry)[T.1]     12.1850      1.741      7.001      0.000       8.748      15.622 norm_femaleemployrate    -0.2029      0.045     -4.538      0.000      -0.291      -0.115 ============================================================================== Omnibus:                       18.222   Durbin-Watson:                   1.734 Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.716 Skew:                          -0.836   Prob(JB):                     3.17e-05 Kurtosis:                       3.445   Cond. No.                         39.6 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Both variables, developed countries and female employment rate are significant with a p value of 0.00.   
The coefficients give a developed country a 12.19 increase in life expectancy and for each female employment%, a decrease of life expectancy of -0.20 in years.  The regression equation is life y = 67.49 + 12.19(x1) + -0.20(x2).
The null hypothesis would be rejected and we would say there is a relationship with developed countries and female employment with a r square score of 0.29.
There is no confonding variables in my analysis. 
Tumblr media Tumblr media Tumblr media
There is a non-linear relationship at the outer quartiles.  There is a poor linear fit in the residuals and an outlier (almost 4 standard deveations from the mean) that should be removed from the analysis and re-run.  
0 notes
jcvallier · 7 years ago
Text
Regression Modeling - Week 3
For this week I used the Gapminder data set variables life expectancy and empoyment rate.  I used a regression analysis to determine if employment rate is associated with life expectancy.   
I adjusted the employment rate variable by subtracting each observation by the mean of approximately 58.99.  This normalized the variable by centering the mean to zero.  
The regression analysis gave me a p-value of zero and a formula of y=-.03x+69.54. The r squared is .10
This means that for every 1 point increase in employment rate decreases life expectancy by .03.  
This model has a negative trend showing employment rate to be significant, but I would not consider this to be a robust model, as the r-squared is a weak score of.10 and the residuals are not linear.  
Program and Output:
import warnings import statsmodels.api as sm import statsmodels.formula.api as smf import seaborn as sns warnings.filterwarnings("ignore")
df2 = df1[['lifeexpectancy','employrate']] df2['norm_employrate'] = df2['employrate'] - df2['employrate'].mean() df2['norm_employrate'].mean() df2['trans_employrate'] = np.exp(df2['employrate'])
modelname = smf.ols(formula='lifeexpectancy ~ norm_employrate', data = df2).fit() print(modelname.summary())
print('average life exp:' , df2['lifeexpectancy'].mean()) print('average employ rate:' , df2['employrate'].mean())
sns.regplot(x='employrate',y='lifeexpectancy', data=df2)                            OLS Regression Results                             ============================================================================== Dep. Variable:         lifeexpectancy   R-squared:                       0.099 Model:                            OLS   Adj. R-squared:                  0.093 Method:                 Least Squares   F-statistic:                     17.94 Date:                Sun, 08 Oct 2017   Prob (F-statistic):           3.79e-05 Time:                        15:24:51   Log-Likelihood:                -607.59 No. Observations:                 166   AIC:                             1219. Df Residuals:                     164   BIC:                             1225. Df Model:                           1                                         Covariance Type:            nonrobust                                         ===================================================================================                      coef    std err          t      P>|t|      [0.025      0.975] ----------------------------------------------------------------------------------- Intercept          69.5456      0.734     94.694      0.000      68.095      70.996 norm_employrate    -0.3021      0.071     -4.235      0.000      -0.443      -0.161 ============================================================================== Omnibus:                       11.183   Durbin-Watson:                   1.772 Prob(Omnibus):                  0.004   Jarque-Bera (JB):               12.051 Skew:                          -0.639   Prob(JB):                      0.00242 Kurtosis:                       2.669   Cond. No.                         10.3 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. average life exp: 69.54557228915664 average employ rate: 58.9879518766265
average norm employ rate: 9.16000876220852e-15
Tumblr media
0 notes
jcvallier · 7 years ago
Text
Regression - Week 1
Source: Gapminder.com
 Sample
We have decided to include the following entities:
192 UN members (as of April 2008)
51 other entities listed in the “List of countries” in Wikipedia (2008-05-13). These include the     Vatican, dependent territories, special entities and disputed territories.     We have excluded the two “sub-dependencies” Ascension Island and Tristan     da Cunha, although they are listed by Wikipedia.
4 French overseas territories (Guadeloupe, Martinique,     Reunion and French Guyana), although they are considered an integral part     of France
10 former states
2 ad-hoc areas: “Serbia excluding Kosovo” and “the     Channel Islands”. The latter is the collective name of the two dependent     territories Guernsey and Jersey.
In spetember 2012 we added the new UN state “South     Sudan” to our country list. The new number of UN members are hence 193.     This gives a total of 193+51+4+10+2=260 countries and     territories. Our goal is to have data for all these entities for at least     two indicators (one of them being population), from 1800 onwards. However,     most indicators will only have data for a selection of these entities.
Procedure
The main purpose of this observational data is to produce graphical presentations that display the magnitude of health and wealth disparities in the contemporary world. Therefore, we have also included rough estimates for countries and territories for which reliable data was not available for 2006. These estimates can only be taken as an indication of the order of magnitude for the indicator. Furthermore, we have not been able to make sure that every single observation is based on the best estimates available. Hence we discourage the use of this data set for statistical analysis and advise those who require more exact data to investigate the available data more carefully and look for additional sources, when appropriate.
Variables
The main analysis performed used income per person to predict employment rate.  The income per person comes from 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.  Employment rate is 2007 total employees age 15+ (% of population) Percentage of total population, age above 15, that has been employed during the given year.
0 notes
jcvallier · 7 years ago
Text
Data Analysis Tools- Week 4
This week we have learned about moderators and how they act as a third variable that affects the strength of the relationship between a dependent and independent variable.  
This week I picked income per person to explain female employment rate as my two variables.  I also chose developed countries and non-developed countries as my moderator.
When looking at the income per person and female employment rate, I have an r  square of 0.00.  
Tumblr media
Now, I introduce the moderator of developed countries to find if income per person can explain the female employment rate in this subgroup.
I found that in developed countries, there is a positive relationship between higher income per person and higher female employment rate. This is reflected in the higher r square score of 0.37.
Tumblr media
I found that in non developed countries, there is a negative relationship between income per person and higher female employment rate. But, this is reflected in the lower r square score of 0.13.
Tumblr media
Code:
#import libraries import pandas as pd from scipy.stats import pearsonr import matplotlib.pyplot as plt import seaborn as sns
# import data data = pd.read_csv('GapminderData.csv')
# select the variables of interest df1 = data[['country','lifeexpectancy','employrate','femaleemployrate','DevelCountry','incomeperperson']]
# new variable  - earningpower (lifeexpectancy * incomeperperson) - this is not an average lifetime income but an indication of earning power potential over ones lifetime
df1['earningpower'] = df1['lifeexpectancy'] * df1['incomeperperson']
# new variable - unemployment (100 - 'employrate') - this is the percentage of the population who does not hold a job
df1['unemployment'] = 100 - df1['employrate']
# new variable - femaleunemployment (100 - 'femaleemployrate') - this is the percentage of females who does not hold a job
df1['femaleunemployment'] = 100 - df1['femaleemployrate']
# how many NAs do we have in employrate
print(df1['employrate'].value_counts(dropna=False))
# drop countries that does not have employrate variable
df1 = df1[pd.notnull(df1['employrate'])]
# did we remove the NAs in employrate variable?
print(df1['employrate'].value_counts(dropna=False))
# how many NAs do we have in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
# drop contries that have NAs in lifeexpenctancy
df1 = df1[pd.notnull(df1['lifeexpectancy'])]
# did we remove the NAs in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
reg = sns.regplot(x='incomeperperson',y='femaleemployrate',data = df1)
fig = reg.get_figure() fig.savefig('ploy.png')
# run scipy stats pearson correlation (R correlation, P Value) x = df1['incomeperperson'] y = df1['femaleemployrate']
pearsonr(x,y)
0.02836*0.02836
df2 = df1[df1['DevelCountry']==1] df3 = df1[df1['DevelCountry']==0]
reg1 = sns.regplot(x='incomeperperson',y='femaleemployrate',data = df2)
fig = reg1.get_figure() fig.savefig('ploy1.png')
x = df2['incomeperperson']
y = df2['femaleemployrate']
pearsonr(x,y)
0.60528 * 0.60528
reg2 = sns.regplot(x='incomeperperson',y='femaleemployrate',data = df3)
fig = reg2.get_figure() fig.savefig('ploy2.png')
x = df3['incomeperperson'] y = df3['femaleemployrate']
pearsonr(x,y)
-0.11386 * -0.11386
0 notes
jcvallier · 7 years ago
Text
Data Analysis Tools - Week 3
This week I used life expectancy to explain the employment rate in developed countries vs non-developed countries.
My Code:
#import libraries import pandas as pd from scipy.stats import pearsonr import seaborn as sns
#import data data = pd.read_csv('GapminderData.csv')
#Visualization of life expectancy and employment rate sns.regplot(x='lifeexpectancy',y='employrate', data=data)
# run scipy stats pearson correlation (R correlation, P Value) x = data['lifeexpectancy'] y = data['employrate']
pearsonr(x,y)
# life expectancy to explain employment rate in developed countries dev = data[data['DevelCountry'] == 1] sns.regplot(x='lifeexpectancy',y='employrate', data=dev)
Tumblr media
# run scipy stats pearson correlation (R correlation, P Value) x = dev['lifeexpectancy'] y = dev['employrate']
pearsonr(x,y)
rsquare = (-.3140)*(-.3140)
# life expectancy to explain employment rate in developed countries dev = data[data['DevelCountry'] == 1]
regfig2 = sns.regplot(x='lifeexpectancy',y='employrate', data=dev)
Tumblr media
# run scipy stats pearson correlation (R correlation, P Value) x = dev['lifeexpectancy'] y = dev['employrate']
pearsonr(x,y)
rsquare = (.4872)*(.4872)
# life expectancy to explain employment rate in non-developed countries nondev = data[data['DevelCountry'] == 0] regfig3 = sns.regplot(x='lifeexpectancy',y='employrate', data=nondev)
Tumblr media
# run scipy stats pearson correlation (R correlation, P Value) x = nondev['lifeexpectancy'] y = nondev['employrate']
pearsonr(x,y)
rsquare = (-.3648)*(-.3648)
## Summary of Correlation (Developed vs. Non-Developed Countries)
Looking at all countries the R correlation life expectancy has a negative relationship with employment rate of -0.31 and a p value of less than .05.  This means there is a significat relationship where employment rate decreases as life expectancy increases.  There is also an r square of .099 which means life expectancy can explain approxamatly 10% of the variance.  
Then, Looking at developed countries the R correlation life expectancy has a positive relationship with employment rate of 0.49 and a p value of less than .05.  This means there is a significat relationship where employment rate increaes as life expectancy increases.  There is also an r square of .237 which means life expectancy can explain approxamatly 24% of the variance.
Finally, Looking at non-developed countries the R correlation life expectancy has a negative relationship with employment rate of -0.36 and a p value of less than .05.  This means there is a significat relationship where employment rate decreases as life expectancy increases.  There is also an r square of 0.133 which means life expectancy can explain approxamatly 13% of the variance.
0 notes
jcvallier · 7 years ago
Text
Data Analysis Tools - Week 2
Model Interpretation for Chi-Square Tests:
For this test I needed to create two new variables perc_group & perc_lifeexp. Where perc_group is three groups( by percentile 0-33%,34-66%,67-100%) from income per person variable.  Perc_lifeexp is three groups (by percentile 0-33%,34-66%,67-100%) from life expectancy variable.
When examining the association between life expectancy and income per person, a chi-square test of independence revealed that higher income percentile have longer life spans. The top income percentile has a life expectancy of 76-83 years old, 83% of the time.  The middle income percentile has a life expectancy of 68-75, 63% of the time . The bottom life expectancy is 47-67, 78% of the time. X2 =136.9, 2 df, p=1.285e-28.
Model Interpretation for post hoc Chi-Square Test results:
A Chi Square test of independence revealed the null hypothsysis can be rejected between life expectancy and income per person.  
Post hoc comparisons of rate adjusted for the Bonferroni adjustment is .05/3  = .02. When comparing this with the pvalue = 1.285e-28 this post hoc evaulation confirms we are correct in rejecting the null hypothsysis 
perc_lifeexp   1   2   3 perc_group               1             46  10   1 2              8  35  11 3              1  11  43 perc_lifeexp         1         2         3 perc_group                                 1             0.836364  0.178571  0.018182 2             0.145455  0.625000  0.200000 3             0.018182  0.196429  0.781818 Chi Square value, p value, excpected counts (136.92436043046089, 1.2852905779637279e-28, 4, array([[ 18.88554217,  19.22891566,  18.88554217],       [ 17.89156627,  18.21686747,  17.89156627],       [ 18.22289157,  18.55421687,  18.22289157]]))
Code
import pandas as pd
# import data data = pd.read_csv('GapminderData.csv')
# select the variables of interest df1 = data[['country','lifeexpectancy','employrate','femaleemployrate','DevelCountry','incomeperperson','perc_group','perc_lifeexp']]
# new variable  - earningpower (lifeexpectancy * incomeperperson) - this is not an average lifetime income but an indication of earning power potential over ones lifetime
df1['earningpower'] = df1['lifeexpectancy'] * df1['incomeperperson']
# new variable - unemployment (100 - 'employrate') - this is the percentage of the population who does not hold a job
df1['unemployment'] = 100 - df1['employrate']
# new variable - femaleunemployment (100 - 'femaleemployrate') - this is the percentage of females who does not hold a job
df1['femaleunemployment'] = 100 - df1['femaleemployrate']
# how many NAs do we have in employrate
print(df1['employrate'].value_counts(dropna=False))
# drop countries that does not have employrate variable
df1 = df1[pd.notnull(df1['employrate'])]
# did we remove the NAs in employrate variable?
print(df1['employrate'].value_counts(dropna=False))
# how many NAs do we have in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
# drop contries that have NAs in lifeexpenctancy
df1 = df1[pd.notnull(df1['lifeexpectancy'])]
# did we remove the NAs in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
df1.dropna(inplace=True)
sht3.range('A1').expand('table').value = df1
print(df1['lifeexpectancy'].max()) print(df1['lifeexpectancy'].min())
CROSS = pd.crosstab(df1['perc_group'],df1['perc_lifeexp'])
print(CROSS)
#%%
colsum = CROSS.sum(axis=0) colpct = CROSS/colsum print(colpct)
#%%
print('Chi Square value, p value, ecpected counts') gchi1 = sci.stats.chi2_contingency(CROSS) print(chi1)
0 notes
jcvallier · 7 years ago
Text
Data Analysis Tools - Week 1
(gapminder dataset)
Model Interpretation for ANOVA:
Using income per person variable, I put the bottom 1/3, middle 1/3, and top 1/3 percentile into a 3,2,1 values, respectively.  This variable is the income percentile group or perc_group.
When examining the association between employment rate and income percentile group, an Analysis of Variance (ANOVA) there is a significant difference between the top, middle, and bottom, with the mean and standard deviation for top percentile is a Mean=57.34% /  s.d. ±7.51%, for the middle percentile is a Mean=55.22% /  s.d. ±9.46%, and bottom percentile is a Mean=64.39% / s.d. ±11.53%, F score=13.66, p value = 3.27e-06.  Since the p value is less than 0.05 there is a significant difference between the means and we reject the null hypothesis.
The degrees of freedom can be found in the OLS table as the DF model and DF residuals 2 and 163, respectively. 
 OLS Regression Results                             ============================================================================== Dep. Variable:             employrate   R-squared:                       0.144 Model:                            OLS   Adj. R-squared:                  0.133 Method:                 Least Squares   F-statistic:                     13.66 Date:                Mon, 04 Sep 2017   Prob (F-statistic):           3.27e-06 Time:                        23:47:31   Log-Likelihood:                -609.75 No. Observations:                 166   AIC:                             1226. Df Residuals:                     163   BIC:                             1235. Df Model:                           2                                         Covariance Type:            nonrobust                                         ======================================================================================                         coef    std err          t      P>|t|      [95.0% Conf. Int.] -------------------------------------------------------------------------------------- Intercept             57.3386      1.274     45.020      0.000        54.824    59.854 C(perc_group)[T.2]    -2.1145      1.826     -1.158      0.249        -5.720     1.491 C(perc_group)[T.3]     7.0541      1.817      3.881      0.000         3.465    10.643 ============================================================================== Omnibus:                        0.415   Durbin-Watson:                   1.951 Prob(Omnibus):                  0.813   Jarque-Bera (JB):                0.460 Skew:                          -0.118   Prob(JB):                        0.794 Kurtosis:                       2.894   Cond. No.                         3.68 ==============================================================================
Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Model Interpretation of post hoc ANOVA results:
ANOVA analysis p value was significant, but to determine which of the means are significantly different I ran a Post hoc test, using ‘Tukey Honest Sigificant Difference’.  
Multiple Comparison of Means - Tukey HSD,FWER=0.05 ============================================ group1 group2 meandiff lower   upper  reject --------------------------------------------  1      2    -2.1145  -6.434  2.2049 False  1      3     7.0541  2.7549 11.3534  True  2      3     9.1687  4.8112 13.5261  True --------------------------------------------
When compared with employment rate, there is a difference between top and bottom income percentiles.  There is also a difference between middle and bottom income percentiles.   
Related Code:
import pandas as pd
data = pd.read_csv('GapminderData.csv')
# select the variables of interest df1 = data[['country','lifeexpectancy','employrate','femaleemployrate','DevelCountry','incomeperperson','perc_group']]
# new variable  - earningpower (lifeexpectancy * incomeperperson) - this is not an average lifetime income but an indication of earning power potential over ones lifetime
df1['earningpower'] = df1['lifeexpectancy'] * df1['incomeperperson']
# new variable - unemployment (100 - 'employrate') - this is the percentage of the population who does not hold a job
df1['unemployment'] = 100 - df1['employrate']
# new variable - femaleunemployment (100 - 'femaleemployrate') - this is the percentage of females who does not hold a job
df1['femaleunemployment'] = 100 - df1['femaleemployrate']
# how many NAs do we have in employrate
print(df1['employrate'].value_counts(dropna=False))
# drop countries that does not have employrate variable
df1 = df1[pd.notnull(df1['employrate'])]
# did we remove the NAs in employrate variable?
print(df1['employrate'].value_counts(dropna=False))
# how many NAs do we have in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
# drop contries that have NAs in lifeexpenctancy
df1 = df1[pd.notnull(df1['lifeexpectancy'])]
# did we remove the NAs in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
df1.dropna(inplace=True)
#%%
df4 = df1[['employrate','perc_group']]
#%%
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
model = smf.ols(formula='employrate ~ C(perc_group)',data=df4) results = model.fit() print(results.summary())
#%%
df5=df4.groupby('perc_group').mean() df6=df4.groupby('perc_group').std()
print(df5,df6)
#%%
mc = multi.MultiComparison(df4['employrate'],df4['perc_group']) result2 = mc.tukeyhsd() print(result2.summary())
0 notes
jcvallier · 7 years ago
Text
Data Management and Visualization - Week 4
I have selected the following variables to visualize (python: .describe() function):
     lifeexpectancy  employrate  femaleemployrate  DevelCountry  \ count      166.000000  166.000000        166.000000    166.000000   mean        69.545572   58.987952         47.963855      0.168675   std          9.936253   10.327332         14.618138      0.375597   min         47.794000   34.900002         12.400000      0.000000   25%         62.725000   51.575001         39.250001      0.000000   50%         73.129000   58.850000         48.450001      0.000000   75%         76.844750   64.975000         56.150001      0.000000   max         83.394000   83.199997         83.300003      1.000000  
      incomeperperson  earningpower  unemployment  femaleunemployment   count       166.000000  1.660000e+02    166.000000          166.000000   mean       7604.543086  5.943488e+05     41.012048           52.036145   std       10812.642619  8.776830e+05     10.327332           14.618138   min         103.775857  5.022440e+03     16.800003           16.699997   25%         609.437746  3.884985e+04     35.025000           43.849999   50%        2507.859649  1.723875e+05     41.150000           51.549999   75%        9209.272098  7.000834e+05     48.424999           60.749999   max       52301.587180  4.182192e+06     65.099998           87.600000  
Tumblr media
I needed to remove the NAs to perform the histogram analysis.  
The first graph, DevelCountry or ’Developed Countries’, is a categorical variable indicated by 1 or 0.  Here 28 or 16.87% are developed countries out of 166.
 Earningspower and incomperperson is basically the equivalent.   Over 126 of the 166 have income equal to or less than $10,000.  The highest income per person is Luxenberg of $52,301.
Employment rate and unemployment rate are the inverse of each other.  The average employment rate is 59.0% with a standard deviation of 10.3%.     
Female employment rate and female unemployment rate are the inverse of each other, as well.  The average female employment rate is 48.0% with a standard deviation of 14.6%. 
Life expectancy average is 69.5 years with a standard deviation of 9.9 years.
Tumblr media Tumblr media Tumblr media Tumblr media
The first bivariate graph is income per person vs. life expectancy as a scatter plot.  This looks like an “L” on its back, where you have countries with similar incomes, but a wide difference in life expectancy. The spike toward the higher end of the plot are related to the developing countries who center around 80 years of life expectancy and much higher income per person. You can confirm this in the second graph, which is only developing countries.  There is a developing country with low income and low life expectancy, South Africa. 
The third and four graph is unemployment and life expectancy.  The data is widely spread around with no real relationship between the two variables.  
0 notes
jcvallier · 7 years ago
Text
Data Management and Visualization - Week 3
I added three new varaibles for my week 3 assignment: earningpower, unemployment, and femaleunemployment. 
Earning Power was added to represent potential earnings over ones lifetime and not represented of an average earnings over ones lifetime.  The unemployment rate and female unemployment rate was created as a more intuitive way of viewing the data.
Since my analysis is on how unemployment is associated with life expectancy, I managed these two variables by dropping any ‘NAs’ that existed as I could not infer an average by country, nor would I want to do so.  I left the ‘NAs’ for income per person and earning power as these are more of my secondary variables and not my primary variables.  When I use these columns in any calculation I will manage them according at that time.
Here is my code:
# import libraries import pandas as pdf import numpy as np
# import data data = pd.read_csv('GapminderData.csv')
# select the variables of interest df1 = data[['country','lifeexpectancy','employrate','femaleemployrate','DevelCountry','incomeperperson']]
# new variable  - earningpower (lifeexpectancy * incomeperperson) - this is not an average lifetime income but an indication of earning power potential over ones lifetime
df1['earningpower'] = df1['lifeexpectancy'] * df1['incomeperperson']
# new variable - unemployment (100 - 'employrate') - this is the percentage of the population who does not hold a job
df1['unemployment'] = 100 - df1['employrate']
# new variable - femaleunemployment (100 - 'femaleemployrate') - this is the percentage of females who does not hold a job
df1['femaleunemployment'] = 100 - df1['femaleemployrate']
# how many NAs do we have in employrate
print(df1['employrate'].value_counts(dropna=False))
# drop countries that does not have employrate variable
df1 = df1[pd.notnull(df1['employrate'])]
# did we remove the NAs in employrate variable?
print(df1['employrate'].value_counts(dropna=False))
# how many NAs do we have in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
# drop contries that have NAs in lifeexpenctancy
df1 = df1[pd.notnull(df1['lifeexpectancy'])]
# did we remove the NAs in lifeexpectancy
print(df1['lifeexpectancy'].value_counts(dropna=False))
# completed managed dataset
print(df1)
Here is my output:
NaN           35 59.900002     3 47.299999     3 58.900002     3 61.500000     3 65.000000     3 53.500000     3 55.900002     3 54.500000     2 60.400002     2 59.000000     2 66.000000     2 58.200001     2 57.500000     2 48.700001     2 56.799999     2 63.799999     2 51.200001     2 61.000000     2 53.400002     2 50.900002     2 83.199997     2 56.000000     2 78.199997     2 51.000000     2 48.599998     2 42.400002     2 70.400002     2 58.400002     2 46.000000     2              .. 57.599998     1 57.200001     1 61.299999     1 54.599998     1 56.400002     1 73.099998     1 79.800003     1 46.200001     1 46.799999     1 41.599998     1 63.200001     1 55.400002     1 44.799999     1 73.199997     1 37.400002     1 41.200001     1 49.599998     1 71.599998     1 44.200001     1 57.099998     1 61.700001     1 62.299999     1 46.400002     1 71.699997     1 57.299999     1 38.900002     1 45.700001     1 55.599998     1 75.199997     1 50.500000     1 Name: employrate, Length: 140, dtype: int64 55.900002    3 59.900002    3 47.299999    3 61.500000    3 53.500000    3 65.000000    3 58.900002    3 48.700001    2 50.900002    2 66.000000    2 58.200001    2 60.400002    2 59.000000    2 54.500000    2 63.799999    2 61.000000    2 51.200001    2 53.400002    2 57.500000    2 56.799999    2 83.199997    2 56.000000    2 59.099998    2 78.199997    2 51.000000    2 48.599998    2 42.400002    2 70.400002    2 58.400002    2 46.000000    2            .. 34.900002    1 41.200001    1 61.299999    1 53.099998    1 54.599998    1 73.099998    1 79.800003    1 46.200001    1 46.799999    1 41.599998    1 63.200001    1 55.400002    1 44.799999    1 73.199997    1 57.599998    1 37.400002    1 49.599998    1 71.599998    1 44.200001    1 57.099998    1 61.700001    1 62.299999    1 46.400002    1 71.699997    1 57.299999    1 38.900002    1 45.700001    1 55.599998    1 75.199997    1 50.500000    1 Name: employrate, Length: 139, dtype: int64 NaN        2 73.979    2 72.974    2 50.239    1 54.675    1 70.349    1 75.850    1 79.158    1 72.477    1 68.846    1 75.901    1 72.150    1 76.640    1 50.411    1 54.116    1 73.235    1 80.934    1 73.339    1 67.185    1 75.181    1 79.120    1 62.095    1 81.126    1 51.093    1 64.986    1 80.557    1 65.437    1 80.414    1 72.640    1 78.531    1          .. 73.403    1 74.044    1 74.941    1 67.017    1 48.196    1 76.835    1 51.088    1 77.005    1 73.131    1 81.097    1 75.620    1 76.652    1 72.231    1 78.371    1 74.156    1 53.183    1 76.072    1 48.673    1 49.025    1 69.042    1 65.493    1 74.825    1 68.823    1 67.852    1 81.618    1 75.632    1 51.610    1 66.618    1 76.848    1 63.125    1 Name: lifeexpectancy, Length: 175, dtype: int64 72.974    2 73.979    2 50.239    1 81.126    1 70.349    1 75.850    1 79.158    1 72.477    1 68.846    1 75.901    1 72.150    1 54.675    1 48.132    1 50.411    1 76.640    1 73.235    1 80.934    1 73.339    1 67.185    1 75.181    1 79.120    1 54.116    1 51.093    1 70.563    1 83.394    1 65.438    1 65.437    1 80.414    1 72.640    1 78.531    1         .. 74.847    1 73.403    1 74.044    1 74.941    1 67.017    1 48.196    1 76.835    1 51.088    1 81.097    1 75.620    1 76.652    1 76.848    1 79.499    1 72.231    1 78.371    1 74.156    1 53.183    1 76.072    1 48.673    1 49.025    1 69.042    1 65.493    1 74.825    1 68.823    1 67.852    1 81.618    1 75.632    1 51.610    1 66.618    1 63.125    1 Name: lifeexpectancy, Length: 174, dtype: int64                    country  lifeexpectancy  employrate  femaleemployrate  \ 0               Afghanistan          48.673   55.700001         25.600000   1                   Albania          76.918   51.400002         42.099998   2                   Algeria          73.131   50.500000         31.700001   4                    Angola          51.093   75.699997         69.400002   6                 Argentina          75.901   58.400002         45.900002   7                   Armenia          74.241   40.099998         34.200001   9                 Australia          81.907   61.500000         54.599998   10                  Austria          80.854   57.099998         49.700001   11               Azerbaijan          70.739   60.900002         56.200001   12                  Bahamas          75.620   66.599998         60.700001   13                  Bahrain          75.057   60.400002         30.200001   14               Bangladesh          68.944   68.099998         53.599998   15                 Barbados          76.835   66.900002         60.299999   16                  Belarus          70.349   53.400002         48.599998   17                  Belgium          80.009   48.599998         41.700001   18                   Belize          76.072   56.799999         38.799999   19                    Benin          56.081   71.599998         58.200001   21                   Bhutan          67.185   58.400002         39.900002   22                  Bolivia          66.618   70.400002         61.599998   23   Bosnia and Herzegovina          75.670   41.200001         34.900002   24                 Botswana          53.183   46.000000         38.700001   25                   Brazil          73.488   64.500000         53.299999   26                   Brunei          78.005   63.799999         55.500000   27                 Bulgaria          73.371   47.299999         42.099998   28             Burkina Faso          55.439   81.300003         75.800003   29                  Burundi          50.411   83.199997         83.300003   30                 Cambodia          63.125   78.900002         73.400002   31                 Cameroon          51.610   59.099998         49.000000   32                   Canada          81.012   63.500000         58.900002   33               Cape Verde          74.156   55.900002         43.599998   ..                      ...             ...         ...               ...   179                   Spain          81.404   52.500000         41.700001   180               Sri Lanka          74.941   55.099998         39.200001   181                   Sudan          61.452   47.299999         27.900000   182                Suriname          70.563   44.700001         30.400000   183               Swaziland          48.718   50.900002         47.099998   184                  Sweden          81.439   60.700001         56.700001   185             Switzerland          82.338   64.300003         57.000000   186                   Syria          75.850   44.799999         16.700001   188              Tajikistan          67.529   54.599998         50.099998   189                Tanzania          58.199   78.199997         76.099998   190                Thailand          74.126   72.000000         65.000000   191             Timor-Leste          62.475   67.300003         54.700001   192                    Togo          57.062   63.900002         48.400002   194     Trinidad and Tobago          70.124   61.500000         50.500000   195                 Tunisia          74.515   41.599998         21.400000   196                  Turkey          73.979   42.799999         21.900000   197            Turkmenistan          64.986   58.500000         53.900002   199                  Uganda          54.116   83.199997         80.000000   200                 Ukraine          68.494   54.400002         49.400002   201    United Arab Emirates          76.546   75.199997         37.299999   202          United Kingdom          80.170   59.299999         53.099998   203           United States          78.531   62.299999         56.000000   204                 Uruguay          77.005   57.500000         46.000000   205              Uzbekistan          68.287   57.500000         52.599998   207               Venezuela          74.402   59.900002         45.799999   208                 Vietnam          75.181   71.000000         67.599998   209      West Bank and Gaza          72.832   32.000000         11.300000   210             Yemen, Rep.          65.493   39.000000         20.299999   211                  Zambia          49.025   61.000000         53.500000   212                Zimbabwe          51.384   66.800003         58.099998       DevelCountry  incomeperperson  earningpower  unemployment  \ 0               0              NaN           NaN     44.299999   1               0      1914.996551  1.472977e+05     48.599998   2               0      2231.993335  1.632279e+05     49.500000   4               0      1381.004268  7.055965e+04     24.300003   6               0     10749.419240  8.158917e+05     41.599998   7               0      1326.741757  9.849863e+04     59.900002   9               1     25249.986060  2.068151e+06     38.500000   10              1     26692.984110  2.158235e+06     42.900002   11              0      2344.896916  1.658757e+05     39.099998   12              0     19630.540550  1.484461e+06     33.400002   13              0     12505.212540  9.386037e+05     39.599998   14              0       558.062877  3.847509e+04     31.900002   15              0      9243.587053  7.102310e+05     33.099998   16              0      2737.670379  1.925924e+05     46.599998   17              1     24496.048260  1.959904e+06     51.400002   18              0      3545.652174  2.697249e+05     43.200001   19              0       377.039699  2.114476e+04     28.400002   21              0      1324.194906  8.896603e+04     41.599998   22              0      1232.794137  8.212628e+04     29.599998   23              0      2183.344867  1.652137e+05     58.799999   24              0      4189.436587  2.228068e+05     54.000000   25              0      4699.411262  3.453503e+05     35.500000   26              0     17092.460000  1.333297e+06     36.200001   27              0      2549.558474  1.870637e+05     52.700001   28              0       276.200413  1.531227e+04     18.699997   29              0       115.305996  5.812691e+03     16.800003   30              0       557.947513  3.522044e+04     21.099998   31              0       713.639303  3.683092e+04     40.900002   32              1     25575.352620  2.071910e+06     36.500000   33              0      1959.844472  1.453342e+05     44.099998   ..            ...              ...           ...           ...   179             1     15461.758370  1.258649e+06     47.500000   180             0      1295.742686  9.710425e+04     44.900002   181             0       523.950151  3.219778e+04     52.700001   182             0      2668.020519  1.882635e+05     55.299999   183             0      1810.230533  8.819081e+04     49.099998   184             1     32292.482980  2.629868e+06     39.299999   185             1     37662.751250  3.101076e+06     35.699997   186             0      1525.780116  1.157304e+05     55.200001   188             0       279.180453  1.885278e+04     45.400002   189             0       456.385712  2.656119e+04     21.800003   190             0      2712.517199  2.010680e+05     28.000000   191             0       369.572954  2.308907e+04     32.699997   192             0       285.224449  1.627548e+04     36.099998   194             0     10480.817200  7.349568e+05     38.500000   195             0      3164.927693  2.358346e+05     58.400002   196             1      5348.597192  3.956839e+05     57.200001   197             0      2062.125152  1.340093e+05     41.500000   199             0       377.421113  2.042452e+04     16.800003   200             0      1036.830725  7.101668e+04     45.599998   201             0     21087.394120  1.614156e+06     24.800003   202             1     28033.489280  2.247445e+06     40.700001   203             1     37491.179520  2.944220e+06     37.700001   204             0      9106.327234  7.012327e+05     42.500000   205             0       952.827261  6.506572e+04     42.500000   207             0      5528.363114  4.113213e+05     40.099998   208             0       722.807559  5.434140e+04     29.000000   209             0              NaN           NaN     68.000000   210             0       610.357367  3.997414e+04     61.000000   211             0       432.226337  2.118990e+04     39.000000   212             0       320.771890  1.648254e+04     33.199997       femaleunemployment   0             74.400000   1             57.900002   2             68.299999   4             30.599998   6             54.099998   7             65.799999   9             45.400002   10            50.299999   11            43.799999   12            39.299999   13            69.799999   14            46.400002   15            39.700001   16            51.400002   17            58.299999   18            61.200001   19            41.799999   21            60.099998   22            38.400002   23            65.099998   24            61.299999   25            46.700001   26            44.500000   27            57.900002   28            24.199997   29            16.699997   30            26.599998   31            51.000000   32            41.099998   33            56.400002   ..                  ...   179           58.299999   180           60.799999   181           72.100000   182           69.600000   183           52.900002   184           43.299999   185           43.000000   186           83.299999   188           49.900002   189           23.900002   190           35.000000   191           45.299999   192           51.599998   194           49.500000   195           78.600000   196           78.100000   197           46.099998   199           20.000000   200           50.599998   201           62.700001   202           46.900002   203           44.000000   204           54.000000   205           47.400002   207           54.200001   208           32.400002   209           88.700000   210           79.700001   211           46.500000   212           41.900002   [176 rows x 9 columns]
0 notes