werwmldataanalysiswesleyanp-blog - Tumblr blog

werwmldataanalysiswesleyanp-blog · 6 years ago

Text

Assignment 4

A K-mean cluster analysis was performed for the gapminder data used in the previous assignments. Groups of countries were identified based on multiple quantitative inputs such as income per person, alcohol consumption, HIV rates and urban rates, all of which were standardized to be zero-mean and with standard deviations of 1.

The data was split randomly into a training set of 74 countries (70% of the original 107 remaining after data cleaning, the same data cleaning as previous assignments), with the remaining 33 left as a test set. Cluster analyses with clusters of size 1 to 9 provided the “elbow curve” below. The value 3 was chosen (a bit arbitrarily) as a reasonable trade-off between number of clusters and usefulness of the result. It might have been reasonable to select k=2, given that there is a marked decrease in that slope which would divide countries into 2 levels (perhaps industrialized and non-industrialized), or to select k=4 as Hans Rosling essentially suggests in his book “Factfulness” where he divides development into four levels as a function of income per capita. Something to keep in mind is that culling of the data may have had the unfortunate effect of wiping out smaller or more remote countries making this clustering a bit less representative than it would be had all nations been considered rather than just those with Gapminder data.

A scatterplot with 3 clusters seemed to verify this choice, with three distinctive shapes in the graph.

The three clusters seem to be split into development levels quite clearly (in the order 1, 0, 2 in ascending development level. Factors like C02 emissions showed ascending levels with this progression, with a huge jump from cluster 0 to cluster 2, as expected, HIV rate sinking with ascending development, and income per person increasing along the same rates.

Finally, an Analysis of Variance test was conducted in order to verify the clustering with the additional variable from the original gapminder data set, life expectancy. This followed the same trend, ascending in the same order as income per capita and CO2 emissions as expected. The highest income group was clustered more tightly, perhaps indicating a sort of ceiling in development.

LIFEEXPECTANCY (STD) cluster 0 72.747027 (6.426278) 1 63.802792 (8.949490) 2 80.559923 (1.450612)

The code used in doing this analysis is provided below for reference:

# -*- coding: utf-8 -*- """ Created on Mon Jan 18 19:51:29 2016

@author: jrose01 modified by Russ 2019-05-17 """ #%% from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans import os

""" Data Management """ if os.path.exists("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis"): os.chdir("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis") elif os.path.exists('C:\\Users\\Russ\\Work\\VCCSDS\\Intro_Data_Science_Python'): os.chdir('C:\\Users\\Russ\\Work\\VCCSDS\\Intro_Data_Science_Python')

gapminder_data = pd.read_csv("gapminder.csv") gapminder_data.drop(['country','oilperperson','armedforcesrate'],axis=1,inplace=True) #gapminder_data.set_index('country',inplace=True) gapminder_data.replace(to_replace=' ',value=np.NaN,inplace=True) # missing data is a space right now

#upper-case all DataFrame column names gapminder_data.columns = map(str.upper, gapminder_data.columns)

# Data Management data_clean = gapminder_data.dropna()

# subset clustering variables cluster=data_clean.copy() cluster.drop(['LIFEEXPECTANCY'],axis=1,inplace=True) cluster.describe()

# standardize clustering variables to have mean=0 and sd=1 for metric in cluster.columns: cluster[metric]=preprocessing.scale(cluster[metric].astype('float64'))

# split data into train and test sets clus_train, clus_test = train_test_split(cluster, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]

for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])

""" Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """

plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')

#%% #%% # Interpret 3 cluster solution model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train) # plot clusters

from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 3 Clusters') plt.show()

""" BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """ # create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level=0, inplace=True) # create a list that has the new index variable cluslist=list(clus_train['index']) # create a list of cluster assignments labels=list(model3.labels_) # combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist # convert newlist dictionary to a dataframe newclus=DataFrame.from_dict(newlist, orient='index') newclus # rename the cluster assignment column newclus.columns = ['cluster']

# now do the same for the cluster assignment variable # create a unique identifier variable from the index for the # cluster assignment dataframe # to merge with cluster training data newclus.reset_index(level=0, inplace=True) # merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train=pd.merge(clus_train, newclus, on='index') merged_train.head(n=100) # cluster frequencies merged_train.cluster.value_counts()

""" END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster """

# FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp)

# validate clusters in training data by examining cluster differences in GPA using ANOVA # first have to merge GPA with clustering variables and cluster assignment data lfe_data=data_clean['LIFEEXPECTANCY'].astype('float64')

# split GPA data into train and test sets lfe_train, lfe_test = train_test_split(lfe_data, test_size=.3, random_state=123) lfe_train1=pd.DataFrame(lfe_train) lfe_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(lfe_train1, merged_train, on='index') sub1 = merged_train_all[['LIFEEXPECTANCY', 'cluster']].dropna()

import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

#gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit() lfemod = smf.ols(formula='LIFEEXPECTANCY ~ C(cluster)', data=sub1).fit() print (lfemod.summary())

print ('means for LIFEEXPECTANCY by cluster') m1= sub1.groupby('cluster').mean() print (m1)

print ('standard deviations for LIFEEXPECTANCY by cluster') m2= sub1.groupby('cluster').std() print (m2)

mc1 = multi.MultiComparison(sub1['LIFEEXPECTANCY'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

0 notes

werwmldataanalysiswesleyanp-blog · 6 years ago

Text

Assignment 3

Based on the data set from Gapminder that aggregates statistics for a number of countries, 12 quantitative variables were taken as candidates in order to predict life expectancy, using a lasso regression analysis.

As with the previous assignments, the "oil per person" variable was removed due to lack of data and other rows (country entries) with blank values were pruned out of the data set in order to do a regression. This may have had the unfortunate effect of unfairly punishing smaller or more undeveloped countries where not all categories could be analyzed.

The remaining indexes included factors like income per person, urban rates and internet use rates (gathered from World Bank data), as well as things like HIV rate from the UNAIDS database and female employment rate from the International Labour Organization. All values used in the analysis were quantitative. These were all normalized to have zero-means and first moments of one, in order to implement the weighting of the Lasso regression equally.

107 countries were left after the data pruning. 74 of them (70%) were retained as training data and 33 (30%) were set aside to be test data. As in the example assignment, k=10 cross fold validation was used with a least angle regression algorithm to create a lasso regression model. Of the 12 predictor variables, 7 were retained in the end with the following regression coefficients, in order of magnitude.

HIVRATE has a coefficient of -3.6974851453693844

INTERNETUSERATE has a coefficient of 2.4136772162632414

INCOMEPERPERSON has a coefficient of 1.7322692274664198

EMPLOYRATE has a coefficient of -1.0766171430677287 URBANRATE has a coefficient of 0.8233909146021535

POLITYSCORE has a coefficient of 0.3457772667579592 RELECTRICPERPERSON has a coefficient of 0.02265736143873757

The remaining categories, ALCCONSUMPTIO, BREASTCANCERPER100T, CO2EMISSIONS, FEMALEEMPLOYRATE, and SUICIDEPER100TH did not have significant enough correlations to remain in the regression.

HIV Rate is, expectedly, strongly negatively correlated with life expectancy and internet use rate (a probable candidate proxy for development index) is positively associated together with income per person.

The resulting R squared values for the training set and test set are 0.77 and 0.65 respectively, indicating a slightly better result on the training set (as expected). The mean square error converged to about 21 for the average, as indicated in the figure below.

The code used to generate the graph is provided as a reference below:

# -*- coding: utf-8 -*- """ Created on Mon Dec 14 16:26:46 2015

@author: jrose01 Modified by Russ 2019-05-08 """ #%% #from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV

if os.path.exists("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis"): os.chdir("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis") elif os.path.exists('C:\\Users\\Russ\\Work\\VCCSDS\\Intro_Data_Science_Python'): os.chdir('C:\\Users\\Russ\\Work\\VCCSDS\\Intro_Data_Science_Python')

#Load the dataset #data = pd.read_csv("tree_addhealth.csv") gapminder_data = pd.read_csv("gapminder.csv") gapminder_data.set_index('country',inplace=True) gapminder_data.replace(to_replace=' ',value=np.NaN,inplace=True) # missing data is a space right now gapminder_data.drop(['oilperperson','armedforcesrate'],axis=1,inplace=True)

#upper-case all DataFrame column names gapminder_data.columns = map(str.upper, gapminder_data.columns)

# Data Management data_clean = gapminder_data.dropna() #data_clean = data.dropna() #recode1 = {1:1, 2:0} #data_clean['MALE']= data_clean['BIO_SEX'].map(recode1)

#select predictor variables and target variable as separate data sets #predvar= data_clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', #'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1', #'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV', #'PARPRES']]

#target = data_clean.SCHCONN1 target = data_clean['LIFEEXPECTANCY'].astype(dtype=np.float64) #%% # standardize predictors to have mean=0 and sd=1 predictors = data_clean.drop('LIFEEXPECTANCY',axis=1).copy() # Pandas DF from sklearn import preprocessing for metric in predictors.columns: predictors[metric]=preprocessing.scale(predictors[metric].astype('float64'))

#%% # split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=523)

# specify the lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients dict(zip(predictors.columns, model.coef_)) #%% # plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') #plt.legend(predictors.columns)

# plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

# MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error)

# R-square from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test,tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

0 notes

werwmldataanalysiswesleyanp-blog · 6 years ago

Text

Assignment 2

A random forest was generated using the same Gapminder data as the previous assignment. The question to be addressed was which parameters are most strongly associated with this data set, and to what degree of help it was to grow more trees in the forest.

Little improvement was seen in the accuracy as a result of increasing the number of trees in the forest, and this result was found to be consistent from run to run as well. The progression as a function of number of trees is shown below:

The parameters most strongly associated with the result were (unsurprisingly) the income per person, electric use per person, and (surprisingly) the breast cancer per 100). The full list of importance scores are included below:

incomeperperson - 0.3031136492034781 alcconsumption - 0.0171179450924944 breastcancerper100th - 0.14321355626569626 co2emissions - 0.02997916832579566 femaleemployrate - 0.033194948133106125 hivrate - 0.026379916171211486 internetuserate - 0.09698119380271526 polityscore - 0.06473411589000137 relectricperperson - 0.1718086730378602 suicideper100th - 0.02919669340767935 employrate - 0.02023633209289948 urbanrate - 0.06404380857706224

Surprisingly, metrics like alcohol consumption, HIV rate and suicide rate had lower importance scores.

The code used to generate the results is given below.

""" Random Forest Code (https://www.coursera.org/learn/machine-learning-data-analysis/) Week 2, adapted by Russ 2019-05-03 """

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier

os.chdir("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis")

#Load the dataset

#AH_data = pd.read_csv("tree_addhealth.csv") #data_clean = AH_data.dropna()

#data_clean.dtypes #data_clean.describe()

gapminder_data = pd.read_csv("gapminder.csv") gapminder_data.set_index('country',inplace=True) gapminder_data.replace(to_replace=' ',value=np.NaN,inplace=True) # missing data is a space right now gapminder_data.drop(['oilperperson','armedforcesrate'],axis=1,inplace=True)

data_clean = gapminder_data.dropna()

#Split into training and testing sets

#predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN','age', #'ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1','ESTEEM1','VIOL1', #'PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES']]

predictors = data_clean.drop('lifeexpectancy',axis=1) # Pandas DF

#targets = data_clean.TREG1

targets = data_clean['lifeexpectancy'].astype(dtype=np.float64)>75 # Pandas Series (True/False)

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape pred_test.shape tar_train.shape tar_test.shape

#Build model on training data from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_)

""" Running a different number of trees and see the effect of that on the accuracy of the prediction """

trees=range(25) accuracy=np.zeros(25)

for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla() plt.plot(trees, accuracy) plt.gca().set_title('Random Forest for Life Expectancy >75') plt.gca().set_xlabel('Number of Trees in Analysis') plt.gca().set_ylabel('Accuracy of Prediction on Test Data') plt.savefig('Life_Expectancy_Above_75') #print(data_clean.drop('lifeexpectancy',axis=1).columns) for subject, importance in zip(data_clean.drop('lifeexpectancy',axis=1).columns,model.feature_importances_): print(''.join([str(subject),' - ',str(importance)])) #print()

0 notes

werwmldataanalysiswesleyanp-blog · 6 years ago

Text

Assignment 1

A decision tree was generated based on data from Gapminder that attempted to predict whether a country would have a life expectancy of greater than 75 years based on a number of statistics included in the data set provided in the course information.

First, the data was culled in two ways. The “Oil per capita” column was removed, as a significant number of the entries were missing and it wasn’t clear whether they were zeroes or missing data, so it may not have been justified to zero-fill these. Next, rows with missing data were removed. This has the unfortunate effect of filtering out smaller and more remote countries for which data acquisition is presumably more difficult, such as the Marshall Islands or Saint Vincent and the Grenadines. However, because much data is missing for these countries they can not be used for this task with our data set.

Next, the remaining categories were used to generate a decision tree in Python. The list of categories (Self explanatory, and in list form) are as follows:

['incomeperperson', 'alcconsumption', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate']

Repeated runs of the program continually demonstrated that incomepercapita was consistently the primary separator. Some of the other categories were a bit surprising and perhaps just artifacts of the chosen data set rather than good predictors. In this particular run, an 80/20 split was done on a pool of 107 countries. The confusion matrix consisted of 12 correct positives, 5 correct negatives, 1 False Negative and 4 False positives, for a rate of 77% accurate classification. This isn’t a whole lot better than just guessing (64% of the population are in one category).

The code is included below as a reference:

""" Created on Sun Dec 13 21:12:54 2015 adapted by Russ 2019-05-03

@author: ldierker """

# -*- coding: utf-8 -*-

os.chdir("Y:\Data\Course_Work\VCC_School_of_Data_Science\Machine_Learning_Data_Analysis")

""" Data Engineering and Analysis """ #Load the dataset

#AH_data = pd.read_csv("tree_addhealth.csv") gapminder_data = pd.read_csv("gapminder.csv") gapminder_data.set_index('country',inplace=True) gapminder_data.replace(to_replace=' ',value=np.NaN,inplace=True) # missing data is a space right now gapminder_data.drop(['oilperperson','armedforcesrate'],axis=1,inplace=True)

data_clean = gapminder_data.dropna()

data_clean.dtypes data_clean.describe()

""" Modeling and Prediction """ # Split into training and testing sets

predictors = data_clean.drop('lifeexpectancy',axis=1) # Pandas DF # targets = data_clean['lifeexpectancy'].astype(dtype=np.float64)>75 # Pandas Series (True/False) ## pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.2) # # pred_train.shape # pred_test.shape # tar_train.shape # tar_test.shape # # #Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train) # predictions=classifier.predict(pred_test) # sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) # #Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out,feature_names=list(predictors.columns)) import pydotplus graph=pydotplus.graph_from_dot_data(out.getvalue()) graph.write_png('Assignment1_moren75_testFeatuernames.png')

0 notes