mariagiorda-blog - Tumblr blog

mariagiorda-blog · 8 years ago

Text

k-means clustering

Holaaaa,

This week I am running a k-means cluster analysis in Python. The objective of the clustering is in to understand to which ‘variable clusters’ my target audiences, students with high VS low self-esteem, belong to.

I am going to use the Wave One, Add Health Survey data.

My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?

Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.

The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.

Explanatory Variables – Categorical

WHITE = H1GI6A 0=no 1=yes

BLACK = H1GI6B 0=no 1=yes

AMERICAN INDIAN = H1GI6C 0=no 1=yes

ASIAN = H1GI6D 0=no 1=yes

RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes

PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes

ARTS AND CRAFTS ACTIVITIES = H1DA2 (how many times in the past week did you do arts and crafts activities?) 0=none 1=1-2times 2=3-4times 3=5 or more times

TEAM SPORT = H1DA5 (how many times in the past week did you play a team sport?) 0=none 1=1-2times 2=3-4times 3=5 or more times

EXERCISE = H1DA6 (how many times in the past week did you exercise?) 0=none 1=1-2times 2=3-4times 3=5 or more times

TIME WITH FRIENDS = H1DA7 (how many times in the past week did you do arts and crafts activities?) 0=none 1=1-2times 2=3-4times 3=5 or more times

TELEVISION = H1DA8 (how many hours a week do you watch television?) Range from 0 to 99

VIDEO GAMES = H1DA10 (how many hours a week do you play videogames?) Range from 0 to 99

SCRIPT 1

#CLUSTERING

#from pandas import Series, Dataframe import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans

#load Dataset PATH = "D:\\AddHealth_Coursera"

data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#Data management #setting variables I will be working with to numeric data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce') data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce') data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce') data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce') data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce') data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce') data['H1ED3'] = pandas.to_numeric(data['H1ED3'], errors='coerce') data['H1DA2'] = pandas.to_numeric(data['H1DA2'], errors='coerce') data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce') data['H1DA6'] = pandas.to_numeric(data['H1DA6'], errors='coerce') data['H1DA7'] = pandas.to_numeric(data['H1DA7'], errors='coerce') data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce') data['H1DA10'] = pandas.to_numeric(data['H1DA10'], errors='coerce')

#data management section

sub1=data

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False) print(c26)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False) print(c27)

sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)

sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)

c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False) print(c28)

sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)

sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)

c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False) print(c29)

sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)

sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)

c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False) print(c30)

sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)

sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)

c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False) print(c31)

sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)

sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)

c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False) print(c32)

sub1['H1DA2'] = data['H1DA2'].replace(6, numpy.nan)

sub1['H1DA2'] = data['H1DA2'].replace(8, numpy.nan)

c33 = sub1['H1DA2'].value_counts(sort=False, dropna=False) print(c33)

sub1['H1DA5'] = data['H1DA5'].replace(6, numpy.nan)

sub1['H1DA5'] = data['H1DA5'].replace(8, numpy.nan)

c34 = sub1['H1DA5'].value_counts(sort=False, dropna=False) print(c34)

sub1['H1DA6'] = data['H1DA6'].replace(6, numpy.nan)

sub1['H1DA6'] = data['H1DA6'].replace(8, numpy.nan)

c35 = sub1['H1DA6'].value_counts(sort=False, dropna=False) print(c35)

sub1['H1DA7'] = data['H1DA7'].replace(6, numpy.nan)

sub1['H1DA7'] = data['H1DA7'].replace(8, numpy.nan)

c36 = sub1['H1DA7'].value_counts(sort=False, dropna=False) print(c36)

sub1['H1DA8'] = data['H1DA8'].replace(996, numpy.nan)

sub1['H1DA8'] = data['H1DA8'].replace(998, numpy.nan)

c37 = sub1['H1DA8'].value_counts(sort=False, dropna=False) print(c37)

sub1['H1DA10'] = data['H1DA10'].replace(996, numpy.nan)

sub1['H1DA10'] = data['H1DA10'].replace(998, numpy.nan)

c38 = sub1['H1DA10'].value_counts(sort=False, dropna=False) print(c38) #collate 5-level response variable into a 2-level one

def SMART(row):

if row['H1SE4'] <= 3 :

return 0

elif row['H1SE4'] > 3 :

return 1

sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)

d2 = sub1.groupby('SMART').size() print(d2)

sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)

#create a clean data frame that drops all NAs

data_clean = sub1.dropna()

#subset clustering variables cluster=data_clean[['SMART','H1WP17F','H1GI6A','H1GI6B','H1GI6C','H1ED3','H1DA5','H1DA2','H1DA6','H1DA7','H1DA8','H1DA10' ]] cluster.describe()

#standardize clustering variables to have mean=0 ans sd=1 clustervar=cluster.copy() clustervar['SMART']=preprocessing.scale(clustervar['SMART'].astype('float64')) clustervar['H1WP17F']=preprocessing.scale(clustervar['H1WP17F'].astype('float64')) clustervar['H1GI6A']=preprocessing.scale(clustervar['H1GI6A'].astype('float64')) clustervar['H1GI6B']=preprocessing.scale(clustervar['H1GI6B'].astype('float64')) clustervar['H1GI6C']=preprocessing.scale(clustervar['H1GI6C'].astype('float64')) clustervar['H1ED3']=preprocessing.scale(clustervar['H1ED3'].astype('float64')) clustervar['H1DA2']=preprocessing.scale(clustervar['H1DA2'].astype('float64')) clustervar['H1DA5']=preprocessing.scale(clustervar['H1DA5'].astype('float64')) clustervar['H1DA6']=preprocessing.scale(clustervar['H1DA6'].astype('float64')) clustervar['H1DA7']=preprocessing.scale(clustervar['H1DA7'].astype('float64')) clustervar['H1DA8']=preprocessing.scale(clustervar['H1DA8'].astype('float64')) clustervar['H1DA10']=preprocessing.scale(clustervar['H1DA10'].astype('float64'))

#split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123 )

#k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]

for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(numpy.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))/ clus_train.shape[0])

#Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose

plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow method')

OUTPUT 1

The plot shows the average minimum distance of the observations from the cluster centroids of each cluster. We can see the average distance decreasing as the number of clusters increases.

Since the goal of cluster analysis is to minimize the distance between observations and their assigned clusters we want to chose the fewest numbers of clusters that provides a low average distance. What we're looking for in this plot is a bend in the elbow that kind of shows where the average distance value might be leveling off such that adding more clusters doesn't decrease the average distance as much. There appears to be a couple of bends at the line at 2,3 and 6 clusters, but it's not very clear.

SCRIPT 2

#Interpret 3 cluster solutions model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train)

#plot clusters

from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of canonical variables for 3 clusters') plt.show()

OUTPUT 2

Here is the scatter plot. The two clusters are densely packed, which would suggest that there is high within cluster correlation and low within cluster variance. There is a good deal of overlap between the two clusters.

The 3 cluster solution shows better separation, but the observations are more spread out indicating lower within cluster correlation and higher within cluster variance.

SCRIPT 3

#BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable mean by cluster

#create a unique identifier variable from the index to merge training data with cluster assignment variable

clus_train.reset_index(level=0, inplace=True) #create a list that has the new index variable cluslist=list(clus_train['index']) #create a list of cluster assignments labels=list(model3.labels_) #combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist #convert newlist dictionary to a dataframe newclus=pandas.DataFrame.from_dict(newlist, orient='index') newclus

#now the same for the cluster assignment variable #rename the cluster assignment column newclus.columns = ['cluster'] #create a unique identifier variable from the index for cluster assignment dataframe to merge with cluster training data newclus.reset_index(level=0, inplace=True) #merge the cluster assignment dataframe with the cluster training variable dataframe by the index variable merged_train=pandas.merge(clus_train, newclus, on='index') merged_train.head(n=100)

#cluster frequencies merged_train.cluster.value_counts()

#END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster #calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print("Clustering variable means by cluster") print(clustergrp)

OUTPUT 3

Clustering variable means by cluster index SMART H1WP17F H1GI6A H1GI6B H1GI6C \ cluster 0 3317.145035 -0.019567 0.012916 0.715419 -0.572201 -0.047231 1 2809.145455 -0.009700 -0.050558 -0.437236 0.238518 0.111593 2 3202.828512 0.021929 -0.016642 -1.326468 1.045616 0.088115

H1ED3 H1DA5 H1DA2 H1DA6 H1DA7 H1DA8 H1DA10 cluster 0 -0.158047 -0.006461 0.040623 -0.019016 0.056149 -0.129638 -0.055092 1 6.327226 -0.069347 0.101848 0.016019 -0.001742 -0.091051 0.083528 2 -0.158047 -0.001304 -0.079557 0.089569 -0.074709 0.226008 0.085216

Adolescents in the 3rd cluster, cluster 2, had the highest likelihood of having high self-esteem . The ones in cluster had the lowest, but also the ones with better parent relationships.

Strangely enough my dataset did not have a GPA column, however the code below is the one I would have used to validate clusters in training data by examining cluster differences in GPA using ANOVA.

CODE 4:

#validate clusters by examining cluster differences in GPA gpa_data=data_clean['GPA1'] #split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pandas.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pandas.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit() print (gpamod.summary)

print ('means for GPA by cluster') m1=sub1.groupby('cluster').mean() print(m1)

print('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print(m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

1 note · View note

mariagiorda-blog · 8 years ago

Text

Lasso Regression

Alo!!

This week I am dealing with Lasso regression in order to understand which variables are the most strongly associated with my target audience, which is students with high VS low self-esteem.

I am going to use the Wave One, Add Health Survey data.

My response/target variable is self-esteem, which is defined based on the answer to the question, compared to people your age, how intelligent are you?

Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.

The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.

Explanatory Variables – Categorical

WHITE = H1GI6A 0=no 1=yes

BLACK = H1GI6B 0=no 1=yes

AMERICAN INDIAN = H1GI6C 0=no 1=yes

ASIAN = H1GI6D 0=no 1=yes

RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes

PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes

I am going to test a Lasso Regression Model using Python

CODE

#from pandas import Series, Dataframe import pandas import numpy import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix import sklearn.metrics import io from sklearn import tree from sklearn.linear_model import LassoLarsCV

#load Dataset PATH = "D:\\AddHealth_Coursera"

data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#data management section

sub1=data

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False) print(c26)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False) print(c27)

sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)

sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)

c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False) print(c28)

sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)

sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)

c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False) print(c29)

sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)

sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)

c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False) print(c30)

sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)

sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)

c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False) print(c31)

sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)

sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)

c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False) print(c32)

#collate 5-level response variable into a 2-level one

def SMART(row):

if row['H1SE4'] <= 3 :

return 0

elif row['H1SE4'] > 3 :

return 1

sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)

d2 = sub1.groupby('SMART').size() print(d2)

sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)

#create a clean data frame that drops all NAs

data_clean = sub1.dropna()

#Next I will create 2 data frames

#1 - Dataframe including only the predictor variables

predvar= data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F', 'H1ED3']]

#2 - Dataframe including only self-esteem response variable

target = data_clean.SMART

#standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['H1GI6A']=preprocessing.scale(predictors['H1GI6A'].astype('float64')) predictors['H1GI6B']=preprocessing.scale(predictors['H1GI6B'].astype('float64')) predictors['H1GI6C']=preprocessing.scale(predictors['H1GI6C'].astype('float64')) predictors['H1GI6D']=preprocessing.scale(predictors['H1GI6D'].astype('float64')) predictors['H1ED3']=preprocessing.scale(predictors['H1ED3'].astype('float64')) predictors['H1WP17F']=preprocessing.scale(predictors['H1WP17F'].astype('float64'))

#split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)

#specify the Lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

#print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))

OUTPUT 1

Out[3]: {'H1ED3': -0.00036269558362801561, 'H1GI6A': 0.033411357878996566, 'H1GI6B': 0.060746780876457317, 'H1GI6C': -0.0078545600538546621, 'H1GI6D': 0.003423440826595485, 'H1WP17F': 0.01843944582393145}

None of my predictors shows a regression coefficient equal to zero, which means that out of the 6 variables, 6 were selected in the final model. H1GI6B (Black ethnicity) and H1GI6A (White ethnicity) were the variables most strongly associated with self-esteem.

CODE 2

#plot coefficient progression m_log_alphas = -numpy.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')

OUTPUT 2

The plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changes as new predictors were added to the model and the points at which each predictor entered the model. The dark green line, has the largest regression coefficient (black ethnicity) and it was the first predictor entering the model.

CODE 3

#plot mean square error for each fold m_log_alphascv = -numpy.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')

OUTPUT 3

We can see that there is variability across the individual cross-validation folds in the training data set, but a part from 3 cases, the change in mean square error as predictors enter the model follows quite a similar pattern. MSE decreases at the start and then levels off, as adding more predictors to the model does not lead to much reduction in the mean square error.

CODE 4

#MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('Training data MSE') print (train_error) print ('test data MSE') print (test_error)

#R-squared from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test, tar_test) print ('training data R-square') print (rsquared_train) print ('test data R-square') print (rsquared_test)

OUTPUT 4

Training data MSE

0.243877161312

test data MSE

0.245976658434

training data R-square

0.0105579580738

test data R-square

0.00393038470188

The test mean square error is very close to the training mean square error, suggesting that prediction accuracy was pretty stable across the two data sets.

The R-square values are 0.01 and 0.004, indicating that the selected model explained 1 and 0.5 % of the variance in self-esteem for the training and test sets respectively.

0 notes

mariagiorda-blog · 8 years ago

Text

Building a random forest with Python

Hello!!!

This week I am generating a random forest with Python.

I am going to use the Wave One, Add Health Survey data.

My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?

Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.

The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.

Explanatory Variables – Categorical:

WHITE = H1GI6A 0=no 1=yes

BLACK = H1GI6B 0=no 1=yes

AMERICAN INDIAN = H1GI6C 0=no 1=yes

ASIAN = H1GI6D 0=no 1=yes

RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes

PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes

CODE – Part 1

#week 2 - import relevant libraries

import pandas

import numpy

import os

import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

import sklearn.metrics

import io

from sklearn import tree

from sklearn import datasets

from sklearn.ensemble import ExtraTreesClassifier

PATH = "D:\\AddHealth_Coursera"

data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric

data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce')

data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce')

data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce')

data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce')

data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce')

data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce')

data['H1ED3'] = pandas.to_numeric(data['H1ED3'], errors='coerce')

#data management section

sub1=data

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False)

print(c26)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False)

print(c27)

sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)

sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)

c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False)

print(c28)

sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)

sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)

c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False)

print(c29)

sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)

sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)

c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False)

print(c30)

sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)

sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)

c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False)

print(c31)

sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)

sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)

c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False)

print(c32)

#collate 5-level response variable into a 2-level one

def SMART(row):

if row['H1SE4'] <= 3 :

return 0

elif row['H1SE4'] > 3 :

return 1

sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)

d2 = sub1.groupby('SMART').size()

print(d2)

sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)

#create a clean data frame that drops all NAs

data_clean = sub1.dropna()

data_clean.dtypes

data_clean.describe()

#set explanatory and response variables and set training sample

predictors = data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F', 'H1ED3']]

target = data_clean.SMART

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)

pred_train.shape

print(pred_train.shape)

pred_test.shape

print(pred_test.shape)

tar_train.shape

print(tar_train.shape)

tar_test.shape

print(tar_test.shape)

#build model on training data

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier (n_estimators=6)

classifier=classifier.fit(pred_train, tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)

sklearn.metrics.accuracy_score (tar_test, predictions)

OUTPUT confusion matrix and accuracy score:

C:\Users\Rob\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:104: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

spy_cfg.IPKernelApp.exec_lines.append(

(3655, 6)

(2437, 6)

(3655,)

(2437,)

Out[1]: 0.55642183011899882

For the confusion matrix, we see the true negatives and the true positives on the diagonal. And the 6 and the 2437 represent the false negatives and false positives, respectively. Notice that the overall accuracy for the forest is 0.56. So 56% of the individuals were classified correctly as students with high self-esteem or students with low self-esteem.

CODE - PART 2:

#fit an Extra Trees model to the data

model = ExtraTreesClassifier()

model.fit(pred_train,tar_train)

#display the relative importance of each attribute

print(model.feature_importances_)

OUTPUT:

The variables are listed in the order they have been named earlier in the code so: 'H1GI6A' (white), 'H1GI6B' (black), 'H1GI6C' (american indian), 'H1GI6D' (asian), 'H1WP17F' (relationship with parents), 'H1ED3' (academic performance).

[ 0.24377588 0.32035186 0.0894001 0.12651909 0.14493537 0.07501769]

As we can see the variable with the highest important score is black ethnicity with 0.32 and the variable with the lowest important score is academic performance (0.075).

CODE - PART 3:

The third part of the code builds 6 trees and then finds the accuracy score for each of these trees.

#impact of tree size on prediction accuracy

trees=range(6)

accuracy=np.zeros(6)

for idx in range(len(trees)):

classifier=RandomForestClassifier(n_estimators=idx + 1)

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

OUTPUT:

With only one tree the accuracy is about 56.4% and it drops only to 56.3% as the number of trees increases. Given accuracy scores are quite near to each other we can be confident that it may be appropriate to interpret a single decision tree for this data.

0 notes

mariagiorda-blog · 8 years ago

Text

Machine Learning.1 - Decision Tree

Hello!!

So for this week I am building a classification tree grown with the data from the AddHealth data set.

My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?

Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.

The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.

Explanatory Variables – Categorical:

WHITE = H1GI6A 0=no 1=yes

BLACK = H1GI6B 0=no 1=yes

AMERICAN INDIAN = H1GI6C 0=no 1=yes

ASIAN = H1GI6D 0=no 1=yes

RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes

CODE

#week 1 - import relevant libraries

import pandas

import numpy

import os

import matplotlib.pylab as plt

from sklearn.cross_validation import train_test_split

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

import sklearn.metrics

import io

from sklearn import tree

PATH = "D:\\AddHealth_Coursera"

data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric

data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce')

data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce')

data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce')

data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce')

data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce')

data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce')

sub1=data

#data management

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False)

print(c26)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False)

print(c27)

sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)

sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)

c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False)

print(c28)

sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)

sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)

c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False)

print(c29)

sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)

sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)

c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False)

print(c30)

sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)

sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)

c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False)

print(c31)

#collate 5-level response variable into a 2-level one

def SMART(row):

if row['H1SE4'] <= 3 :

return 0

elif row['H1SE4'] > 3 :

return 1

sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)

d2 = sub1.groupby('SMART').size()

print(d2)

#create a clean data frame that drops all NAs

sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)

data_clean = sub1.dropna()

data_clean.dtypes

data_clean.describe()

#split into training and testing sets

predictors = data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F']]

target = data_clean.SMART

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)

pred_train.shape

print(pred_train.shape)

pred_test.shape

print(pred_test.shape)

tar_train.shape

print(tar_train.shape)

tar_test.shape

print(tar_test.shape)

#building model on training data

classifier=tree.DecisionTreeClassifier()

classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)

sklearn.metrics.accuracy_score (tar_test, predictions)

OUTPUT confusion matrix and accuracy score:

The diagonal 119 and 1261 shows the number of true negative for people with high self-esteem, and the number of true positives, respectively. I

I also looked at the accuracy score, which is approximately 0.56%, which suggest that the decision tree model has classified 56% of the sample correctly as either students with high or low self-esteem.

#displaying decision tree

from sklearn import tree

from io import StringIO

from IPython.display import Image

out = StringIO()

tree.export_graphviz(classifier, out_file=out)

import pydotplus

graph=pydotplus.graph_from_dot_data(out.getvalue())

Img=(Image(graph.create_png()))

with open('picture_out1.png', 'wb') as f:

f.write(graph.create_png())

Output:(3656, 5)(2438, 5)(3656,)(2438,)Out[2]: 0.57629204265791634

Decision Tree:

The resulting tree start with the split on X[1], our second explanatory variable, black ethnicity. if the value of ethnicity black is less than 0.5, that is no black ethnicity, the observations move to the left side of the split and include 2773 individuals out of the initial sample of 3656.

From this node, another split is made on X(4), good relationship with parents, such that among those individuals that do not correspond to a black ethnicity 1095 do not have good communication with their parents.

Following the terminal of the tree we can see that among those that are not of black ethnicity and do not have a good communication of their parents 13 have low self -esteem and the other high.

0 notes

mariagiorda-blog · 8 years ago

Text

Logistic regression - Background more influential than relationship with parents in adolescents’ perception of self-esteem

I am going to test a logistic regression using the following variables from the AddHealth codebook.

1. Primary Explanatory variable :

A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your parents about a personal problem you are having?

With 0=No and 1=Yes

2. Response variable :

A 4 level categorical variable = (coded as H1SE4) = Compared with other people your age, how intelligent are you? With 1=moderately below average, 2=slightly below average 3= average 4=slightly above average 5=moderately above average 6=extremely above average

Because this response variable has more than 2 categories I have to apply some data management.

Confounding Variables:

A. A two level categorical variable = (coded as H1RR1) = Have you been in a romantic relationship in the last 18 months?

With 0=No and 1=Yes

B. A two level categorical variable = (coded as H1ED3) = Have you ever skipped a grade?

With 0=No and 1=Yes

C. A two level categorical variable = (coded as H1GI4) = Are you of Hispanic or Latino origin?

With 0=No and 1=Yes

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric:

data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)

data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)

sub1=data

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

sub1['H1RR1'] = data['H1RR1'].replace(6, numpy.nan)

sub1['H1RR1'] = data['H1RR1'].replace(8, numpy.nan)

sub1['H1ED3'] = data[' H1RR1'].replace(9, numpy.nan)

sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)

sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)

sub1['H1GI4'] = data['H1GI4'].replace(6, numpy.nan)

sub1['H1GI4'] = data['H1GI4'].replace(8, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(3, numpy.nan)

def SMART(row):

if row['H1SE4'] == 1 :

return 0

elif row['H1SE4'] == 2 :

return 0

else :

return 1

sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)

#logistic regression

lreg1 = smf.logit(formula = 'SMART ~ H1WP17F', data = sub1).fit()

print (lreg1.summary())

Optimization terminated successfully.

Current function value: 0.230118

Iterations 7

Logit Regression Results

==============================================================================

Dep. Variable: SMART No. Observations: 6123

Model: Logit Df Residuals: 6121

Method: MLE Df Model: 1

Date: Thu, 23 Mar 2017 Pseudo R-squ.: 0.001131

Time: 22:53:45 Log-Likelihood: -1409.0

converged: True LL-Null: -1410.6

LLR p-value: 0.07404

==============================================================================

coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 2.6569 0.066 40.283 0.000 2.528 2.786

H1WP17F 0.1984 0.112 1.771 0.077 -0.021 0.418

==============================================================================

Whilst looking at regression the correlation was significant, collating my response variable into 2 levels has made the correlation non significant P>0.05 (0.077 In bold).

There is no statistically significant correlation between having personal conversations with parents and having high feelings of self-esteem

I am anyway going to calculate the odds ratios

#odds ratios

print ('Odds Ratios')

print (numpy.exp(lreg1.params))

Odds Ratios

Intercept 14.252033

H1WP17F 1.219464

Adolescents that have personal conversations with their parents are 1.22 times more likely to have high self-esteem

Conference Intervals

params = lreg1.params

conf = lreg1.conf_int()

conf['OR'] = params

conf.columns = ['Lower CI', 'Upper CI', 'OR']

print (numpy.exp(conf))

Lower CI Upper CI OR

Intercept 12.523753 16.218815 14.252033

H1WP17F 0.979081 1.518867 1.219464

Let’s add confounding variables one by one

A. H1RR1 (romantic relationship in last 18 months)

coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 2.6136 0.083 31.364 0.000 2.450 2.777

H1WP17F 0.1900 0.114 1.669 0.095 -0.033 0.413

H1RR1 0.1055 0.107 0.983 0.326 -0.105 0.316

==============================================================================

Odds Ratios

Intercept 13.647537

H1WP17F 1.209201

H1RR1 1.111267

dtype: float64

Lower CI Upper CI OR

Intercept 11.591067 16.068863 13.647537

H1WP17F 0.967442 1.511374 1.209201

H1RR1 0.900408 1.371506 1.111267

No confounding effect associated with having been (or still being ) in a romantic relationship

B. H1ED3 – Having skipped a grade

==============================================================================

coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 2.6146 0.084 31.259 0.000 2.451 2.779

H1WP17F 0.1901 0.114 1.670 0.095 -0.033 0.413

H1ED3 0.1032 0.108 0.953 0.341 -0.109 0.316

==============================================================================

Odds Ratios

Intercept 13.662013

H1WP17F 1.209350

H1ED3 1.108762

dtype: float64

Lower CI Upper CI OR

Intercept 11.596229 16.095803 13.662013

H1WP17F 0.967569 1.511548 1.209350

H1ED3 0.896639 1.371069 1.108762

No confounding effect associated with having skipped a grade

==============================================================================

coef std err z P>|z| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 2.7602 0.071 38.685 0.000 2.620 2.900

H1WP17F 0.1730 0.113 1.537 0.124 -0.048 0.393

H1GI4 -0.5985 0.141 -4.251 0.000 -0.874 -0.323

==============================================================================

Odds Ratios

Intercept 15.802544

H1WP17F 1.188843

H1GI4 0.549660

dtype: float64

Lower CI Upper CI OR

Intercept 13.740246 18.174376 15.802544

H1WP17F 0.953576 1.482155 1.188843

H1GI4 0.417130 0.724297 0.549660

As we can see from the p<0.05 there is a significant NEGATIVE correlation between feelings of self-esteem and coming from a Latino background.

The confounding variable number 3 is actually the only one being correlated with the response variable. As we can see from odds ration an adolescent of Latino background is 1.69 times less likely to consider him/herself as being smart.

0 notes

mariagiorda-blog · 8 years ago

Text

Multiple regression

Happy Sunday!

I am going to investigate the relationship between the following variables:

1. Primary Explanatory variable :

A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your mom about a personal problem you are having?

With 0=No and 1=Yes

2. Response variable :

NB: In my code I centred this variable and renamed in ‘intelligence’

3. Confounding variable :

A two level categorical variable = (coded as H1RR1) = Have you been in a romantic relationship in the last 18 months?

With 0=No and 1=Yes

import pandas as pds

import numpy

import statsmodels.api as sm

import statsmodels.formula.api as smf

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric:

data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)

data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)

data['H1RR1'] = data['H1RR1'].convert_objects(convert_numeric=True)

sub1=data

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

sub1['H1RR1'] = data['H1RR1'].replace(6, numpy.nan)

sub1['H1RR1'] = data['H1RR1'].replace(8, numpy.nan)

sub1['H1RR1'] = data['H1RR1'].replace(9, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

#centre quantitative response variable for regression analysis

sub1['intelligence'] = (sub1['H1SE4'] - sub1['H1SE4'].mean())

print(sub1['intelligence'].mean())

reg2 = smf.ols('intelligence ~ H1WP17F + H1RR1', data=sub1).fit()

print (reg2.summary())

OLS Regression Results

==============================================================================

Dep. Variable: intelligence R-squared: 0.004

Model: OLS Adj. R-squared: 0.004

Method: Least Squares F-statistic: 11.80

Date: Sun, 19 Mar 2017 Prob (F-statistic): 7.65e-06

Time: 09:15:46 Log-Likelihood: -9164.9

No. Observations: 6085 AIC: 1.834e+04

Df Residuals: 6082 BIC: 1.836e+04

Df Model: 2

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept -0.0697 0.023 -3.067 0.002 -0.114 -0.025

H1WP17F 0.1152 0.029 3.977 0.000 0.058 0.172

H1RR1 0.0632 0.028 2.229 0.026 0.008 0.119

==============================================================================

Omnibus: 249.948 Durbin-Watson: 1.963

Prob(Omnibus): 0.000 Jarque-Bera (JB): 111.966

Skew: 0.083 Prob(JB): 4.86e-25

Kurtosis: 2.357 Cond. No. 2.95

==============================================================================

For both explanatory variables (primary and confounding) p values are less than 0.05 (in bold), and both t values (highlighted in yellow) are positive indicating that having been in a romantic relationship and talking with your parents are associated with higher perception of intelligence.

We can conclude that both explanatory variables are significantly associated with perception of intelligence, after taking out the part of the association accounted for by the other.

Let us try adding an additional potential confounding variable to the model:

A two level categorical variable = (coded as H1ED3) = Have you ever skipped a grade?

With 0=No and 1=Yes

OLS Regression Results

==============================================================================

Dep. Variable: intelligence R-squared: 0.004

Model: OLS Adj. R-squared: 0.003

Method: Least Squares F-statistic: 8.117

Date: Sun, 19 Mar 2017 Prob (F-statistic): 2.16e-05

Time: 09:39:38 Log-Likelihood: -9158.7

No. Observations: 6083 AIC: 1.833e+04

Df Residuals: 6079 BIC: 1.835e+04

Df Model: 3

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept -0.0705 0.023 -3.095 0.002 -0.115 -0.026

H1WP17F 0.1152 0.029 3.977 0.000 0.058 0.172

H1RR1 0.0621 0.028 2.189 0.029 0.006 0.118

H1ED3 0.0883 0.091 0.972 0.331 -0.090 0.266

==============================================================================

Omnibus: 251.891 Durbin-Watson: 1.964

Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.525

Skew: 0.083 Prob(JB): 3.68e-25

Kurtosis: 2.355 Cond. No. 8.08

==============================================================================

Interestingly having skipped a grade in school is not significantly associated with someone’s perception of intelligence.

Confidence intervals are highlighted in red. As you can see for H1ED3 confidence intervals include a value of zero.

Also as you can see from the Adj. R-squared value romantic relationship roster and communication with parents explain only around 0.003% of perception of intelligence in a student.

What is the residual variability?

Let us try with a q-q plot

This indicates that our residuals do not follow a perfectly normal distribution. I should include more explanatory variables in my model.

Now let’s look at the standardized residuals

Most residuals fall within -3 and 2 so we do have some extreme outliers. I believe that the fit of the model is quite poor. I should definitely include more explanatory variables.

The leverage plot shows that , again, there are a lot of outliers with higher than average leverage but are not outliers.

What this tells me is that the explanatory variables I took into consideration so far have a significant relationship with perception of intelligence but are not enough to build a robust predictive model of the response variable.

0 notes

mariagiorda-blog · 8 years ago

Text

My first regression model

Bonjour!

This week I want to test a basic linear regression model.

I am going to investigate the relationship between the following variables:

1. Explanatory variable :

A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your mom about a personal problem you are having?

With 0=No and 1=Yes

2. Response variable :

A 4 level categorical variable = (coded as H1FS8) = how often do you feel hopeful about the future?

With 0=Never, 1=Sometimes, 2=A lot of the time, 3=Most of the time or all of the time

The objective would be to test how a good parental relationship affects an youngster’s feelings of self-esteem and self-efficacy

import pandas as pds

import numpy

import statsmodels.api as sm

import statsmodels.formula.api as smf

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric:

data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)

data['H1FS8'] = data['H1FS8'].convert_objects(convert_numeric=True)

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

sub1['H1FS8'] = data['H1FS8'].replace(6, numpy.nan)

sub1['H1FS8'] = data['H1FS8'].replace(8, numpy.nan)

print ('OLS regressions model for association parental support and hope towards the future')

reg1 = smf.ols ('H1FS8 ~ H1WP17F', data=sub1).fit()

print (reg1.summary())

OLS regressions model for association parental support and hope towards the future

OLS Regression Results

==============================================================================

Dep. Variable: H1FS8 R-squared: 0.000

Model: OLS Adj. R-squared: -0.000

Method: Least Squares F-statistic: 0.1034

Date: Sat, 11 Mar 2017 Prob (F-statistic): 0.748

Time: 10:29:36 Log-Likelihood: -9151.4

No. Observations: 6479 AIC: 1.831e+04

Df Residuals: 6477 BIC: 1.832e+04

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 1.8506 0.014 135.341 0.000 1.824 1.877

H1WP17F -0.0025 0.008 -0.322 0.748 -0.018 0.013

==============================================================================

Omnibus: 439.909 Durbin-Watson: 2.003

Prob(Omnibus): 0.000 Jarque-Bera (JB): 239.490

Skew: -0.316 Prob(JB): 9.90e-53

Kurtosis: 2.302 Cond. No. 2.11

==============================================================================

Let us try another response variable that is more closely related to self-esteem.

Response variable :

import pandas as pds

import numpy

import statsmodels.api as sm

import statsmodels.formula.api as smf

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric:

data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)

data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)

sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)

sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)

print ('OLS regressions model for association parental support and hope towards the future')

reg1 = smf.ols ('H1SE4 ~ H1WP17F', data=sub1).fit()

print (reg1.summary())

OLS regressions model for association parental support and hope towards the future

OLS Regression Results

==============================================================================

Dep. Variable: H1SE4 R-squared: 0.005

Model: OLS Adj. R-squared: 0.005

Method: Least Squares F-statistic: 34.36

Date: Sat, 11 Mar 2017 Prob (F-statistic): 4.81e-09

Time: 10:36:21 Log-Likelihood: -15806.

No. Observations: 6478 AIC: 3.162e+04

Df Residuals: 6476 BIC: 3.163e+04

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------

Intercept 3.8502 0.038 100.752 0.000 3.775 3.925

H1WP17F 0.1264 0.022 5.862 0.000 0.084 0.169

==============================================================================

Omnibus: 16272.765 Durbin-Watson: 1.995

Prob(Omnibus): 0.000 Jarque-Bera (JB): 222208638.051

Skew: 27.575 Prob(JB): 0.00

Kurtosis: 908.653 Cond. No. 2.12

==============================================================================

This time the relationship is significant:

P=4.81e-09

Equation:

Feelings of self-esteem compared to others = 3.85 + 0.126*parental talk

However looking at R-squared value I can say that talking with your parents predicts only a minimal part of perceptions on self esteem (R-squared 0 0.005)

0 notes

mariagiorda-blog · 8 years ago

Text

Writing about your data - Week 1

The Sample is drawn from The National Longitudinal Study of Adolescent Health (AddHealth) – Wave 1 .

The study used the following build the sample:

- 80 high schools were selected on the basis of region, urbanicity, school type, ethnic mix and size.

- From September 1994 to April 1995, in-school questionnaires were administered to these schools’ students. The in-school questionnaire investigated the school context, friendship networks, school activities, future expectations, health conditions.

- From the students who answered in-school questionnaires a nationally representative sample of around 15,000 American adolescents in grades 7-12 was selected on the basis of sex, grade, ethnicity, disability and family roster. These student were then administered a 90-minute in-home interview.

To protect confidentiality, no paper questionnaires were used. Instead, all data were recorded on laptop computers. For less sensitive topics, the interviewer read the questions aloud and entered the respondent's answers. For more sensitive topics, the respondent listened through earphones to pre-recorded questions and entered the answers directly. In addition to maintaining data security, this minimized the potential for interviewer or parental influence.

The in-house questionnaire investigated family composition and dynamics, romantic expectations, aspirations.

The purpose of the study was to specifically investigate the relationship between family roster, school context and larger context and likelihood of students to engage in risky sex behaviour.

My research question investigates correlation between parenting style and feeling of independence and self-esteem. The variables analysed can be found in the following sections of the AddHealth Codebook.

- Section 9 – Self-efficacy

- Section 10 – Feelings scale

- Section 16 – Relations with Parents

- Section 18 – Personality and Family

Section 9 and 10 included a number of variables describing the respondents feelings of self-esteem (how they see themselves) and self-efficacy (feelings of autonomy). These variables constituted my response variables.

In Section 16 and 18 respondents were asked to assess their relationship with their parents: how supportive their family is, how independent their parents want them to be... these constituted my explanatory variables

Respondents’ feelings and perceptions are assessed using a Likert scale. When responding to a Likert item, respondents specify their level of agreement or disagreement on a symmetric agree-disagree scale for a series of statements.

This kind of scales can provide distorted responses for a number of reasons:

- respondents may avoid expressing extreme opinions

- respondents may tend to agree with statements as they are expressed

Seen that the AddHealth questionnaires was mostly administered in a safe environment (the respondents home) and students were left alone to submit their answers in a computer, I believe that the risk of running in the above study limitations has been largely avoided.

However I also decided to look at a number of variables measuring similar feelings and perceptions for the very reason that I wanted to avoid to reach conclusions only looking at one set of correlations.

I also when managing my data I grouped together ‘strongly agree’ and ‘agree’ responses and ‘strongly disagree’ and ‘disagree’ responses in order to account for the central tendency bias.

0 notes

mariagiorda-blog · 8 years ago

Text

Testing a potential moderator

This week I decided to measure the linear relationship between two variables part of the AddHealth dataset including in the analysis a third variable, a moderator.

Variable 1 = Independent variable (x)

H1WP8 - On how many of the past 7 days were your parents in the room with you while you ate your evening meal?

Variable 2 = Dependent variable (y)

H1ID6 - How much would you like to have a romantic relationship in the next year?

Moderator

H1WP17F - Have you had a talk with your parents about a personal problem you are having?

I already found out last week that there is a WEAK negative relationship between x and y. I now want to investigate whether this is still the case when I split observations in 2 groups.

Sub 1: students that recently spoke of their personal problems with their parents

Sub 2: students that have not recently spoken of their personal problems with their parents

Script

import pandas as pds

import numpy

import seaborn

import scipy

import matplotlib.pyplot as plt

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric:

data['H1WP8'] = data['H1WP8'].convert_objects(convert_numeric=True)

data['H1ID6'] = data['H1ID6'].convert_objects(convert_numeric=True)

data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)

data['H1WP8'] = data['H1WP8'].replace(96, numpy.nan)

data['H1WP8'] = data['H1WP8'].replace(97, numpy.nan)

data['H1WP8'] = data['H1WP8'].replace(98, numpy.nan)

data['H1ID6'] = data['H1ID6'].replace(7, numpy.nan)

data['H1ID6'] = data['H1ID6'].replace(8, numpy.nan)

data['H1ID6'] = data['H1ID6'].replace(9, numpy.nan)

data['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)

data['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)

data['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)

data_clean = data.dropna()

print (scipy.stats.pearsonr(data_clean['H1WP8'], data_clean['H1ID6']))

sub1=data_clean[(data_clean['H1WP17F']== 0)]

sub2=data_clean[(data_clean['H1WP17F']== 1)]

print ('association between weekly meals and romantic expectations for students not talking of personal problems with parents')

print (scipy.stats.pearsonr(sub1['H1WP8'], sub1['H1ID6']))

print (' ')

print ('association between weekly meals and romantic expectations for students talking of personal problems with parents')

print (scipy.stats.pearsonr(sub2['H1WP8'], sub2['H1ID6']))

Output

association between weekly meals and romantic expectations for students not talking of personal problems with parents

(-0.10417957730826818, 1.9417985672438483e-10)

association between weekly meals and romantic expectations for students talking of personal problems with parents

(-0.10564892211849905, 2.9501572305228157e-07)

According to the results of the correlation analysis there is a weak negative relationship between number of weekly family meals and romantic expectations. The correlation is very weak (around -0.1) however is statistically relevant (p lower than 0.05)

I expected that the relationship between the two variables would be weaker for, subgroup 2, students talk about personal problems with their parents, but, according to the analysis, students that have recently had personal chats with their parents are even less likely to desire being in a romantic relationship.

Below the scatterplots for sub 1 (students that have NOT had a personal talk with their parents recently) and sub 2 1 (students that have had a personal talk with their parents recently).

0 notes

mariagiorda-blog · 8 years ago

Text

week 3. correlation coefficient.

Hola!

This week I decided to measure the linear relationship between two variables part of the AddHealth dataset.

Variable 1 = Independent variable (x)

H1WP8 - On how many of the past 7 days were your parents in the room with you while you ate your evening meal?

Variable 2 = Dependent variable (y)

H1ID6 - How much would you like to have a romantic relationship in the next year?

#Here is the script I run to get the scatterplot

import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt

PATH = "D:\\AddHealth_Coursera"

data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)

#setting variables I will be working with to numeric: data['H1WP8'] = data['H1WP8'].convert_objects(convert_numeric=True) data['H1ID6'] = data['H1ID6'].convert_objects(convert_numeric=True)

data['H1WP8'] = data['H1WP8'].replace(96, numpy.nan) data['H1WP8'] = data['H1WP8'].replace(97, numpy.nan) data['H1WP8'] = data['H1WP8'].replace(98, numpy.nan)

data['H1ID6'] = data['H1ID6'].replace(7, numpy.nan) data['H1ID6'] = data['H1ID6'].replace(8, numpy.nan) data['H1ID6'] = data['H1ID6'].replace(9, numpy.nan)

scat1= seaborn.regplot(x='H1WP8', y='H1ID6', fit_reg=True, data=data) plt.xlabel('H1WP8') plt.ylabel('H1ID6') plt.title('Scatterplot for the Association between weekly meals and romantic expectations')

#Here is my output

I am aware of the fact that, as both of my variables used discrete values, the scatterpot is not that useful.

Looking at the graph I would expect my Pearson correlation coefficient to be negative

#Here is the script I run to calculate the coefficient

data_clean=data.dropna()

print ('Association between weekly meals and romantic expectations') print (scipy.stats.pearsonr(data_clean['H1WP8'], data_clean['H1ID6']))

#And here is the output

Association between weekly meals and romantic expectations (-0.10415897700852748, 1.1554106741087854e-16)

The second value in the brackets, the p value, tells me that the correlation is significant.

However the r number (-0.104) tells me that the negative linear relationship between number of family meals in the past week and romantic expectations is quite weak.

In other words, when we square r we get 0.010849092491462957, which means that only 1% of the variabilty in romantic expectations can be described by the number of family meal one has in a week.

Thank goodness! otherwise it would appear that the more you eat together with your child the less likely he will be to put himself in a romantic relationship!

0 notes