Don't wanna be here? Send us removal request.
Text
Day 3 of The Breadline Challenge
I am personally not finding too hard to live on £2.50 a day and yet people take up a dismayed look whenever I mention the challenge.
So what makes me the ideal candidate to undertake this kind of initiative?
Well simply, my complete unfussiness coupled with my tolerance towards food boredom.
When I was in university I developed the ability of surviving on more or less two kinds of vegetables per week. I would go to a large supermarket, choose the bigger bag of the cheapest range of the cheapest kind of vegetables (for example 2Kg of Sainsburys Basic carrots) and use it in a range of different dishes for the entire week.
Mind you, of course I don’t love eating carrots for 7 days in a row, and of course I feel bad for shopping in a supermarket not considering their despicable behaviour towards their suppliers... but I still do it when I get into my extreme money-saving mode.
It is not a healthy lifestyle either, but better than falling into the’ fruit and veg free’ shopping behaviour most people that cannot afford fruit and veg are forced into. When you can get one bag of white buns for less than £1 and one single apple costs 50p, what are you gonna do? You are going to consume loads of cheap, low quality, processed food.
In the foto my portions of chicken and celery soup I got out of an whole chicken (reduced price £2.35)
0 notes
Text
Day 2 of the Breadline Challenge
Today I want to talk about restaurants, for the simple reason that my boyfriend’s flatmate brought home a pile of sausages and a tray of focaccia from the place where he works.... so I am sorted protein-wise.
I don’t know much about food waste in the hospitality industry and I will not pretend I do. But I do know that food waste is happening and I do know that often restaurant employees prefer to throw away leftovers rather than running the risk of getting in trouble if they donate it or take it home.
I would love to find out more about how we can prevent food waste in this sector, but in this sector food waste is so widespread and consumption needs to be so prompt that I have not heard of any game-changing solution.
A couple of years ago a friend of mine told me about Winnow Solutions a technology that helps track food waste in restaurants, but since then this is the only tech firm I have heard getting involved in this area.
But I am not sure the solution really lies in finding good ways to monitor waste? I once met a girl that told me that in the bakery she worked for they would bake bread all day long just to make their window shelves look full and attractive... when our marketing strategies involve creating waste ‘hard’ solutions to foow waste can only be implemented alongside a movement to change our own mindset
0 notes
Text
The Breadline Challenge
I joined FoodCycle over two years ago after spending two months in Senegal, where the idea of throwing away food cannot even be conceived.
And yet FoodCycle does so much more than simply fighting food waste! Our weekly meals bring together communities, alleviate social isolation, make like and unlike-minded people meet and, above all, reduce food poverty.
The UK bins around £17 billions of food a year and yet in 2017, The Trussell Trust’s foodbank network distributed 586,907 three day emergency food supplies to people in crisis.
I want to raise awareness of these two apparently contradictory issues, food poverty and food waste by living with £2.50 worth of food per day for one week.
I will be posting (hopefully) daily updates of how the challenge is going here and if you find me entertainingly crazy enough you can donate here :
https://uk.virginmoneygiving.com/MariaGiorda
0 notes
Text
Capstone Project - Preliminary Results
Buona domenica :)
Today I will report the initial results I got from running bivariate analysis between my dependent variable (GDP per capita) and each of the independent variables considered (urbanisation rate, value added by agriculture, value of imports and value of exports) As all variables are quantitative I used Pearson correlation as a statistical analysis.
Here is the output I got for each bivariate association:
association between growth in Urban Rate and GDP per capita Growth
(-0.29232347861959446, 0.030335418488114534)
association between growth in value added by Agriculture and GDP per capita Growth
(0.47675897976567433, 0.0002333827057912437)
association between growth in value of Imports and GDP per capita Growth
(0.14121198515458913, 0.30377761630811739)
association between growth in value of Exports and GDP per capita Growth
(-0.16363902990091383, 0.23256715988711754)
The onlsy significant relationship is that of GDP per capita growth and Value added by Agriculture. This can also be intuitively extrapolated looking at the scatterplots for each relationship
The one looking at the relationship between agriculture and growth in GDP is the only one showing a clear upward trend
0 notes
Text
Capstone Project - Draft Methods
Happy Sunday!
I am using the data set provided by the World Bank and reporting information on 163 indicators of economic development for 248 countries.
The variables analysed are the following:
Dependent variable: GDP per capita growth (annual)
Independent variables 1:
- Urban population growth (% total)
- Agriculture, value added (annual % growth)
Independent variables 2:
- Imports of goods and services (% GDP)
- Exports of goods and services (% GDP)
I will first test if there is a significant relationship between the dependent variable and each of the independent variables individually and then two by two as I my objective is to identify what combination of each of the independent variables is significantly positively associated to growth in GDP per capita.
Therefore, in the first part of my analysis I will apply 8 bivariate statistical tests (Anova and Pearson) both testing relationship between quantitative response variable and predictors left as quantitative variables and testing the same relationship when predictors converted in categorical explanatory variables.
Depending on the results I will then add different moderators to add strength to my model and perform a regression analysis on my data set.
0 notes
Text
Capstone Project - Introduction to Research Question
The association between growth in GDP per capita and indicators of urbanisation and economic self-sufficiency.
The purpose of my research is multi-fold
1. My first objective is to investigate the association between growth in GDP per capita and two independent variables, annual growth in urban population and annual growth in value added by agriculture, both individually and jointly. The outcome of the first part of the analysis would be to identify which combination of annual growth in urban population and annual growth in value added by agriculture leads to the highest growth in GDP per capita.
2. My second objective is to investigate the association between GDP per capita and two independent variables, import and export of goods and services, both individually and jointly. The outcome of the second part of the analysis would be to identify which combination of imports and exports leads to the highest growth in GDP per capita.
Ultimately through my investigation I want to understand how the above-mentioned 4 independent variables are jointly associated to growth in per capita GDP.
As someone that is deeply interested in Strategies for Sustainable Development, especially those involving the concept of Short Supply Chains and Local Food Systems, I would like to understand what is the necessary balance to be reached between agricultural and urban development.
I live in London and volunteer on a weekly basis for an organisation fighting food poverty and food waste. I believe that further research and investment in food security and self-sufficiency through the development of rural-urban linkages and peri-agriculture can help modern societies improve people’s well-being.
0 notes
Text
k-means clustering
Holaaaa,
This week I am running a k-means cluster analysis in Python. The objective of the clustering is in to understand to which ‘variable clusters’ my target audiences, students with high VS low self-esteem, belong to.
I am going to use the Wave One, Add Health Survey data.
My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?
Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.
The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.
Explanatory Variables – Categorical
WHITE = H1GI6A 0=no 1=yes
BLACK = H1GI6B 0=no 1=yes
AMERICAN INDIAN = H1GI6C 0=no 1=yes
ASIAN = H1GI6D 0=no 1=yes
RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes
PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes
ARTS AND CRAFTS ACTIVITIES = H1DA2 (how many times in the past week did you do arts and crafts activities?) 0=none 1=1-2times 2=3-4times 3=5 or more times
TEAM SPORT = H1DA5 (how many times in the past week did you play a team sport?) 0=none 1=1-2times 2=3-4times 3=5 or more times
EXERCISE = H1DA6 (how many times in the past week did you exercise?) 0=none 1=1-2times 2=3-4times 3=5 or more times
TIME WITH FRIENDS = H1DA7 (how many times in the past week did you do arts and crafts activities?) 0=none 1=1-2times 2=3-4times 3=5 or more times
TELEVISION = H1DA8 (how many hours a week do you watch television?) Range from 0 to 99
VIDEO GAMES = H1DA10 (how many hours a week do you play videogames?) Range from 0 to 99
SCRIPT 1
#CLUSTERING
#from pandas import Series, Dataframe import pandas import numpy import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans
#load Dataset PATH = "D:\\AddHealth_Coursera"
data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#Data management #setting variables I will be working with to numeric data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce') data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce') data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce') data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce') data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce') data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce') data['H1ED3'] = pandas.to_numeric(data['H1ED3'], errors='coerce') data['H1DA2'] = pandas.to_numeric(data['H1DA2'], errors='coerce') data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce') data['H1DA6'] = pandas.to_numeric(data['H1DA6'], errors='coerce') data['H1DA7'] = pandas.to_numeric(data['H1DA7'], errors='coerce') data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce') data['H1DA10'] = pandas.to_numeric(data['H1DA10'], errors='coerce')
#data management section
sub1=data
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False) print(c26)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False) print(c27)
sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)
sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)
c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False) print(c28)
sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)
sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)
c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False) print(c29)
sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)
sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)
c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False) print(c30)
sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)
sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)
c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False) print(c31)
sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)
sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)
c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False) print(c32)
sub1['H1DA2'] = data['H1DA2'].replace(6, numpy.nan)
sub1['H1DA2'] = data['H1DA2'].replace(8, numpy.nan)
c33 = sub1['H1DA2'].value_counts(sort=False, dropna=False) print(c33)
sub1['H1DA5'] = data['H1DA5'].replace(6, numpy.nan)
sub1['H1DA5'] = data['H1DA5'].replace(8, numpy.nan)
c34 = sub1['H1DA5'].value_counts(sort=False, dropna=False) print(c34)
sub1['H1DA6'] = data['H1DA6'].replace(6, numpy.nan)
sub1['H1DA6'] = data['H1DA6'].replace(8, numpy.nan)
c35 = sub1['H1DA6'].value_counts(sort=False, dropna=False) print(c35)
sub1['H1DA7'] = data['H1DA7'].replace(6, numpy.nan)
sub1['H1DA7'] = data['H1DA7'].replace(8, numpy.nan)
c36 = sub1['H1DA7'].value_counts(sort=False, dropna=False) print(c36)
sub1['H1DA8'] = data['H1DA8'].replace(996, numpy.nan)
sub1['H1DA8'] = data['H1DA8'].replace(998, numpy.nan)
c37 = sub1['H1DA8'].value_counts(sort=False, dropna=False) print(c37)
sub1['H1DA10'] = data['H1DA10'].replace(996, numpy.nan)
sub1['H1DA10'] = data['H1DA10'].replace(998, numpy.nan)
c38 = sub1['H1DA10'].value_counts(sort=False, dropna=False) print(c38) #collate 5-level response variable into a 2-level one
def SMART(row):
if row['H1SE4'] <= 3 :
return 0
elif row['H1SE4'] > 3 :
return 1
sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)
d2 = sub1.groupby('SMART').size() print(d2)
sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)
#create a clean data frame that drops all NAs
data_clean = sub1.dropna()
#subset clustering variables cluster=data_clean[['SMART','H1WP17F','H1GI6A','H1GI6B','H1GI6C','H1ED3','H1DA5','H1DA2','H1DA6','H1DA7','H1DA8','H1DA10' ]] cluster.describe()
#standardize clustering variables to have mean=0 ans sd=1 clustervar=cluster.copy() clustervar['SMART']=preprocessing.scale(clustervar['SMART'].astype('float64')) clustervar['H1WP17F']=preprocessing.scale(clustervar['H1WP17F'].astype('float64')) clustervar['H1GI6A']=preprocessing.scale(clustervar['H1GI6A'].astype('float64')) clustervar['H1GI6B']=preprocessing.scale(clustervar['H1GI6B'].astype('float64')) clustervar['H1GI6C']=preprocessing.scale(clustervar['H1GI6C'].astype('float64')) clustervar['H1ED3']=preprocessing.scale(clustervar['H1ED3'].astype('float64')) clustervar['H1DA2']=preprocessing.scale(clustervar['H1DA2'].astype('float64')) clustervar['H1DA5']=preprocessing.scale(clustervar['H1DA5'].astype('float64')) clustervar['H1DA6']=preprocessing.scale(clustervar['H1DA6'].astype('float64')) clustervar['H1DA7']=preprocessing.scale(clustervar['H1DA7'].astype('float64')) clustervar['H1DA8']=preprocessing.scale(clustervar['H1DA8'].astype('float64')) clustervar['H1DA10']=preprocessing.scale(clustervar['H1DA10'].astype('float64'))
#split data into train and test sets clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123 )
#k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[]
for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(numpy.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))/ clus_train.shape[0])
#Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose
plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow method')
OUTPUT 1
The plot shows the average minimum distance of the observations from the cluster centroids of each cluster. We can see the average distance decreasing as the number of clusters increases.
Since the goal of cluster analysis is to minimize the distance between observations and their assigned clusters we want to chose the fewest numbers of clusters that provides a low average distance. What we're looking for in this plot is a bend in the elbow that kind of shows where the average distance value might be leveling off such that adding more clusters doesn't decrease the average distance as much. There appears to be a couple of bends at the line at 2,3 and 6 clusters, but it's not very clear.
SCRIPT 2
#Interpret 3 cluster solutions model3=KMeans(n_clusters=3) model3.fit(clus_train) clusassign=model3.predict(clus_train)
#plot clusters
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of canonical variables for 3 clusters') plt.show()
OUTPUT 2
Here is the scatter plot. The two clusters are densely packed, which would suggest that there is high within cluster correlation and low within cluster variance. There is a good deal of overlap between the two clusters.
The 3 cluster solution shows better separation, but the observations are more spread out indicating lower within cluster correlation and higher within cluster variance.
SCRIPT 3
#BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable mean by cluster
#create a unique identifier variable from the index to merge training data with cluster assignment variable
clus_train.reset_index(level=0, inplace=True) #create a list that has the new index variable cluslist=list(clus_train['index']) #create a list of cluster assignments labels=list(model3.labels_) #combine index variable list with cluster assignment list into a dictionary newlist=dict(zip(cluslist, labels)) newlist #convert newlist dictionary to a dataframe newclus=pandas.DataFrame.from_dict(newlist, orient='index') newclus
#now the same for the cluster assignment variable #rename the cluster assignment column newclus.columns = ['cluster'] #create a unique identifier variable from the index for cluster assignment dataframe to merge with cluster training data newclus.reset_index(level=0, inplace=True) #merge the cluster assignment dataframe with the cluster training variable dataframe by the index variable merged_train=pandas.merge(clus_train, newclus, on='index') merged_train.head(n=100)
#cluster frequencies merged_train.cluster.value_counts()
#END multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster #calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print("Clustering variable means by cluster") print(clustergrp)
OUTPUT 3
Clustering variable means by cluster index SMART H1WP17F H1GI6A H1GI6B H1GI6C \ cluster 0 3317.145035 -0.019567 0.012916 0.715419 -0.572201 -0.047231 1 2809.145455 -0.009700 -0.050558 -0.437236 0.238518 0.111593 2 3202.828512 0.021929 -0.016642 -1.326468 1.045616 0.088115
H1ED3 H1DA5 H1DA2 H1DA6 H1DA7 H1DA8 H1DA10 cluster 0 -0.158047 -0.006461 0.040623 -0.019016 0.056149 -0.129638 -0.055092 1 6.327226 -0.069347 0.101848 0.016019 -0.001742 -0.091051 0.083528 2 -0.158047 -0.001304 -0.079557 0.089569 -0.074709 0.226008 0.085216
Adolescents in the 3rd cluster, cluster 2, had the highest likelihood of having high self-esteem . The ones in cluster had the lowest, but also the ones with better parent relationships.
Strangely enough my dataset did not have a GPA column, however the code below is the one I would have used to validate clusters in training data by examining cluster differences in GPA using ANOVA.
CODE 4:
#validate clusters by examining cluster differences in GPA gpa_data=data_clean['GPA1'] #split GPA data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1=pandas.DataFrame(gpa_train) gpa_train1.reset_index(level=0, inplace=True) merged_train_all=pandas.merge(gpa_train1, merged_train, on='index') sub1 = merged_train_all[['GPA1', 'cluster']].dropna()
import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi
gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit() print (gpamod.summary)
print ('means for GPA by cluster') m1=sub1.groupby('cluster').mean() print(m1)
print('standard deviations for GPA by cluster') m2= sub1.groupby('cluster').std() print(m2)
mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())
1 note
·
View note
Text
Lasso Regression
Alo!!
This week I am dealing with Lasso regression in order to understand which variables are the most strongly associated with my target audience, which is students with high VS low self-esteem.
I am going to use the Wave One, Add Health Survey data.
My response/target variable is self-esteem, which is defined based on the answer to the question, compared to people your age, how intelligent are you?
Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.
The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.
Explanatory Variables – Categorical
WHITE = H1GI6A 0=no 1=yes
BLACK = H1GI6B 0=no 1=yes
AMERICAN INDIAN = H1GI6C 0=no 1=yes
ASIAN = H1GI6D 0=no 1=yes
RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes
PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes
I am going to test a Lasso Regression Model using Python
CODE
#from pandas import Series, Dataframe import pandas import numpy import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix import sklearn.metrics import io from sklearn import tree from sklearn.linear_model import LassoLarsCV
#load Dataset PATH = "D:\\AddHealth_Coursera"
data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#Data management #setting variables I will be working with to numeric data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce') data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce') data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce') data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce') data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce') data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce') data['H1ED3'] = pandas.to_numeric(data['H1ED3'], errors='coerce')
#data management section
sub1=data
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False) print(c26)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False) print(c27)
sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)
sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)
c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False) print(c28)
sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)
sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)
c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False) print(c29)
sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)
sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)
c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False) print(c30)
sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)
sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)
c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False) print(c31)
sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)
sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)
c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False) print(c32)
#collate 5-level response variable into a 2-level one
def SMART(row):
if row['H1SE4'] <= 3 :
return 0
elif row['H1SE4'] > 3 :
return 1
sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)
d2 = sub1.groupby('SMART').size() print(d2)
sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)
#create a clean data frame that drops all NAs
data_clean = sub1.dropna()
#Next I will create 2 data frames
#1 - Dataframe including only the predictor variables
predvar= data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F', 'H1ED3']]
#2 - Dataframe including only self-esteem response variable
target = data_clean.SMART
#standardize predictors to have mean=0 and sd=1 predictors=predvar.copy() from sklearn import preprocessing predictors['H1GI6A']=preprocessing.scale(predictors['H1GI6A'].astype('float64')) predictors['H1GI6B']=preprocessing.scale(predictors['H1GI6B'].astype('float64')) predictors['H1GI6C']=preprocessing.scale(predictors['H1GI6C'].astype('float64')) predictors['H1GI6D']=preprocessing.scale(predictors['H1GI6D'].astype('float64')) predictors['H1ED3']=preprocessing.scale(predictors['H1ED3'].astype('float64')) predictors['H1WP17F']=preprocessing.scale(predictors['H1WP17F'].astype('float64'))
#split data into train and test sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
#specify the Lasso regression model model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
#print variable names and regression coefficients dict(zip(predictors.columns, model.coef_))
OUTPUT 1
Out[3]: {'H1ED3': -0.00036269558362801561, 'H1GI6A': 0.033411357878996566, 'H1GI6B': 0.060746780876457317, 'H1GI6C': -0.0078545600538546621, 'H1GI6D': 0.003423440826595485, 'H1WP17F': 0.01843944582393145}
None of my predictors shows a regression coefficient equal to zero, which means that out of the 6 variables, 6 were selected in the final model. H1GI6B (Black ethnicity) and H1GI6A (White ethnicity) were the variables most strongly associated with self-esteem.
CODE 2
#plot coefficient progression m_log_alphas = -numpy.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-numpy.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths')
OUTPUT 2
The plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changes as new predictors were added to the model and the points at which each predictor entered the model. The dark green line, has the largest regression coefficient (black ethnicity) and it was the first predictor entering the model.
CODE 3
#plot mean square error for each fold m_log_alphascv = -numpy.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold')
OUTPUT 3
We can see that there is variability across the individual cross-validation folds in the training data set, but a part from 3 cases, the change in mean square error as predictors enter the model follows quite a similar pattern. MSE decreases at the start and then levels off, as adding more predictors to the model does not lead to much reduction in the mean square error.
CODE 4
#MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('Training data MSE') print (train_error) print ('test data MSE') print (test_error)
#R-squared from training and test data rsquared_train=model.score(pred_train,tar_train) rsquared_test=model.score(pred_test, tar_test) print ('training data R-square') print (rsquared_train) print ('test data R-square') print (rsquared_test)
OUTPUT 4
Training data MSE
0.243877161312
test data MSE
0.245976658434
training data R-square
0.0105579580738
test data R-square
0.00393038470188
The test mean square error is very close to the training mean square error, suggesting that prediction accuracy was pretty stable across the two data sets.
The R-square values are 0.01 and 0.004, indicating that the selected model explained 1 and 0.5 % of the variance in self-esteem for the training and test sets respectively.
0 notes
Text
Building a random forest with Python
Hello!!!
This week I am generating a random forest with Python.
I am going to use the Wave One, Add Health Survey data.
My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?
Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.
The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.
Explanatory Variables – Categorical:
WHITE = H1GI6A 0=no 1=yes
BLACK = H1GI6B 0=no 1=yes
AMERICAN INDIAN = H1GI6C 0=no 1=yes
ASIAN = H1GI6D 0=no 1=yes
RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes
PERFORMANCE AT SCHOOL = H1ED3 (have you ever skipped a grade?) 0=no 1=yes
CODE – Part 1
#week 2 - import relevant libraries
import pandas
import numpy
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import sklearn.metrics
import io
from sklearn import tree
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
PATH = "D:\\AddHealth_Coursera"
data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric
data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce')
data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce')
data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce')
data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce')
data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce')
data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce')
data['H1ED3'] = pandas.to_numeric(data['H1ED3'], errors='coerce')
#data management section
sub1=data
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False)
print(c26)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False)
print(c27)
sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)
sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)
c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False)
print(c28)
sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)
sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)
c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False)
print(c29)
sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)
sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)
c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False)
print(c30)
sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)
sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)
c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False)
print(c31)
sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)
sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)
c32 = sub1['H1ED3'].value_counts(sort=False, dropna=False)
print(c32)
#collate 5-level response variable into a 2-level one
def SMART(row):
if row['H1SE4'] <= 3 :
return 0
elif row['H1SE4'] > 3 :
return 1
sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)
d2 = sub1.groupby('SMART').size()
print(d2)
sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)
#create a clean data frame that drops all NAs
data_clean = sub1.dropna()
data_clean.dtypes
data_clean.describe()
#set explanatory and response variables and set training sample
predictors = data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F', 'H1ED3']]
target = data_clean.SMART
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
pred_train.shape
print(pred_train.shape)
pred_test.shape
print(pred_test.shape)
tar_train.shape
print(tar_train.shape)
tar_test.shape
print(tar_test.shape)
#build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier (n_estimators=6)
classifier=classifier.fit(pred_train, tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score (tar_test, predictions)
OUTPUT confusion matrix and accuracy score:
C:\Users\Rob\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py:104: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
spy_cfg.IPKernelApp.exec_lines.append(
(3655, 6)
(2437, 6)
(3655,)
(2437,)
Out[1]: 0.55642183011899882
For the confusion matrix, we see the true negatives and the true positives on the diagonal. And the 6 and the 2437 represent the false negatives and false positives, respectively. Notice that the overall accuracy for the forest is 0.56. So 56% of the individuals were classified correctly as students with high self-esteem or students with low self-esteem.
CODE - PART 2:
#fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
#display the relative importance of each attribute
print(model.feature_importances_)
OUTPUT:
The variables are listed in the order they have been named earlier in the code so: 'H1GI6A' (white), 'H1GI6B' (black), 'H1GI6C' (american indian), 'H1GI6D' (asian), 'H1WP17F' (relationship with parents), 'H1ED3' (academic performance).
[ 0.24377588 0.32035186 0.0894001 0.12651909 0.14493537 0.07501769]
As we can see the variable with the highest important score is black ethnicity with 0.32 and the variable with the lowest important score is academic performance (0.075).
CODE - PART 3:
The third part of the code builds 6 trees and then finds the accuracy score for each of these trees.
#impact of tree size on prediction accuracy
trees=range(6)
accuracy=np.zeros(6)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
OUTPUT:
With only one tree the accuracy is about 56.4% and it drops only to 56.3% as the number of trees increases. Given accuracy scores are quite near to each other we can be confident that it may be appropriate to interpret a single decision tree for this data.
0 notes
Text
Machine Learning.1 - Decision Tree
Hello!!
So for this week I am building a classification tree grown with the data from the AddHealth data set.
My response/target variable is self-esteem, which is defined base on the answer to the question, compared to people your age, how intelligent are you?
Target Variable: categorical 5-level variable, H1SE4 , that I managed to become a binary categorical variable renamed SMART.
The candidate explanatory variables include Gender, Relationship with Parents and Academic performance.
Explanatory Variables – Categorical:
WHITE = H1GI6A 0=no 1=yes
BLACK = H1GI6B 0=no 1=yes
AMERICAN INDIAN = H1GI6C 0=no 1=yes
ASIAN = H1GI6D 0=no 1=yes
RELATIONSHIP WITH PARENTS = H1WP17F (have you have a personal talk with your parents in the past 7 days?) 0=no 1=yes
CODE
#week 1 - import relevant libraries
import pandas
import numpy
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import sklearn.metrics
import io
from sklearn import tree
PATH = "D:\\AddHealth_Coursera"
data = pandas.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric
data['H1WP17F'] = pandas.to_numeric(data['H1WP17F'], errors='coerce')
data['H1SE4'] = pandas.to_numeric(data['H1SE4'], errors='coerce')
data['H1GI6A'] = pandas.to_numeric(data['H1GI6A'], errors='coerce')
data['H1GI6B'] = pandas.to_numeric(data['H1GI6B'], errors='coerce')
data['H1GI6C'] = pandas.to_numeric(data['H1GI6C'], errors='coerce')
data['H1GI6D'] = pandas.to_numeric(data['H1GI6D'], errors='coerce')
sub1=data
#data management
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
c26 = sub1['H1WP17F'].value_counts(sort=False, dropna=False)
print(c26)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
c27 = sub1['H1SE4'].value_counts(sort=False, dropna=False)
print(c27)
sub1['H1GI6A'] = data['H1GI6A'].replace(6, numpy.nan)
sub1['H1GI6A'] = data['H1GI6A'].replace(8, numpy.nan)
c28 = sub1['H1GI6A'].value_counts(sort=False, dropna=False)
print(c28)
sub1['H1GI6B'] = data['H1GI6B'].replace(6, numpy.nan)
sub1['H1GI6B'] = data['H1GI6B'].replace(8, numpy.nan)
c29 = sub1['H1GI6B'].value_counts(sort=False, dropna=False)
print(c29)
sub1['H1GI6C'] = data['H1GI6C'].replace(6, numpy.nan)
sub1['H1GI6C'] = data['H1GI6C'].replace(8, numpy.nan)
c30 = sub1['H1GI6C'].value_counts(sort=False, dropna=False)
print(c30)
sub1['H1GI6D'] = data['H1GI6D'].replace(6, numpy.nan)
sub1['H1GI6D'] = data['H1GI6D'].replace(8, numpy.nan)
c31 = sub1['H1GI6D'].value_counts(sort=False, dropna=False)
print(c31)
#collate 5-level response variable into a 2-level one
def SMART(row):
if row['H1SE4'] <= 3 :
return 0
elif row['H1SE4'] > 3 :
return 1
sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)
d2 = sub1.groupby('SMART').size()
print(d2)
#create a clean data frame that drops all NAs
sub1['SMART'] = sub1['SMART'].convert_objects(convert_numeric=True)
data_clean = sub1.dropna()
data_clean.dtypes
data_clean.describe()
#split into training and testing sets
predictors = data_clean[['H1GI6A', 'H1GI6B', 'H1GI6C', 'H1GI6D', 'H1WP17F']]
target = data_clean.SMART
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4)
pred_train.shape
print(pred_train.shape)
pred_test.shape
print(pred_test.shape)
tar_train.shape
print(tar_train.shape)
tar_test.shape
print(tar_test.shape)
#building model on training data
classifier=tree.DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score (tar_test, predictions)
OUTPUT confusion matrix and accuracy score:
The diagonal 119 and 1261 shows the number of true negative for people with high self-esteem, and the number of true positives, respectively. I
I also looked at the accuracy score, which is approximately 0.56%, which suggest that the decision tree model has classified 56% of the sample correctly as either students with high or low self-esteem.
#displaying decision tree
from sklearn import tree
from io import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Img=(Image(graph.create_png()))
with open('picture_out1.png', 'wb') as f:
f.write(graph.create_png())
Output:(3656, 5)(2438, 5)(3656,)(2438,)Out[2]: 0.57629204265791634
Decision Tree:
The resulting tree start with the split on X[1], our second explanatory variable, black ethnicity. if the value of ethnicity black is less than 0.5, that is no black ethnicity, the observations move to the left side of the split and include 2773 individuals out of the initial sample of 3656.
From this node, another split is made on X(4), good relationship with parents, such that among those individuals that do not correspond to a black ethnicity 1095 do not have good communication with their parents.
Following the terminal of the tree we can see that among those that are not of black ethnicity and do not have a good communication of their parents 13 have low self -esteem and the other high.
0 notes
Text
Logistic regression - Background more influential than relationship with parents in adolescents’ perception of self-esteem
I am going to test a logistic regression using the following variables from the AddHealth codebook.
1. Primary Explanatory variable :
A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your parents about a personal problem you are having?
With 0=No and 1=Yes
2. Response variable :
A 4 level categorical variable = (coded as H1SE4) = Compared with other people your age, how intelligent are you? With 1=moderately below average, 2=slightly below average 3= average 4=slightly above average 5=moderately above average 6=extremely above average
Because this response variable has more than 2 categories I have to apply some data management.
Confounding Variables:
A. A two level categorical variable = (coded as H1RR1) = Have you been in a romantic relationship in the last 18 months?
With 0=No and 1=Yes
B. A two level categorical variable = (coded as H1ED3) = Have you ever skipped a grade?
With 0=No and 1=Yes
C. A two level categorical variable = (coded as H1GI4) = Are you of Hispanic or Latino origin?
With 0=No and 1=Yes
So
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric:
data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)
data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)
sub1=data
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
sub1['H1RR1'] = data['H1RR1'].replace(6, numpy.nan)
sub1['H1RR1'] = data['H1RR1'].replace(8, numpy.nan)
sub1['H1ED3'] = data[' H1RR1'].replace(9, numpy.nan)
sub1['H1ED3'] = data['H1ED3'].replace(6, numpy.nan)
sub1['H1ED3'] = data['H1ED3'].replace(8, numpy.nan)
sub1['H1GI4'] = data['H1GI4'].replace(6, numpy.nan)
sub1['H1GI4'] = data['H1GI4'].replace(8, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(3, numpy.nan)
def SMART(row):
if row['H1SE4'] == 1 :
return 0
elif row['H1SE4'] == 2 :
return 0
else :
return 1
sub1['SMART'] = sub1.apply (lambda row: SMART (row), axis=1)
#logistic regression
lreg1 = smf.logit(formula = 'SMART ~ H1WP17F', data = sub1).fit()
print (lreg1.summary())
Optimization terminated successfully.
Current function value: 0.230118
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: SMART No. Observations: 6123
Model: Logit Df Residuals: 6121
Method: MLE Df Model: 1
Date: Thu, 23 Mar 2017 Pseudo R-squ.: 0.001131
Time: 22:53:45 Log-Likelihood: -1409.0
converged: True LL-Null: -1410.6
LLR p-value: 0.07404
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.6569 0.066 40.283 0.000 2.528 2.786
H1WP17F 0.1984 0.112 1.771 0.077 -0.021 0.418
==============================================================================
Whilst looking at regression the correlation was significant, collating my response variable into 2 levels has made the correlation non significant P>0.05 (0.077 In bold).
There is no statistically significant correlation between having personal conversations with parents and having high feelings of self-esteem
I am anyway going to calculate the odds ratios
#odds ratios
print ('Odds Ratios')
print (numpy.exp(lreg1.params))
Odds Ratios
Intercept 14.252033
H1WP17F 1.219464
Adolescents that have personal conversations with their parents are 1.22 times more likely to have high self-esteem
Conference Intervals
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print (numpy.exp(conf))
Lower CI Upper CI OR
Intercept 12.523753 16.218815 14.252033
H1WP17F 0.979081 1.518867 1.219464
Let’s add confounding variables one by one
A. H1RR1 (romantic relationship in last 18 months)
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.6136 0.083 31.364 0.000 2.450 2.777
H1WP17F 0.1900 0.114 1.669 0.095 -0.033 0.413
H1RR1 0.1055 0.107 0.983 0.326 -0.105 0.316
==============================================================================
Odds Ratios
Intercept 13.647537
H1WP17F 1.209201
H1RR1 1.111267
dtype: float64
Lower CI Upper CI OR
Intercept 11.591067 16.068863 13.647537
H1WP17F 0.967442 1.511374 1.209201
H1RR1 0.900408 1.371506 1.111267
No confounding effect associated with having been (or still being ) in a romantic relationship
B. H1ED3 – Having skipped a grade
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.6146 0.084 31.259 0.000 2.451 2.779
H1WP17F 0.1901 0.114 1.670 0.095 -0.033 0.413
H1ED3 0.1032 0.108 0.953 0.341 -0.109 0.316
==============================================================================
Odds Ratios
Intercept 13.662013
H1WP17F 1.209350
H1ED3 1.108762
dtype: float64
Lower CI Upper CI OR
Intercept 11.596229 16.095803 13.662013
H1WP17F 0.967569 1.511548 1.209350
H1ED3 0.896639 1.371069 1.108762
No confounding effect associated with having skipped a grade
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.7602 0.071 38.685 0.000 2.620 2.900
H1WP17F 0.1730 0.113 1.537 0.124 -0.048 0.393
H1GI4 -0.5985 0.141 -4.251 0.000 -0.874 -0.323
==============================================================================
Odds Ratios
Intercept 15.802544
H1WP17F 1.188843
H1GI4 0.549660
dtype: float64
Lower CI Upper CI OR
Intercept 13.740246 18.174376 15.802544
H1WP17F 0.953576 1.482155 1.188843
H1GI4 0.417130 0.724297 0.549660
As we can see from the p<0.05 there is a significant NEGATIVE correlation between feelings of self-esteem and coming from a Latino background.
The confounding variable number 3 is actually the only one being correlated with the response variable. As we can see from odds ration an adolescent of Latino background is 1.69 times less likely to consider him/herself as being smart.
0 notes
Text
Multiple regression
Happy Sunday!
I am going to investigate the relationship between the following variables:
1. Primary Explanatory variable :
A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your mom about a personal problem you are having?
With 0=No and 1=Yes
2. Response variable :
A 4 level categorical variable = (coded as H1SE4) = Compared with other people your age, how intelligent are you? With 1=moderately below average, 2=slightly below average 3= average 4=slightly above average 5=moderately above average 6=extremely above average
NB: In my code I centred this variable and renamed in ‘intelligence’
3. Confounding variable :
A two level categorical variable = (coded as H1RR1) = Have you been in a romantic relationship in the last 18 months?
With 0=No and 1=Yes
import pandas as pds
import numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric:
data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)
data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)
data['H1RR1'] = data['H1RR1'].convert_objects(convert_numeric=True)
sub1=data
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
sub1['H1RR1'] = data['H1RR1'].replace(6, numpy.nan)
sub1['H1RR1'] = data['H1RR1'].replace(8, numpy.nan)
sub1['H1RR1'] = data['H1RR1'].replace(9, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
#centre quantitative response variable for regression analysis
sub1['intelligence'] = (sub1['H1SE4'] - sub1['H1SE4'].mean())
print(sub1['intelligence'].mean())
reg2 = smf.ols('intelligence ~ H1WP17F + H1RR1', data=sub1).fit()
print (reg2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: intelligence R-squared: 0.004
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 11.80
Date: Sun, 19 Mar 2017 Prob (F-statistic): 7.65e-06
Time: 09:15:46 Log-Likelihood: -9164.9
No. Observations: 6085 AIC: 1.834e+04
Df Residuals: 6082 BIC: 1.836e+04
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -0.0697 0.023 -3.067 0.002 -0.114 -0.025
H1WP17F 0.1152 0.029 3.977 0.000 0.058 0.172
H1RR1 0.0632 0.028 2.229 0.026 0.008 0.119
==============================================================================
Omnibus: 249.948 Durbin-Watson: 1.963
Prob(Omnibus): 0.000 Jarque-Bera (JB): 111.966
Skew: 0.083 Prob(JB): 4.86e-25
Kurtosis: 2.357 Cond. No. 2.95
==============================================================================
For both explanatory variables (primary and confounding) p values are less than 0.05 (in bold), and both t values (highlighted in yellow) are positive indicating that having been in a romantic relationship and talking with your parents are associated with higher perception of intelligence.
We can conclude that both explanatory variables are significantly associated with perception of intelligence, after taking out the part of the association accounted for by the other.
Let us try adding an additional potential confounding variable to the model:
A two level categorical variable = (coded as H1ED3) = Have you ever skipped a grade?
With 0=No and 1=Yes
OLS Regression Results
==============================================================================
Dep. Variable: intelligence R-squared: 0.004
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 8.117
Date: Sun, 19 Mar 2017 Prob (F-statistic): 2.16e-05
Time: 09:39:38 Log-Likelihood: -9158.7
No. Observations: 6083 AIC: 1.833e+04
Df Residuals: 6079 BIC: 1.835e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -0.0705 0.023 -3.095 0.002 -0.115 -0.026
H1WP17F 0.1152 0.029 3.977 0.000 0.058 0.172
H1RR1 0.0621 0.028 2.189 0.029 0.006 0.118
H1ED3 0.0883 0.091 0.972 0.331 -0.090 0.266
==============================================================================
Omnibus: 251.891 Durbin-Watson: 1.964
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112.525
Skew: 0.083 Prob(JB): 3.68e-25
Kurtosis: 2.355 Cond. No. 8.08
==============================================================================
Interestingly having skipped a grade in school is not significantly associated with someone’s perception of intelligence.
Confidence intervals are highlighted in red. As you can see for H1ED3 confidence intervals include a value of zero.
Also as you can see from the Adj. R-squared value romantic relationship roster and communication with parents explain only around 0.003% of perception of intelligence in a student.
What is the residual variability?
Let us try with a q-q plot
This indicates that our residuals do not follow a perfectly normal distribution. I should include more explanatory variables in my model.
Now let’s look at the standardized residuals
Most residuals fall within -3 and 2 so we do have some extreme outliers. I believe that the fit of the model is quite poor. I should definitely include more explanatory variables.
The leverage plot shows that , again, there are a lot of outliers with higher than average leverage but are not outliers.
What this tells me is that the explanatory variables I took into consideration so far have a significant relationship with perception of intelligence but are not enough to build a robust predictive model of the response variable.
0 notes
Text
My first regression model
Bonjour!
This week I want to test a basic linear regression model.
I am going to investigate the relationship between the following variables:
1. Explanatory variable :
A two level categorical variable = (coded as H1WP17F) = In the past week have you had a talk with your mom about a personal problem you are having?
With 0=No and 1=Yes
2. Response variable :
A 4 level categorical variable = (coded as H1FS8) = how often do you feel hopeful about the future?
With 0=Never, 1=Sometimes, 2=A lot of the time, 3=Most of the time or all of the time
The objective would be to test how a good parental relationship affects an youngster’s feelings of self-esteem and self-efficacy
import pandas as pds
import numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric:
data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)
data['H1FS8'] = data['H1FS8'].convert_objects(convert_numeric=True)
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
sub1['H1FS8'] = data['H1FS8'].replace(6, numpy.nan)
sub1['H1FS8'] = data['H1FS8'].replace(8, numpy.nan)
print ('OLS regressions model for association parental support and hope towards the future')
reg1 = smf.ols ('H1FS8 ~ H1WP17F', data=sub1).fit()
print (reg1.summary())
OLS regressions model for association parental support and hope towards the future
OLS Regression Results
==============================================================================
Dep. Variable: H1FS8 R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.1034
Date: Sat, 11 Mar 2017 Prob (F-statistic): 0.748
Time: 10:29:36 Log-Likelihood: -9151.4
No. Observations: 6479 AIC: 1.831e+04
Df Residuals: 6477 BIC: 1.832e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 1.8506 0.014 135.341 0.000 1.824 1.877
H1WP17F -0.0025 0.008 -0.322 0.748 -0.018 0.013
==============================================================================
Omnibus: 439.909 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 239.490
Skew: -0.316 Prob(JB): 9.90e-53
Kurtosis: 2.302 Cond. No. 2.11
==============================================================================
Let us try another response variable that is more closely related to self-esteem.
Response variable :
A 4 level categorical variable = (coded as H1SE4) = Compared with other people your age, how intelligent are you? With 1=moderately below average, 2=slightly below average 3= average 4=slightly above average 5=moderately above average 6=extremely above average
import pandas as pds
import numpy
import statsmodels.api as sm
import statsmodels.formula.api as smf
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric:
data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)
data['H1SE4'] = data['H1SE4'].convert_objects(convert_numeric=True)
sub1['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
sub1['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(96, numpy.nan)
sub1['H1SE4'] = data['H1SE4'].replace(98, numpy.nan)
print ('OLS regressions model for association parental support and hope towards the future')
reg1 = smf.ols ('H1SE4 ~ H1WP17F', data=sub1).fit()
print (reg1.summary())
OLS regressions model for association parental support and hope towards the future
OLS Regression Results
==============================================================================
Dep. Variable: H1SE4 R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 34.36
Date: Sat, 11 Mar 2017 Prob (F-statistic): 4.81e-09
Time: 10:36:21 Log-Likelihood: -15806.
No. Observations: 6478 AIC: 3.162e+04
Df Residuals: 6476 BIC: 3.163e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 3.8502 0.038 100.752 0.000 3.775 3.925
H1WP17F 0.1264 0.022 5.862 0.000 0.084 0.169
==============================================================================
Omnibus: 16272.765 Durbin-Watson: 1.995
Prob(Omnibus): 0.000 Jarque-Bera (JB): 222208638.051
Skew: 27.575 Prob(JB): 0.00
Kurtosis: 908.653 Cond. No. 2.12
==============================================================================
This time the relationship is significant:
P=4.81e-09
Equation:
Feelings of self-esteem compared to others = 3.85 + 0.126*parental talk
However looking at R-squared value I can say that talking with your parents predicts only a minimal part of perceptions on self esteem (R-squared 0 0.005)
0 notes
Text
Writing about your data - Week 1
The Sample is drawn from The National Longitudinal Study of Adolescent Health (AddHealth) – Wave 1 .
The study used the following build the sample:
- 80 high schools were selected on the basis of region, urbanicity, school type, ethnic mix and size.
- From September 1994 to April 1995, in-school questionnaires were administered to these schools’ students. The in-school questionnaire investigated the school context, friendship networks, school activities, future expectations, health conditions.
- From the students who answered in-school questionnaires a nationally representative sample of around 15,000 American adolescents in grades 7-12 was selected on the basis of sex, grade, ethnicity, disability and family roster. These student were then administered a 90-minute in-home interview.
To protect confidentiality, no paper questionnaires were used. Instead, all data were recorded on laptop computers. For less sensitive topics, the interviewer read the questions aloud and entered the respondent's answers. For more sensitive topics, the respondent listened through earphones to pre-recorded questions and entered the answers directly. In addition to maintaining data security, this minimized the potential for interviewer or parental influence.
The in-house questionnaire investigated family composition and dynamics, romantic expectations, aspirations.
The purpose of the study was to specifically investigate the relationship between family roster, school context and larger context and likelihood of students to engage in risky sex behaviour.
My research question investigates correlation between parenting style and feeling of independence and self-esteem. The variables analysed can be found in the following sections of the AddHealth Codebook.
- Section 9 – Self-efficacy
- Section 10 – Feelings scale
- Section 16 – Relations with Parents
- Section 18 – Personality and Family
Section 9 and 10 included a number of variables describing the respondents feelings of self-esteem (how they see themselves) and self-efficacy (feelings of autonomy). These variables constituted my response variables.
In Section 16 and 18 respondents were asked to assess their relationship with their parents: how supportive their family is, how independent their parents want them to be... these constituted my explanatory variables
Respondents’ feelings and perceptions are assessed using a Likert scale. When responding to a Likert item, respondents specify their level of agreement or disagreement on a symmetric agree-disagree scale for a series of statements.
This kind of scales can provide distorted responses for a number of reasons:
- respondents may avoid expressing extreme opinions
- respondents may tend to agree with statements as they are expressed
Seen that the AddHealth questionnaires was mostly administered in a safe environment (the respondents home) and students were left alone to submit their answers in a computer, I believe that the risk of running in the above study limitations has been largely avoided.
However I also decided to look at a number of variables measuring similar feelings and perceptions for the very reason that I wanted to avoid to reach conclusions only looking at one set of correlations.
I also when managing my data I grouped together ‘strongly agree’ and ‘agree’ responses and ‘strongly disagree’ and ‘disagree’ responses in order to account for the central tendency bias.
0 notes
Text
Testing a potential moderator
This week I decided to measure the linear relationship between two variables part of the AddHealth dataset including in the analysis a third variable, a moderator.
Variable 1 = Independent variable (x)
H1WP8 - On how many of the past 7 days were your parents in the room with you while you ate your evening meal?
Variable 2 = Dependent variable (y)
H1ID6 - How much would you like to have a romantic relationship in the next year?
Moderator
H1WP17F - Have you had a talk with your parents about a personal problem you are having?
I already found out last week that there is a WEAK negative relationship between x and y. I now want to investigate whether this is still the case when I split observations in 2 groups.
Sub 1: students that recently spoke of their personal problems with their parents
Sub 2: students that have not recently spoken of their personal problems with their parents
Script
import pandas as pds
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric:
data['H1WP8'] = data['H1WP8'].convert_objects(convert_numeric=True)
data['H1ID6'] = data['H1ID6'].convert_objects(convert_numeric=True)
data['H1WP17F'] = data['H1WP17F'].convert_objects(convert_numeric=True)
data['H1WP8'] = data['H1WP8'].replace(96, numpy.nan)
data['H1WP8'] = data['H1WP8'].replace(97, numpy.nan)
data['H1WP8'] = data['H1WP8'].replace(98, numpy.nan)
data['H1ID6'] = data['H1ID6'].replace(7, numpy.nan)
data['H1ID6'] = data['H1ID6'].replace(8, numpy.nan)
data['H1ID6'] = data['H1ID6'].replace(9, numpy.nan)
data['H1WP17F'] = data['H1WP17F'].replace(6, numpy.nan)
data['H1WP17F'] = data['H1WP17F'].replace(7, numpy.nan)
data['H1WP17F'] = data['H1WP17F'].replace(8, numpy.nan)
data_clean = data.dropna()
print (scipy.stats.pearsonr(data_clean['H1WP8'], data_clean['H1ID6']))
sub1=data_clean[(data_clean['H1WP17F']== 0)]
sub2=data_clean[(data_clean['H1WP17F']== 1)]
print ('association between weekly meals and romantic expectations for students not talking of personal problems with parents')
print (scipy.stats.pearsonr(sub1['H1WP8'], sub1['H1ID6']))
print (' ')
print ('association between weekly meals and romantic expectations for students talking of personal problems with parents')
print (scipy.stats.pearsonr(sub2['H1WP8'], sub2['H1ID6']))
Output
association between weekly meals and romantic expectations for students not talking of personal problems with parents
(-0.10417957730826818, 1.9417985672438483e-10)
association between weekly meals and romantic expectations for students talking of personal problems with parents
(-0.10564892211849905, 2.9501572305228157e-07)
According to the results of the correlation analysis there is a weak negative relationship between number of weekly family meals and romantic expectations. The correlation is very weak (around -0.1) however is statistically relevant (p lower than 0.05)
I expected that the relationship between the two variables would be weaker for, subgroup 2, students talk about personal problems with their parents, but, according to the analysis, students that have recently had personal chats with their parents are even less likely to desire being in a romantic relationship.
Below the scatterplots for sub 1 (students that have NOT had a personal talk with their parents recently) and sub 2 1 (students that have had a personal talk with their parents recently).
0 notes
Text
week 3. correlation coefficient.
Hola!
This week I decided to measure the linear relationship between two variables part of the AddHealth dataset.
Variable 1 = Independent variable (x)
H1WP8 - On how many of the past 7 days were your parents in the room with you while you ate your evening meal?
Variable 2 = Dependent variable (y)
H1ID6 - How much would you like to have a romantic relationship in the next year?
#Here is the script I run to get the scatterplot
import pandas import numpy import seaborn import scipy import matplotlib.pyplot as plt
PATH = "D:\\AddHealth_Coursera"
data = pds.read_csv("D:\\AddHealth_Coursera\\addhealth_pds.csv", low_memory=False)
#setting variables I will be working with to numeric: data['H1WP8'] = data['H1WP8'].convert_objects(convert_numeric=True) data['H1ID6'] = data['H1ID6'].convert_objects(convert_numeric=True)
data['H1WP8'] = data['H1WP8'].replace(96, numpy.nan) data['H1WP8'] = data['H1WP8'].replace(97, numpy.nan) data['H1WP8'] = data['H1WP8'].replace(98, numpy.nan)
data['H1ID6'] = data['H1ID6'].replace(7, numpy.nan) data['H1ID6'] = data['H1ID6'].replace(8, numpy.nan) data['H1ID6'] = data['H1ID6'].replace(9, numpy.nan)
scat1= seaborn.regplot(x='H1WP8', y='H1ID6', fit_reg=True, data=data) plt.xlabel('H1WP8') plt.ylabel('H1ID6') plt.title('Scatterplot for the Association between weekly meals and romantic expectations')
#Here is my output
I am aware of the fact that, as both of my variables used discrete values, the scatterpot is not that useful.
Looking at the graph I would expect my Pearson correlation coefficient to be negative
#Here is the script I run to calculate the coefficient
data_clean=data.dropna()
print ('Association between weekly meals and romantic expectations') print (scipy.stats.pearsonr(data_clean['H1WP8'], data_clean['H1ID6']))
#And here is the output
Association between weekly meals and romantic expectations (-0.10415897700852748, 1.1554106741087854e-16)
The second value in the brackets, the p value, tells me that the correlation is significant.
However the r number (-0.104) tells me that the negative linear relationship between number of family meals in the past week and romantic expectations is quite weak.
In other words, when we square r we get 0.010849092491462957, which means that only 1% of the variabilty in romantic expectations can be described by the number of family meal one has in a week.
Thank goodness! otherwise it would appear that the more you eat together with your child the less likely he will be to put himself in a romantic relationship!
0 notes