datavvisualization - Tumblr blog

datavvisualization · 5 years ago

Text

Machine Learning for Data Analysis 4

Code:

from pandas import Series, DataFrame import pandas as pd import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans plt.rcParams['figure.figsize'] = (15, 5) #Load the dataset loans = pd.read_csv("./LendingClub.csv", low_memory = False) loans['safe_loans'] = loans['bad_loans'].apply(lambda x : 1 if x == 0 else 0) loans.drop('bad_loans', axis = 1, inplace = True) # Select features to handle # In this oportunity, we are going to ignore 'grade' and 'sub_grade' predictors # assuming those are a way to "clustering" the loans predictors = ['short_emp', # one year or less of employment 'emp_length_num', # number of years of employment 'home_ownership', # home_ownership status: own, mortgage or rent 'dti', # debt to income ratio 'purpose', # the purpose of the loan 'term', # the term of the loan 'last_delinq_none', # has borrower had a delinquincy 'last_major_derog_none', # has borrower had 90 day or worse rating 'revol_util', # percent of available credit being used 'total_rec_late_fee', # total late fees received to day ] target = 'safe_loans' # prediction target (y) (+1 means safe, 0 is risky) ignored = ['grade', # grade of the loan 'sub_grade', # sub-grade of the loan ] # Extract the predictors and target columns loans = loans[predictors + [target]] # Delete rows where any or all of the data are missing loans = loans.dropna() # Convert categorical text variables into numerical ones categorical = ['home_ownership', 'purpose', 'term'] for attr in categorical: attributes_list = list(set(loans[attr])) loans[attr] = [attributes_list.index(idx) for idx in loans[attr] ] print((loans.describe()).T) # MODELING AND PREDICTION # Standardize clustering variables to have mean=0 and sd=1 for attr in predictors: loans[attr] = preprocessing.scale(loans[attr].astype('float64')) # Split data into train and test sets clus_train, clus_test = train_test_split(loans[predictors], test_size = .3, random_state = 123) print('clus_train.shape', clus_train.shape) print('clus_test.shape', clus_test.shape ) # K-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters = range(1,10) meandist = list() for k in clusters: model = KMeans(n_clusters = k).fit(clus_train) clusassign = model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0]) """ Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """ plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') plt.show() # Interpret 5 cluster solution model = KMeans(n_clusters = 5) model.fit(clus_train) clusassign = model.predict(clus_train) # Plot clusters from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x = plot_columns[:, 0], y = plot_columns[:, 1], c = model.labels_, ) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 5 Clusters') plt.show() # BEGIN multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster # Create a unique identifier variable from the index for the # cluster training data to merge with the cluster assignment variable clus_train.reset_index(level = 0, inplace = True) # Create a list that has the new index variable cluslist = list(clus_train['index']) # Create a list of cluster assignments labels = list(model.labels_) # Combine index variable list with cluster assignment list into a dictionary newlist = dict(zip(cluslist, labels)) newlist # Convert newlist dictionary to a dataframe newclus = DataFrame.from_dict(newlist, orient = 'index') newclus # Rename the cluster assignment column newclus.columns = ['cluster'] # Now do the same for the cluster assignment variable: # Create a unique identifier variable from the index for the cluster assignment # dataframe to merge with cluster training data newclus.reset_index(level = 0, inplace = True) # Merge the cluster assignment dataframe with the cluster training variable dataframe # by the index variable merged_train = pd.merge(clus_train, newclus, on = 'index') merged_train.head(n = 100) # cluster frequencies merged_train.cluster.value_counts() # # END multiple steps to merge cluster assignment with clustering variables to examine # cluster variable means by cluster # # FINALLY calculate clustering variable means by cluster clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") print(clustergrp) # Validate clusters in training data by examining cluster differences in SAFE_LOANS using ANOVA # first have to merge SAFE_LOANS with clustering variables and cluster assignment data gpa_data = loans['safe_loans'] # split safe_loans data into train and test sets gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123) gpa_train1 = pd.DataFrame(gpa_train) gpa_train1.reset_index(level = 0, inplace = True) merged_train_all = pd.merge(gpa_train1, merged_train, on = 'index') sub1 = merged_train_all[['safe_loans', 'cluster']].dropna() import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi gpamod = smf.ols(formula = 'safe_loans ~ C(cluster)', data = sub1).fit() print (gpamod.summary()) print ('Means for SAFE_LOANS by cluster') m1 = sub1.groupby('cluster').mean() print (m1) print ('Standard deviations for SAFE_LOANS by cluster') m2 = sub1.groupby('cluster').std() print (m2) mc1 = multi.MultiComparison(sub1['safe_loans'], sub1['cluster']) res1 = mc1.tukeyhsd() print(res1.summary())

Output:

-----------------------------------

clus_train.shape (85824, 10)

clus_test.shape (36783, 10)

-----------------------------------

1 26830 2 25138 4 13664 0 10525 3 9667 Name: cluster, dtype: int64

-----------------------------------

Means for SAFE_LOANS by cluster cluster safe_loans 0 0.788504 1 0.869027 2 0.818641 3 0.818248 4 0.691745

-----------------------------------

Standard deviations for SAFE_LOANS by cluster cluster safe_loans 0 0.408389 1 0.337377 2 0.385323 3 0.385660 4 0.461790

-----------------------------------

0 notes

datavvisualization · 5 years ago

Text

Machine Learning for Data Analysis 3

Code:

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.linear_model import LassoLarsCV # Load the dataset loans = pd.read_csv("./LendingClub.csv", low_memory = False) # Exploring the target column loans['safe_loans'] = loans['bad_loans'].apply(lambda x : 1 if x == 0 else 0) loans.drop('bad_loans', axis = 1, inplace = True) # Select features to handle predictors = ['grade', # grade of the loan 'sub_grade', # sub-grade of the loan 'short_emp', # one year or less of employment 'emp_length_num', # number of years of employment 'home_ownership', # home_ownership status: own, mortgage or rent 'dti', # debt to income ratio 'purpose', # the purpose of the loan 'term', # the term of the loan 'last_delinq_none', # has borrower had a delinquincy 'last_major_derog_none', # has borrower had 90 day or worse rating 'revol_util', # percent of available credit being used 'total_rec_late_fee', # total late fees received to day ] target = 'safe_loans' # prediction target (y) (+1 means safe, 0 is risky) # Extract the predictors and target columns loans = loans[predictors + [target]] # Delete rows where any or all of the data are missing data_clean = loans.dropna() # Convert categorical variables into binary variables data_clean = pd.get_dummies(data_clean, prefix_sep = '=') # Describe current dataset print((data_clean.describe()).T) # Extract new features names features = data_clean.columns.values features = features[features != target] # Modeling and Prediction predvar = data_clean[features] predictors = predvar.copy() target = data_clean.safe_loans # Standardize predictors to have mean=0 and sd=1 from sklearn import preprocessing for attr in predictors.columns.values: predictors[attr] = preprocessing.scale(predictors[attr].astype('float64')) # Split into training and testing sets pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size = .4, random_state = 123) print('pred_train.shape', pred_train.shape) print('pred_test.shape', pred_test.shape) print('tar_train.shape', tar_train.shape) print('tar_test.shape', tar_test.shape) # Specify the lasso regression model model = LassoLarsCV(cv = 10, precompute = False).fit(pred_train, tar_train) # Print variable names and regression coefficients print(pd.DataFrame([dict(zip(predictors.columns, model.coef_))], index=['coef']).T) # Plot coefficient progression m_log_alphas = -np.log10(model.alphas_) ax = plt.gca() plt.plot(m_log_alphas, model.coef_path_.T) plt.axvline(-np.log10(model.alpha_), linestyle = '--', color = 'k', label = 'alpha CV') plt.ylabel('Regression Coefficients') plt.xlabel('-log(alpha)') plt.title('Regression Coefficients Progression for Lasso Paths') # Plot mean square error for each fold m_log_alphascv = -np.log10(model.cv_alphas_) plt.figure() plt.plot(m_log_alphascv, model.cv_mse_path_, ':') plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis = -1), 'k', label = 'Average across the folds', linewidth = 2) plt.axvline(-np.log10(model.alpha_), linestyle = '--', color = 'k', label = 'alpha CV') plt.legend() plt.xlabel('-log(alpha)') plt.ylabel('Mean squared error') plt.title('Mean squared error on each fold') # MSE from training and test data from sklearn.metrics import mean_squared_error train_error = mean_squared_error(tar_train, model.predict(pred_train)) test_error = mean_squared_error(tar_test, model.predict(pred_test)) print ('training data MSE') print(train_error) print ('test data MSE') print(test_error) # R-square from training and test data rsquared_train = model.score(pred_train, tar_train) rsquared_test = model.score(pred_test, tar_test) print ('training data R-square') print(rsquared_train) print ('test data R-square') print(rsquared_test)

Explanation:

LendingClub.csv is a dataset taken from The LendingClub (https://www.lendingclub.com/) which is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors.

The target column (label column) of the dataset that we are interested in is called “bad_loans”. In this column 1 means a risky (bad) loan 0 means a safe loan. In order to make this more intuitive, we reassign the target to be: 1 as a safe loan, 0 as a risky (bad) loan. We put this in a new column called “safe_loans”.

Output:

pred_train.shape (73564, 67) pred_test.shape (49043, 67) tar_train.shape (73564,) tar_test.shape (49043,)

training data MSE 0.141354906717 test data MSE 0.140656085708

training data R-square 0.0799940399148 test data R-square 0.0772929635462

0 notes

datavvisualization · 5 years ago

Text

Machine Learning for Data Analysis 2

Code:

import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier import sklearn.metrics from sklearn.preprocessing import LabelEncoder import matplotlib.pylab as plt credit = pd.read_csv("credit.txt",sep = "\t") credit = credit.dropna() targets = LabelEncoder().fit_transform(credit['default']) predictors = credit.ix[:,credit.columns != 'default'] # Recode categorical variables as numeric variables predictors.dtypes for i in range(0,len(predictors.dtypes)): if predictors.dtypes[i] != 'int64': predictors[predictors.columns[i]] = LabelEncoder().fit_transform(predictors[predictors.columns[i]]) pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3) # Build model on training data classifier = RandomForestClassifier(n_estimators = 25) classifier = classifier.fit(pred_train, tar_train) # Make predictions on testing data predictions = classifier.predict(pred_test) # Calculate accuracy sklearn.metrics.confusion_matrix(tar_test, predictions) sklearn.metrics.accuracy_score(tar_test, predictions) # Fit an extra trees model to the training data model = ExtraTreesClassifier().fit(pred_train, tar_train) # Display the relative importance of each attribute print(pd.Series(model.feature_importances_, index = predictors.columns).sort_values(ascending = False)) """ Running a different number of trees and see the effect of that on the accuracy of the prediction """ ntree = [50,150,250,350,450,550,650,750,850,950,1000] accuracy = [] for idx in range(len(ntree)): classifier = RandomForestClassifier(n_estimators = ntree[idx]) classifier = classifier.fit(pred_train, tar_train) predictions = classifier.predict(pred_test) accuracy.append(sklearn.metrics.accuracy_score(tar_test,predictions)) pd.Series(accuracy, index = ntree).sort_values(ascending = False) plt.plot(ntree,accuracy) plt.show()

Explanation:

In the above procedure, We first build a random forest with 25 decision trees. This gives us 74% accuracy on the testing data.

We also explore the importane of 16 explanatory variables. The first three most important explanatory variables are checking_balance, amount, months_loan_duration which are slightly different from those obtained in R.

We finally run random forest with different number of decision trees. The results show that we obtain the highest 76% accuracy when the number of trees is 850 or 250. We would definitely choose 250 due to less computation time.

Output:

array([[189, 32], [ 46, 33]])

0.73999999999999999

checking_balance 0.133015 amount 0.109541 months_loan_duration 0.096196 age 0.086818 employment_duration 0.064515 credit_history 0.064045 percent_of_income 0.063428 purpose 0.063158 savings_balance 0.055704 years_at_residence 0.052617 job 0.045315 existing_loans_count 0.039384 other_credit 0.038604 housing 0.035119 phone 0.030843 dependents 0.021698 dtype: float64

850 0.760000 250 0.760000 1000 0.756667 950 0.756667 750 0.756667 650 0.756667 450 0.756667 350 0.756667 550 0.743333 50 0.740000 150 0.736667 dtype: float64

0 notes

datavvisualization · 5 years ago

Text

Machine Learning for Data Analysis 1

import pandas as pd import os from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier import sklearn.metrics from sklearn.preprocessing import LabelEncoder from sklearn import tree from io import StringIO from IPython.display import Image import pydotplus

credit = pd.read_csv("credit.csv") credit = credit.dropna() targets = LabelEncoder().fit_transform(credit['default']) predictors = credit.ix[:,credit.columns != 'default'] # Recode categorical variables as numeric variables predictors.dtypes for i in range(0,len(predictors.dtypes)): if predictors.dtypes[i] != 'int64': predictors[predictors.columns[i]] = LabelEncoder().fit_transform(predictors[predictors.columns[i]]) pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) #Build model on training data classifier = DecisionTreeClassifier().fit(pred_train,tar_train) predictions = classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) #Displaying the decision tree out = StringIO() tree.export_graphviz(classifier, out_file=out) graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png())

Output: array([[220, 66], [ 58, 56]]) 0.68999999999999995

Explanation:

We get an accuracy of 68%, but the decision tree that results is too much complex.

0 notes

datavvisualization · 5 years ago

Text

Data Visualization & Management

Final

Code:

# -*- coding: utf-8 -*-

import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt

# load gapminder dataset data = pd.read_csv('gapminder.csv',low_memory=False)

# lower-case all DataFrame column names data.columns = map(str.lower, data.columns)

# bug fix for display formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x)

# suppressing the 'space' values data = data.replace([r'\s+'], np.nan, regex=True)

# setting variables to be numeric data['suicideper100th'] = pd.to_numeric(data['suicideper100th']) data['breastcancerper100th'] = pd.to_numeric(data['breastcancerper100th']) data['hivrate'] = pd.to_numeric(data['hivrate']) data['employrate'] = pd.to_numeric(data['employrate'])

# subset data for a high suicide rate based on summary statistics sub = data[(data['suicideper100th']>12)] sub_copy = sub.copy()

# Univariate graph for breast cancer rate for people with a high suicide rate plt.figure(1) sb.distplot(sub_copy["breastcancerper100th"].dropna(),kde=False) plt.xlabel('Breast cancer rate') plt.ylabel('Frequency') plt.title('Breast cancer rate for people with a high suicide rate')

# Univariate graph for hiv rate for people with a high suicide rate plt.figure(2) sb.distplot(sub_copy["hivrate"].dropna(),kde=False) plt.xlabel('HIV rate') plt.ylabel('Frequency') plt.title('HIV rate for people with a high suicide rate')

# Univariate graph for employment rate for people with a high suicide rate plt.figure(3) sb.distplot(sub_copy["employrate"].dropna(),kde=False) plt.xlabel('Employment rate') plt.ylabel('Frequency') plt.title('Employment rate for people with a high suicide rate')

# Bivariate graph for association of breast cancer rate with HIV rate for people with a high suicide rate plt.figure(4) sb.regplot(x="hivrate",y="breastcancerper100th",fit_reg=False,data=sub_copy) plt.xlabel('HIV Rate') plt.ylabel('Breast cancer rate') plt.title('Breast cancer rate vs. HIV rate for people with a high suicide rate') Output and explanation:

This graph is unimodal and skewed to the right, you can see the highest pick at 0-20% of breast cancer rate.

This graph is unimodal and skewed to the right as well. You can see the highest pick at 0-1% of HIV rate. So the having the HIV doesn’t seems to influence the suicide rate.

This graph is unimodal and more or less symetric. You can see the highest pick at 55-60%, the median of the Employement rate.

This graph plots the breast cancer rate vs. HIV rate for people with a high suicide rate. It shows that people with breast cancer are not infected with HIV.

0 notes

datavvisualization · 5 years ago

Text

Data Visualization & Management

2nd code

Code:

# -*- coding: utf-8 -*-

import pandas as pd import numpy as np

data = pd.read_csv('addhealth_pds.csv', low_memory=False)

#Bug fix for fisplay formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x) pd.set_option('mode.chained_assignment', None)

#Creation of subs mother = data[['H1WP9','H1WP10','H1WP17H','H1WP17I','H1WP17J']] father = data[['H1WP13','H1WP14','H1WP18H','H1WP18I','H1WP18J']]

#Convert data to numeric values #Mother-child relationship mother['H1WP9'] = pd.to_numeric(mother['H1WP9'], downcast='integer')#How close does the child feel to his mother figure mother['H1WP10'] = pd.to_numeric(mother['H1WP10']) #How much does the child think she cares about him

mother['H1WP17H'] = pd.to_numeric(mother['H1WP17H']) #Did the mother figure talk about school mother['H1WP17I'] = pd.to_numeric(mother['H1WP17I']) #Did the mother figure work on a school project mother['H1WP17J'] = pd.to_numeric(mother['H1WP17J']) #Did the mother figure talk about other things the child does at school #Father-child relationship father['H1WP13'] = pd.to_numeric(father['H1WP13']) #How close does the child feel to his father figure father['H1WP14'] = pd.to_numeric(father['H1WP14']) #How much does the child think he cares about him

father['H1WP18H'] = pd.to_numeric(father['H1WP18H']) #Did the father figure talk about school father['H1WP18I'] = pd.to_numeric(father['H1WP18I']) #Did the father figure work on a school project father['H1WP18J'] = pd.to_numeric(father['H1WP18J']) #Did the father figure talk about other things the child does at school

#We replace the the values 'refused', 'legitimate skip', 'don't know', and 'not applicable' by 'NaN' mother['H1WP9'] = mother['H1WP9'].replace([6,7,8], np.nan) mother['H1WP9'] = mother['H1WP9'].replace([1,2,3,4,5], ['not at all','very little','somewhat','quite a bit','very much'])

mother['H1WP10'] = mother['H1WP10'].replace([6,7,8,9], np.nan) mother['H1WP10'] = mother['H1WP10'].replace([1,2,3,4,5], ['not at all','very little','somewhat','quite a bit','very much'])

mother['H1WP17H'] = mother['H1WP17H'].replace([6,7,8,9], np.nan) mother['H1WP17H'] = mother['H1WP17H'].replace([0,1], ['no','yes'])

mother['H1WP17I'] = mother['H1WP17I'].replace([6,7,8,9], np.nan) mother['H1WP17I'] = mother['H1WP17I'].replace([0,1], ['no','yes'])

mother['H1WP17J'] = mother['H1WP17J'].replace([6,7,8,9], np.nan) mother['H1WP17J'] = mother['H1WP17J'].replace([0,1], ['no','yes'])

#---------------------------------------------------

father['H1WP13'] = father['H1WP13'].replace([6,7,8], np.nan) father['H1WP13'] = father['H1WP13'].replace([1,2,3,4,5], ['not at all','very little','somewhat','quite a bit','very much'])

father['H1WP14'] = father['H1WP14'].replace([6,7,8,9], np.nan) father['H1WP14'] = father['H1WP14'].replace([1,2,3,4,5], ['not at all','very little','somewhat','quite a bit','very much'])

father['H1WP18H'] = father['H1WP18H'].replace([6,7,8,9], np.nan) father['H1WP18H'] = father['H1WP18H'].replace([0,1], ['no','yes'])

father['H1WP18I'] = father['H1WP18I'].replace([6,7,8,9], np.nan) father['H1WP18I'] = father['H1WP18I'].replace([0,1], ['no','yes'])

father['H1WP18J'] = father['H1WP18J'].replace([6,7,8,9], np.nan) father['H1WP18J'] = father['H1WP18J'].replace([0,1], ['no','yes'])

print("Mother-child relationship") #Shows how close does the child feel close to his mother figure print("Counts of how close does the child feel close to his mother figure") moc1 = mother['H1WP9'].value_counts(sort=False, dropna=False) print(moc1) print("Percentage of how close does the child feel close to his mother figure") mop1 = mother['H1WP9'].value_counts(sort=False, dropna=False, normalize=True) print(mop1,'\n')

#Shows how much does the child think she cares about him print("Counts of how much does the child think she cares about him") moc2 = mother['H1WP10'].value_counts(sort=False, dropna=False) print(moc2) print("Percentage of how much does the child think she cares about him") mop2 = mother['H1WP10'].value_counts(sort=False, dropna=False, normalize=True) print(mop2,'\n')

#Shows if the mother figure talked about school print("Counts of if the mother figure talked about school") moc3 = mother['H1WP17H'].value_counts(sort=False, dropna=False) print(moc3) print("Percentage of if the mother figure talked about school") mop3 = mother['H1WP17H'].value_counts(sort=False, dropna=False, normalize=True) print(mop3,'\n')

#Shows if the mother figure worked on a school project print("Counts of if the mother figure worked on a school project") moc4 = mother['H1WP17I'].value_counts(sort=False, dropna=False) print(moc4) print("Percentage of if the mother figure worked on a school project") mop4 = mother['H1WP17I'].value_counts(sort=False, dropna=False, normalize=True) print(mop4,'\n')

#Shows if the mother figure talked about other things that the child does at school print("Counts of if the mother figure talked about other things that the child does at school") moc5 = mother['H1WP17J'].value_counts(sort=False, dropna=False) print(moc5) print("Percentage of if the mother figure talked about other things that the child does at school") mop5 = mother['H1WP17J'].value_counts(sort=False, dropna=False, normalize=True) print(mop5,'\n')

#---------------------------------------------------

print("Father-child relationship") #Shows how close does the child feel close to his father figure print("Counts of how close does the child feel close to his father figure") fac1 = father['H1WP13'].value_counts(sort=False, dropna=False) print(fac1) print("Percentage of how close does the child feel close to his father figure") fap1 = father['H1WP13'].value_counts(sort=False, dropna=False, normalize=True) print(fap1,'\n')

#Shows how much does the child think he cares about him print("Counts of how much does the child think he cares about him") fac2 = father['H1WP14'].value_counts(sort=False, dropna=False) print(fac2) print("Percentage of how much does the child think he cares about him") fap2 = father['H1WP14'].value_counts(sort=False, dropna=False, normalize=True) print(fap2,'\n')

#Shows if the father figure talked about school print("Counts of if the father figure talked about school") fac3 = father['H1WP18H'].value_counts(sort=False, dropna=False) print(fac3) print("Percentage of if the father figure talked about school") fap3 = father['H1WP18H'].value_counts(sort=False, dropna=False, normalize=True) print(fap3,'\n')

#Shows if the father figure worked on a school project print("Counts of if the father figure worked on a school project") fac4 = father['H1WP18I'].value_counts(sort=False, dropna=False) print(fac4) print("Percentage of if the father figure worked on a school project") fap4 = father['H1WP18I'].value_counts(sort=False, dropna=False, normalize=True) print(fap4,'\n')

#Shows if the father figure talked about other things that the child does at school print("Counts of if the father figure talked about other things that the child does at school") fac5 = father['H1WP18J'].value_counts(sort=False, dropna=False) print(fac5) print("Percentage of if the father figure talked about other things that the child does at school") fap5 = father['H1WP18J'].value_counts(sort=False, dropna=False, normalize=True) print(fap5,'\n')

Output:

Mother-child relationship Counts of how close does the child feel close to his mother figure NaN 375 very little 156 quite a bit 1229 not at all 25 somewhat 480 very much 4239 Name: H1WP9, dtype: int64 Percentage of how close does the child feel close to his mother figure NaN 0.057657 very little 0.023985 quite a bit 0.188961 not at all 0.003844 somewhat 0.073801 very much 0.651753 Name: H1WP9, dtype: float64

Counts of how much does the child think she cares about him NaN 374 very little 39 quite a bit 445 not at all 15 somewhat 127 very much 5504 Name: H1WP10, dtype: int64 Percentage of how much does the child think she cares about him NaN 0.057503 very little 0.005996 quite a bit 0.068419 not at all 0.002306 somewhat 0.019526 very much 0.846248 Name: H1WP10, dtype: float64

Counts of if the mother figure talked about school NaN 381 yes 3851 no 2272 Name: H1WP17H, dtype: int64 Percentage of if the mother figure talked about school NaN 0.058579 yes 0.592097 no 0.349323 Name: H1WP17H, dtype: float64

Counts of if the mother figure worked on a school project NaN 381 yes 807 no 5316 Name: H1WP17I, dtype: int64 Percentage of if the mother figure worked on a school project NaN 0.058579 yes 0.124077 no 0.817343 Name: H1WP17I, dtype: float64

Counts of if the mother figure talked about other things that the child does at school NaN 381 yes 3184 no 2939 Name: H1WP17J, dtype: int64 Percentage of if the mother figure talked about other things that the child does at school NaN 0.058579 yes 0.489545 no 0.451876 Name: H1WP17J, dtype: float64

Father-child relationship Counts of how close does the child feel close to his father figure NaN 1957 very little 184 quite a bit 1211 not at all 75 somewhat 610 very much 2467 Name: H1WP13, dtype: int64 Percentage of how close does the child feel close to his father figure NaN 0.300892 very little 0.028290 quite a bit 0.186193 not at all 0.011531 somewhat 0.093788 very much 0.379305 Name: H1WP13, dtype: float64

Counts of how much does the child think he cares about him NaN 1957 very little 65 quite a bit 535 not at all 15 somewhat 180 very much 3752 Name: H1WP14, dtype: int64 Percentage of how much does the child think he cares about him NaN 0.300892 very little 0.009994 quite a bit 0.082257 not at all 0.002306 somewhat 0.027675 very much 0.576876 Name: H1WP14, dtype: float64

Counts of if the father figure talked about school NaN 1962 yes 2357 no 2185 Name: H1WP18H, dtype: int64 Percentage of if the father figure talked about school NaN 0.301661 yes 0.362392 no 0.335947 Name: H1WP18H, dtype: float64

Counts of if the father figure worked on a school project NaN 1962 yes 508 no 4034 Name: H1WP18I, dtype: int64 Percentage of if the father figure worked on a school project NaN 0.301661 yes 0.078106 no 0.620234 Name: H1WP18I, dtype: float64

Counts of if the father figure talked about other things that the child does at school NaN 1962 yes 1988 no 2554 Name: H1WP18J, dtype: int64 Percentage of if the father figure talked about other things that the child does at school NaN 0.301661 yes 0.305658 no 0.392681 Name: H1WP18J, dtype: float64

Explanation:

I added the values of the father-child relationship.

I subdivided my data into two datasets : one for the mother, one for the father.

I changed the numerical values to their true meaning.

I displayed those values in counts and percentage.

0 notes

datavvisualization · 5 years ago

Text

Data Visualization & Management

1st code

Code:

# -*- coding: utf-8 -*-

import pandas as pd import numpy as np

#Load dataset csv = pd.read_csv('addhealth_pds.csv', low_memory=False)

#Bug fix for fisplay formats to avoid run time errors pd.set_option('display.float_format', lambda x:'%f'%x)

#Exclude the values 'refused', 'legitimate skip', 'don't know', and 'not applicable' sub1=csv[(csv['H1WP9'] != 6) & (csv['H1WP9'] != 7) &(csv['H1WP9'] != 8) & (csv['H1WP9'] != 9)] sub2=csv[(csv['H1WP10'] != 6) & (csv['H1WP10'] != 7) &(csv['H1WP10'] != 8) & (csv['H1WP10'] != 9)] sub3=csv[(csv['H1WP17D'] != 6) & (csv['H1WP17D'] != 7) &(csv['H1WP17D'] != 8) & (csv['H1WP17D'] != 9)]

#Convert data to numeric values #Mother-child relationship mo_num1 = pd.to_numeric(sub1['H1WP9']) #How close does the child feel to his mother figure mo_num2 = pd.to_numeric(sub2['H1WP10']) #How much does the child think she cares about him mo_num3 = pd.to_numeric(sub3['H1WP17D']) #Did the mother figure talk about personal relationship or outside of school event

#Shows how close the child feel to his mother figure print('How close does the child feel to his mother figure') moc1 = mo_num1.value_counts(sort=True) #By count print(moc1) mop1 = mo_num1.value_counts(sort=True, normalize=True) #By percentage print(mop1, '\n')

#Shows much does the child think she cares about him print('How much does the child think she cares about him') moc2 = mo_num2.value_counts(sort=True) #By count print(moc2) mop2 = mo_num2.value_counts(sort=True, normalize=True) #By percentage print(mop2, '\n')

#Shows if the mother figure talk about personal relationship or outside of school event print('Did the mother figure talk about personal relationship or outside of school event') moc3 = mo_num3.value_counts(sort=True) #By count print(moc3) mop3 = mo_num3.value_counts(sort=True, normalize=True) #By percentage print(mop3, '\n')

Output:

How close does the child feel to his mother figure 5 4239 4 1229 3 480 2 156 1 25 Name: H1WP9, dtype: int64 5 0.691630 4 0.200522 3 0.078316 2 0.025453 1 0.004079 Name: H1WP9, dtype: float64

How much does the child think she cares about him 5 5504 4 445 3 127 2 39 1 15 Name: H1WP10, dtype: int64 5 0.897879 4 0.072594 3 0.020718 2 0.006362 1 0.002447 Name: H1WP10, dtype: float64

Did the mother figure talk about personal relationship or outside of school event 0 3228 1 2895 Name: H1WP17D, dtype: int64 0 0.527193 1 0.472807 Name: H1WP17D, dtype: float64

Explanation:

For the 1st code I decided to count the 3 first values of the one I want to use. Because this code is pretty simple there ies nothing special to conclude about it. But we can see the values in the output.

We observe with an clear majority that the adolescents feel close to their mother figure (69%) and think that she cares about them (89%).

As for if the mother talked about non-school relative thing the output is more balanced with a 52% of yes.

0 notes

datavvisualization · 5 years ago

Text

Data Visualization Subject

I chose to work with the AddHealth codebook. After reviewing it and some thinking I decided to focus on the well/ill-being of the adolescentsregarding their school participation and grades.

The main reason why I am interested in this topic is because when a kid do not do weel in school, it is usually put on the parents or the teacher. There are no data regarding the teacher aspect so I will focus on the parents and especially how their relationship with their parents affect their school grades.

For my second topic I chose to look at the friends circle and the affect on the school performance of the adolescents. To be clear, if a child surrounded by loving friends do better at school than someone who does not have any.

My questions are :

- How the relationship between parents and child affect the child’s school performance

- How the social circle of a child affect his school performance

For me the first question is pretty obivious. The education of a child depends on the scholar system but also on the parents. Intuitively I want to say that if the parents do not highlight to their child the importance of listening in school, there is a low probability they do. And according to the research I found there is direct and indirect correlation between these two.

As for how the friendship affect the school performance, I did not find any free research online, but I think that the friend influence affect the decision adolescent make, maybe even more than parents because at this stage of the development, the human usually listen more the to the opinions and critics of their peers than of their parents.

Sources :

The following source uses the parents’ involvment to see a connection with the adolescent well-being in school on different aspects that they appoint by behavioral, emotional and cognitive.

They stumbled upon the conclusion that the child is most likely to to well in school if the parents encourages him to and alos if the child is self-involved in it.

https://www.tandfonline.com/doi/pdf/10.1080/19404476.2008.11462053?needAccess=true

0 notes