jayateerth - Tumblr blog

jayateerth · 5 years ago

Text

ML for Data Analysis - Week 1 Assignment

Age, Picture - Decision Tree for Regular Smoking (Run 2)

* Run 1 decision is not included because that was done for all the 24 explanatory variables and the split was not able to explain a specific non linear relationship among variables.

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable.

All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

For Run -1, all the 24 explanatory variables were included as possible contributors to a classification tree model evaluating smoking (response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Alcohol use, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.

The deviance score was the first variable to separate the sample into two subgroups. Adolescents with a deviance score greater than 0.112 (range 0 to 2.8 –M=0.13, SD=0.209) were more likely to have experimented with smoking compared to adolescents not meeting this cutoff (18.6% vs. 11.2%).

Of the adolescents with deviance scores less than or equal to 0.112, a further subdivision was made with the dichotomous variable of alcohol use without supervision. Adolescents who reported having used alcohol without supervision were more likely to have experimented with smoking. Adolescents with a deviance score less than or equal to 0.112 who had never drank alcohol were less likely to have experimented with smoking.

The total model of run 1 classified 79% of the sample correctly,below is the confusion matrix.

[988, 154], [116, 115]]

For Run -1, following variables were used - Age, Alcohol use, marijuana use, cocaine use , availability of cigarettes in the home, and grade point average.

The total model of run 2 classified 78% of the sample correctly,below is the confusion matrix.

[[1308, 211], [ 186, 125]]

The code is included below.

# -*- coding: utf-8 -*- """ Created on Tue Mar 30, 17:15:34 2020 @author: Jayateerth Kulkarni

Python code for Classification using Decision Trees Week 1 Assignment - ML for Data Analysis """ """ OBJECTIVE - Decision tree analysis to test nonlinear relationships among a list of explanatory variables (Categorical & Quantitative) to classify categorical response variable. """

from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt

# the previous command where sklearn.cross-validation was used #is no longer applicable here hence changed to model selection

from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics

os.chdir("D:\ANALYTICS (DS, DAS ETC.)\R_Python\Python")

#Load the dataset

raw_data = pd.read_csv("tree_addhealth.csv") """ the dataframe.dropna() drops missing values for data frame""" data_clean = raw_data.dropna() """ the df.dtypes() shows the data type of the dataset""" data_clean.dtypes """ the df.describe() shows the summary of dataset""" data_clean.describe()

""" RUN-1 - with all variables""" #Split into training and testing sets

features = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN', 'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1', 'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV', 'PARPRES']]

response = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(features, response, test_size=.3)

pred_train.shape pred_test.shape tar_train.shape tar_test.shape

#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out = StringIO() tree.export_graphviz(classifier, out_file=out) """ as pydotplus module not found we need to use the following conda install -c conda-forge pydotplus""" import pydotplus

graph=pydotplus.graph_from_dot_data(out.getvalue()) Image(graph.create_png()) graph.write_pdf("graph.pdf")

""" RUN-2 - with limited variables""" #Split into training and testing sets

features2 = data_clean[['age','ALCEVR1','marever1','cocever1', 'cigavail','GPA1']]

response2 = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(features2, response2, test_size=.4)

pred_train.shape pred_test.shape tar_train.shape tar_test.shape

#Build model on training data classifier=DecisionTreeClassifier() classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree from sklearn import tree #from StringIO import StringIO from io import StringIO #from StringIO import StringIO from IPython.display import Image out2 = StringIO() tree.export_graphviz(classifier, out_file=out2) """ as pydotplus module not found we need to use the following conda install -c conda-forge pydotplus""" import pydotplus

graph=pydotplus.graph_from_dot_data(out2.getvalue()) Image(graph.create_png()) graph.write_pdf("graph.pdf")

1 note · View note