House Price Classification (1-6) using Lasso Regression
House Price Classification
The objective is to classify the price of a house with the given parameters into the following price classes:
1 = <$250,000
2 = $250,000-$500,000
3 = $500,000-$750,000
4 = $750,000-$1,000,000
5 = $1,000,000-$1,250,000
6 = >$1,250,000
and identify the various parameters responsible for the pricing.
The data is for various parameters (predictors) like number of bedrooms, sqft living area, number of bathrooms,
whether the house has a view or not etc
(from online source) is fed into the classification regression model.
Target variable is the price_class of the house.
Each variable was standardised to have a mean of 0 and a standard deviation of 1 while pre-processing.
A k=10 fold cross-validation technique was used to test the model.
Code
#########################
#from pandas import Series, DataFrame
import os
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
os.chdir("A:\ML Coursera")
#Load the dataset
data = pd.read_csv("home_data4.csv")
#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
#select predictor variables and target variable as separate data sets
predvar= data_clean[['BEDROOMS', 'BATHROOMS', 'SQFT_LIVING',
'SQFT_LOT', 'FLOORS', 'WATERFRONT',
'VIEW', 'CONDITION', 'GRADE',
'SQFT_ABOVE', 'SQFT_BASEMENT', 'YR_BUILT'
]]
target = data_clean.PRICE_CLASS
# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing
predictors['BEDROOMS']=preprocessing.scale(predictors['BEDROOMS'].astype('float64'))
predictors['BATHROOMS']=preprocessing.scale(predictors['BATHROOMS'].astype('float64'))
predictors['SQFT_LIVING']=preprocessing.scale(predictors['SQFT_LIVING'].astype('float64'))
predictors['SQFT_LOT']=preprocessing.scale(predictors['SQFT_LOT'].astype('float64'))
predictors['FLOORS']=preprocessing.scale(predictors['FLOORS'].astype('float64'))
predictors['WATERFRONT']=preprocessing.scale(predictors['WATERFRONT'].astype('float64'))
predictors['VIEW']=preprocessing.scale(predictors['VIEW'].astype('float64'))
predictors['CONDITION']=preprocessing.scale(predictors['CONDITION'].astype('float64'))
predictors['GRADE']=preprocessing.scale(predictors['GRADE'].astype('float64'))
predictors['SQFT_ABOVE']=preprocessing.scale(predictors['SQFT_ABOVE'].astype('float64'))
predictors['SQFT_BASEMENT']=preprocessing.scale(predictors['SQFT_BASEMENT'].astype('float64'))
predictors['YR_BUILT']=preprocessing.scale(predictors['YR_BUILT'].astype('float64'))
predictors.head()
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
#########################
Output:
{'BEDROOMS': -0.051115370744155196,
'BATHROOMS': 0.11210926212354395,
'SQFT_LIVING': 0.32651387620252414,
'SQFT_LOT': -0.00865938319608584,
'FLOORS': 0.08635931610792122,
'WATERFRONT': 0.059669258580336657,
'VIEW': 0.11322348584018982,
'CONDITION': 0.0541793720936317,
'GRADE': 0.5706466464368843,
'SQFT_ABOVE': 0.0,
'SQFT_BASEMENT': 0.02533279918612259,
'YR_BUILT': -0.3439055814902168}
training data MSE
0.4905109604220459
test data MSE
0.48521148751529647
training data R-square
0.6247653438966141
test data R-square
0.6385205011430025
Interpretation:
As we can see from the model output, some of the parameters' coefficients has been shrunk to very low values (like Sqft_above and Sqft_lot)
indicating that these are not particularly influential in the pricing of the house.
Sqft_living, Grade (describing the quality of the material) and the year built have the highest coefficients indicating that these are the most
influential parameters. This can be seen in the plot of regression coefficients progression for Lasso paths.
The training R2 and the test R2 are 0.624 and 0.638 respectively, which are decent but could use some improvement. Further variables could be added to the dataset
that are likely to affect the pricing of the house. The consistency of the R2 in both cases shows that there is not over-fitting in the model.
0 notes
Suicide Rate - K Means Cluster Analysis
The objective of the exercise is to form clusters, using K Means cluster analysis,for different regions using certain demographic and other stats like income per person, urbanisation rate,
female employment rate etc (Data from an Online source) and then use the clusters to analyse the suicide rate of the region.
############Code####################
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 24 10:02:19 2019
@author: Sandeep
"""
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
import os
os.chdir("A:/ML Coursera/Machine Learning for Data Analysis")
"""
Data Management
"""
data = pd.read_csv("Suicide_rate2.csv")
#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
# subset clustering variables
cluster=data_clean[['INCOMEPERPERSON', 'ALCCONSUMPTION',
'ARMEDFORCESRATE', 'BREASTCANCERPER100TH',
'CO2EMISSIONS', 'FEMALEEMPLOYRATE', 'INTERNETUSERATE',
'LIFEEXPECTANCY', 'URBANRATE']]
cluster.describe()
# standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
#clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
#clustervar['SUICIDEPER100TH']=preprocessing.scale(clustervar['SUICIDEPER100TH'].astype('float64'))
clustervar['INCOMEPERPERSON']=preprocessing.scale(clustervar['INCOMEPERPERSON'].astype('float64'))
clustervar['ALCCONSUMPTION']=preprocessing.scale(clustervar['ALCCONSUMPTION'].astype('float64'))
clustervar['ARMEDFORCESRATE']=preprocessing.scale(clustervar['ARMEDFORCESRATE'].astype('float64'))
clustervar['BREASTCANCERPER100TH']=preprocessing.scale(clustervar['BREASTCANCERPER100TH'].astype('float64'))
clustervar['CO2EMISSIONS']=preprocessing.scale(clustervar['CO2EMISSIONS'].astype('float64'))
clustervar['FEMALEEMPLOYRATE']=preprocessing.scale(clustervar['FEMALEEMPLOYRATE'].astype('float64'))
clustervar['INTERNETUSERATE']=preprocessing.scale(clustervar['INTERNETUSERATE'].astype('float64'))
clustervar['LIFEEXPECTANCY']=preprocessing.scale(clustervar['LIFEEXPECTANCY'].astype('float64'))
clustervar['URBANRATE']=preprocessing.scale(clustervar['URBANRATE'].astype('float64'))
# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
# k-means cluster analysis for 1-9 clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
/ clus_train.shape[0])
"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
# Interpret 4 cluster solution
model4=KMeans(n_clusters=4)
model4.fit(clus_train)
clusassign=model4.predict(clus_train)
# plot clusters
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model4.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 4 Clusters')
plt.show()
# Interpret 2 cluster solution
model2=KMeans(n_clusters=2)
model2.fit(clus_train)
clusassign=model2.predict(clus_train)
# plot clusters
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model2.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 2 Clusters')
plt.show()
#We proceed with the 4-cluster model after looking at the plots (attached)
"""
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist=list(clus_train['index'])
# create a list of cluster assignments
labels=list(model4.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']
# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the
# cluster assignment dataframe
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()
"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
# validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge SUICIDEPER100TH with clustering variables and cluster assignment data
suicide_data=data_clean['SUICIDEPER100TH']
# split suicide data into train and test sets
suicide_train, suicide_test = train_test_split(suicide_data, test_size=.3, random_state=123)
suicide_train1=pd.DataFrame(suicide_train)
suicide_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(suicide_train1, merged_train, on='index')
#sub1 = merged_train_all[['SUICIDEPER100TH', 'cluster']].dropna()
sub1 = merged_train_all[['SUICIDEPER100TH', 'cluster']].dropna()
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
suicidemod = smf.ols(formula='SUICIDEPER100TH ~ C(cluster)', data=sub1).fit()
print (suicidemod.summary())
print ('means for suicide by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
print ('standard deviations for suicide by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['SUICIDEPER100TH'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
#############Code End#################
Results:
Clustering variable means by cluster
index INCOMEPERPERSON ... LIFEEXPECTANCY URBANRATE
cluster ...
0 89.888889 -0.240612 ... 0.227263 0.135969
1 67.942857 -0.228249 ... 0.475045 0.395141
2 82.750000 -0.611545 ... -1.098212 -0.883506
3 76.000000 1.923554 ... 1.159696 0.871168
print (suicidemod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SUICIDEPER100TH R-squared: 0.063
Model: OLS Adj. R-squared: 0.036
Method: Least Squares F-statistic: 2.315
Date: Sun, 24 Mar 2019 Prob (F-statistic): 0.0802
Time: 12:24:10 Log-Likelihood: -342.48
No. Observations: 107 AIC: 693.0
Df Residuals: 103 BIC: 703.6
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 6.5936 1.427 4.620 0.000 3.763 9.424
C(cluster)[T.1] 4.3384 1.756 2.470 0.015 0.856 7.821
C(cluster)[T.2] 3.3438 1.718 1.946 0.054 -0.064 6.752
C(cluster)[T.3] 4.5218 2.158 2.096 0.039 0.243 8.801
==============================================================================
Omnibus: 34.985 Durbin-Watson: 2.146
Prob(Omnibus): 0.000 Jarque-Bera (JB): 66.414
Skew: 1.358 Prob(JB): 3.79e-15
Kurtosis: 5.742 Cond. No. 5.93
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print ('means for suicide by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
means for suicide by cluster
SUICIDEPER100TH
cluster
0 6.593638
1 10.932016
2 9.937488
3 11.115407
print ('standard deviations for suicide by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
standard deviations for suicide by cluster
SUICIDEPER100TH
cluster
0 6.835408
1 8.263957
2 3.310233
3 4.226248
mc1 = multi.MultiComparison(sub1['SUICIDEPER100TH'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff lower upper reject
---------------------------------------------
0 1 4.3384 -0.2479 8.9247 False
0 2 3.3438 -1.144 7.8317 False
0 3 4.5218 -1.1129 10.1565 False
1 2 -0.9945 -4.6544 2.6653 False
1 3 0.1834 -4.8169 5.1837 False
2 3 1.1779 -3.7323 6.0881 False
---------------------------------------------
Interpretation:
A k-means clustering algorithm was run as explained. Using the elbow curve method we used a four cluster solution to interpret and analyse the results. Tukey test was used to compare the mean for the four clusters created and ANOVA was used for comparing the variance.
The results show that both mean and std deviation are significantly different for the suicide rate for the four different clusters.
0 notes
House Price Classification - Lasso
House Price Classification
The objective is to classify the price of a house with the given parameters as 0(Price <$500,000)
and 1(Price>=$500000) and identify the various parameters responsible for the pricing.
The data is for various parameters (predictors) like number of bedrooms, sqft living area, number of bathrooms,
whether the house has a view or not etc
(from online source) is fed into the classification regression model.
Target variable is the price_class of the house.
Code
#########################
#from pandas import Series, DataFrame
import os
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
os.chdir("A:\ML Coursera")
#Load the dataset
data = pd.read_csv("home_data3.csv")
#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
# Data Management
data_clean = data.dropna()
#select predictor variables and target variable as separate data sets
predvar= data_clean[['BEDROOMS', 'BATHROOMS', 'SQFT_LIVING',
'SQFT_LOT', 'FLOORS', 'WATERFRONT', 'VIEW',
'CONDITION', 'GRADE', 'SQFT_ABOVE',
'SQFT_BASEMENT', 'YR_BUILT', 'YR_RENOVATED'
]]
target = data_clean.PRICE_CLASS
# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing
predictors['BEDROOMS']=preprocessing.scale(predictors['BEDROOMS'].astype('float64'))
predictors['BATHROOMS']=preprocessing.scale(predictors['BATHROOMS'].astype('float64'))
predictors['SQFT_LIVING']=preprocessing.scale(predictors['SQFT_LIVING'].astype('float64'))
predictors['SQFT_LOT']=preprocessing.scale(predictors['SQFT_LOT'].astype('float64'))
predictors['FLOORS']=preprocessing.scale(predictors['FLOORS'].astype('float64'))
predictors['WATERFRONT']=preprocessing.scale(predictors['WATERFRONT'].astype('float64'))
predictors['VIEW']=preprocessing.scale(predictors['VIEW'].astype('float64'))
predictors['CONDITION']=preprocessing.scale(predictors['CONDITION'].astype('float64'))
predictors['GRADE']=preprocessing.scale(predictors['GRADE'].astype('float64'))
predictors['SQFT_ABOVE']=preprocessing.scale(predictors['SQFT_ABOVE'].astype('float64'))
predictors['SQFT_BASEMENT']=preprocessing.scale(predictors['SQFT_BASEMENT'].astype('float64'))
predictors['YR_BUILT']=preprocessing.scale(predictors['YR_BUILT'].astype('float64'))
predictors['YR_RENOVATED']=preprocessing.scale(predictors['YR_RENOVATED'].astype('float64'))
predictors.head()
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
#########################
Output:
{'BEDROOMS': -0.00042543529151601797,
'BATHROOMS': 0.02735333928831517,
'SQFT_LIVING': 0.08572220222236561,
'SQFT_LOT': 0.007998695399789518,
'FLOORS': 0.04744203874354188,
'WATERFRONT': 0.0015738192746130087,
'VIEW': 0.0171052057946378,
'CONDITION': 0.022601962456814437,
'GRADE': 0.22293970415978565,
'SQFT_ABOVE': 0.0,
'SQFT_BASEMENT': 0.014998665998908652,
'YR_BUILT': -0.14130408055369761,
'YR_RENOVATED': -0.003021277379032161}
training data MSE
0.14746511823540823
test data MSE
0.14729803797546054
training data R-square
0.39345776492926426
test data R-square
0.39651086088327553
The test R-square value is around 40%, which is quite low.
0 notes
Random Forest - House Price Classification
The objective is to classify the price of a house with the given parameters as Low (Price <$500,000)
and High (Price>=$500000) and identify the various parameters responsible for the pricing.
The data is for various parameters (predictors) like number of bedrooms, sqft living area, number of bathrooms,
whether the house has a view or not etc
(from online source) is fed into the random forest algorithm.
Target variable is the price_class of the house.
Code
#########################
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
os.chdir("A:\ML Coursera")
#Load the dataset
AH_data = pd.read_csv("home_data2.csv")
data_clean = AH_data.dropna()
data_clean.dtypes
data_clean.describe()
#Split into training and testing sets
predictors = data_clean[['bedrooms', 'bathrooms', 'sqft_living',
'sqft_lot', 'floors', 'waterfront', 'view' ,
'sqft_above', 'sqft_basement', 'yr_built']]
targets = data_clean.price_class
targets.head()
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3)
pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape
#Build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
"""
Running a different number of trees and see the effect
of that on the accuracy of the prediction
"""
trees=range(25)
accuracy=np.zeros(25)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)
###############
Output Summary:
Accuracy: 0.7830043183220234
Feature OOB Gini
'bedrooms', 0.05573633
'bathrooms', 0.08098398
'sqft_living', 0.20783781
'sqft_lot', 0.14737421
'floors', 0.04302802
'waterfront', 0.00146581
'view' 0.0441065
'sqft_above', 0.18545935
'sqft_basement', 0.0822214
'yr_built' 0.15178659
sklearn.metrics.confusion_matrix(tar_test,predictions)
array([[2045, 702],
[ 659, 3078]], dtype=int64)
The accuracy of predictions as per the matrix is 79%.
0 notes
House Price Classification
House Price Classification
The objective is to classify the price of a house with the given parameters as Low (Price <$500,000)
and High (Price>=$500000).
The data is for various parameters (predictors) like number of bedrooms, sqft living area, number of bathrooms, whether the house has a view or not etc
(from online source) is fed into the decision tree.
Target variable is the price_class of the house.
#Code
###############################
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
import pydotplus
os.chdir("A:\ML Coursera")
#Load the dataset
home_data = pd.read_csv("home_data2.csv")
data_clean = home_data.dropna()
"""
Modeling and Prediction
"""
#Split into training and testing sets
predictors = data_clean[['bedrooms', 'bathrooms',
'sqft_living', 'sqft_lot', 'floors', 'waterfront'
, 'view', 'sqft_above', 'sqft_basement', 'yr_built']]
targets = data_clean.price_class
targets.shape
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
out = StringIO()
tree.export_graphviz(classifier, out_file=dotfile)
pydotplus.graph_from_dot_data(out.getvalue()).write_png(“dtree1.png”)
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
from IPython.display import Image
Image(filename = 'tree_limited.png')
#################################
Output:
Confusion Matrix:
array([[2511, 1174],
[1230, 3731]], dtype=int64)
sklearn.metrics.accuracy_score(tar_test, predictions)
Out[8]: 0.7219523479065464
This shows that the accuracy of classification is 72%, which is pretty decent.
Summary:
Hence, the model was able to predict whether the house is low-priced or high-priced with an accuracy of 72%. An improvement in accuracy can be aimed for
if we try to add more variables related to the same to the dataset and observe how they influence the model.
0 notes
Get busy living or get busy dying
Shawshank Redemption (1994)
0 notes
Getting comfortable with Machine Learning
Having used Machine Learning tools quite rarely in the past, I am trying to learn the concepts a bit more formally and with a better structure. I am comfortable with programming, although a little new to python. But have completed a course before, online.
1 note
·
View note