#sas simple boxplot | Explore Tumblr Posts and Blogs

razan-n-athamneh · 2 years

Text

Machine Learning for Data Analysis — Week 4 Assignment

This covers my work submitted for the fourth week’s assignment of the Machine Learning for Data Analysis course. The goal of this assignment was to practice k-means cluster analysis.

For this analysis I used the Boston House Price dataset freely available from Machine Learning Mastery at this link. This dataset involves predicting the price of a house in thousands of dollars given details of the house and its neighborhood. The variable present in the dataset are:

CRIM: per capita crime rate by town.

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of nonretail business acres per town.

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

NOX: nitric oxides concentration (parts per 10 million).

RM: average number of rooms per dwelling.

AGE: proportion of owner-occupied units built prior to 1940.

DIS: weighted distances to five Boston employment centers.

RAD: index of accessibility to radial highways.

TAX: full-value property-tax rate per $10,000.

PTRATIO: pupil-teacher ratio by town.

B: 1000(Bk — 0.63)² where Bk is the proportion of blacks by town.

LSTAT: % lower status of the population.

MEDV: Median value of owner-occupied homes in $1000s.

For my analysis I used SAS, as that is the software I’m most familiar with and make the most use of in my day-to-day role. Following the examples in the course, I began by cleaning the data and removing all observations with missing data. Then, I split the data into test and train sets via simple random sampling, and standardised the output. I then created a kmeans macro utilising the fastclus procedure, which could accept an input of a variety of values of k. I then plotted the elbow curve for the resulting values of r-squared for each of the different values of k, which yielded the following:

The plot showed significant bends at the 2, 5 and 7 clusters. For the rest of this analysis, I focused on the k-means algorithm with five clusters.

Using canonical discriminant analysis, I reduced the dimensions of the dataset so that we could plot the result. Below are the results of the reduction prodecure:

Using this, we can plot the first two canonical variables as a scatterplot:

This shows that observations within clusters one and three are highly correlated with each other, and within-cluster variance is low. Observations in cluster two are further spread out, but the cluster is relatively distinct. Cluster four is very spread out, and cluster five only contains a handful of observations. This suggests that the best clustering solution may contain less than five clusters.

Below are the results of that fastclus procedure for the five cluster solution, which were created by the macro described earlier:

As an example, clusters one and three share similar levels of crime rate, but vary on land zoned for lots of 25,000 sqft, with this being higher for cluster one.

Now let’s see how this impacts house prices using the ANOVA procedure and producing boxplots. Note that I’ve discounted the fifth cluster from this analysis as there were so few observations in that cluster:

We can see that clusters two and four contain the, on average, less costly houses, whereas clusters one and three contain the more expensive ones.

Code used:

data clust; set housingnew;* create a unique identifier to merge cluster assignment variable with the main data set;idnum=_n_; keep idnum CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV; * delete observations with missing data; if cmiss(of _all_) then delete; run;ods graphics on;* Split data randomly into test and training data;proc surveyselect data=clust out=traintest seed = 123 samprate=0.7 method=srs outall; run;data clus_train; set traintest; if selected=1; run;data clus_test; set traintest; if selected=0; run;* standardize the clustering variables to have a mean of 0 and standard deviation of 1;proc standard data=clus_train out=clustvar mean=0 std=1; var CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT; run;%macro kmean(K); proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300; var CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT; run; %mend;%kmean(1); %kmean(2); %kmean(3); %kmean(4); %kmean(5); %kmean(6); %kmean(7); %kmean(8); %kmean(9);* extract r-square values from each cluster solution and then merge them to plot elbow curve;data clus1; set cluststat1; nclust=1; if _type_='RSQ'; keep nclust over_all; run;data clus2; set cluststat2; nclust=2; if _type_='RSQ'; keep nclust over_all; run;data clus3; set cluststat3; nclust=3; if _type_='RSQ'; keep nclust over_all; run;data clus4; set cluststat4; nclust=4; if _type_='RSQ'; keep nclust over_all; run;data clus5; set cluststat5; nclust=5; if _type_='RSQ'; keep nclust over_all; run;data clus6; set cluststat6; nclust=6; if _type_='RSQ'; keep nclust over_all; run;data clus7; set cluststat7; nclust=7; if _type_='RSQ'; keep nclust over_all; run;data clus8; set cluststat8; nclust=8; if _type_='RSQ'; keep nclust over_all; run;data clus9; set cluststat9; nclust=9; if _type_='RSQ'; keep nclust over_all; run;data clusrsquare; set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9; run;* plot elbow curve using r-square values; symbol1 color=blue interpol=join; proc gplot data=clusrsquare; plot over_all*nclust; run;******************************************************************** further examine cluster solution for the number of clusters suggested by the elbow curve ******************************************************************** plot clusters for 5 cluster solution; proc candisc data=outdata5 out=clustcan; class cluster; var CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT; run;proc sgplot data=clustcan; scatter y=can2 x=can1 / group=cluster; run;data MEDV_data; set clus_train; keep idnum MEDV; run;proc sort data=outdata4; by idnum; run;proc sort data=MEDV_data; by idnum; run;data merged; merge outdata4 MEDV_data; by idnum; run;proc sort data=merged; by cluster; run;proc means data=merged; var MEDV; by cluster; run;proc anova data=merged; class cluster; model MEDV = cluster; means cluster/tukey; run;

0 notes