Tumgik
#Running a k-means Cluster Analysis
apoorva-week4 · 1 year
Text
Week 4: Peer-graded Assignment: Running a k-means Cluster Analysis
This assignment is intended for Coursera course "Machine Learning for Data Analysis by Wesleyan University”.
It is for "Week 4: Peer-graded Assignment: Running a k-means Cluster Analysis".
I am working on k-means Cluster Analysis in Python.
Syntax used to run k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of real machine parameters based on their similarity of responses on 19 variables that represent characteristics that could have an impact on product yield loss. Clustering variables included only quantitative variables measuring different machine parameters. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1. Data were randomly split into a training set that included 70% of the observations (N=116) and a test set that included 30% of the observations (N=50). A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
2. Code used to run k-means Cluster Analysis
Tumblr media Tumblr media Tumblr media
3. Corresponding Output
Tumblr media
Figure 1. Elbow curve of r-square values for the nine cluster solutions
Tumblr media
Figure 2. Plot of the first two canonical variables for the clustering variables by cluster.
4. Interpretation
For Figure 1: The elbow curve was inconclusive, suggesting that the 2, 4 and 8-cluster solutions might be interpreted. All 3 were tested, yielding [F-statistic and Prob (F-statistic)] of: [0.5298,0.469]; [6.242,0.000725] and [3.73,0.00156] accordingly. The results below are for an interpretation of the 4-cluster solution (highest F-statistic and lowest Prob).
Canonical discriminant analyses was used to reduce the 19 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster indicated that the observations in cluster 1 was densely packed with relatively low within cluster variance, and did not overlap very much with the other clusters. Cluster 2 was generally distinct, but the observations had greater spread suggesting higher within cluster variance. Observations in cluster 3 and 4 were spread out more than the other clusters, showing high within cluster variance. The results of this plot suggest that the best cluster solution may have fewer than 4 clusters, so it will be especially important to also evaluate the cluster solutions with fewer than 4 clusters.
For Figure 2: In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on product failure rates (BINS_SUM).
A tukey test was used for post hoc comparisons between the clusters. Results indicated some significant differences between the clusters on BINS_SUM (F (3, 85) = 6.242, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on BINS_SUM for cluster vs. 1 and 3, however insignificance of all other clusters among each other. Samples in cluster 4 had the lowest BINS_SUM (mean=60.28, sd=10.89), and cluster 1 had the highest BINS_SUM (mean=76.35, sd=13.08).
0 notes
shihab1992 · 2 years
Text
Weak 4 :Running a k-means Cluster Analysis
import pandas import statistics import numpy as np import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans # bug fix for display formats to avoid run time errors pandas.set_option('display.float_format', lambda x:'%.2f'%x) #load the data data = pandas.read_csv('../separatedData.csv') # convert to numeric format data["breastCancer100th"] = pandas.to_numeric(data["breastCancer100th"], errors='coerce') data["meanSugarPerson"] = pandas.to_numeric(data["meanSugarPerson"], errors='coerce') data["meanFoodPerson"] = pandas.to_numeric(data["meanFoodPerson"], errors='coerce') data["meanCholesterol"] = pandas.to_numeric(data["meanCholesterol"], errors='coerce') # listwise deletion of missing values sub1 = data[['breastCancer100th', 'meanFoodPerson', 'meanCholesterol', 'meanSugarPerson']].dropna() #Split into training and testing sets cluster = sub1[[ 'meanSugarPerson', 'meanFoodPerson', 'meanCholesterol']] # standardize predictors to have mean=0 and sd=1 clustervar = cluster.copy() clustervar['meanSugarPerson']= preprocessing.scale(clustervar['meanSugarPerson'].astype('float64')) clustervar['meanFoodPerson']= preprocessing.scale(clustervar['meanFoodPerson'].astype('float64')) clustervar['meanCholesterol']= preprocessing.scale(clustervar['meanCholesterol'].astype('float64')) # split data into train and test sets - Train = 70%, Test = 30% clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
To run the k-means Cluster Analysis we must standardize the predictors to have mean = 0 and standard deviation = 1. After that, we make 9 analysis with the data, the first one with one cluster increasing a cluster per experiment.# k-means cluster analysis for 1-9 clusters from scipy.spatial.distance import cdist clusters=range(1,10) meandist=[] for k in clusters: model=KMeans(n_clusters=k) model.fit(clus_train) clusassign=model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0]) """ Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose """ plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method')
0 notes
monuonrise · 2 years
Text
Running a k-means Cluster Analysis:
Machine Learning for Data Analysis
Week 4: Running a k-means Cluster Analysis
A k-means cluster analysis was conducted to identify underlying subgroups of countries based on their similarity of responses on 7 variables that represent characteristics that could have an impact on internet use rates. Clustering variables included quantitative variables measuring income per person, employment rate, female employment rate, polity score, alcohol consumption, life expectancy, and urban rate. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.
Because the GapMinder dataset which I am using is relatively small (N < 250), I have not split the data into test and training sets. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters (r-square) was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.
Load the data, set the variables to numeric, and clean the data of NA values
In [1]:''' Code for Peer-graded Assignments: Running a k-means Cluster Analysis Course: Data Management and Visualization Specialization: Data Analysis and Interpretation ''' import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi from sklearn.cross_validation import train_test_split from sklearn import preprocessing from sklearn.cluster import KMeans data = pd.read_csv('c:/users/greg/desktop/gapminder.csv', low_memory=False) data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce') data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce') data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce') data['femaleemployrate'] = pd.to_numeric(data['femaleemployrate'], errors='coerce') data['polityscore'] = pd.to_numeric(data['polityscore'], errors='coerce') data['alcconsumption'] = pd.to_numeric(data['alcconsumption'], errors='coerce') data['lifeexpectancy'] = pd.to_numeric(data['lifeexpectancy'], errors='coerce') data['urbanrate'] = pd.to_numeric(data['urbanrate'], errors='coerce') sub1 = data.copy() data_clean = sub1.dropna()
Subset the clustering variables
In [2]:cluster = data_clean[['incomeperperson','employrate','femaleemployrate','polityscore', 'alcconsumption', 'lifeexpectancy', 'urbanrate']] cluster.describe()
Out[2]:incomeperpersonemployratefemaleemployratepolityscorealcconsumptionlifeexpectancyurbanratecount150.000000150.000000150.000000150.000000150.000000150.000000150.000000mean6790.69585859.26133348.1006673.8933336.82173368.98198755.073200std9861.86832710.38046514.7809996.2489165.1219119.90879622.558074min103.77585734.90000212.400000-10.0000000.05000048.13200010.40000025%592.26959252.19999939.599998-1.7500002.56250062.46750036.41500050%2231.33485558.90000248.5499997.0000006.00000072.55850057.23000075%7222.63772165.00000055.7250009.00000010.05750076.06975071.565000max39972.35276883.19999783.30000310.00000023.01000083.394000100.000000
Standardize the clustering variables to have mean = 0 and standard deviation = 1
In [3]:clustervar=cluster.copy() clustervar['incomeperperson']=preprocessing.scale(clustervar['incomeperperson'].astype('float64')) clustervar['employrate']=preprocessing.scale(clustervar['employrate'].astype('float64')) clustervar['femaleemployrate']=preprocessing.scale(clustervar['femaleemployrate'].astype('float64')) clustervar['polityscore']=preprocessing.scale(clustervar['polityscore'].astype('float64')) clustervar['alcconsumption']=preprocessing.scale(clustervar['alcconsumption'].astype('float64')) clustervar['lifeexpectancy']=preprocessing.scale(clustervar['lifeexpectancy'].astype('float64')) clustervar['urbanrate']=preprocessing.scale(clustervar['urbanrate'].astype('float64'))
Split the data into train and test sets
In [4]:clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)
Perform k-means cluster analysis for 1-9 clusters
In [5]:from scipy.spatial.distance import cdist clusters = range(1,10) meandist = [] for k in clusters: model = KMeans(n_clusters = k) model.fit(clus_train) clusassign = model.predict(clus_train) meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) / clus_train.shape[0])
Plot average distance from observations from the cluster centroid to use the Elbow Method to identify number of clusters to choose
In [6]:plt.plot(clusters, meandist) plt.xlabel('Number of clusters') plt.ylabel('Average distance') plt.title('Selecting k with the Elbow Method') plt.show()
Tumblr media
64.media.tumblr.com
Interpret 3 cluster solution
In [7]:model3 = KMeans(n_clusters=4) model3.fit(clus_train) clusassign = model3.predict(clus_train)
Plot the clusters
In [8]:from sklearn.decomposition import PCA pca_2 = PCA(2) plt.figure() plot_columns = pca_2.fit_transform(clus_train) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,) plt.xlabel('Canonical variable 1') plt.ylabel('Canonical variable 2') plt.title('Scatterplot of Canonical Variables for 4 Clusters') plt.show()
Tumblr media
64.media.tumblr.com
Begin multiple steps to merge cluster assignment with clustering variables to examine cluster variable means by cluster.
Create a unique identifier variable from the index for the cluster training data to merge with the cluster assignment variable.
In [9]:clus_train.reset_index(level=0, inplace=True)
Create a list that has the new index variable
In [10]:cluslist = list(clus_train['index'])
Create a list of cluster assignments
In [11]:labels = list(model3.labels_)
Combine index variable list with cluster assignment list into a dictionary
In [12]:newlist = dict(zip(cluslist, labels)) print (newlist) {2: 1, 4: 2, 6: 0, 10: 0, 11: 3, 14: 2, 16: 3, 17: 0, 19: 2, 22: 2, 24: 3, 27: 3, 28: 2, 29: 2, 31: 2, 32: 0, 35: 2, 37: 3, 38: 2, 39: 3, 42: 2, 45: 2, 47: 1, 53: 3, 54: 3, 55: 1, 56: 3, 58: 2, 59: 3, 63: 0, 64: 0, 66: 3, 67: 2, 68: 3, 69: 0, 70: 2, 72: 3, 77: 3, 78: 2, 79: 2, 80: 3, 84: 3, 88: 1, 89: 1, 90: 0, 91: 0, 92: 0, 93: 3, 94: 0, 95: 1, 97: 2, 100: 0, 102: 2, 103: 2, 104: 3, 105: 1, 106: 2, 107: 2, 108: 1, 113: 3, 114: 2, 115: 2, 116: 3, 123: 3, 126: 3, 128: 3, 131: 2, 133: 3, 135: 2, 136: 0, 139: 0, 140: 3, 141: 2, 142: 3, 144: 0, 145: 1, 148: 3, 149: 2, 150: 3, 151: 3, 152: 3, 153: 3, 154: 3, 158: 3, 159: 3, 160: 2, 173: 0, 175: 3, 178: 3, 179: 0, 180: 3, 183: 2, 184: 0, 186: 1, 188: 2, 194: 3, 196: 1, 197: 2, 200: 3, 201: 1, 205: 2, 208: 2, 210: 1, 211: 2, 212: 2}
Convert newlist dictionary to a dataframe
In [13]:newclus = pd.DataFrame.from_dict(newlist, orient='index') newclus
Out[13]:0214260100113142163170192222243273282292312320352373382393422452471533543551563582593630......145114831492150315131523153315431583159316021730175317831790180318321840186118821943196119722003201120522082210121122122
105 rows × 1 columns
Rename the cluster assignment column
In [14]:newclus.columns = ['cluster']
Repeat previous steps for the cluster assignment variable
Create a unique identifier variable from the index for the cluster assignment dataframe to merge with cluster training data
In [15]:newclus.reset_index(level=0, inplace=True)
Merge the cluster assignment dataframe with the cluster training variable dataframe by the index variable
In [16]:merged_train = pd.merge(clus_train, newclus, on='index') merged_train.head(n=100)
Out[16]:indexincomeperpersonemployratefemaleemployratepolityscorealcconsumptionlifeexpectancyurbanratecluster0159-0.393486-0.0445910.3868770.0171271.843020-0.0160990.79024131196-0.146720-1.591112-1.7785290.498818-0.7447360.5059900.6052111270-0.6543650.5643511.0860520.659382-0.727105-0.481382-0.2247592329-0.6791572.3138522.3893690.3382550.554040-1.880471-1.9869992453-0.278924-0.634202-0.5159410.659382-0.1061220.4469570.62033335153-0.021869-1.020832-0.4073320.9805101.4904110.7233920.2778493635-0.6665191.1636281.004595-0.785693-0.715352-2.084304-0.7335932714-0.6341100.8543230.3733010.177691-1.303033-0.003846-1.24242828116-0.1633940.119726-0.3394510.338255-1.1659070.5304950.67993439126-0.630263-1.446126-0.3055100.6593823.1711790.033923-0.592152310123-0.163655-0.460219-0.8010420.980510-0.6448300.444628-0.560127311106-0.640452-0.2862350.1153530.659382-0.247166-2.104758-1.317152212142-0.635480-0.808186-0.7874660.0171271.155433-1.731823-0.29859331389-0.615980-2.113062-2.423400-0.625129-1.2442650.0060770.512695114160-0.6564731.9852172.199302-1.1068200.620643-1.371039-1.63383921556-0.430694-0.102586-0.2240530.659382-0.5547190.3254460.250272316180-0.559059-0.402224-0.6041870.338255-1.1776610.603401-1.777949317133-0.419521-1.668438-0.7331610.3382551.032020-0.659900-0.81098631831-0.618282-0.0155940.061048-1.2673840.211226-1.7590620.075026219171.801349-1.030498-0.4344840.6593820.7029191.1165791.8808550201450.447771-0.827517-1.731013-1.909640-1.1561120.4042250.7359771211000.974856-0.034925-0.0068330.6593822.4150301.1806761.173646022178-0.309804-1.755430-0.9368040.8199460.653945-1.6388680.2520513231732.6193200.3033760.217174-0.946256-1.0346581.2296851.99827802459-0.056177-0.2669040.2714790.8199462.0408730.5916550.63990432568-0.562821-0.3538960.0271070.338255-0.0316830.481486-0.1037773261080.111383-1.030498-1.690284-1.749076-1.3167450.5879080.999290127212-0.6582520.7286690.678765-0.464565-0.364702-1.781946-0.78874722819-0.6525281.1926250.6855540.498818-0.928876-1.306335-0.617060229188-0.662484-0.4505530.135717-1.106820-0.672255-0.147127-1.2726732..............................70140-0.594402-0.044591-0.8214060.819946-0.3157280.5125720.074137371148-0.0905570.052066-0.3190860.8199460.0936890.7235950.80625437211-0.4523170.1583900.549792-1.7490761.2768870.177913-0.140250373641.636776-0.779188-0.1697480.8199461.1084191.2715050.99128407484-0.117682-1.156153-0.5295180.9805101.8214720.5500380.5527263751750.604211-0.3248980.0882000.9805101.5903171.048938-0.287918376197-0.481087-0.0735890.393665-2.070203-0.356866-0.404628-0.287029277183-0.506714-0.808186-0.067926-2.070203-0.347071-2.051902-1.340281278210-0.628790-1.958410-1.887139-0.946256-1.297156-0.353290-1.08675317954-0.5150780.042400-0.1765360.1776910.5109430.6733710.467327380114-0.6661982.2945212.111056-0.625129-1.077755-0.229248-1.1365692814-0.5503841.5889211.445822-0.946256-0.245207-1.8114130.072358282911.575455-0.769523-0.1154430.980510-0.8426821.2795041.62732708377-0.5015740.332373-0.2783580.6593820.0545110.221758-0.28880838466-0.265535-0.0252600.305419-0.1434370.516820-0.6358011.332879385921.240375-1.243145-0.8349830.9805100.5677521.3035020.5785230862011.4545511.540592-0.733161-1.909640-1.2344700.7659211.014413187105-0.004485-1.281808-1.7513770.498818-0.8857790.3704051.418278188205-0.593947-0.1702460.305419-2.070203-0.629158-0.070373-0.8118762891540.504036-0.1605810.1696570.9805101.3846291.0649370.19511839045-0.6307520.061732-0.678856-0.625129-0.068902-1.377621-0.27991229197-0.6432031.3472771.2557550.498818-0.576267-1.199710-1.488839292632.067368-0.1992430.3597250.9805101.2298731.1133390.365916093211-0.6469130.1680550.3665130.498818-0.638953-2.020815-0.874146294158-0.422620-0.943506-0.2919340.8199461.8273490.505990-0.037060395135-0.6635950.2453810.4411820.338255-0.862272-0.018934-1.68276529679-0.6744750.6416770.1221410.338255-0.572349-2.111239-1.1223362971790.882197-0.653534-0.4344840.9805100.9810881.2578350.980609098149-0.6151691.0766361.4118810.017127-0.623282-0.626890-1.891814299113-0.464904-2.354706-1.4459120.8199460.4149550.5938830.5260393
100 rows × 9 columns
Cluster frequencies
In [17]:merged_train.cluster.value_counts()
Out[17]:3 39 2 35 0 18 1 13 Name: cluster, dtype: int64
Calculate clustering variable means by cluster
In [18]:clustergrp = merged_train.groupby('cluster').mean() print ("Clustering variable means by cluster") clustergrp Clustering variable means by cluster
Out[18]:indexincomeperpersonemployratefemaleemployratepolityscorealcconsumptionlifeexpectancyurbanratecluster093.5000001.846611-0.1960210.1010220.8110260.6785411.1956961.0784621117.461538-0.154556-1.117490-1.645378-1.069767-1.0827280.4395570.5086582100.657143-0.6282270.8551520.873487-0.583841-0.506473-1.034933-0.8963853107.512821-0.284648-0.424778-0.2000330.5317550.6146160.2302010.164805
Validate clusters in training data by examining cluster differences in internetuserate using ANOVA. First, merge internetuserate with clustering variables and cluster assignment data
In [19]:internetuserate_data = data_clean['internetuserate']
Split internetuserate data into train and test sets
In [20]:internetuserate_train, internetuserate_test = train_test_split(internetuserate_data, test_size=.3, random_state=123) internetuserate_train1=pd.DataFrame(internetuserate_train) internetuserate_train1.reset_index(level=0, inplace=True) merged_train_all=pd.merge(internetuserate_train1, merged_train, on='index') sub5 = merged_train_all[['internetuserate', 'cluster']].dropna()
In [21]:internetuserate_mod = smf.ols(formula='internetuserate ~ C(cluster)', data=sub5).fit() internetuserate_mod.summary()
Out[21]:
OLS Regression ResultsDep. Variable:internetuserateR-squared:0.679Model:OLSAdj. R-squared:0.669Method:Least SquaresF-statistic:71.17Date:Thu, 12 Jan 2017Prob (F-statistic):8.18e-25Time:20:59:17Log-Likelihood:-436.84No. Observations:105AIC:881.7Df Residuals:101BIC:892.3Df Model:3Covariance Type:nonrobustcoefstd errtP>|t|[95.0% Conf. Int.]Intercept75.20683.72720.1770.00067.813 82.601C(cluster)[T.1]-46.95175.756-8.1570.000-58.370 -35.534C(cluster)[T.2]-66.56684.587-14.5130.000-75.666 -57.468C(cluster)[T.3]-39.48604.506-8.7630.000-48.425 -30.547Omnibus:5.290Durbin-Watson:1.727Prob(Omnibus):0.071Jarque-Bera (JB):4.908Skew:0.387Prob(JB):0.0859Kurtosis:3.722Cond. No.5.90
Means for internetuserate by cluster
In [22]:m1= sub5.groupby('cluster').mean() m1
Out[22]:internetuseratecluster075.206753128.25501828.639961335.720760
Standard deviations for internetuserate by cluster
In [23]:m2= sub5.groupby('cluster').std() m2
Out[23]:internetuseratecluster014.093018121.75775228.399554319.057835
In [24]:mc1 = multi.MultiComparison(sub5['internetuserate'], sub5['cluster']) res1 = mc1.tukeyhsd() res1.summary()
Out[24]:
Multiple Comparison of Means - Tukey HSD,FWER=0.05group1group2meandifflowerupperreject01-46.9517-61.9887-31.9148True02-66.5668-78.5495-54.5841True03-39.486-51.2581-27.7139True12-19.6151-33.0335-6.1966True137.4657-5.76520.6965False2327.080817.461736.6999True
The elbow curve was inconclusive, suggesting that the 2, 4, 6, and 8-cluster solutions might be interpreted. The results above are for an interpretation of the 4-cluster solution.
In order to externally validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on internet use rate. A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on internet use rate (F=71.17, p<.0001). The tukey post hoc comparisons showed significant differences between clusters on internet use rate, with the exception that clusters 0 and 2 were not significantly different from each other. Countries in cluster 1 had the highest internet use rate (mean=75.2, sd=14.1), and cluster 3 had the lowest internet use rate (mean=8.64, sd=8.40).
9 notes · View notes
Text
Data gathering. Relevant data for an analytics application is identified and assembled. The data may be located in different source systems, a data warehouse or a data lake, an increasingly common repository in big data environments that contain a mix of structured and unstructured data. External data sources may also be used. Wherever the data comes from, a data scientist often moves it to a data lake for the remaining steps in the process.
Data preparation. This stage includes a set of steps to get the data ready to be mined. It starts with data exploration, profiling and pre-processing, followed by data cleansing work to fix errors and other data quality issues. Data transformation is also done to make data sets consistent, unless a data scientist is looking to analyze unfiltered raw data for a particular application.
Mining the data. Once the data is prepared, a data scientist chooses the appropriate data mining technique and then implements one or more algorithms to do the mining. In machine learning applications, the algorithms typically must be trained on sample data sets to look for the information being sought before they're run against the full set of data.
Data analysis and interpretation. The data mining results are used to create analytical models that can help drive decision-making and other business actions. The data scientist or another member of a data science team also must communicate the findings to business executives and users, often through data visualization and the use of data storytelling techniques.
Tumblr media
Types of data mining techniques
Various techniques can be used to mine data for different data science applications. Pattern recognition is a common data mining use case that's enabled by multiple techniques, as is anomaly detection, which aims to identify outlier values in data sets. Popular data mining techniques include the following types:
Association rule mining. In data mining, association rules are if-then statements that identify relationships between data elements. Support and confidence criteria are used to assess the relationships -- support measures how frequently the related elements appear in a data set, while confidence reflects the number of times an if-then statement is accurate.
Classification. This approach assigns the elements in data sets to different categories defined as part of the data mining process. Decision trees, Naive Bayes classifiers, k-nearest neighbor and logistic regression are some examples of classification methods.
Clustering. In this case, data elements that share particular characteristics are grouped together into clusters as part of data mining applications. Examples include k-means clustering, hierarchical clustering and Gaussian mixture models.
Regression. This is another way to find relationships in data sets, by calculating predicted data values based on a set of variables. Linear regression and multivariate regression are examples. Decision trees and some other classification methods can be used to do regressions, too
Tumblr media
Data mining companies follow the procedure
4 notes · View notes
coding02 · 1 year
Text
Running a k-means Cluster Analysis
Load the necessary libraries
library(dplyr) # For data manipulation library(ggplot2) # For data visualization library(cluster) # For clustering analysis
Load your data set
data <- read.csv("your_data_file.csv")
Select your clustering variables
clustering_vars <- data %>% select(var1, var2, var3)
Normalize the clustering variables (optional)
clustering_vars_norm <- scale(clustering_vars)
Choose the number of clusters (k)
k <- 3
Run the k-means clustering analysis
set.seed(123) # For reproducibility kmeans_results <- kmeans(clustering_vars_norm, centers = k)
View the cluster assignments for each observation
cluster_assignments <- kmeans_results$cluster
Visualize the clusters using scatterplots (optional)
ggplot(data, aes(x = var1, y = var2, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()
ggplot(data, aes(x = var1, y = var3, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()
ggplot(data, aes(x = var2, y = var3, color = factor(cluster_assignments))) + geom_point() + labs(color = "Cluster") + theme_minimal()
View the cluster centers (centroids)
kmeans_results$centers
In this example, we first load the necessary libraries and our data set. We then select our clustering variables (var1, var2, and var3) and normalize them (if desired). We choose the number of clusters (k) to be 3, and run the k-means clustering analysis using the kmeans() function. We also set a seed for reproducibility purposes. We then view the cluster assignments for each observation, and optionally visualize the clusters using scatterplots. Finally, we view the cluster centers (centroids).
The output of this analysis includes the cluster assignments for each observation (stored in the cluster_assignments variable), the cluster centers (stored in the kmeans_results$centers object), and the scatterplots (if created). The cluster assignments and centers can be used to further analyze and interpret the clusters.
4 notes · View notes
mscdscourseraweek4 · 1 month
Text
Machine Learning for Data Analysis-Week4-Running a K-Means Cluster Analysis:
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Based on the elbow chart, we see 4 clusters as optimum clusters. We can probably assign these clusters to our dataset and check how the countries are grouped accordingly.This might help understand how the rest of the features are grouped based on these clusters.
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
Cluster Profiles:Cluster 0: Health Expenditure: Moderate, with an average of $188.46 per capita. Access to Electricity: Fairly low at 70.99%. Sanitation and Water Access: Moderate access to improved sanitation (57.49%) and water sources (87.89%). Fertility Rate: Relatively high at 3.16 births per woman. Mortality Rate (Under 5): Moderate at 46.17 per 1,000. Fixed Broadband Subscriptions: Low at 1.81 per 100 people. Survival to Age 65 (Female): Moderate survival rate at 73.42%. Rural Population: Majority rural with 55.62% of the population. GDP per Capita: Low to moderate at $3,273.65. Life Expectancy: Moderate at 67.71 years. Cluster 1: Health Expenditure: Relatively high, averaging $697.40 per capita. Access to Electricity: Very high at 98.62%. Sanitation and Water Access: High access to improved sanitation (89.54%) and water sources (95.90%). Fertility Rate: Lower at 1.98 births per woman. Mortality Rate (Under 5): Low at 14.50 per 1,000. Fixed Broadband Subscriptions: Moderate at 11.96 per 100 people. Survival to Age 65 (Female): High survival rate at 85.53%. Rural Population: Lower proportion of rural population at 36.55%. GDP per Capita: Higher at $10,972.77. Life Expectancy: Relatively high at 74.68 years. Cluster 2: Health Expenditure: Low, with an average of $79.03 per capita. Access to Electricity: Very low at 28.86%. Sanitation and Water Access: Low access to improved sanitation (28.59%) and water sources (64.85%). Fertility Rate: Very high at 5.12 births per woman. Mortality Rate (Under 5): High at 88.65 per 1,000. Fixed Broadband Subscriptions: Very low at 0.13 per 100 people. Survival to Age 65 (Female): Low survival rate at 57.54%. Rural Population: Higher proportion of rural population at 66.50%. GDP per Capita: Low at $1,708.19. Life Expectancy: Lower at 58.04 years. Cluster 3: Health Expenditure: Very high, averaging $4,843.75 per capita. Access to Electricity: Almost universal at 99.92%. Sanitation and Water Access: Nearly universal access to improved sanitation (98.79%) and water sources (99.69%). Fertility Rate: Low at 1.73 births per woman. Mortality Rate (Under 5): Very low at 4.56 per 1,000. Fixed Broadband Subscriptions: High at 30.32 per 100 people. Survival to Age 65 (Female): Very high survival rate at 91.76%. Rural Population: Very low proportion of rural population at 17.03%. GDP per Capita: Very high at $50,487.41. Life Expectancy: Very high at 81.02 years.
Key Insights:Economic and Social Development: Clusters 1 and 3 represent more economically developed groups with high health expenditure, access to infrastructure, and longer life expectancies. Cluster 3, in particular, represents the highest economic development and best health outcomes. Health and Mortality: Cluster 2, with the lowest economic indicators, exhibits the highest fertility rates and child mortality, along with the lowest life expectancy, indicating a significant need for improvement in healthcare and living conditions. Rural vs. Urban: Clusters 0 and 2 have a higher proportion of rural populations, which correlates with lower access to services and poorer health outcomes, while Clusters 1 and 3 are more urbanized with better access to healthcare and higher life expectancy.
Tumblr media Tumblr media
ANOVA Test Summary:
The ANOVA test explored the relationship between the clusters and life expectancy. Here’s what the results indicate:R-squared (0.039): The model explains 3.9% of the variance in life expectancy, indicating that while clusters are statistically significant (p-value = 0.008), they only explain a small portion of the variation in life expectancy. Coefficient (1.639): On average, being in a higher cluster (which typically reflects better socio-economic conditions) is associated with a 1.639-year increase in life expectancy. F-statistic (7.134): This value indicates that the model is statistically significant.
Insights:Health Expenditure & Life Expectancy: There’s a clear positive correlation between health expenditure per capita and life expectancy. Clusters with higher health expenditure (e.g., Cluster 3) have the highest life expectancy. Basic Infrastructure: Access to electricity, sanitation, and water is strongly associated with higher life expectancy. Clusters with better infrastructure (e.g., Clusters 1 and 3) show higher life expectancy. Socio-Economic Indicators: GDP per capita and access to broadband also correlate with better health outcomes and longer life expectancy. Child Mortality: Lower under-5 mortality rates are observed in clusters with higher life expectancy, highlighting the importance of child health in overall life expectancy.
The findings suggest that significant disparities in life expectancy are associated with differences in healthcare expenditure, infrastructure, and socio-economic development. Clusters with better resources and infrastructure consistently exhibit higher life expectancy, reinforcing the critical role of these factors in population health outcomes.
Tumblr media
Tukey HSD Test Summary:
The Tukey HSD (Honestly Significant Difference) test was conducted to compare the mean life expectancy between each pair of clusters. The results indicate significant differences between all pairs of clusters, as all p-values are 0.0, and the null hypothesis is rejected for each comparison. Here’s a summary of the findings:Cluster 0 vs. Cluster 1: Mean Difference: 6.97 years. Interpretation: Life expectancy in Cluster 1 is significantly higher than in Cluster 0 by approximately 7 years. Cluster 0 vs. Cluster 2: Mean Difference: -9.67 years. Interpretation: Life expectancy in Cluster 2 is significantly lower than in Cluster 0 by about 9.67 years. Cluster 0 vs. Cluster 3: Mean Difference: 13.31 years. Interpretation: Life expectancy in Cluster 3 is significantly higher than in Cluster 0 by around 13.31 years. Cluster 1 vs. Cluster 2: Mean Difference: -16.65 years. Interpretation: Life expectancy in Cluster 2 is significantly lower than in Cluster 1 by about 16.65 years. Cluster 1 vs. Cluster 3: Mean Difference: 6.33 years. Interpretation: Life expectancy in Cluster 3 is significantly higher than in Cluster 1 by approximately 6.33 years. Cluster 2 vs. Cluster 3: Mean Difference: 22.98 years. Interpretation: Life expectancy in Cluster 3 is significantly higher than in Cluster 2 by about 22.98 years.
Insights:Significant Differences: All pairwise comparisons between the clusters show significant differences in life expectancy, indicating that each cluster represents a distinct group with a unique profile of life expectancy. Cluster 3 Dominance: Cluster 3, which is characterized by high health expenditure, nearly universal access to infrastructure, and a high GDP per capita, has the highest life expectancy. It significantly outperforms all other clusters. Cluster 2 Under performance: Cluster 2, marked by low health expenditure, poor infrastructure, and high fertility and mortality rates, has the lowest life expectancy, significantly trailing all other clusters. Middle Ground: Clusters 0 and 1 fall in between the extremes, with Cluster 1 showing a moderately high life expectancy and Cluster 0 showing moderate to low life expectancy.
This analysis reinforces the earlier findings that socio-economic factors, health expenditure, and access to essential services are crucial determinants of life expectancy. The significant differences between clusters highlight the importance of targeted policy interventions to address the disparities in life expectancy.
0 notes
dataanalyst75 · 2 months
Text
Running a k-means Cluster Analysis: How the presence of adolescent troubles and dependencies can have an impact on symptoms of negative emotionality
To assess the impact of adolescent troubles and dependencies on depression, a k-means cluster analysis is carried out to detect underlying subgroups of individuals based on their similitude of responses on a wide array of variables - representing typical adolescent problems - to symptoms of negative emotionality.
15 worthy of attention parameters – taken from a dataset from Wave 1 of the Add Health survey including youth in grade 7 through 12 - have been considered as clustering variables, expression of the similarities of responses on the item: they are split into binary variables and quantitative variables as follows, and have been standardized via SAS to get a mean of 0 and a standard deviation of 1:
5 binary variables dichotomising
whether or not the adolescents had cigarettes easily available at home and smoked regularly,
Poverty status, based on whether the adolescent’s mother or father is currently on public assistance.
whether adolescents had ever drunk alcohol without permission, i.e. when they were not with their parents or other adults,
whether adolescents had ever consumed inhalants and marijuana.
10 quantitative variables containing
age as population demographic factor,
a Grade Point Average calculated on a 4.0 scale according to the adolescent’s reported performance at school,
scales rating
exhibitions of violence,
self-esteem,
parental presence at home of a parent when the adolescent leaves for school, comes back from school and goes to sleep, parent-child parent–child activities,
family connectedness assessing the adolescent’s relationship with and feelings toward parents and family members,
school connectedness assessing the adolescent’s relationship with teachers and peers and perceptions of school atmosphere,
the engagement in deviant behaviours (such as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school).
Data have been randomly split into a training dataset containing 70% of the observations (N=2.242) and a test dataset including the remaining 30% (N=961).
A sequence of k-means cluster analyses have been performed on the training data setting k = 1-9 clusters, applying Euclidean distance. The variance (r-square) in the clustering variables which has been accounted for by the clusters has been displayed for each of the 9 cluster solutions in an elbow curve to get some insights into the selection of the number of clusters to decipher. 
The elbow curve is however questionable, indicating that the 2, 4, 5, 6 and 8-cluster solutions could be interpreted. The results below are for an interpretation of the 4-cluster solution.
Canonical discriminant analyses have been applied to decrease the 15-clustering variable to a small number of variables that accounted for most of the variance in the clustering variables.
A scatterplot of the first 2 canonical variables by cluster shows that
the observations in clusters 3 and 1 have been extensively packed (3 slightly more than 1) and the variance is relatively low within these clusters, i.e. the observations within each of these clusters are pretty highly correlated with each other. These clusters did not overlap so much with the other clusters 4 and 2.
different story for cluster 4 and 2
with regard to cluster 4, it is by and large distinct with the exception of some observations next to  clusters 1 and 3 but, although there is some indication of a cluster, the observations are spread out more than the other clusters 1 and 3. This implies there is less correlation between the observations in subject cluster, i.e. within cluster variance is however high.
The situation becomes worse in correspondence of cluster 2: observations in cluster 2 are spread out more than the other clusters, indicating the highest variance within a cluster.
In a nutshell, the best cluster solution could have fewer than 4 clusters, meaning that it would be especially important to further asses the cluster solutions with lower than 4 clusters. 
Looking at the output for the 4-cluster solution and, in particular, at the cluster means table to assess the patterns of means on the clustering variables for each cluster,
compared to adolescents in the other clusters
adolescents in cluster 1 have
a relatively low likelihood of smoking cigarettes and marijuana, having cigarettes available at home, being in a poverty status and drinking alcohol,
the lowest levels of deviant and violence behaviours, self-esteem and parental involvement in activities,
fairly low levels of alcohol problems, school connectedness, parental presence, family connectedness and grade point average
adolescents in cluster 2, have
the highest likelihood of having used marijuana, being in a poverty status, drinking alcohol, of smoking without permission,
the highest levels of deviant and violent behaviours,
the lowest levels of school connectedness and performance, parental presence and in activities, and family connectedness,
a moderate likelihood of smoking regularly, being second only to cluster4,
a relative low level of self-esteem, where the worst value is reached by cluster 1.
In a nutshell, cluster 2 clearly includes the most troubled adolescents.
adolescents in cluster 3,
are the least likely to smoke marijuana, cigarettes regularly and without permission, being poor, having drunk alcohol,  highest likelihood of having used alcohol, a very high likelihood of having used marijuana
the lowest engagement in deviant and violent behaviours,
the lowest number of alcohol problems,
the best self-esteem,
the best levels of school connectedness and performance,  parental precense and involvement in activities, and family connectedness.
In a nutshell, cluster 3 clearly includes the least troubled adolescents.
Adolescents in cluster 4,
are the most likely to smoke regularly,
have a moderate likelihood of smoking without permission, drinking alcohol, second only to cluster 2,
are the least likely to be in a poor status
have relatively low levels of engagement in deviant and violent behaviours, numbers of alcohol problems,
have moderate levels of self-esteem, school connectedness, parental presence and involvement in activities,
have relatively low levels of school performance.
To assess how the clusters differ with regard to depression, an Analysis of Variance (ANOVA) is performed to test for significant differences between clusters and depression, validating the clusters externally. As expected, the box plot depicting the mean depression by cluster shows that,
cluster 3 – including the least troubled individuals - has the lowest levels of depression compared to other clusters, while
cluster 2 – characterized by the most troubled adolescents - have the highest levels of depression.
Finally, the clusters differ significantly in mean DEP1, from each other ((F(3, 319)= 185.36, <.0001), as shown by the results of the turkey test. Cluster 1 and 4 show the lowest difference (2.0455) in comparison to the other cluster comparisons.
Adolescents in cluster 4 present the highest levels of depression (mean= 14.07, sd=8.64), and cluster 3 the lowest level (mean=6.29, sd=4.81).
Hereinafter the SAS syntax code used to run the present k-means Cluster Analysis along with the extract of the output returned by the SAS program which depict the results achieved running the code.
PROC IMPORT DATAFILE ='/home/u63783903/my_courses/tree_addhealth.csv' OUT = imported REPLACE; RUN;
DATA MANAGEMENT ; DATA clust; set imported;
/* lib name statement and data step to call in the Wave 1 dataset of the Add Health survey including youth in grade 7 through 12 for the purpose of Running a K-means cluster analysis */
/* a unique identifier is being assigned to each observation to merge the cluster assignment variable back with the main dataset later on */ idnum=n;
/* creation of a dataset that includes only clustering variables and the depresion quantitative variable to be used to externally validate the clusters */ keep idnum dep1 treg1 cigavail passist alcevr1 marever1 age deviant1 viol1 alcprobs1 esteem1 schconn1 parpres paractv famconct gpa1;
/* delete observations with missing data on the clustering variables (for every variable in the dataset) */ if cmiss(of all) then delete; run;
/* delete observations with missing data on the clustering variables (for every variable in the dataset) */ ods graphics on;
/* Split data randomly into test and training data / proc surveyselect data=clust out=traintest seed = 123 samprate=0.7 method=srs outall; run; / Surveyselect procedure to randomly split present dataset into a training data set consisting of 70% of the total observations in the dataset and a test dataset consisting of the other 30% of the observations; Specification of the name of the managed dataset as clust; Specification of the name of the randomly split output dataset as traintest; Seed option to specify a random number to ensure that the data are randomly split the same way if the code being run again;
samprate command to split the input dataset; 70% of the observations are designated as training observations; The remaining 30% are designated as test observations; Specification of the data being split using simple random sampling; outall option to include both the training and test observations in a single output dataset having a new variable called selected;
The selected variable to indicate whether an observation belongs to the training dataset or the test data set */
/* Training set observations are being coded 1 on the selected variable; test observations are being coded zero on the selected variable */ data clus_train; set traintest; if selected=1; run; data clus_test; set traintest; if selected=0; run;
/* standardize the clustering variables to have a mean of 0 and standard deviation of 1
/* due to the fact that in cluster analysis variables with large values contribute more to the distance calculations, a proc standard procedure is being used for standardization of the clustering variables to get a mean of 0 and standard deviation of 1; Variables measured on different scales should be standardized prior to clustering so that the solution is not being driven by variables measured on larger scales / proc standard data=clus_train out=clustvar mean=0 std=1; var treg1 cigavail passist alcevr1 marever1 age deviant1 viol1 alcprobs1 esteem1 schconn1 parpres paractv famconct gpa1; run; / provided the name of clus_train to the training dataset with the unstandardized clustering variables; generated a dataset named clustvar including the standardized clustering variables; standardized the clustering variables to get a mean of 0 and a standard deviation of 1;
list of the clustering variables to be standardized;
a series of cluster analyses for a range of values for the number of clusters is being run as the amount of clusters actually existing in the population is not kown: a macro named knean is being used to automate the process */
%macro kmean(K); /*The %macro indicates that the code is part of a SAS macro. The macro is called knean and the K indicates that the macro runs the procedure code for number of different values of K whose value will be specified later */
/*The fastclus procedure is being used to perform the K means cluster analysis */ proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300; var treg1 cigavail passist alcevr1 marever1 age deviant1 viol1 alcprobs1 esteem1 schconn1 parpres paractv famconct gpa1; run; /*The fastclus procedure uses clustvar - the standardized training data - as input; An output dataset named outdata&K is being created for a range of values of K: this dataset contains a variable for cluster assignment for each observation and the distance of each observation from the cluster centroid. A numeric value to the name of the data set is being added; An output dataset for the cluster analysis statistics for range of values of K is being created; the cluster analysis is being run and the maximum number of clusters for a range of values of K is being specified; up to 300 iterations are being used to find the cluster solution;
List of the standardized clustering variables */
%mend; /* stop running the macro */
%kmean(1); %kmean(2); %kmean(3); %kmean(4); %kmean(5); %kmean(6); %kmean(7); %kmean(8); %kmean(9); /* Print of the output and creation of the output datasets for K from 1 to 9 clusters by typing %knean, which is the name of the macro and, in parenthesis, the value of K;
extract r-square values from each cluster solution and then merge them to plot elbow curve: an elbow plot is being created by plotting the r squared values for each of the k equals 1 to 9 cluster solutions, to determine how many clusters to retain and interpret: To do this, the r squared value from the output (for each of the 1 to 9 cluster solutions) */ data clus1; *Creation of a dataset named clus1; set cluststat1; *Usage of the cluster analysis to a statistics dataset for K=1 to create the dataset; nclust=1;
if type='RSQ'; *Selection of r-square statistics;
keep nclust over_all; *keeping the nclust variable and the variable label over_all: the latter containing the actual r-square value; run;
data clus2; set cluststat2; nclust=2;
if type='RSQ';
keep nclust over_all; run;
data clus3; set cluststat3; nclust=3;
if type='RSQ';
keep nclust over_all; run;
data clus4; set cluststat4; nclust=4;
if type='RSQ';
keep nclust over_all; run; data clus5; set cluststat5; nclust=5;
if type='RSQ';
keep nclust over_all; run; data clus6; set cluststat6; nclust=6;
if type='RSQ';
keep nclust over_all; run; data clus7; set cluststat7; nclust=7;
if type='RSQ';
keep nclust over_all; run; data clus8; set cluststat8; nclust=8;
if type='RSQ';
keep nclust over_all; run; data clus9; set cluststat9; nclust=9;
if type='RSQ';
keep nclust over_all; run;
data clusrsquare; *keeping the nclust variable and the variable label over_all: the latter containing the actual r-square value;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9; *clusrsquare as the name of the new dataset got by adding together the 9 rsquare data; run;
plot elbow curve using r-square values with the gplot procedure; symbol1 color=blue interpol=join;
display parameters for the plot: the r-square being plotted in blue, each of the plotted r-square values are being connected with a line;
proc gplot data=clusrsquare; plot over_all*nclust; run;
Plot of the name of the variable having R2 values over_all on the Y axis and plot of the variable having the number of clusters nclust on the x-axis;
further examine cluster solution for the number of clusters suggested by the elbow curve to assess whether or not the clusters overlap with each other
plot clusters for 3 cluster solution;
A canonical discriminate analysis is being used as data reduction technique capable of creating a smaller number of variables that are linear combinations of the 24 clustering variables; proc candisc data=outdata3 out=clustcan; *a candisc procedure is being applied to create the canonical variables from the cluster analysis output dataset having the cluster assignment variable created when the cluster analysis is being run; *outdata4 is the name of the dataset for the four cluster solution; *The out=clustcan code tells SASS output a data set called clustcan that includes the canonical variables that are estimated by the canonical discriminate analysis; *a dataset named clustcan including the canonical variables is being output: the canonical variables are being estimated by the canonical discriminate analysis;
class cluster; *a cluster assignment variable named cluster is being specied as a categorical variable because it has 4 categories;
var treg1 cigavail passist alcevr1 marever1 age deviant1 viol1 alcprobs1 esteem1 schconn1 parpres paractv famconct gpa1; run; *creation of a smaller number of variables; *the clustering variables are being listed;
proc sgplot data=clustcan; scatter y=can2 x=can1 / group=cluster; *plot of the first 2 canonical variables using the sgplot procedure; run; *linear combination of clustering variables;
validate clusters on dep1;
first merge clustering variable and assignment data with dep1 data;
/* Extraction of the dep1 variables from the training data set and merge of it with the dataset including the cluster assignment variable. */
data dep1_data; set clus_train; keep idnum dep1; run; *The new variables named canonical variables are being ordered in terms of the proportion of variance in the clustering variables that is accounted for by each of the canonical variables;
/* Sort both datasets by the unique identifier, ID num being used to link the datasets */ proc sort data=outdata4; by idnum; run;
proc sort data=dep1_data; by idnum; run; *Majority of the variants in the clustering variable being accounted for by the first couple of canonical variables;
data merged; merge outdata4 dep1_data; by idnum; run;
proc sort data=merged; by cluster; run;
proc means data=merged; var dep1; by cluster; run; /* Merge of the datasets by the unique identifier into a data et called merged*/
/* Anova procedure to test whether there are significant differences between clusters and dep1 / proc anova data=merged; class cluster; / class statement to indicate that the cluster membership variable is categorical/ model dep1 = cluster; / The model statement specifies the model with dep1 as the response variable and cluster as the explanatory variable / means cluster/tukey; / Because the categorical cluster variable has 3 categories, a 2D test to evaluate post hot comparisons between the clusters is being requested */ run;
Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media Tumblr media
0 notes
ramanidevi16 · 3 months
Text
Run K-means Cluster Analysis
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including the clustering variables.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport seaborn as sns```### Step 3: Load and Standardize Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Select the clustering variablesX = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)```### Step 4: Determine the Optimal Number of ClustersUse the Elbow method to find the optimal number of clusters.```python# Determine the optimal number of clusters using the Elbow methodinertia = []K = range(1, 11)for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_)# Plot the Elbow curveplt.figure(figsize=(10,6))plt.plot(K, inertia, 'bo-')plt.xlabel('Number of clusters')plt.ylabel('Inertia')plt.title('Elbow Method For Optimal k')plt.show()```### Step 5: Train the k-means ModelChoose the number of clusters based on the Elbow plot and train the k-means model.```python# Train the k-means model with the optimal number of clustersoptimal_clusters = 3 # replace with the optimal number you identifiedkmeans = KMeans(n_clusters=optimal_clusters, random_state=42)kmeans.fit(X_scaled)# Get the cluster labelslabels = kmeans.labels_data['Cluster'] = labels```### Step 6: Visualize the ClustersUse a pairplot or other visualizations to see the clustering results.```python# Visualize the clusterssns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable namesplt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:- **Optimal Number of Clusters**: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.- **Cluster Labels**: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.- **Cluster Visualization**: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the k-means cluster analysis (as shown above).- Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
krishnamanohari2108 · 3 months
Text
Run a K means cluster analysis
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including the clustering variables.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport seaborn as sns```### Step 3: Load and Standardize Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Select the clustering variablesX = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)```### Step 4: Determine the Optimal Number of ClustersUse the Elbow method to find the optimal number of clusters.```python# Determine the optimal number of clusters using the Elbow methodinertia = []K = range(1, 11)for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_)# Plot the Elbow curveplt.figure(figsize=(10,6))plt.plot(K, inertia, 'bo-')plt.xlabel('Number of clusters')plt.ylabel('Inertia')plt.title('Elbow Method For Optimal k')plt.show()```### Step 5: Train the k-means ModelChoose the number of clusters based on the Elbow plot and train the k-means model.```python# Train the k-means model with the optimal number of clustersoptimal_clusters = 3 # replace with the optimal number you identifiedkmeans = KMeans(n_clusters=optimal_clusters, random_state=42)kmeans.fit(X_scaled)# Get the cluster labelslabels = kmeans.labels_data['Cluster'] = labels```### Step 6: Visualize the ClustersUse a pairplot or other visualizations to see the clustering results.```python# Visualize the clusterssns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable namesplt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:- **Optimal Number of Clusters**: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.- **Cluster Labels**: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.- **Cluster Visualization**: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the k-means cluster analysis (as shown above).- Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
shwetha18112002 · 3 months
Text
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including the clustering variables.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport seaborn as sns```### Step 3: Load and Standardize Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Select the clustering variablesX = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)```### Step 4: Determine the Optimal Number of ClustersUse the Elbow method to find the optimal number of clusters.```python# Determine the optimal number of clusters using the Elbow methodinertia = []K = range(1, 11)for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_)# Plot the Elbow curveplt.figure(figsize=(10,6))plt.plot(K, inertia, 'bo-')plt.xlabel('Number of clusters')plt.ylabel('Inertia')plt.title('Elbow Method For Optimal k')plt.show()```### Step 5: Train the k-means ModelChoose the number of clusters based on the Elbow plot and train the k-means model.```python# Train the k-means model with the optimal number of clustersoptimal_clusters = 3 # replace with the optimal number you identifiedkmeans = KMeans(n_clusters=optimal_clusters, random_state=42)kmeans.fit(X_scaled)# Get the cluster labelslabels = kmeans.labels_data['Cluster'] = labels```### Step 6: Visualize the ClustersUse a pairplot or other visualizations to see the clustering results.```python# Visualize the clusterssns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable namesplt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:- **Optimal Number of Clusters**: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.- **Cluster Labels**: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.- **Cluster Visualization**: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the k-means cluster analysis (as shown above).- Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
ratthika · 3 months
Text
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:### Step 1: Prepare Your DataEnsure your data is ready for analysis, including the clustering variables.### Step 2: Import Necessary LibrariesFor this example, I’ll use Python and the `scikit-learn` library.#### Python```pythonimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltimport seaborn as sns```### Step 3: Load and Standardize Your Data```python# Load your datasetdata = pd.read_csv('your_dataset.csv')# Select the clustering variablesX = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names# Standardize the datascaler = StandardScaler()X_scaled = scaler.fit_transform(X)```### Step 4: Determine the Optimal Number of ClustersUse the Elbow method to find the optimal number of clusters.```python# Determine the optimal number of clusters using the Elbow methodinertia = []K = range(1, 11)for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_)# Plot the Elbow curveplt.figure(figsize=(10,6))plt.plot(K, inertia, 'bo-')plt.xlabel('Number of clusters')plt.ylabel('Inertia')plt.title('Elbow Method For Optimal k')plt.show()```### Step 5: Train the k-means ModelChoose the number of clusters based on the Elbow plot and train the k-means model.```python# Train the k-means model with the optimal number of clustersoptimal_clusters = 3 # replace with the optimal number you identifiedkmeans = KMeans(n_clusters=optimal_clusters, random_state=42)kmeans.fit(X_scaled)# Get the cluster labelslabels = kmeans.labels_data['Cluster'] = labels```### Step 6: Visualize the ClustersUse a pairplot or other visualizations to see the clustering results.```python# Visualize the clusterssns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable namesplt.show()```### InterpretationAfter running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:- **Optimal Number of Clusters**: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.- **Cluster Labels**: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.- **Cluster Visualization**: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.### Blog Entry SubmissionFor your blog entry, include:- The code used to run the k-means cluster analysis (as shown above).- Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).- A brief interpretation of the results.If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
divya08112002 · 3 months
Text
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:
Step 1: Prepare Your Data
Ensure your data is ready for analysis, including the clustering variables.
Step 2: Import Necessary Libraries
For this example, I’ll use Python and the scikit-learn library.
Python
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns
Step 3: Load and Standardize Your Data
# Load your dataset data = pd.read_csv('your_dataset.csv') # Select the clustering variables X = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Step 4: Determine the Optimal Number of Clusters
Use the Elbow method to find the optimal number of clusters.# Determine the optimal number of clusters using the Elbow method inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Plot the Elbow curve plt.figure(figsize=(10,6)) plt.plot(K, inertia, 'bo-') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.title('Elbow Method For Optimal k') plt.show()
Step 5: Train the k-means Model
Choose the number of clusters based on the Elbow plot and train the k-means model.# Train the k-means model with the optimal number of clusters optimal_clusters = 3 # replace with the optimal number you identified kmeans = KMeans(n_clusters=optimal_clusters, random_state=42) kmeans.fit(X_scaled) # Get the cluster labels labels = kmeans.labels_ data['Cluster'] = labels
Step 6: Visualize the Clusters
Use a pairplot or other visualizations to see the clustering results.# Visualize the clusters sns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable names plt.show()
Interpretation
After running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:
Optimal Number of Clusters: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.
Cluster Labels: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.
Cluster Visualization: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.
Blog Entry Submission
For your blog entry, include:
The code used to run the k-means cluster analysis (as shown above).
Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).
A brief interpretation of the results.
If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
varsha172003 · 3 months
Text
To run a k-means cluster analysis, you'll use a programming language like Python with appropriate libraries. Here’s a guide to help you complete this assignment:
Step 1: Prepare Your Data
Ensure your data is ready for analysis, including the clustering variables.
Step 2: Import Necessary Libraries
For this example, I’ll use Python and the scikit-learn library.
Python
import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import matplotlib.pyplot as plt import seaborn as sns
Step 3: Load and Standardize Your Data
# Load your dataset data = pd.read_csv('your_dataset.csv') # Select the clustering variables X = data[['var1', 'var2', 'var3', ...]] # replace with your actual variable names # Standardize the data scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Step 4: Determine the Optimal Number of Clusters
Use the Elbow method to find the optimal number of clusters.# Determine the optimal number of clusters using the Elbow method inertia = [] K = range(1, 11) for k in K: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X_scaled) inertia.append(kmeans.inertia_) # Plot the Elbow curve plt.figure(figsize=(10,6)) plt.plot(K, inertia, 'bo-') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.title('Elbow Method For Optimal k') plt.show()
Step 5: Train the k-means Model
Choose the number of clusters based on the Elbow plot and train the k-means model.# Train the k-means model with the optimal number of clusters optimal_clusters = 3 # replace with the optimal number you identified kmeans = KMeans(n_clusters=optimal_clusters, random_state=42) kmeans.fit(X_scaled) # Get the cluster labels labels = kmeans.labels_ data['Cluster'] = labels
Step 6: Visualize the Clusters
Use a pairplot or other visualizations to see the clustering results.# Visualize the clusters sns.pairplot(data, hue='Cluster', vars=['var1', 'var2', 'var3', ...]) # replace with your actual variable names plt.show()
Interpretation
After running the above code, you'll have the output from your model, including the optimal number of clusters, the cluster labels for each observation, and a visualization of the clusters. Here’s an example of how you might interpret the results:
Optimal Number of Clusters: The Elbow method helps determine the number of clusters where the inertia begins to plateau, indicating an optimal number of clusters.
Cluster Labels: Each observation in the dataset is assigned a cluster label, indicating the subgroup it belongs to based on the similarity of responses on the clustering variables.
Cluster Visualization: The pairplot (or other visualizations) shows the distribution of observations within each cluster, helping to understand the patterns and similarities among the clusters.
Blog Entry Submission
For your blog entry, include:
The code used to run the k-means cluster analysis (as shown above).
Screenshots or text of the output (Elbow plot, cluster labels, and cluster visualization).
A brief interpretation of the results.
If your dataset is small and you decide not to split it into training and test sets, provide a rationale for this decision in your summary. Ensure the content is clear and understandable for peers who may not be experts in the field. This will help them effectively assess your work.
0 notes
sanaablog1 · 3 months
Text
Assignment 4:
K-means Cluster Analysis in the banking system
Introduction:
A personal equity plan (PEP) was an investment plan introduced in the United Kingdom that encouraged people over the age of 18 to invest in British companies. Participants could invest in shares, authorized unit trusts, or investment trusts and receive both income and capital gains free of tax. The PEP was designed to encourage investment by individuals. Banks engage in data analysis related to Personal Equity Plans (PEPs) for various reasons. They use it to assess the risk associated with these investment plans. By examining historical performance, market trends, and individual investor behavior, banks can make informed decisions about offering PEPs to their clients.
In general, banks analyze PEP-related data to make informed investment decisions, comply with regulations, and tailor their offerings to customer needs. The goal is to provide equitable opportunities for investors while managing risks effectively. 
SAS Code
proc import out=mylib.mydata datafile='/home/u63879373/bank1.csv' dbms=CSV replace;
proc print data=mylib.mydata;
run;
/********************************************************************
DATA MANAGEMENT
*********************************************************************/
data new_clust; set mylib.mydata;
* create a unique identifier to merge cluster assignment variable with
the main data set;
idnum=_n_;
keep idnum age sex region income married children car save_act current_act mortgage pep;
* delete observations with missing data;
 if cmiss(of _all_) then delete;
 run;
ods graphics on;
* Split data randomly into test and training data;
proc surveyselect data=new_clust out=traintest seed = 123
 samprate=0.7 method=srs outall;
run;  
data clus_train;
set traintest;
if selected=1;
run;
data clus_test;
set traintest;
if selected=0;
run;
* standardize the clustering variables to have a mean of 0 and standard deviation of 1;
proc standard data=clus_train out=clustvar mean=0 std=1;
var age sex region income married children car save_act current_act mortgage;
run;
%macro kmean(K);
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
var age sex region income married children car save_act current_act mortgage;
run;
%mend;
%kmean(1);
%kmean(2);
%kmean(3);
%kmean(4);
%kmean(5);
%kmean(6);
%kmean(7);
%kmean(8);
%kmean(9);
* extract r-square values from each cluster solution and then merge them to plot elbow curve;
data clus1;
set cluststat1;
nclust=1;
if _type_='RSQ';
keep nclust over_all;
run;
data clus2;
set cluststat2;
nclust=2;
if _type_='RSQ';
keep nclust over_all;
run;
data clus3;
set cluststat3;
nclust=3;
if _type_='RSQ';
keep nclust over_all;
run;
data clus4;
set cluststat4;
nclust=4;
if _type_='RSQ';
keep nclust over_all;
run;
data clus5;
set cluststat5;
nclust=5;
if _type_='RSQ';
keep nclust over_all;
run;
data clus6;
set cluststat6;
nclust=6;
if _type_='RSQ';
keep nclust over_all;
run;
data clus7;
set cluststat7;
nclust=7;
if _type_='RSQ';
keep nclust over_all;
run;
data clus8;
set cluststat8;
nclust=8;
if _type_='RSQ';
keep nclust over_all;
run;
data clus9;
set cluststat9;
nclust=9;
if _type_='RSQ';
keep nclust over_all;
run;
data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9;
run;
* plot elbow curve using r-square values;
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
 plot over_all*nclust;
 run;
*****************************************************************************
Number of clusters suggested by the elbow curve
*****************************************************************************
*the proposed numbers are: 2,5,7, and 8
* plot clusters for 5 cluster solution;
proc candisc data=outdata5 out=clustcan;
class cluster;
var age sex region income married children car save_act current_act mortgage;
run;
proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;
run;
* validate clusters on PEP; The categorial target variable
* first merge clustering variable and assignment data with PEP data;
data pep_data;
set clus_train;
keep idnum pep;
run;
proc sort data=outdata5;
by idnum;
run;
proc sort data=pep_data;
by idnum;
run;
data merged;
merge outdata5 pep_data;
by idnum;
run;
proc sort data=merged;
by cluster;
run;
proc means data=merged;
var pep;
by cluster;
run;
proc anova data=merged;
class cluster;
model pep = cluster;
means cluster/tukey;
run;
Dataset
The dataset I used in this assignment contains information about customers in a bank. The Data analysis used will help the bank take know the important features that can affect the PEP of a client from the following features: age, sex, region, income, married, children, car, save_act, current_act and the mortgage.
Id: a unique identification number,
age: age of customer in years (numeric),
income: income of customer (numeric)
sex: 0 for MALE / 1 for FEMALE
married: is the customer married (1 for YES/ 0 for NO)
children: number of children (numeric)
car: does the customer own a car (1 for YES/ 0 for NO)
save_acct: does the customer have a saving account (1 for YES/ 0 for NO)
current_acct: does the customer have a current account (1 for YES/ 0 for NO)
mortgage: does the customer have a mortgage (1 for YES/ 0 for NO)
Figure1: dataset
Tumblr media
K-means Clustering Algorithm
Cluster analysis is an unsupervised learning method used to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. The goal is to partition the observations in a data set into a smaller set of clusters and each observation belongs to only one cluster.
Cluster analysis will be used in this assignment to develop individual banks’ customers profiles to target the ones that will benefit from the pep from those who shouldn’t.
With cluster analysis, we want customers within clusters to be more similar to each other than they are to observations in other clusters.
K-means Cluster Analysis
I used the libname statement to call my dataset. All the data is numerical (continues or logical depending on the feature description).
We'll first create a dataset that includes only my clustering variables and the PEP variable which will be used to externally validate the clusters. Then, we will assign each observation a unique identifier so that we can merge the cluster assignment variable back with the main dataset later on.
data new_clust; set mylib.mydata;
idnum=_n_; * create a unique identifier to merge cluster assignment variable with the main data set;
keep age sex region income married children car save_act current_act mortgage pep;
The SURVEYSELECT procedure
ODS graphics is turned on for the SAS to plot graphics. The data set is randomly split into a training data set consisting of 70% of the total observations in the data set, and a test data set consisting of the other 30% of the observations. Data equals, specifies the name of my managed dataset, called new_clust as seen in the figure below.
Figure 2
Tumblr media
Statistics for variables
proc surveyselect data=new_clust out=traintest seed = 123
 samprate=0.7 method=srs outall;
run;  
method = srs specifies that the data are to be split using simple random sampling.
out=traintest to include both the training and test observations in a single output data set that has a new variable called selected.
The selected variable is 1 when an observation belongs to the training data set and 0 when an observation belongs to the test data set.
In cluster analysis, variables with large values contribute more to the distance calculations. Variables measured on different scales should be standardized prior to clustering. So, that the solution is not driven by variables measured on larger scales. We use the following code to standardize the clustering variables to have a mean of zero and a standard deviation of one.
proc standard data=clus_train out=clustvar mean=0 std=1;
var age sex region income married children car save_act current_act mortgage; run;
Figure 3
Tumblr media
The Elbow curve
%macro kmean(K); indicates that the code is part of a Sass macro called knean and the K in parenthesis indicates that the macro will run the procedure code for number of different values of K. the output is then printed and the output data sets for K from 1 to 11 clusters is created.
Figure4
Tumblr media
To view the different R-squared values for each of the k equals 1 to 11, the elbow plot is drawn as shown in the figure 5. We start with the K equals 1 with an R-squared=0 zero because there's no clustering yet. 2 cluster solution accounts for about 13% of the variance. The R-square value increases as more clusters are specified. We are looking for the bend in the elbow that shows where the R-square value might be leveling off. From the graph we can notice that there is a bend at 2 clusters, 5 clusters, 7 clusters, and 10 clusters. To help us figure out which solutions is best, we should further examine the results for the 2, 5, 7, and 10 cluster solutions.
Figure 5
Tumblr media
The Canonical discriminate analysis
We should further examine the results for the 2, 5, 7, and 10 cluster solutions to see whether the clusters overlap or the patterns of means on the clustering variables are unique and meaningful and whether there are significant differences between the clusters on our external validation variable, PEP. We will interpret the result for the 5 cluster solution.
Since we have 10 variables, we will not be able to plot a scatter chart to see whether or not the clusters overlap with each other in terms of their location in the 10-dimensional space. For this, we have to use the canonical discriminate analysis which is a data reduction technique that creates a smaller number of variables that are linear combinations of the 10 clustering variables. Usually, the majority of the variants in the clustering variable will be accounted for by the first couple of canonical variables and those are the variables we can plot.
Figure 6
Tumblr media
Results in Figure 6 show that the 10 variables are now reduced to 4 canonical variables that can be used to visualize the location of the clusters in a two or three dimensional space as shown in Figure 7.
Figure 7
Tumblr media
What this shows is that the observations in cluster 5 is little more spread out, indicating less correlation among the observations and higher within cluster variance. Clusters 2 and 3 are relatively distinct with the exception that some of the observations are closer to each other indicating some overlap with these clusters. The same thing applies to cluster 2 and cluster 4. However, cluster 1 is all over the place. There is some indication of a cluster but the observations are spread out more than the other clusters. This means that the within cluster variance is high as there is less correlation between the observations in this cluster, so we don't really know what's going to happen with that cluster. So, the best cluster solution may have fewer than 5 clusters.
Cluster means table
We will consider for the rest that k=5. We will take a look at the cluster means table to examine the patterns of means on the clustering variables for each cluster which was shown in Figure8.
Figure 8
Tumblr media
The means on the clustering variables show that compared to the other clusters, customers in cluster 2 and 5 have relatively good income, a current and saving accounts with no mortgage while customers in cluster 1 and 3 they have a low income, married and less number of children. Cluster 4 includes customers with a low income, low saving account and a mortgage.
Fit criteria for pep
The last part of this analysis will show how the clusters differ in PEP. We first have to extract the PEP variable from the training data set, then sort both data sets by the unique identifier, ID num which we will use to link the data sets and finally merge it with the data set that includes the cluster assignment variable.
The graph in Figure 9 below shows the mean PEP by cluster. As it was described in the mean table before, customers in cluster 1 and 3 they have a low income, married and less number of children so these customers will be certainly accorded a PEP with 60% while customers in cluster 2 and 5 who have relatively good income, a current and saving accounts with no mortgage will be accorded also a PEP with 40%. However, customers in cluster 4 with a low income, low saving account and a mortgage will be accorded a pep with 25% as this is logic as they already have mortgage.
Figure 9
Tumblr media
The tukey test
The anova procedure is used to test whether there are significant differences between clusters and PEP as follows:
proc anova data=merged;
class cluster;
model pep = cluster;
means cluster/tukey;
run;
The tukey test shows that the clusters differed significantly in mean PEP as shown in figure 10, with the exception of clusters 2 and 3, which did not differ significantly from each other.
Figure 10
Tumblr media
Conclusion
Kmeans is Cluster analysis is an unsupervised learning method used to group or cluster observations into subsets based on the similarity of responses on multiple variables. It is a useful machine learning method that can be applied in any field. However, k-means cluster analysis will have to give us the correct number of clusters and figuring out the correct number clusters that represent the true number of clusters in the population is pretty subjective. Also, if results can change depending on the location of the observations that are randomly chosen as initial centroids. It also assumes that the underlying clusters in the population are spherical, distinct, and are of approximately equal size. As a result, tends to identify clusters with these characteristics. It won't work as well if clusters are elongated or not equal in size.
0 notes
melolon · 3 months
Text
Running a k-means Cluster Analysis
Tumblr media
The number of clusters for k-means cluster analysis: 3 clusters
Adolescents in cluster 1 shows highest level of school connectedness, self-esteem, family connectedness and lowest level of depression, problem with alcohol, marijuana while other two clusters have opposite tendency.
There is significant difference in GPA between cluster 1 and other two clusters.
Tumblr media Tumblr media
0 notes
chaolinchang · 3 months
Text
Running a k-means Cluster Analysis on IRIS
# Import necessary libraries
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.cluster import KMeans
# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
# Define the number of clusters
k = 3
# Initialize the k-means model
kmeans = KMeans(n_clusters=k, random_state=42)
# Fit the model to the data
kmeans.fit(X)
# Get the cluster centers
cluster_centers = kmeans.cluster_centers_
# Get the labels for each data point
labels = kmeans.labels_
# Print the cluster centers
print("Cluster Centers:")
print(cluster_centers)
# Print the labels
print("\nLabels:")
print(labels)
Tumblr media
0 notes