K-means Clustering

K-means Clustering with University Data
K-means Clustering is an unsupervised algorithm used to categorise data into groups. The approach followed in the algorithm is as follows:
- Decide the number of clusters i.e. K. The approach to calculate the ideal value of K is covered further down
- Set K number of centroids which will be considered as the centre of the K-number of clusters. Note, the centroid may or may not be a data point. The centroids can be assigned randomly.
- Measure the distance between each data point and the centroid and align the data point with the nearest centroid
- Move the centroid to the mean point of all the data points aligned with it
- Repeat steps 2 and 3 for a fixed number of iterations which will cause to data points to realign with the new centroid points. You can exit pre-maturely if an iteration does not cause any realignment of the data points
You can view this video explaining the above algorithm pictorially.
Note, that if the dimensions of the dataset are of different scales, then standardise the data so that the distance calculations are uniform and no single dimension has undue weightage.
Some of the use cases of clustering are:
- Anomaly detection: Like in case of anti-money laundering detection software, the users are grouped as per their income pattern and then their spending pattern is compared to determine any anomaly
- Recommendation engines: The users are grouped based on their purchase history (collaborative) or products are grouped based on some feature set (content based)
The clustering can be of two types:
- Hard clustering: Each data point is aligned with only one cluster
- Soft or Fuzzy clustering: There is some overlap between clusters. Consequently some data points are aligned with multiple clusters
Now let’s analyse the algorithm with an example of university dataset.
Objective: Use KMeans Clustering to cluster Universities into two groups, Private and Public.
Data Source: The publicly available College dataset found in the ISLR package will be used.
Note: The data also has label but they wont be used as it is an unsupervised learning algorithm although we will refer to them to analyse how well the algorithm has performed.
Data
The data has 777 records with following 18 variables.
- Private A factor with levels No and Yes indicating private or public university
- Apps Number of applications received
- Accept Number of applications accepted
- Enroll Number of new students enrolled
- Top10perc Pct. new students from top 10% of H.S. class
- Top25perc Pct. new students from top 25% of H.S. class
- F.Undergrad Number of full-time undergraduates
- P.Undergrad Number of part-time undergraduates
- Outstate Out-of-state tuition
- Room.Board Room and board costs
- Books Estimated book costs
- Personal Estimated personal spending
- PhD Pct. of faculty with Ph.D.’s
- Terminal Pct. of faculty with terminal degree
- S.F.Ratio Student/faculty ratio
- perc.alumni Pct. alumni who donate
- Expend Instructional expenditure per student
- Grad.Rate Graduation rate
Import Libraries
Import the libraries you usually use for data analysis.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Get the Data
Read in the College_Data file using read_csv. Figure out how to set the first column as the index.
df = pd.read_csv('College_Data',index_col=0)
df.head()
Check the info() and describe() methods on the data.
df.info()
df.describe()
EDA
Scatterplot of Grad.Rate versus Room.Board with hue by the Private column.
sns.lmplot(data=df,x='Room.Board',y='Grad.Rate',hue='Private',palette='coolwarm',size=10,aspect=1,fit_reg=False)
The private universities have costlier Rooms and Boards and also their overall Graduation rate is better than public universities. One interesting point to note is that one of the universities has graduation rate of over 100%.
Scatterplot of F.Undergrad versus Outstate with hue by the Private column.
sns.lmplot(data=df,x='Outstate',y='F.Undergrad',hue='Private',palette='coolwarm',size=10,aspect=1,fit_reg=False)
The number of full-time under graduates are much higher in Public universities but the Out of State tuition rate is significantly higher in Private universities.
Out of State histogram.
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7,edgecolor='black', linewidth=1)
Revalidating observation from histogram that Out of State tuition rate is much higher in Private universities.
Grad.Rate histogram.
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7,edgecolor='black', linewidth=1)
Also revalidating the higher graduation rate in Private universities and one of the private university showing graduation rate of over 100%.
Let’s find the name of the university.
df[df['Grad.Rate'] > 100]
Set that graduation rate to 100 for ‘Cazenovia College’ as rate of 118% doesnt make sense.
df.loc['Cazenovia College','Grad.Rate'] = 100
df.loc['Cazenovia College','Grad.Rate']
K Means Cluster Creation
Now it is time to create the Cluster labels!
Import KMeans from SciKit Learn.
from sklearn.cluster import KMeans
Create an instance of a K Means model with 2 clusters.
kmeans = KMeans(n_clusters=2)
Fit the model to all the data except for the Private label.
kmeans.fit(df.drop('Private',axis=1))
cluster center vectors
kmeans.cluster_centers_
Evaluation
Since it is a test data we have labels which are generally not present in clustering data. We will use the target labels to evaluate our model.
Firstly create a new column for df called ‘Cluster’, which is a 1 for a Private school, and a 0 for a public school.
def converter(cluster):
if cluster=='Yes':
return 1
else:
return 0
df['Cluster'] = df['Private'].apply(converter)
df.head()
Confusion matrix and classification report
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))
Finding the optimal number of clusters
In the given problem we know that the dataset has only 2 groups of clusters i.e. public and private universities. There would be situations when the number of clusters are not pre-decided. In such cases you can use one of the followingg methods to determine the ideal number:
- Elbow method
- Silhoutte score analysis
- Dendogram method
Refer to the link for an explanation of each of these 3 methods.
We will use the elbow method in our dataset to analyze the error rate for different number of cluster counts.
cluster_range = range( 1, 10 )
cluster_errors = []
for num_clusters in cluster_range:
kmeans_index = KMeans( num_clusters )
kmeans_index.fit( df.drop('Private',axis=1) )
cluster_errors.append( kmeans_index.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:10]
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Elbow method conclusion
If you look at the above figure, there is a clear bend of arm at cluster count 3 and then the slope tapers down with each subsequent cluster count value on X-axis. In this problem since we have the domain knowledge of the dataset, we know the dataset has only 2 clusters. Otherwise if the data would have been generic the ideal cluster count for the dataset would have been 3.
One important step after categorizing the data into clusters is to map the data with the business domain. Which means you need to describe properties of each of the clusters. For example in the above dataset we know the clusters are of Private and Public universities. In real-life the categorizations may not be that clear so you will have to provide more descriptive definition.
Main objective is to have high intra-cluster and low inter-cluster similarity. So the data points within same cluster should have more things in common but they should have some differentiating factor with data points of other clusters.
Further Reading:
Refer to the blog for reading about another clustering technique, Hierarchical Agglomerative Clustering and comparison of results of this algorithm with K-means.