Agglomerative Clustering

Agglomerative and K-means Clustering on US Crime Data
Objective: Group US states based on crime data using K-means clustering algorithm and then compare the results with Hierarchical Agglomerative Clustering algorithm.
Data Source: 50 US states crime dataset will be used.
The Data
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
- Murder numeric Murder arrests (per 100,000)
- Assault numeric Assault arrests (per 100,000)
- UrbanPop numeric Percent urban population
- Rape numeric Rape arrests (per 100,000)
Import Libraries
Import the commonly used libraries.
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
Read in the data file using read_csv. Set states column as the index.
df = pd.read_csv('crime_data.csv') df['Total'] = df['Murder'] + df['Assault'] + df['Rape'] df.head()
sns.lmplot(data=df,x='UrbanPop',y='Total')
Although the points are scattered across but we can see an upward trend in crimes reported with increase in Urbanization.
sns.heatmap(df.drop('Total',axis=1).corr(),annot=True)
Inference: Assault and Murder are highly correlated crimes.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(df[['Murder','Rape','Assault']]) scaled_data = scaler.transform(df[['Murder','Rape','Assault']])
Import KMeans from SciKit Learn.
from sklearn.cluster import KMeans
cluster_range = range( 1, 10 ) cluster_errors = [] for num_clusters in cluster_range: kmeans_index = KMeans( num_clusters ) kmeans_index.fit( scaled_data ) cluster_errors.append( kmeans_index.inertia_ ) clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
plt.figure(figsize=(12,6)) plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
If you look at the above figure, there is a clear bend of arm at cluster count 2 and then the next significant bend is at 4 although 2 of the groups have hardly any difference in slope (between 2 and 4). Lets cluster the data in 4 groups.
kmeans = KMeans(n_clusters=4) kmeans.fit(scaled_data)
df['Cluster'] = kmeans.labels_ df.head()
df.groupby('Cluster').mean()
df.loc[:, 'Cluster'].replace([2,3], [1,2], inplace=True) df.groupby('Cluster').mean()
Cluster Classification
- 2: Low crime zone states
- 0: Medium crime zone states
- 1: High crime zone states
df.loc[:, 'Cluster'].replace([0,1,2], ['Medium','High','Low'], inplace=True) df.head()
Hierarchical Agglomerative Clustering
Now lets try to cluster our data using the hierarchical agglomerative clustering approach.
from sklearn.cluster import AgglomerativeClustering ward = AgglomerativeClustering(n_clusters=4, linkage='ward').fit(scaled_data) df['Cluster_H'] = ward.labels_ df.groupby('Cluster_H').mean()
Merging cluster 0 and 3 as they have similar data and also categorize them as high medium and low crime rate zones.
df.loc[:, 'Cluster_H'].replace([0,1,2,3], ['High','Medium','Low','High'], inplace=True)
The results from KMeans and Hierarchical clustering are quite similar except in few states. Refer to the table towards end of the page.
Now using Hierarchical Agglomerative Clustering with 3 cluster count
ward_c3 = AgglomerativeClustering(n_clusters=3, linkage='ward').fit(scaled_data) df['Cluster_H_c3'] = ward_c3.labels_ df.groupby('Cluster_H_c3').mean()
df.loc[:, 'Cluster_H_c3'].replace([0,1,2], ['High','Medium','Low'], inplace=True) df.sort_values('Cluster_Kmeans')
NOTE: Although we have selected only 3 clusters, the algorithm has by itself done clustering as we did manually while reducing clusters from 4 to 3. So even if elbow showed good deviation at 4, since 2 and 3 had hardly any deviation their merging has been done automatically