Agglomerative Clustering

Agglomerative Clustering

Agglomerative and K-means Clustering on US Crime Data

Objective: Group US states based on crime data using K-means clustering algorithm and then compare the results with Hierarchical Agglomerative Clustering algorithm.

Data Source: 50 US states crime dataset will be used.

The Data

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

  • Murder numeric Murder arrests (per 100,000)
  • Assault numeric Assault arrests (per 100,000)
  • UrbanPop numeric Percent urban population
  • Rape numeric Rape arrests (per 100,000)

Import Libraries

Import the commonly used libraries.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 Get the Data 

Read in the data file using read_csv. Set states column as the index.

In [2]:
df = pd.read_csv('crime_data.csv')
df['Total'] = df['Murder'] + df['Assault'] + df['Rape']
df.head()
Out[2]:
State Murder Assault UrbanPop Rape Total
0 Alabama 13.2 236 58 21.2 270.4
1 Alaska 10.0 263 48 44.5 317.5
2 Arizona 8.1 294 80 31.0 333.1
3 Arkansas 8.8 190 50 19.5 218.3
4 California 9.0 276 91 40.6 325.6
Describing the dataframe you can find out that there is no missing data.
Exploratory Data Analysis
In [7]:
sns.lmplot(data=df,x='UrbanPop',y='Total')
Out[7]:

Although the points are scattered across but we can see an upward trend in crimes reported with increase in Urbanization.

In [8]:
sns.heatmap(df.drop('Total',axis=1).corr(),annot=True)
Out[8]: 
Crimes Correlation Heatmap

Crimes Correlation Heatmap

Inference: Assault and Murder are highly correlated crimes.

Data Standardization
In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Murder','Rape','Assault']])
scaled_data = scaler.transform(df[['Murder','Rape','Assault']])
K Means Cluster Creation

Import KMeans from SciKit Learn.

In [14]:
from sklearn.cluster import KMeans
Finding the optimal number of clusters using Elbow method
In [15]:
cluster_range = range( 1, 10 )
cluster_errors = []
for num_clusters in cluster_range:
 kmeans_index = KMeans( num_clusters )
 kmeans_index.fit( scaled_data )
 cluster_errors.append( kmeans_index.inertia_ )

clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
 In [18]:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
 Out[18]:
Elbow Method Chart

Elbow Method Chart

Elbow method conclusion

If you look at the above figure, there is a clear bend of arm at cluster count 2 and then the next significant bend is at 4 although 2 of the groups have hardly any difference in slope (between 2 and 4). Lets cluster the data in 4 groups.

Create an instance of a K Means model with 4 clusters and fit the scaled data.
In [19]:
kmeans = KMeans(n_clusters=4)
kmeans.fit(scaled_data)
Out[19]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [23]:
df['Cluster'] = kmeans.labels_
df.head()
Out[23]:
State Murder Assault UrbanPop Rape Total Cluster
0 Alabama 13.2 236 58 21.2 270.4 1
1 Alaska 10.0 263 48 44.5 317.5 2
2 Arizona 8.1 294 80 31.0 333.1 2
3 Arkansas 8.8 190 50 19.5 218.3 0
4 California 9.0 276 91 40.6 325.6 2
In [26]:
df.groupby('Cluster').mean()
Out[26]:
Murder Assault UrbanPop Rape Total
Cluster
0 6.588235 145.764706 68.235294 20.076471 172.429412
1 13.633333 258.166667 64.666667 23.925000 295.725000
2 10.100000 261.285714 74.571429 38.285714 309.671429
3 3.078571 80.928571 58.500000 11.800000 95.807143
Group 1 and 2 have very similar crime rates (except Rapes reported), so they will be clubbed into one cluster.
 In [29]:
df.loc[:, 'Cluster'].replace([2,3], [1,2], inplace=True)
df.groupby('Cluster').mean()
Out[29]:
Murder Assault UrbanPop Rape Total
Cluster
0 6.588235 145.764706 68.235294 20.076471 172.429412
1 12.331579 259.315789 68.315789 29.215789 300.863158
2 3.078571 80.928571 58.500000 11.800000 95.807143

Cluster Classification

  • 2: Low crime zone states
  • 0: Medium crime zone states
  • 1: High crime zone states
In [40]:
df.loc[:, 'Cluster'].replace([0,1,2], ['Medium','High','Low'], inplace=True)
df.head()
Out[40]:
State Murder Assault UrbanPop Rape Total Cluster
0 Alabama 13.2 236 58 21.2 270.4 High
1 Alaska 10.0 263 48 44.5 317.5 High
2 Arizona 8.1 294 80 31.0 333.1 High
3 Arkansas 8.8 190 50 19.5 218.3 Medium
4 California 9.0 276 91 40.6 325.6 High

Hierarchical Agglomerative Clustering

Now lets try to cluster our data using the hierarchical agglomerative clustering approach.

In [36]:
from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=4, linkage='ward').fit(scaled_data)
df['Cluster_H'] = ward.labels_
df.groupby('Cluster_H').mean()
Out[36]:
Murder Assault UrbanPop Rape Total Cluster
Cluster_H
0 12.762500 256.875000 66.875000 25.843750 295.481250 0.937500
1 6.105263 138.000000 69.578947 18.847368 162.952632 0.315789
2 2.736364 73.727273 53.363636 10.927273 87.390909 2.000000
3 9.775000 248.750000 74.500000 42.450000 300.975000 1.000000

Merging cluster 0 and 3 as they have similar data and also categorize them as high medium and low crime rate zones.

In [43]:
df.loc[:, 'Cluster_H'].replace([0,1,2,3], ['High','Medium','Low','High'], inplace=True)

The results from KMeans and Hierarchical clustering are quite similar except in few states. Refer to the table towards end of the page.

Now using Hierarchical Agglomerative Clustering with 3 cluster count

In [58]:
ward_c3 = AgglomerativeClustering(n_clusters=3, linkage='ward').fit(scaled_data)
df['Cluster_H_c3'] = ward_c3.labels_
df.groupby('Cluster_H_c3').mean()
Out[58]:
Murder Assault UrbanPop Rape Total
Cluster_H_c3
0 12.165000 255.250000 68.400000 29.165000 296.580000
1 6.105263 138.000000 69.578947 18.847368 162.952632
2 2.736364 73.727273 53.363636 10.927273 87.390909
In [61]:
df.loc[:, 'Cluster_H_c3'].replace([0,1,2], ['High','Medium','Low'], inplace=True)
df.sort_values('Cluster_Kmeans')
Out[61]:
State Murder Assault UrbanPop Rape Total Cluster_Kmeans Cluster_H Cluster_H_c3
0 Alabama 13.2 236 58 21.2 270.4 High High High
21 Michigan 12.1 255 74 35.1 302.2 High High High
42 Texas 12.7 201 80 25.5 239.2 High High High
19 Maryland 11.3 300 67 27.8 339.1 High High High
41 Tennessee 13.2 188 59 26.9 228.1 High High High
17 Louisiana 15.4 249 66 22.2 286.6 High High High
27 Nevada 12.2 252 81 46.0 310.2 High High High
12 Illinois 10.4 249 83 24.0 283.4 High High High
31 New York 11.1 254 86 26.1 291.2 High High High
30 New Mexico 11.4 285 70 32.1 328.5 High High High
8 Florida 15.4 335 80 31.9 382.3 High High High
32 North Carolina 13.0 337 45 16.1 366.1 High High High
5 Colorado 7.9 204 78 38.7 250.6 High High High
4 California 9.0 276 91 40.6 325.6 High High High
39 South Carolina 14.4 279 48 22.5 315.9 High High High
2 Arizona 8.1 294 80 31.0 333.1 High High High
1 Alaska 10.0 263 48 44.5 317.5 High High High
9 Georgia 17.4 211 60 25.8 254.2 High High High
23 Mississippi 16.1 259 44 17.1 292.2 High High High
48 Wisconsin 2.6 53 66 10.8 66.4 Low Low Low
40 South Dakota 3.8 86 45 12.8 102.6 Low Low Low
28 New Hampshire 2.1 57 56 9.5 68.6 Low Low Low
33 North Dakota 0.8 45 44 7.3 53.1 Low Low Low
26 Nebraska 4.3 102 62 16.5 122.8 Low Medium Medium
18 Maine 2.1 83 51 7.8 92.9 Low Low Low
38 Rhode Island 3.4 174 87 8.3 185.7 Low Medium Medium
44 Vermont 2.2 48 32 11.2 61.4 Low Low Low
14 Iowa 2.2 56 57 11.3 69.5 Low Low Low
11 Idaho 2.6 120 54 14.2 136.8 Low Low Low
10 Hawaii 5.3 46 83 20.2 71.5 Low Medium Medium
6 Connecticut 3.3 110 77 11.1 124.4 Low Low Low
47 West Virginia 5.7 81 39 9.3 96.0 Low Low Low
22 Minnesota 2.7 72 66 14.9 89.6 Low Low Low
43 Utah 3.2 120 80 22.9 146.1 Medium Medium Medium
45 Virginia 8.5 156 63 20.7 185.2 Medium Medium Medium
46 Washington 4.0 145 73 26.2 175.2 Medium Medium Medium
37 Pennsylvania 6.3 106 72 14.9 127.2 Medium Medium Medium
24 Missouri 9.0 178 70 28.2 215.2 Medium High High
35 Oklahoma 6.6 151 68 20.0 177.6 Medium Medium Medium
34 Ohio 7.3 120 75 21.4 148.7 Medium Medium Medium
29 New Jersey 7.4 159 89 18.8 185.2 Medium Medium Medium
25 Montana 6.0 109 53 16.4 131.4 Medium Medium Medium
20 Massachusetts 4.4 149 85 16.3 169.7 Medium Medium Medium
16 Kentucky 9.7 109 52 16.3 135.0 Medium Medium Medium
15 Kansas 6.0 115 66 18.0 139.0 Medium Medium Medium
13 Indiana 7.2 113 65 21.0 141.2 Medium Medium Medium
7 Delaware 5.9 238 72 15.8 259.7 Medium Medium Medium
3 Arkansas 8.8 190 50 19.5 218.3 Medium Medium Medium
36 Oregon 4.9 159 67 29.3 193.2 Medium Medium Medium
49 Wyoming 6.8 161 60 15.6 183.4 Medium Medium Medium

NOTE: Although we have selected only 3 clusters, the algorithm has by itself done clustering as we did manually while reducing clusters from 4 to 3. So even if elbow showed good deviation at 4, since 2 and 3 had hardly any deviation their merging has been done automatically