Agglomerative Clustering

Posted: 13/01/2018/Under: Machine Learning/By: Vaibhav

Agglomerative and K-means Clustering on US Crime Data

Objective: Group US states based on crime data using K-means clustering algorithm and then compare the results with Hierarchical Agglomerative Clustering algorithm.

Data Source: 50 US states crime dataset will be used.

The Data

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

Murder numeric Murder arrests (per 100,000)
Assault numeric Assault arrests (per 100,000)
UrbanPop numeric Percent urban population
Rape numeric Rape arrests (per 100,000)

Import Libraries

Import the commonly used libraries.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

 Get the Data 

Read in the data file using read_csv. Set states column as the index.

In [2]:

df = pd.read_csv('crime_data.csv')
df['Total'] = df['Murder'] + df['Assault'] + df['Rape']
df.head()

Out[2]:

	State	Murder	Assault	UrbanPop	Rape	Total
0	Alabama	13.2	236	58	21.2	270.4
1	Alaska	10.0	263	48	44.5	317.5
2	Arizona	8.1	294	80	31.0	333.1
3	Arkansas	8.8	190	50	19.5	218.3
4	California	9.0	276	91	40.6	325.6

Describing the dataframe you can find out that there is no missing data.

Exploratory Data Analysis

In [7]:

sns.lmplot(data=df,x='UrbanPop',y='Total')
Out[7]:

Although the points are scattered across but we can see an upward trend in crimes reported with increase in Urbanization.

In [8]:

sns.heatmap(df.drop('Total',axis=1).corr(),annot=True)
Out[8]: 

Crimes Correlation Heatmap

Inference: Assault and Murder are highly correlated crimes.

Data Standardization

In [9]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Murder','Rape','Assault']])
scaled_data = scaler.transform(df[['Murder','Rape','Assault']])

K Means Cluster Creation

Import KMeans from SciKit Learn.

In [14]:

from sklearn.cluster import KMeans
Finding the optimal number of clusters using Elbow method

In [15]:

cluster_range = range( 1, 10 )
cluster_errors = []
for num_clusters in cluster_range:
 kmeans_index = KMeans( num_clusters )
 kmeans_index.fit( scaled_data )
 cluster_errors.append( kmeans_index.inertia_ )

clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
 In [18]:

plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )

Out[18]:

Elbow Method Chart

Elbow method conclusion

If you look at the above figure, there is a clear bend of arm at cluster count 2 and then the next significant bend is at 4 although 2 of the groups have hardly any difference in slope (between 2 and 4). Lets cluster the data in 4 groups.

Create an instance of a K Means model with 4 clusters and fit the scaled data.

In [19]:

kmeans = KMeans(n_clusters=4)
kmeans.fit(scaled_data)

Out[19]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [23]:

df['Cluster'] = kmeans.labels_
df.head()

Out[23]:

	State	Murder	Assault	UrbanPop	Rape	Total	Cluster
0	Alabama	13.2	236	58	21.2	270.4	1
1	Alaska	10.0	263	48	44.5	317.5	2
2	Arizona	8.1	294	80	31.0	333.1	2
3	Arkansas	8.8	190	50	19.5	218.3	0
4	California	9.0	276	91	40.6	325.6	2

In [26]:

df.groupby('Cluster').mean()
Out[26]:

	Murder	Assault	UrbanPop	Rape	Total
Cluster
0	6.588235	145.764706	68.235294	20.076471	172.429412
1	13.633333	258.166667	64.666667	23.925000	295.725000
2	10.100000	261.285714	74.571429	38.285714	309.671429
3	3.078571	80.928571	58.500000	11.800000	95.807143

Group 1 and 2 have very similar crime rates (except Rapes reported), so they will be clubbed into one cluster.

In [29]:

df.loc[:, 'Cluster'].replace([2,3], [1,2], inplace=True)
df.groupby('Cluster').mean()
Out[29]:

	Murder	Assault	UrbanPop	Rape	Total
Cluster
0	6.588235	145.764706	68.235294	20.076471	172.429412
1	12.331579	259.315789	68.315789	29.215789	300.863158
2	3.078571	80.928571	58.500000	11.800000	95.807143

Cluster Classification

2: Low crime zone states
0: Medium crime zone states
1: High crime zone states

In [40]:

df.loc[:, 'Cluster'].replace([0,1,2], ['Medium','High','Low'], inplace=True)
df.head()

Out[40]:

	State	Murder	Assault	UrbanPop	Rape	Total	Cluster
0	Alabama	13.2	236	58	21.2	270.4	High
1	Alaska	10.0	263	48	44.5	317.5	High
2	Arizona	8.1	294	80	31.0	333.1	High
3	Arkansas	8.8	190	50	19.5	218.3	Medium
4	California	9.0	276	91	40.6	325.6	High

Hierarchical Agglomerative Clustering

Now lets try to cluster our data using the hierarchical agglomerative clustering approach.

In [36]:

from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=4, linkage='ward').fit(scaled_data)
df['Cluster_H'] = ward.labels_
df.groupby('Cluster_H').mean()

Out[36]:

	Murder	Assault	UrbanPop	Rape	Total	Cluster
Cluster_H
0	12.762500	256.875000	66.875000	25.843750	295.481250	0.937500
1	6.105263	138.000000	69.578947	18.847368	162.952632	0.315789
2	2.736364	73.727273	53.363636	10.927273	87.390909	2.000000
3	9.775000	248.750000	74.500000	42.450000	300.975000	1.000000

Merging cluster 0 and 3 as they have similar data and also categorize them as high medium and low crime rate zones.

In [43]:

df.loc[:, 'Cluster_H'].replace([0,1,2,3], ['High','Medium','Low','High'], inplace=True)

The results from KMeans and Hierarchical clustering are quite similar except in few states. Refer to the table towards end of the page.

Now using Hierarchical Agglomerative Clustering with 3 cluster count

In [58]:

ward_c3 = AgglomerativeClustering(n_clusters=3, linkage='ward').fit(scaled_data)
df['Cluster_H_c3'] = ward_c3.labels_
df.groupby('Cluster_H_c3').mean()

Out[58]:

	Murder	Assault	UrbanPop	Rape	Total
Cluster_H_c3
0	12.165000	255.250000	68.400000	29.165000	296.580000
1	6.105263	138.000000	69.578947	18.847368	162.952632
2	2.736364	73.727273	53.363636	10.927273	87.390909

In [61]:

df.loc[:, 'Cluster_H_c3'].replace([0,1,2], ['High','Medium','Low'], inplace=True)
df.sort_values('Cluster_Kmeans')

Out[61]:

	State	Murder	Assault	UrbanPop	Rape	Total	Cluster_Kmeans	Cluster_H	Cluster_H_c3
0	Alabama	13.2	236	58	21.2	270.4	High	High	High
21	Michigan	12.1	255	74	35.1	302.2	High	High	High
42	Texas	12.7	201	80	25.5	239.2	High	High	High
19	Maryland	11.3	300	67	27.8	339.1	High	High	High
41	Tennessee	13.2	188	59	26.9	228.1	High	High	High
17	Louisiana	15.4	249	66	22.2	286.6	High	High	High
27	Nevada	12.2	252	81	46.0	310.2	High	High	High
12	Illinois	10.4	249	83	24.0	283.4	High	High	High
31	New York	11.1	254	86	26.1	291.2	High	High	High
30	New Mexico	11.4	285	70	32.1	328.5	High	High	High
8	Florida	15.4	335	80	31.9	382.3	High	High	High
32	North Carolina	13.0	337	45	16.1	366.1	High	High	High
5	Colorado	7.9	204	78	38.7	250.6	High	High	High
4	California	9.0	276	91	40.6	325.6	High	High	High
39	South Carolina	14.4	279	48	22.5	315.9	High	High	High
2	Arizona	8.1	294	80	31.0	333.1	High	High	High
1	Alaska	10.0	263	48	44.5	317.5	High	High	High
9	Georgia	17.4	211	60	25.8	254.2	High	High	High
23	Mississippi	16.1	259	44	17.1	292.2	High	High	High
48	Wisconsin	2.6	53	66	10.8	66.4	Low	Low	Low
40	South Dakota	3.8	86	45	12.8	102.6	Low	Low	Low
28	New Hampshire	2.1	57	56	9.5	68.6	Low	Low	Low
33	North Dakota	0.8	45	44	7.3	53.1	Low	Low	Low
26	Nebraska	4.3	102	62	16.5	122.8	Low	Medium	Medium
18	Maine	2.1	83	51	7.8	92.9	Low	Low	Low
38	Rhode Island	3.4	174	87	8.3	185.7	Low	Medium	Medium
44	Vermont	2.2	48	32	11.2	61.4	Low	Low	Low
14	Iowa	2.2	56	57	11.3	69.5	Low	Low	Low
11	Idaho	2.6	120	54	14.2	136.8	Low	Low	Low
10	Hawaii	5.3	46	83	20.2	71.5	Low	Medium	Medium
6	Connecticut	3.3	110	77	11.1	124.4	Low	Low	Low
47	West Virginia	5.7	81	39	9.3	96.0	Low	Low	Low
22	Minnesota	2.7	72	66	14.9	89.6	Low	Low	Low
43	Utah	3.2	120	80	22.9	146.1	Medium	Medium	Medium
45	Virginia	8.5	156	63	20.7	185.2	Medium	Medium	Medium
46	Washington	4.0	145	73	26.2	175.2	Medium	Medium	Medium
37	Pennsylvania	6.3	106	72	14.9	127.2	Medium	Medium	Medium
24	Missouri	9.0	178	70	28.2	215.2	Medium	High	High
35	Oklahoma	6.6	151	68	20.0	177.6	Medium	Medium	Medium
34	Ohio	7.3	120	75	21.4	148.7	Medium	Medium	Medium
29	New Jersey	7.4	159	89	18.8	185.2	Medium	Medium	Medium
25	Montana	6.0	109	53	16.4	131.4	Medium	Medium	Medium
20	Massachusetts	4.4	149	85	16.3	169.7	Medium	Medium	Medium
16	Kentucky	9.7	109	52	16.3	135.0	Medium	Medium	Medium
15	Kansas	6.0	115	66	18.0	139.0	Medium	Medium	Medium
13	Indiana	7.2	113	65	21.0	141.2	Medium	Medium	Medium
7	Delaware	5.9	238	72	15.8	259.7	Medium	Medium	Medium
3	Arkansas	8.8	190	50	19.5	218.3	Medium	Medium	Medium
36	Oregon	4.9	159	67	29.3	193.2	Medium	Medium	Medium
49	Wyoming	6.8	161	60	15.6	183.4	Medium	Medium	Medium

NOTE: Although we have selected only 3 clusters, the algorithm has by itself done clustering as we did manually while reducing clusters from 4 to 3. So even if elbow showed good deviation at 4, since 2 and 3 had hardly any deviation their merging has been done automatically

Agglomerative Clustering

Agglomerative and K-means Clustering on US Crime Data

The Data

Import Libraries

Hierarchical Agglomerative Clustering

Now using Hierarchical Agglomerative Clustering with 3 cluster count

Pages

Recent Posts

Archives

Categories