Principal Component Analysis (PCA)

Posted: 22/01/2018/Under: Machine Learning/By: Vaibhav

Principal Component Analysis (PCA) is a process to achieve Dimensionality Reduction which means reducing the number of features or dimensions while retaining the original variance of the whole data set. The new set of principal components have the variance in descending order so that the first component has maximum variance.

Data: We will use the same Breast Cancer data we have used in Support Vector Machines which is provided under sklearn datasets. Refer to SVM file for description of data.

Pros and Cons of PCA

Advantages

Improves performance: You could be working with hundreds of features which can take a long time to process. Using PCA you will reduce the overall number of features
Correlated features removal: Saves you the effort of resolving multicollinearity in your data
Makes visualization feasible: You can reduce the dataset to two features and plot them as a 2 dimensional chart

Disadvantages

Impacts model interpretability: Its difficult to explain the outcome of model based on PCA features as compared to the original features set
Possibility of loss of variance: We need to select adequate number of PCA features to ensure there is no loss of variance

Now lets look at the practical application of PCA.

Load Data

Import the required libraries.

import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np 
import seaborn as sns %matplotlib inline

Load the cancer dataset.

from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer()

Have a look at the type of information available with the cancer dataset.

cancer.keys()
 Out:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

Create dataframe from dataset

df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])

Take a look at the data from the dataset

df.head()

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	…	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	…	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	…	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	…	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	…	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	…	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 30 columns

Data Standardization

Before using PCA, we need to scale our data so that each feature has a single unit variance.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
Out:
StandardScaler(copy=True, with_mean=True, with_std=True)
Standardize the data. 
scaled_data = scaler.transform(df)

Applying PCA dimensionality reduction

Steps involved in PCA dimensionality reduction are,

instantiate a PCA object
find the principal components using the fit() method
apply the rotation and dimensionality reduction by calling transform()

We can limit the number of components retained by passing n_components parameter with a limit value while creating the PCA object.

Now apply the above steps to the standardized data.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
Out:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Data is transformed to its first 2 principal components.

Note: We have reduced to only 2 dimensions so that we can plot the components easily in 2 dimensional space. In real-world you might have work with more components.

x_pca = pca.transform(scaled_data)
scaled_data.shape
Out:
(569, 30)

So the input dataset had 30 columns. Now check the number of columns after PCA dimensionality reduction.

x_pca.shape
Out:
(569, 2)

As we can see, the number of dimensions have been reduced to two columns only.

PCA Visualization

Plotting the two dimensions.

plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
Out:

Using these two components we can separate the two classes.

Interpreting the components

Drawback of dimensionality reduction is not able to understand what these components represent as each component correspond to combinations of the original features.

pca.components_
 Out:

array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
         0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
         0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
         0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
         0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
         0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
       [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
         0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
        -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
         0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
        -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
         0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

Each row represents a principal component, and each column represents original features. We can visualize the relationship with a heatmap:

df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)

Out:

This heatmap shows the correlation between the various feature and the principal components.

Principal Component Analysis (PCA)

Pros and Cons of PCA

Advantages

Disadvantages

Load Data

Data Standardization

Applying PCA dimensionality reduction

PCA Visualization

Interpreting the components

Pages

Recent Posts

Archives

Categories