Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a process to achieve Dimensionality Reduction which means reducing the number of features or dimensions while retaining the original variance of the whole data set. The new set of principal components have the variance in descending order so that the first component has maximum variance.
Data: We will use the same Breast Cancer data we have used in Support Vector Machines which is provided under sklearn datasets. Refer to SVM file for description of data.
Pros and Cons of PCA
Advantages
- Improves performance: You could be working with hundreds of features which can take a long time to process. Using PCA you will reduce the overall number of features
- Correlated features removal: Saves you the effort of resolving multicollinearity in your data
- Makes visualization feasible: You can reduce the dataset to two features and plot them as a 2 dimensional chart
Disadvantages
- Impacts model interpretability: Its difficult to explain the outcome of model based on PCA features as compared to the original features set
- Possibility of loss of variance: We need to select adequate number of PCA features to ensure there is no loss of variance
Now lets look at the practical application of PCA.
Load Data
Import the required libraries.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns %matplotlib inline
Load the cancer dataset.
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer()
Have a look at the type of information available with the cancer dataset.
cancer.keys()
Out:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Create dataframe from dataset
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
Take a look at the data from the dataset
df.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | … | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | … | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | … | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | … | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | … | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | … | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
Data Standardization
Before using PCA, we need to scale our data so that each feature has a single unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
Out:
StandardScaler(copy=True, with_mean=True, with_std=True)
Standardize the data.
scaled_data = scaler.transform(df)
Applying PCA dimensionality reduction
Steps involved in PCA dimensionality reduction are,
- instantiate a PCA object
- find the principal components using the fit() method
- apply the rotation and dimensionality reduction by calling transform()
We can limit the number of components retained by passing n_components parameter with a limit value while creating the PCA object.
Now apply the above steps to the standardized data.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
Out:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Data is transformed to its first 2 principal components.
Note: We have reduced to only 2 dimensions so that we can plot the components easily in 2 dimensional space. In real-world you might have work with more components.
x_pca = pca.transform(scaled_data)
scaled_data.shape
Out:
(569, 30)
So the input dataset had 30 columns. Now check the number of columns after PCA dimensionality reduction.
x_pca.shape
Out:
(569, 2)
As we can see, the number of dimensions have been reduced to two columns only.
PCA Visualization
Plotting the two dimensions.
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')
Out:
Using these two components we can separate the two classes.
Interpreting the components
Drawback of dimensionality reduction is not able to understand what these components represent as each component correspond to combinations of the original features.
pca.components_
Each row represents a principal component, and each column represents original features. We can visualize the relationship with a heatmap:
df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names']) plt.figure(figsize=(12,6)) sns.heatmap(df_comp,cmap='plasma',)
This heatmap shows the correlation between the various feature and the principal components.