Hi! If you are reading this, you are either:

• not a data scientist but want to understand why PCA should be used
• a data scientist but want to explain to a colleague why PCA should be used

I recently realised that explaining Principal Component Analysis (PCA) to non-scientists is a universally common experience that all data scientists eventually do.

I will explain when PCA is needed, what PCA does, an example with PCA, what benefits PCA brings, and how to interpret the results of PCA.

I will not detail how PCA works, how to use PCA, or whether to use PCA or other methods, but I link nice resources in the FAQ.

## Why is PCA needed?

PCA reduces the number of dimensions of the dataset.

As an analogy, like how shadows are a 2D representation of 3D objects, PCA finds the best "angle" to cast the "shadow" of the dataset. Credits to Evan Qu.

Thus, PCA is needed if you:

• have a very high-dimensional dataset (e.g. ≥100 features)
• cannot afford a large amount of compute (minutes vs hours vs days)
• want to make sense of many features (e.g. to select important features)

By applying PCA, you:

• save time (during model training and model iteration)
• save space (in memory and model parameters)
• trade-off a little bit of performance

## What does PCA do?

In a sentence:

PCA produces new features as a weighted sum of the original features in the dataset, by preserving variance in the original features.

What is a weighted sum?

$X = w_1\cdot x_1 + \dotsb + w_n\cdot x_n$

$$X$$ is a sum of values $$x_1,\dotsc, x_n$$, each weighted by $$w_1,\dotsc, w_n$$ respectively. Likewise, PCA produces new features, e.g. $$X_1, X_2,\dotsc$$, as weighted sums of original features $$x_1,\dotsc, x_n$$.

Why preserve variance?

Preserving variance in the original features retains information about the spread of features in the data, usually resulting in good representations of original features for machine learning.

## Show me an example!

Let’s walk through PCA with the Boston Housing dataset:

Given a dataset of 37 numerical features of houses, you want to predict its sale price. However, dealing with so many features can be (imagine) expensive in compute and difficult to interpret.

Code
Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea ... SalePrice
0 1 60 65.0 8450 7 5 2003 2003 196.0 ... 208500
1 2 20 80.0 9600 6 8 1976 1976 0.0 ... 181500
2 3 60 68.0 11250 7 5 2001 2002 162.0 ... 223500
3 4 70 60.0 9550 7 5 1915 1970 0.0 ... 140000
4 5 60 84.0 14260 8 5 2000 2000 350.0 ... 250000

5 rows x 38 columns

Because PCA reduces the number of dimensions of the dataset, model training can take a shorter time without sacrificing too much performance.

Let’s run PCA on the 37 features:

Code
Original features (normalized):
[-0.646  0.136  0.171 -0.252  1.297 -0.523  1.221  1.129  0.026  0.221
-0.283 -0.675 -0.531 -0.856  1.734 -0.106  0.784  1.175 -0.233  0.79
1.238  0.175 -0.202  0.9   -0.943  1.214  0.211  0.271 -0.746  1.613
-0.363 -0.114 -0.285 -0.074 -0.146  2.113  0.879]

PCA features:
[ 2.141 -1.273]


## What benefits does PCA bring? At what trade-off?

Let’s train 2 linear regression models:

1. Trained on the original 37-dimensional features
2. Trained on the 2-dimensional features produced by PCA

Even though PCA reduces the number of dimensions from 37 to merely 2, there is a small decrease in test error of only -0.02 (RMSE of log of sale price as evaluated on Kaggle).

Code
Original Features (37 dimensions)
Train error: 0.15
Test error: 0.1348

PCA features (2 dimensions)
Train error: 0.1935
Test error: 0.1562


#### Training Time: A Major Speed-up

With PCA, because number of features is drastically reduced (from 37 to 2), training time taken is also reduced from 4.85ms to 584µs, a 8300X speedup!

Code (Linear regression with original features)
4.85 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Code (Linear regression with PCA features)
584 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Memory Space: A Significant Reduction

Because the no. of dimensions decreased by 18.5X, so did memory space used, from 265Kb to 14Kb.

Code
Memory size of data w/o PCA: 265216 bytes
Memory size of data w/ PCA: 14336 bytes


## How to interpret PCA results

Recall what PCA does:

PCA produces new features as a weighted sum of original features, by preserving variance in the original features.

PCA produces new features, $$X_1, X_2$$ as a weighted sum of the original features, $$x_1, x_2, x_3$$.

In other words, we can get $$X_1$$ and $$X_2$$ using $$PC_1$$ and $$PC_2$$, by computing a weighted sum of $$[x_1, x_2, x_3]$$.

$X_1 = w_{11}x_1 + w_{12}x_2 + w_{13}x_3$ $X_2 = w_{21}x_1 + w_{22}x_2 + w_{23}x_3$

Compare weighted sums with PCA features:

Code
PC1 weighted sum: 1.515
PC2 weighted sum: 0.968
PCA features: [1.515 0.968]


Inspecting principal component 1, the highest weighted feature is OverallQual. This means that OverallQual has the highest variance of all features in the dataset, and is likely a useful feature for sale price prediction.

Sort features by the 1st principal component’s weights:

Code

## FAQ

How does PCA preserve variance? WIP
When should I not use PCA? WIP
What are principal components? \begin{align} \begin{bmatrix}X_1\\X_2\end{bmatrix} =& \begin{bmatrix}PC_1\\PC_2\end{bmatrix} \begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\\ =& \begin{bmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23} \end{bmatrix} \begin{bmatrix}x_1\\x_2\\x_3\end{bmatrix}\\ =& \begin{bmatrix} w_{11}x_1 + w_{12}x_2 + w_{13}x_3\\ w_{21}x_1 + w_{22}x_2 + w_{23}x_3 \end{bmatrix} \end{align}