# PCA Explained Simply

Hi! If you are reading this, you are either:

- not a data scientist but want to understand why PCA should be used
- a data scientist but want to explain to a colleague why PCA should be used

I recently realised that explaining Principal Component Analysis (PCA) to non-scientists is a universally common experience that all data scientists eventually do.

I will explain when PCA is needed, what PCA does, an example with PCA, what benefits PCA brings, and how to interpret the results of PCA.

I will **not** detail how PCA works, how to use PCA, or whether to use PCA or other methods, but I link nice resources in the FAQ.

If you have questions not addressed in the FAQ, please leave a comment and I will answer you.

## Why is PCA needed?

PCA reduces the number of dimensions of the dataset.

Thus, PCA is needed if you:

- have a very high-dimensional dataset (e.g. ≥100 features)
- cannot afford a large amount of compute (minutes vs hours vs days)
- want to make sense of many features (e.g. to select important features)

By applying PCA, you:

- save time (during model training and model iteration)
- save space (in memory and model parameters)
- trade-off a little bit of performance

## What does PCA do?

In a sentence:

PCA produces new features as aof the original features in the dataset, byweighted sumin the original features.preserving variance

What is a **weighted sum**?

\(X\) is a sum of values \(x_1,\dotsc, x_n\), each weighted by \(w_1,\dotsc, w_n\) respectively. Likewise, PCA produces new features, e.g. \(X_1, X_2,\dotsc\), as weighted sums of original features \(x_1,\dotsc, x_n\).

Why **preserve variance**?

Preserving variance in the original features retains information about the spread of features in the data, usually resulting in good representations of original features for machine learning.

## Show me an example!

Let’s walk through PCA with the Boston Housing dataset:

Given a dataset of 37 numerical features of houses, you want to predict its sale price. However, dealing with so many features can be (imagine) expensive in compute and difficult to interpret.

## Code

Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ... | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | ... | 208500 |

1 | 2 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | ... | 181500 |

2 | 3 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | ... | 223500 |

3 | 4 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | ... | 140000 |

4 | 5 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | ... | 250000 |

5 rows x 38 columns

Because PCA reduces the number of dimensions of the dataset, model training can take a shorter time without sacrificing too much performance.

Let’s run PCA on the 37 features:

## Code

```
Original features (normalized):
[-0.646 0.136 0.171 -0.252 1.297 -0.523 1.221 1.129 0.026 0.221
-0.283 -0.675 -0.531 -0.856 1.734 -0.106 0.784 1.175 -0.233 0.79
1.238 0.175 -0.202 0.9 -0.943 1.214 0.211 0.271 -0.746 1.613
-0.363 -0.114 -0.285 -0.074 -0.146 2.113 0.879]
PCA features:
[ 2.141 -1.273]
```

## What benefits does PCA bring? At what trade-off?

Let’s train 2 linear regression models:

- Trained on the original 37-dimensional features
- Trained on the 2-dimensional features produced by PCA

#### Performance: A Tiny Trade-off

Even though PCA reduces the number of dimensions from 37 to merely 2, there is a small decrease in test error of only **-0.02** (RMSE of log of sale price as evaluated on Kaggle).

## Code

```
Original Features (37 dimensions)
Train error: 0.15
Test error: 0.1348
PCA features (2 dimensions)
Train error: 0.1935
Test error: 0.1562
```

#### Training Time: A Major Speed-up

With PCA, because number of features is drastically reduced (from 37 to 2), training time taken is also reduced from 4.85ms to 584µs, a **8300X** speedup!

## Code (Linear regression with original features)

```
4.85 ms ± 276 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

## Code (Linear regression with PCA features)

```
584 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

#### Memory Space: A Significant Reduction

Because the no. of dimensions decreased by 18.5X, so did memory space used, **from 265Kb to 14Kb**.

## Code

```
Memory size of data w/o PCA: 265216 bytes
Memory size of data w/ PCA: 14336 bytes
```

## How to interpret PCA results

Recall what PCA does:

PCA produces new features as aof original features, byweighted sumin the original features.preserving variance

PCA produces new features, \(X_1, X_2\) as a weighted sum of the original features, \(x_1, x_2, x_3\).

In other words, we can get \(X_1\) and \(X_2\) using \(PC_1\) and \(PC_2\), by computing a weighted sum of \([x_1, x_2, x_3]\).

\[X_1 = w_{11}x_1 + w_{12}x_2 + w_{13}x_3\] \[X_2 = w_{21}x_1 + w_{22}x_2 + w_{23}x_3\]Compare weighted sums with PCA features:

## Code

```
PC1 weighted sum: 1.515
PC2 weighted sum: 0.968
PCA features: [1.515 0.968]
```

Inspecting principal component 1, the highest weighted feature is `OverallQual`

. This means that `OverallQual`

has the highest variance of all features in the dataset, and is likely a useful feature for sale price prediction.

Sort features by the 1st principal component’s weights: