Unsupervised dimensionality reduction.
Assumption: PCA needs linear correlation between all variables (not non-linear). The eigenvectors of the covariance matrix are the principal components and eigenvalues represent variance carried in each component.
Process
- Standardize the data (PCA is sensitive to variance within features)
- Calculate the covariance matrix
- Calculate eigenvectors and eigenvalues — these are the principal components
Choose number of components to capture 95-99% of variance.
Drawbacks
- Computationally expensive
- Information is always lost
- Explainability becomes much more difficult
Questions
Optimal number of components?
\[k = \min\left\{n: \frac{\sum_{i=1}^n \lambda_i}{\sum_{i=1}^N \lambda_i} \geq \text{threshold}\right\}\]Use a scree plot (x-axis: component number, y-axis: variance explained).