Principal Component Analysis (PCA): Mathematical Overview with Implementation on Python

4 min readOct 30, 2021

What is Principal Component Analysis?

Principal Component Analysis is a dimension reduction tool that tries to find a representation of data in a dimension lower than the dimension of the original data such that much information is still retained.

Here, we try to capture the maximum amount of variance in the data in a lower dimension. It is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the dataset comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on.

Mathematically, it is simply a change-of-basis operation. The mathematical question posed is: Given a data matrix X, in a basis, say x, can we obtain another basis y that is a linear combination of the original basis and that re-expresses the data optimally? To achieve this, we seek uncorrelated variables such that each of these variables represents unique information that describes the data.

2. Principal Component Analysis: Mathematical View

2.1 Assumptions

We assume that the data follow a statistical distribution.

2.2 Obtaining the PCA Step by Step

We consider an n x p matrix X with rows x₁, x₂, . . . , xₙ where each xᵢ is an individual from a measurement with each pᵢ of X corresponding to variables.

For this tutorial, we use the iris dataset which you can download here: https://www.kaggle.com/uciml/iris

#importing librariesimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt

Reading the Data

X=pd.read_csv('Iris.csv')

2. Normalization

In this step, we subtract the mean X̄ of X from X and divide by the standard deviation σ to ensure that the variables contribute equally to the analysis and are not drawn by large values.

X_scaled=(X - X.mean())/X.std()

3. The Covariance Matrix

We move on then to compute the covariance matrix C. This is a square positive semi-definite matrix that would allow us to obtain positive eigenvalues.

Hence, the covariance matrix C from is defined as:

covariance_mat=X_scaled.corr()

which is a square symmetric p x p matrix where each diagonal element of C represents the variance of a variable with itself while the off-diagonal elements are covariances between variables.

4. Compute the eigenvalues and eigenvectors to determine the principal components using Singular Value Decomposition.

So, from the covariance matrix, we perform SVD to obtain the eigenvalues and eigenvectors as:

import numpy as npeigen_vectors, eigen_values, eigen_vectors_inv=np.linalg.svd(covariance_mat)

where V is a matrix of eigenvectors with V⁻¹ its inverse and a diagonal matrix A whose diagonals are eigenvalues. These eigenvalues represent the amount of variance present while the eigenvectors give the direction of the features which are now uncorrelated.

Hence, we can now arrange the eigenvectors in descending order by using the eigenvalues.

order = eigen_values.argsort()[::-1]eig_values = eigen_values[order]eig_vectors = eigen_vectors[order]

It is important to take the number of components with high variance. Hence, we compute the amount of variance contributed by each eigenvalue.

row = ["Component " + str(i+1) for i in range(len(eigen_values))]col = ["std deviation","variance(%)","cumulative variance(%)"]total_variance = sum(eigen_values)variance_table = pd.DataFrame([[i**0.5,np.round(i*100 /total_variance,2), np.round(j*100/total_variance,2)] for i,j in zip(eigen_values,np.cumsum(eigen_values))],columns= col,index = row)variance_table

5. Change-of-Basis (Projection into a New Dimension)

We then move on to obtain a new matrix which is a linear combination of the old matrix X and the eigenvectors V.

n_components = 2eigenvector_subset = eig_vectors[0:n_components]X_prime = np.dot(X_scaled,eigenvector_subset)

which is an n x p matrix.

For the full implementation on python, check out the notebook on GitHub here

Do leave a comment if you found this interesting. If you have questions, please leave them in the comments section.

You can follow me on Twitter @ifeoma_nwabufo and on LinkedIn Ifeoma Veronica Nwabufo.

Principal Component Analysis (PCA): Mathematical Overview with Implementation on Python

Written by Ifeoma Veronica Nwabufo