Using PCA to visualize your data

A python example on how to visualize your data with multiple dimension in order to understand problems with your model.

Guy Barash
5 min readJul 26, 2020

Introduction

You’ve finished gathering the data, your sample’s labels are balanced. Everything should be clean and simple. So you use an off-the-shelf classifier, but the predictions are all wrong!
maybe one class is over-represented in the predictions, maybe the accuracy is not what you’ve expected.

In this blog you’ll understand how to use PCA to visualize your data and be get hints on the underline distribution and separability of your data.

with python code!

Let’s start with the relevant imports. These are the libraries we’re going to use.

import numpy as np
import
pandas as pd
from
sklearn.decomposition
import
PCA from matplotlib
import
pyplot from IPython.display
import
display import matplotlib.pyplot as plt
%matplotlib inline

Generating synthetic data for demonstration

To get a better intuition on the subject, we’ll create our own data.

In this data-set we’ll have 3 classes.

Each class will have:
- 300 samples
- 10 features
- a label
- a normal distribution with a constant MEAN and STD.

sample_per_class = 300
features = 10
below_center = 0 #@param {type:"slider", min:0, max:2, step:1}
below_std = 2 #@param {type:"slider", min:0, max:5, step:1}
bingo_center = 3
bingo_std = 2 #@param {type:"slider", min:0, max:5, step:1}
above_center = 6 #@param {type:"slider", min:4, max:10, step:1}
above_std = 3 #@param {type:"slider", min:0, max:5, step:1}

The Classes:

Below (label “0”) : Each feature is a number drawn from an normal distribution with center at 0 and standard deviation of 2.

Bingo (label “1”) : Each feature is a number drawn from an normal distribution with center at 3 and standard deviation of 2.

Above (label “2”) : Each feature is a number drawn from an normal distribution with center at 6 and standard deviation of 3.

10 features per sample, 300 samples per class.

Let’s generate the data:

p_below = np.clip(np.random.normal(loc=below_center, scale=below_std, size=(sample_per_class, features)).astype(int),
a_min=0, a_max=8)
p_bingo = np.clip(np.random.normal(loc=bingo_center, scale=bingo_std, size=(sample_per_class, features)).astype(int),
a_min=0, a_max=8)
p_above = np.clip(np.random.normal(loc=above_center, scale=above_std, size=(sample_per_class, features)).astype(int),
a_min=0, a_max=8)

data_cols = [f'game_{idx + 1 }' for idx in range(features)]
label_col = ['label']
df = pd.DataFrame(index=range(sample_per_class * 3), columns=data_cols + label_col)
df[data_cols] = np.concatenate([p_below, p_bingo, p_above])
for idx_class in range(3):
start_idx = 300 * idx_class
end_idx = (300 * (idx_class + 1)) - 1
df.loc[start_idx:end_idx, label_col] = idx_class

X = df[data_cols].to_numpy(dtype=float)
y = df[label_col].to_numpy(dtype=float)

Our data should like something like this:

example data

This data makes sense, and it should be pretty easy to classify. But when you run it with a generic classifier class 0 (below) and class 1 (Bingo) have too much mistakes between them. What could be the reason?

Applying PCA to reduce dimensions

We’ll take the data and reduce it to 2 dimensions, it will make it possible to plot and understand. We’ll loose some of the information, but relations between samples will mostly remain.

We take the data with N dimensions, 10 in our case and reduce it to 2 dimensions, like so:

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_df = pd.DataFrame(columns=['C1', 'C2'],
data=X_pca
)
pca_df['label'] = y

Now the data should look something like this:

example data, reduced to 2 dimensions

Notice that the labels are kept exactly as they are, but the features have changed completely.

Visualizing

Now we want to see the data. We’ll use matplotlib to visualize it.

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('C1', fontsize=15)
ax.set_ylabel('C2', fontsize=15)
ax.set_title('2 component PCA', fontsize=20)
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
labels = {0: "Below 3", 1: "Bingo", 2: "Above 3", }
for target, color in zip(targets, colors):
target_indices = pca_df['label'] == target
pca_df.loc[target_indices, 'C1']
pca_df.loc[target_indices, 'C2']
ax.scatter(pca_df.loc[target_indices, 'C1'],
pca_df.loc[target_indices, 'C2'],
c=color,
s=50,
label=labels[target],
)
ax.legend()
ax.grid()

And we’ll get:

2D visualization of the data

When the data is projected into a 2-D image, it is easier to understand crucial information on the data. A key insight from the figure above, is that the blue class, “above 3”, will be easy to separate from the other data, while for the remaining classes, perhaps additional extrapolation is needed.

Additional examples

Let’s play with the parameters to visualize more information that can be understood

“Above_3" mean: 5, std: 1. “Bingo” mean: 3, std: 1. “Below_3” mean: 0, std: 1

The data in this image is very well separated, and should be a piece-of-cake for any classifier .

“Above_3” mean: 4, std: 2. “Bingo” mean: 3, std: 1. “Below_3” mean: 2, std: 3

This data is a mess, most of the simple classifier won’t be able to reach good results on it — it’s not about the parameters of the learning, it’s about the data.

“Above_3” mean: 5, std: 1. “Bingo” mean: 3, std: 4. “Below_3” mean: 0, std: 1

This data seems good, but “bingo” will probably be over-represented in the model’s predictions. In a lot of models, everything that is not “spot-on” in the center of the blue or the red will be classified as green. If that is not the desired situation than perhaps under-sampling the green class is a good suggestion.

Conclusion

In this article we have shown how to transform matrix-data to 2 dimensional space using PCA, and how to visualize this data.

This visualization can give good intuition to how your data “looks” and what are the directions you’ll need to consider when building the right model.

The Code

A complete notebook is present here:

--

--