Interpretating Biplot

A tool for exploring multivariate data

statistics
machine learning
data science
Author

TheRimalaya

Published

March 18, 2017

Modified

October 6, 2024

A biplot is a powerful graphical tool that represents data in two dimensions, where both the observations and variables are represented. Biplots are particularly useful for multivariate data, allowing users to examine relationships between variables and identify patterns.

Consider a data matrix \(\mathbf{X}_{n\times p}\) with \(p\) variables and \(n\) observations. To explore the data further with biplot, principal component analysis (PCA) is used here.

\[ \begin{pmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_p \end{pmatrix}^T = \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1p} \\ x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{np} \end{pmatrix} \] where \(\mathbf{x}_{i} = \begin{pmatrix}x_{1i} & \ldots & x_{ni}\end{pmatrix}\) is the ith variable.

Principal component analysis (PCA) compresses the variance of the data matrix to create a new set of orthogonal (linearly independent) variables \(\mathbf{Z} = \begin{pmatrix}\mathbf{z}_1 & \mathbf{z}_2 & \ldots & \mathbf{z}_q\end{pmatrix}\), where \(q = \min(n, p)\), often termed as principal components (PC) or scores. In PCA, the most variation is captured by the first component and rest in the subsequent components in decreasing order. In other words, the first principal component (\(\mathbf{z}_1\)) captures the highest variation and second principal component (\(\mathbf{z}_2\)) captures the maximum of remaing variation and so on.

\[\begin{aligned} \mathbf{z}_{1} &= w_{11}\mathbf{x}_{1} + \ldots + w_{1p}\mathbf{x}_{p} \\ \mathbf{z}_{2} &= w_{21}\mathbf{x}_{1} + \ldots + w_{2p}\mathbf{x}_{p} \\ \vdots \\ \mathbf{z}_{q} &= w_{q1}\mathbf{x}_{1} + \ldots + w_{qp}\mathbf{x}_{p} \\ \end{aligned} \tag{1}\]

We can use eigenvalue decomposition or singular value decompostion for this purpose.

These principle components are created using linear combination of the original variables. For example, \(\mathbf{z}_1 = w_{11}\mathbf{x}_1 + \ldots + w_{1p}\mathbf{x}_p\) and similarly for \(\mathbf{z}_2, \ldots, \mathbf{z}_q\). Here we can estimate weights \(w_{ij}\), where \(i = 1 \ldots q\) and \(j = 1 \ldots p\) using eigenvalue decomposition or singular value (SVD). These weights are also refered as loading or the matrix of these weights as rotation matrix.

Biplot plots the scores from two principal components together with loadings (weight) for each variables in the same plot. The following example uses USArrests data from datasets package in R. The dataset contains the number of arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973.

Two functions prcomp and princomp in R performs the principal component analysis. The function prcomp uses SVD on data matrix \(\mathbf{X}\) while princomp uses eigenvalue decomposition on covariance or correlation matrix of the data matrix. We will use prcomp for our example.

Dataset: USArrest

Code
USArrests <- as_tidytable(USArrests, .keep_rownames = "State")
head(USArrests)
# A tidytable: 6 × 5
  State      Murder Assault UrbanPop  Rape
  <chr>       <dbl>   <int>    <int> <dbl>
1 Alabama      13.2     236       58  21.2
2 Alaska       10       263       48  44.5
3 Arizona       8.1     294       80  31  
4 Arkansas      8.8     190       50  19.5
5 California    9       276       91  40.6
6 Colorado      7.9     204       78  38.7

Principal component analysis

Code
pca <- prcomp(USArrests[, -1], scale. = TRUE)
str(pca)
List of 5
 $ sdev    : num [1:4] 1.575 0.995 0.597 0.416
 $ rotation: num [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 -0.418 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 $ center  : Named num [1:4] 7.79 170.76 65.54 21.23
  ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
 $ scale   : Named num [1:4] 4.36 83.34 14.47 9.37
  ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
 $ x       : num [1:50, 1:4] -0.976 -1.931 -1.745 0.14 -2.499 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 - attr(*, "class")= chr "prcomp"
Code
score_df <- as_tidytable(pca$x) %>% 
  mutate(State = USArrests[, State])

loading_df <- as_tidytable(
  pca$rotation, 
  .keep_rownames = "Variable"
)

ggplot(score_df, aes(PC1, PC2)) +
  geom_hline(
    yintercept = 0, 
    color = "royalblue", 
    linetype = "dashed"
  ) +
  geom_vline(
    xintercept = 0, 
    color = "royalblue", 
    linetype = "dashed"
  ) +
  geom_segment(
    data = loading_df,
    aes(
      x = 0, xend = PC1 * 2,
      y = 0, yend = PC2 * 2
    ),
    color = "firebrick",
    arrow = arrow(length = unit(2, "mm"))
  ) +
  geom_text(
    data = loading_df,
    aes(
      x = PC1 * 2.6, 
      y = PC2 * 2.6, 
      label = Variable
    ),
    color = "firebrick",
    size = rel(5)
  ) +
  geom_text(
    aes(label = State), 
    check_overlap = TRUE,
    color = "gray50"
  ) +
  theme_minimal(base_size = 18) +
  theme(
    panel.background = element_rect(color = "gray50")
  )

Code
summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000
Code
var_df <- tidytable(
  PC = paste0("PC", 1:length(pca$sdev)),
  Variance = pca$sdev^2
)
ggplot(var_df, aes(PC, Variance)) +
  geom_line(group = 1) +
  geom_point(
    stroke = 1, size = 3, 
    shape = 21, fill = "whitesmoke"
  ) +
  theme_minimal(base_size = 18) +
  theme(
    panel.background = element_rect(color = "gray20")
  )

Here we see that more than 87% of variation in \(\mathbf{X}\) was captured by first two principal components (65% by the first and 25% by the second component).

Explore data using biplot

Now lets compare the scores from the first two components, observations and variables in the data matrix. From the USArrests data, lets look at the top five states with highest and lowest arrest for each of these crimes.

Code
USArrests %>% arrange(-Murder) %>% top_n()
# A tidytable: 5 × 5
  State          Murder Assault UrbanPop  Rape
  <chr>           <dbl>   <int>    <int> <dbl>
1 Georgia          17.4     211       60  25.8
2 Mississippi      16.1     259       44  17.1
3 Florida          15.4     335       80  31.9
4 Louisiana        15.4     249       66  22.2
5 South Carolina   14.4     279       48  22.5
Code
USArrests %>% arrange(-Assault) %>% top_n()
# A tidytable: 5 × 5
  State          Murder Assault UrbanPop  Rape
  <chr>           <dbl>   <int>    <int> <dbl>
1 North Carolina   13       337       45  16.1
2 Florida          15.4     335       80  31.9
3 Maryland         11.3     300       67  27.8
4 Arizona           8.1     294       80  31  
5 New Mexico       11.4     285       70  32.1
Code
USArrests %>% arrange(-UrbanPop) %>% top_n()
# A tidytable: 5 × 5
  State         Murder Assault UrbanPop  Rape
  <chr>          <dbl>   <int>    <int> <dbl>
1 California       9       276       91  40.6
2 New Jersey       7.4     159       89  18.8
3 Rhode Island     3.4     174       87   8.3
4 New York        11.1     254       86  26.1
5 Massachusetts    4.4     149       85  16.3
Code
USArrests %>% arrange(-Rape) %>% top_n()
# A tidytable: 5 × 5
  State      Murder Assault UrbanPop  Rape
  <chr>       <dbl>   <int>    <int> <dbl>
1 Nevada       12.2     252       81  46  
2 Alaska       10       263       48  44.5
3 California    9       276       91  40.6
4 Colorado      7.9     204       78  38.7
5 Michigan     12.1     255       74  35.1
Code
USArrests %>% arrange(Murder) %>% top_n()
# A tidytable: 5 × 5
  State         Murder Assault UrbanPop  Rape
  <chr>          <dbl>   <int>    <int> <dbl>
1 North Dakota     0.8      45       44   7.3
2 Maine            2.1      83       51   7.8
3 New Hampshire    2.1      57       56   9.5
4 Iowa             2.2      56       57  11.3
5 Vermont          2.2      48       32  11.2
Code
USArrests %>% arrange(Assault) %>% top_n()
# A tidytable: 5 × 5
  State        Murder Assault UrbanPop  Rape
  <chr>         <dbl>   <int>    <int> <dbl>
1 North Dakota    0.8      45       44   7.3
2 Hawaii          5.3      46       83  20.2
3 Vermont         2.2      48       32  11.2
4 Wisconsin       2.6      53       66  10.8
5 Iowa            2.2      56       57  11.3
Code
USArrests %>% arrange(UrbanPop) %>% top_n()
# A tidytable: 5 × 5
  State          Murder Assault UrbanPop  Rape
  <chr>           <dbl>   <int>    <int> <dbl>
1 Vermont           2.2      48       32  11.2
2 West Virginia     5.7      81       39   9.3
3 Mississippi      16.1     259       44  17.1
4 North Dakota      0.8      45       44   7.3
5 North Carolina   13       337       45  16.1
Code
USArrests %>% arrange(Rape) %>% top_n()
# A tidytable: 5 × 5
  State         Murder Assault UrbanPop  Rape
  <chr>          <dbl>   <int>    <int> <dbl>
1 North Dakota     0.8      45       44   7.3
2 Maine            2.1      83       51   7.8
3 Rhode Island     3.4     174       87   8.3
4 West Virginia    5.7      81       39   9.3
5 New Hampshire    2.1      57       56   9.5

Comparing these highest and lowest arrests with the biplot, we can see a pattern. The weights corresponding to PC1 for all the variables are negative and are directed towards states like Florida, Nevada, and California. These states have the highest number of arrests for all of these crimes where as states that are in the oppositve direction like Iowa, North Dakota, and Vermont have the lowest arrest.

Similarly, UrbanPop have the highest weights corresponding to PC2 so the states in that direction such as California, Hawaii, and New Jersey have highest arrest related to UrbanPop. The states in the opposite direction, i.e. with negative PC2 scores such as Mississippi, North Carolina, Vermont, and South Carolina have the lowest arrest related to UrbanPop.

The weights for all variables are negative and towards states like . So in our data these states must have the highest arrests in all these crimes where as states like New Dakota, Vermont, and Iowa have the lowest arrests.