Visualisation of high-dimension datasets

Datasets with high dimensions emerge massively in biomedical research. Think of gene expression analyses where the amount of measured variables (e.g. 20 000 genes) exceeds the number of samples (e.g. 100) by a multitude.

Here I put some quality information as a resource.

Principal component analysis

Definition Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (Wikipedia).

Screen Shot 2016-03-20 at 22.38.55

Screenshot from Setosa, see link below.

I can recommend reading this straighforward paper: Ringnér, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303–304. http://doi.org/10.1038/nbt0308-303

Screen Shot 2016-03-20 at 22.56.07

Figure from Ringner, see citation above.

Continue reading on PCA

t-SNE

Definition t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. (Laurens van der Maaten website)

This paper in Nature Immunology beautifully uses this technique to describe and discriminate murine immune cells against a 38-surface antibody mass cytometry assay.

Becher, B., Schlitzer, A., Chen, J., Mair, F., Sumatoh, H. R., Teng, K. W. W., et al. (2014). High-dimensional analysis of the murine myeloid cell system. Nature Immunology, 15(12), 1181–1189. http://doi.org/10.1038/ni.3006

ni.3006-F2

tSNE analysis objectively delineates myeloid cell subsets of lung, spleen and bone marrow. (a–c) tSNE composite dimensions (dim.) 1 and 2 for cells derived from lung (a), spleen (b) and bone marrow (c). Left, cells grouped according to biased (traditional) definitions with gating strategies similar to that described in Figure 1. Remaining cells (gray) are those unaccounted for by these definitions; predominant unidentified clusters are indicated with arrows. Right, cells are grouped according to automatic (unbiased) cluster designation. Predominant clusters (frequencies >1%) including those that correspond to subsets not accounted for by traditional gating are labeled with cluster numbers. Average frequencies (± s.e.m., n = 3 mice) as percentage of total CD45+CD90−CD19−CD3− population are shown for each subset. Alv., alveolar; inter., interstitial; rem., remaining.