Data reduction for spectral clustering to analyze high throughput flow cytometry data

doi:10.1186/1471-2105-11-403

. 2010 Jul 28:11:403.

doi: 10.1186/1471-2105-11-403.

Data reduction for spectral clustering to analyze high throughput flow cytometry data

Habil Zare¹, Parisa Shooshtari, Arvind Gupta, Ryan R Brinkman

Affiliations

PMID: 20667133
PMCID: PMC2923634
DOI: 10.1186/1471-2105-11-403

Data reduction for spectral clustering to analyze high throughput flow cytometry data

Habil Zare et al. BMC Bioinformatics. 2010.

. 2010 Jul 28:11:403.

doi: 10.1186/1471-2105-11-403.

Authors

Habil Zare¹, Parisa Shooshtari, Arvind Gupta, Ryan R Brinkman

Affiliation

¹ Terry Fox Laboratory, BC Cancer Agency, 675 W 10th Ave, Vancouver, BC, Canada.

PMID: 20667133
PMCID: PMC2923634
DOI: 10.1186/1471-2105-11-403

Abstract

Background: Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.

Results: We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.

Conclusions: This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.

PubMed Disclaimer

Figures

**Figure 1**
**Data reduction scheme**. (a) Running spectral clustering is impractical on data that contains thousands of points. (b) Faithful sampling picks up a reasonable subset of points such that running spectral clustering is possible on them. However, all information about the local density is lost by considering only these sample points. (c) We assign weights to the edges of the graph; the edges between the nodes in denser regions are weighted considerably higher. The information about the local density is retrieved in this way.

**Figure 2**
**Faithful sampling**. (a) Original data from telomere data set before sampling. (b) The distribution of representatives is almost uniform in the space after faithful sampling.

**Figure 3**
**Defining the similarity between two communities and identifying the number of clusters**. (a) We define the similarity between two communities c and c' as the sum of pairwise similarities between the members of c and the members of c'. (b) This figure shows the largest eigenvalues of a sample from the stem cell dataset. The number of clusters is estimated according to the knee point of eigenvalues curve. This point is defined as the intersection of the above regression line and the line y = 1. The horizontal coordinate of the knee point estimates the number of spectral clusters.

**Figure 4**
**Comparative clustering of the telomere dataset**. (a-c) Proper identification of overlapping populations. Although two populations shown by red and blue contours are overlapping in all bi-variant plots of this 3-dimensional sample, SamSPECTRAL can properly distinguish them by considering multiple parameters simultaneously.(d) SamSPECTRAL can also identify two major subpopulations of granulocytes correctly, as verified by expert analysis. (e) flowMerge does not distinguish between two populations of interest, and (f) FLAME improperly splits the same sample into several clusters.

**Figure 5**
**Comparative clustering of dead cells (PI positive) and live cells (PI negative) in the viability data**. (a) SamSPECTRAL could distinguish between dead cells (blue) and live cells (red) properly. (b) flowMerge identified dead cells correctly, but split live cells into two clusters. (c) FLAME did not distinguish between these two population.

**Figure 6**
**Comparative clustering of the GvHD dataset**. (Left) Identification of non-elliptical shaped populations. (a) SamSPECTRAL could properly identify the red, non-elliptical population, while (b) flowMerge mixed this population with the one below it. (c) FLAME produced satisfactory results in identifying this population. (Right) Identification of low density populations close to dense populations. (d) SamSPECTRAL and (e) flowMerge could identify the low density population shown in red at the centre of the figure correctly, while (f) FLAME merged this population with the other ones surrounding it.

**Figure 7**
**Comparative identification of a low density population surrounded by much denser populations in the stem cell data set**. (a-c) SamSPECTRAL correctly identified the blue, low density population, while (d-f) flowMerge merged it to the yellow, high density population. (g-i) FLAME merged it to the red population. (j-l) The outcome of our modified MCL was similar to that obtained by SamSPECTRAL using classic spectral clustering. This shows that SamSPECTRAL is extensible by substituting classic spectral clustering with other clustering algorithms for weighted graph.

**Figure 8**
**Rare population in the stem cell data set**. (a-c) This is a typical sample from the stem cell data set that contains a rare population. In these three dimensional plots, the red dots represent the cells that are positive for all three markers. Only 23/9721 (0.24%) events belong to this population in this sample. SamSPECTRAL could properly identify the rare population in 27/34 (79.4%) samples from the stem cell data set.

**Figure 9**
**Performance of SamSPECTRAL on synthetic data**. (a) This synthetic two dimensional data consists of a normal distribution with 30,000 points, four normal distribution each with 300 points and a uniform background noise with 4000 points. (b) Around 3000 sample points are picked up by faithful sampling. These are distributed almost uniformly in the space, therefore, almost all information about density will be lost if one considers only the samples points. (c) The final outcome of SamSPECTRAL confirms that the information about density could be retrieved by properly assigning weights to the edges of the graph. The high density cluster is shown in red and the surrounding sparser clusters are shown in yellow, light blue, green and black.

**Figure 10**
**Comparing Uniform sampling with faithful sampling**. Directly applying classical spectral clustering is not efficient on this sample of the stem cell dataset which contains 48000 cytometry events in 3 dimensions. (a) Although only 2115 data points were selected by faithful sampling, each population has a considerable number of representatives in the selected points. (b) 3000 points were selected by uniform sampling. The low density population in the middle of the plot consists of only 55 sample points resulting in mixing this population with a high density one incorrectly (d). (c) The result of SamSPECTRAL on the original data is satisfactory because the low density red population and other high density populations are identified properly.

See this image and copyright information in PMC

Cited by

flowDensity: reproducing manual gating of flow cytometry data by automated density-based cell population identification.
Malek M, Taghiyar MJ, Chong L, Finak G, Gottardo R, Brinkman RR. Malek M, et al. Bioinformatics. 2015 Feb 15;31(4):606-7. doi: 10.1093/bioinformatics/btu677. Epub 2014 Oct 16. Bioinformatics. 2015. PMID: 25378466 Free PMC article.
Automated identification of maximal differential cell populations in flow cytometry data.
Yue A, Chauve C, Libbrecht MW, Brinkman RR. Yue A, et al. Cytometry A. 2022 Feb;101(2):177-184. doi: 10.1002/cyto.a.24503. Epub 2021 Oct 22. Cytometry A. 2022. PMID: 34559446 Free PMC article.
Competitive SWIFT cluster templates enhance detection of aging changes.
Rebhahn JA, Roumanes DR, Qi Y, Khan A, Thakar J, Rosenberg A, Lee FE, Quataert SA, Sharma G, Mosmann TR. Rebhahn JA, et al. Cytometry A. 2016 Jan;89(1):59-70. doi: 10.1002/cyto.a.22740. Epub 2015 Oct 6. Cytometry A. 2016. PMID: 26441030 Free PMC article.
Critical assessment of automated flow cytometry data analysis techniques.
Aghaeepour N, Finak G; FlowCAP Consortium; DREAM Consortium; Hoos H, Mosmann TR, Brinkman R, Gottardo R, Scheuermann RH. Aghaeepour N, et al. Nat Methods. 2013 Mar;10(3):228-38. doi: 10.1038/nmeth.2365. Epub 2013 Feb 10. Nat Methods. 2013. PMID: 23396282 Free PMC article.
The race to understand immunopathology in COVID-19: Perspectives on the impact of quantitative approaches to understand within-host interactions.
Gazeau S, Deng X, Ooi HK, Mostefai F, Hussin J, Heffernan J, Jenner AL, Craig M. Gazeau S, et al. Immunoinformatics (Amst). 2023 Mar;9:100021. doi: 10.1016/j.immuno.2023.100021. Epub 2023 Jan 8. Immunoinformatics (Amst). 2023. PMID: 36643886 Free PMC article. Review.

See all "Cited by" articles

References

1. Hawley TS, Hawley RG. Flow Cytometry Protocols, Methods in Molecular Biology. 2. Humana Press; 2005.
1. Perfetto SP, Chattopadhyay PK, Roederer M. Seventeen-colour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004;4(8):648–655. doi: 10.1038/nri1416. - DOI - PubMed
1. Bashashati A, Brinkman R. A survey of flow cytometry data analysis methods. Advances in Bioinformatics. 2009;2009:1–19. doi: 10.1155/2009/584603. - DOI - PMC - PubMed
1. Klinke D II, Brundage K. Scalable analysis of flow cytometry data using R/Bioconductor. Cytometry Part A. 2009;75(8):699–706. doi: 10.1002/cyto.a.20746. - DOI - PMC - PubMed
1. Lugli E, Roederer M, Cossarizza A. Data analysis in flow cytometry: The future just started. Cytometry Part A. 2010;77(7):705–13. doi: 10.1002/cyto.a.20901. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

[1] Hawley TS, Hawley RG. Flow Cytometry Protocols, Methods in Molecular Biology. 2. Humana Press; 2005.

[2] Hawley TS, Hawley RG. Flow Cytometry Protocols, Methods in Molecular Biology. 2. Humana Press; 2005.

[3] Perfetto SP, Chattopadhyay PK, Roederer M. Seventeen-colour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004;4(8):648–655. doi: 10.1038/nri1416. - DOI - PubMed

[4] Perfetto SP, Chattopadhyay PK, Roederer M. Seventeen-colour flow cytometry: unravelling the immune system. Nat Rev Immunol. 2004;4(8):648–655. doi: 10.1038/nri1416. - DOI - PubMed

[5] Bashashati A, Brinkman R. A survey of flow cytometry data analysis methods. Advances in Bioinformatics. 2009;2009:1–19. doi: 10.1155/2009/584603. - DOI - PMC - PubMed

[6] Bashashati A, Brinkman R. A survey of flow cytometry data analysis methods. Advances in Bioinformatics. 2009;2009:1–19. doi: 10.1155/2009/584603. - DOI - PMC - PubMed

[7] Klinke D II, Brundage K. Scalable analysis of flow cytometry data using R/Bioconductor. Cytometry Part A. 2009;75(8):699–706. doi: 10.1002/cyto.a.20746. - DOI - PMC - PubMed

[8] Klinke D II, Brundage K. Scalable analysis of flow cytometry data using R/Bioconductor. Cytometry Part A. 2009;75(8):699–706. doi: 10.1002/cyto.a.20746. - DOI - PMC - PubMed

[9] Lugli E, Roederer M, Cossarizza A. Data analysis in flow cytometry: The future just started. Cytometry Part A. 2010;77(7):705–13. doi: 10.1002/cyto.a.20901. - DOI - PMC - PubMed

[10] Lugli E, Roederer M, Cossarizza A. Data analysis in flow cytometry: The future just started. Cytometry Part A. 2010;77(7):705–13. doi: 10.1002/cyto.a.20901. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data reduction for spectral clustering to analyze high throughput flow cytometry data

Affiliation

Data reduction for spectral clustering to analyze high throughput flow cytometry data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous