Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;38(3):e17.
doi: 10.1093/nar/gkp942. Epub 2009 Nov 18.

A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data

Affiliations

A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data

Nuno L Barbosa-Morais et al. Nucleic Acids Res. 2010 Jan.

Abstract

Illumina BeadArrays are among the most popular and reliable platforms for gene expression profiling. However, little external scrutiny has been given to the design, selection and annotation of BeadArray probes, which is a fundamental issue in data quality and interpretation. Here we present a pipeline for the complete genomic and transcriptomic re-annotation of Illumina probe sequences, also applicable to other platforms, with its output available through a Web interface and incorporated into Bioconductor packages. We have identified several problems with the design of individual probes and we show the benefits of probe re-annotation on the analysis of BeadArray gene expression data sets. We discuss the importance of aspects such as probe coverage of individual transcripts, alternative messenger RNA splicing, single-nucleotide polymorphisms, repeat sequences, RNA degradation biases and probes targeting genomic regions with no known transcription. We conclude that many of the Illumina probes have unreliable original annotation and that our re-annotation allows analyses to focus on the good quality probes, which form the majority, and also to expand the scope of biological information that can be extracted.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Annotation pipeline. Schematics of the computational pipeline flow—see Methods section for details.
Figure 2.
Figure 2.
Impact of annotation on expression analysis. (A) Box plots of the average expression ranks of probes, calculated across the GEO arrays, for each annotation category (box widths proportional to the number of probes in the respective category); (B) proportions of probes retained by each filtering method applied to the MAQC V1 data set (y-axis) as a function of the chosen cut-off (x-axis); (C) proportions of probes of each category (y-axis) found in a gene list of certain length arising from a differential expression analysis of the MAQC V1 data (x-axis). For panels (B) and (C), the colour code is the same as used in (A); dashed lines are associated with quality grade categories and solid lines with coding zones.
Figure 2.
Figure 2.
Impact of annotation on expression analysis. (A) Box plots of the average expression ranks of probes, calculated across the GEO arrays, for each annotation category (box widths proportional to the number of probes in the respective category); (B) proportions of probes retained by each filtering method applied to the MAQC V1 data set (y-axis) as a function of the chosen cut-off (x-axis); (C) proportions of probes of each category (y-axis) found in a gene list of certain length arising from a differential expression analysis of the MAQC V1 data (x-axis). For panels (B) and (C), the colour code is the same as used in (A); dashed lines are associated with quality grade categories and solid lines with coding zones.
Figure 3.
Figure 3.
Association between genotype and measured expression. (A) The sequence we examine is in the BCKDHB gene on chromosome 6, which contains a SNP (rs7740958—T/C) at position 7. The Illumina probe targeting this sequence (GI_34101271-I) contains a C at this location. There is also a probe (GI_34101266-A) targeting a constitutive splice junction of BCKDHB and matching no known SNPs. [Figure built based on UCSC Genome Browser graphics (26).] (B) Box plots of the log2 expression ratios in the Japanese HapMap population according to the rs7740958 genotype.
Figure 4.
Figure 4.
Effect of repetitive sequences on gene expression measurements. Volcano plot (y-axis: empirical Bayes log-odds of differential expression; x-axis: log2 fold change in expression) comparing the expression of human transcripts in livers between Tc1 mice carrying a human chromosome 21 and their wild-type litter-mates (Tc0). Blue dots depict probes targeting genes on human chromosome 21 and red dots probes comprising human RepeatMasker sequences. The dashed line is an arbitrary cut-off for differential expression based on the assumption that there are no human sequences transcribed in the wild-type mouse and, therefore, no human transcript should be overexpressed in Tc0. The P-values (Fisher’s exact test) show that the numbers of differentially expressed transcripts from human chromosome 21 (blue) and comprising human repeats (red) are both highly significant.
Figure 5.
Figure 5.
Importance of alternative splicing on the interpretation of gene expression data. Schematics of the structure of the human CAST gene, with exons depicted by numbered boxes in light blue and all annotated alternative splicing events represented by the associated thin black dashed lines. The three upper tracks represent the logged expression levels for all the probes targeting the CAST locus for each of the three human Illumina WG platforms. Bars are positioned according to the relative position of the respective probes in the locus. Bar height is proportional to the average log gene expression in the MAQC data: red for brain samples and blue for reference samples. The coloured dashed lines indicate the respective average expression levels of probes targeting CAST for each sample type. The short coloured full lines in the middle of the V2 track indicate the expression level of CAST estimated by the original Illumina gene summarization procedure for each sample type; black stars identify V2 probes targeting CAST according to the original Illumina annotation. Black arrows identify probes mapping to the reverse strand. The three lower tracks represent the corresponding log-ratios. This figure shows that a gene-centric analysis indicating underexpression of the CAST gene in brain (dashed blue line) might be biased by a brain-specific skipping of exon 30 and/or an alternative first exon.

Similar articles

Cited by

  • Deconstructing Intratumoral Heterogeneity through Multiomic and Multiscale Analysis of Serial Sections.
    Schupp PG, Shelton SJ, Brody DJ, Eliscu R, Johnson BE, Mazor T, Kelley KW, Potts MB, McDermott MW, Huang EJ, Lim DA, Pieper RO, Berger MS, Costello JF, Phillips JJ, Oldham MC. Schupp PG, et al. Cancers (Basel). 2024 Jul 1;16(13):2429. doi: 10.3390/cancers16132429. Cancers (Basel). 2024. PMID: 39001492 Free PMC article.
  • Graphical modeling of gene expression in monocytes suggests molecular mechanisms explaining increased atherosclerosis in smokers.
    Verdugo RA, Zeller T, Rotival M, Wild PS, Münzel T, Lackner KJ, Weidmann H, Ninio E, Trégouët DA, Cambien F, Blankenberg S, Tiret L. Verdugo RA, et al. PLoS One. 2013;8(1):e50888. doi: 10.1371/journal.pone.0050888. Epub 2013 Jan 23. PLoS One. 2013. PMID: 23372645 Free PMC article.
  • Lymphocyte Invasion in IC10/Basal-Like Breast Tumors Is Associated with Wild-Type TP53.
    Quigley D, Silwal-Pandit L, Dannenfelser R, Langerød A, Vollan HK, Vaske C, Siegel JU, Troyanskaya O, Chin SF, Caldas C, Balmain A, Børresen-Dale AL, Kristensen V. Quigley D, et al. Mol Cancer Res. 2015 Mar;13(3):493-501. doi: 10.1158/1541-7786.MCR-14-0387. Epub 2014 Oct 28. Mol Cancer Res. 2015. PMID: 25351767 Free PMC article.
  • Haptoglobin promoter polymorphism rs5472 as a prognostic biomarker for peptide vaccine efficacy in castration-resistant prostate cancer patients.
    Araki H, Pang X, Komatsu N, Soejima M, Miyata N, Takaki M, Muta S, Sasada T, Noguchi M, Koda Y, Itoh K, Kuhara S, Tashiro K. Araki H, et al. Cancer Immunol Immunother. 2015 Dec;64(12):1565-73. doi: 10.1007/s00262-015-1756-7. Epub 2015 Oct 1. Cancer Immunol Immunother. 2015. PMID: 26428930 Free PMC article.
  • Novel loci for childhood body mass index and shared heritability with adult cardiometabolic traits.
    Vogelezang S, Bradfield JP, Ahluwalia TS, Curtin JA, Lakka TA, Grarup N, Scholz M, van der Most PJ, Monnereau C, Stergiakouli E, Heiskala A, Horikoshi M, Fedko IO, Vilor-Tejedor N, Cousminer DL, Standl M, Wang CA, Viikari J, Geller F, Íñiguez C, Pitkänen N, Chesi A, Bacelis J, Yengo L, Torrent M, Ntalla I, Helgeland Ø, Selzam S, Vonk JM, Zafarmand MH, Heude B, Farooqi IS, Alyass A, Beaumont RN, Have CT, Rzehak P, Bilbao JR, Schnurr TM, Barroso I, Bønnelykke K, Beilin LJ, Carstensen L, Charles MA, Chawes B, Clément K, Closa-Monasterolo R, Custovic A, Eriksson JG, Escribano J, Groen-Blokhuis M, Grote V, Gruszfeld D, Hakonarson H, Hansen T, Hattersley AT, Hollensted M, Hottenga JJ, Hyppönen E, Johansson S, Joro R, Kähönen M, Karhunen V, Kiess W, Knight BA, Koletzko B, Kühnapfel A, Landgraf K, Langhendries JP, Lehtimäki T, Leinonen JT, Li A, Lindi V, Lowry E, Bustamante M, Medina-Gomez C, Melbye M, Michaelsen KF, Morgen CS, Mori TA, Nielsen TRH, Niinikoski H, Oldehinkel AJ, Pahkala K, Panoutsopoulou K, Pedersen O, Pennell CE, Power C, Reijneveld SA, Rivadeneira F, Simpson A, Sly PD, Stokholm J, Teo KK, Thiering E, Timpson NJ, Uitterlinden AG, van Beijsterveldt CEM, van Schaik BDC, Vau… See abstract for full author list ➔ Vogelezang S, et al. PLoS Genet. 2020 Oct 12;16(10):e1008718. doi: 10.1371/journal.pgen.1008718. eCollection 2020 Oct. PLoS Genet. 2020. PMID: 33045005 Free PMC article.

References

    1. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315:848–853. - PMC - PubMed
    1. Cancer Genome Atlas Research Network Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. - PMC - PubMed
    1. Goring HH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, Cole SA, Jowett JB, Abraham LJ, Rainwater DL, Comuzzie AG, et al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 2007;39:1208–1216. - PubMed
    1. Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P. Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Res. 2005;33:5914–5923. - PMC - PubMed
    1. Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavaré S, Ritchie ME. Statistical issues in the analysis of Illumina data. BMC Bioinformatics. 2008;9:85. - PMC - PubMed

Publication types

MeSH terms

Substances