Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jan 9:8:6.
doi: 10.1186/1471-2105-8-6.

Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices

Affiliations

Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices

Roger A Craig et al. BMC Bioinformatics. .

Abstract

Background: Protein-protein interactions are critical for cellular functions. Recently developed computational approaches for predicting protein-protein interactions utilize co-evolutionary information of the interacting partners, e.g., correlations between distance matrices, where each matrix stores the pairwise distances between a protein and its orthologs from a group of reference genomes.

Results: We proposed a novel, simple method to account for some of the intra-matrix correlations in improving the prediction accuracy. Specifically, the phylogenetic species tree of the reference genomes is used as a guide tree for hierarchical clustering of the orthologous proteins. The distances between these clusters, derived from the original pairwise distance matrix using the Neighbor Joining algorithm, form intermediate distance matrices, which are then transformed and concatenated into a super phylogenetic vector. A support vector machine is trained and tested on pairs of proteins, represented as super phylogenetic vectors, whose interactions are known. The performance, measured as ROC score in cross validation experiments, shows significant improvement of our method (ROC score 0.8446) over that of using Pearson correlations (0.6587).

Conclusion: We have shown that the phylogenetic tree can be used as a guide to extract intra-matrix correlations in the distance matrices of orthologous proteins, where these correlations are represented as intermediate distance matrices of the ancestral orthologous proteins. Both the unsupervised and supervised learning paradigms benefit from the explicit inclusion of these intermediate distance matrices, and particularly so in the latter case, which offers a better balance between sensitivity and specificity in the prediction of protein-protein interactions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Geometric interpretation of subtracting phylogenetic background. Panel A corresponds to Eq(2), where a phylogenetic vector |v> is subtracted by the background vector |u16s>, the resulting vector |ε> is guaranteed to be orthogonal to |u16s>. Panel B corresponds to Eq(2'), where a phylogenetic vector |p> is subtracted by the background vector |r>, the resulting vector |p'> may still have residual components along the orientation of |r>. Panel C shows that the resulting vector |p'> may become as nearly orthogonal to |r> when the length of the vector |r> is properly rescaled.
Figure 2
Figure 2
Illustration of how the elements of the distance matrices correspond to distances between leaves on the phylogenetic trees. In matrix A, element (i, j) corresponds to a pair of neighboring genomes, whereas element (i', j') to a pair of genomes that are distantly positioned in the protein A tree, which can be reconstructed from matrix A using the standard methods, such as neighbor-joining algorithm. Likewise, elements (i, j) and (i', j') in matrix B have similar interpretation as corresponding to the respective pairs of genomes in the protein B tree. When comparing two proteins A and B by calculating the Pearson correlation coefficients between the two corresponding matrices, the elements (i, j) and (i', j') should be weighted according to their "importance" dictated by the positions in the trees. It is noted that although the two protein trees shown here have different branch lengths but the same topology, in more complicated cases the tree topologies can also be different. In this study, however, the indices of the two matrices are mapped to the same tree, the species tree. The justification and effect of using the species tree is explained in the text.
Figure 3
Figure 3
Schematic illustration of TreeSec method to derive super phylogenetic vector from a distance matrix for a given protein from the distance matrix of its orthologous proteins A, B, ..., H. Section 1 across the tree leads to four clusters of the orthologous proteins: α = {A, B}, β = {C}, δ = {D, E, F}, and γ = {G, H}. The distances among these clusters are calculated by Eqs. (4–6), resulting in an intermediate matrix. The procedure is repeated for all sections, producing more intermediate matrices. The upper triangles of the matrices are transformed into vectors and concatenated (denoted by the symbol ⊕) into a super phylogenetic vector. In this way, the phylogenetic vector is extended with extra "bits" that encode the topological information of the protein tree with reference to the species tree.
Figure 4
Figure 4
ROC Curves for Predictions using unsupervised learning with Correlation Coefficients and supervised learning with a SVM of Gaussian kernel.

Similar articles

Cited by

References

    1. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two- hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;98:4569–4574. doi: 10.1073/pnas.061034498. - DOI - PMC - PubMed
    1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. doi: 10.1038/35001009. - DOI - PubMed
    1. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999;96:4285–4288. doi: 10.1073/pnas.96.8.4285. - DOI - PMC - PubMed
    1. Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. A combined algorithm for genome-wide prediction of protein function. Nature. 1999;402:83–86. doi: 10.1038/47048. - DOI - PubMed
    1. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. Protein interaction maps for complete genome based on gene fusion events. Nature. 1999;403:86–90. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources